git-annex: Managing my most ancient data
Published on Nov 15, 2024.
At the moment, my various files are quite spread out between different computers, a number of USB drives and even old hard drives I usually keep offline. For the stuff I need access to everyday and always want to have the newest version in use, such as my notes, Emacs configuration or music files, I have been relying on Syncthing: a private, continuous file synchronisation tool which only keeps copies of my files on my own devices.
Generally, that is a great solution, but for me it breaks down when adding either large or very many files that I sometimes want access to. If I were to sync, for example, all video files as well as all archived project files between all devices, then I would have to invest into new storage first. But if I don’t sync all files, then I am sure to have to search a while before I find what I am looking for.
This is one of the scenarios that git-annex is out to solve. It builds upon
git
but allows to keep files in an annex instead for the standard
repository. When cloning the repository, you do not automatically get all the
files in the annex: on your target machine, you will instead only see a (broken)
symbolic link. Just like git
, git-annex
is decentralized and allows many different types of remotes, including “special” ones like the hosted storage
offerings of large-brand tech companies. Using git-annex
, you can not only
find out how many copies of any given file exist and where they are located but
even retrieve the file with a single command. That is, as long as e.g. the
necessary drive is connected or the remote computer is reachable, depending on
your configuration and distribution of data.
This way, you can carry only what you need with you on your laptop knowing that
you can always retrieve missing files on the go. When deleting (“dropping” in
git-annex
terminology) files, git-annex
will make sure that a minimum number
of copies are still around to prevent you from deleting the last remaining copy
of your holiday pictures from last year.
If you are interested to get started, I found the walkthrough on the project’s
homepage very helpful. I would suggest to try it out on a practice repository
first: some concepts, such as “unlocking” files for editing them take some time
to get used to and might not be the right approach to your data management
needs. I am myself still quite new to it and have been testing git-annex
for a
few weeks now. I am quite satisfied with it so far, even if the day-to-day workflows are not really in my muscle memory yet and I am still figuring out how to configure everything to my needs.
For me, the main issue so far has been that git-annex
by default does not
track the modification time of a file. While it keeps it intact when adding a
file, any cloned repository will have all files and their symlinks appear to
have last been modified when the repository was cloned. So even when I look into
the oldest data I have on this computer, I only see:
$ ls -lHh * -r--r--r-- 1 hanno hanno 49K nov 11 10:24 dorf.map -r--r--r-- 1 hanno hanno 18K nov 11 10:24 end.map -r--r--r-- 1 hanno hanno 12K nov 11 10:24 fire2.map -r--r--r-- 1 hanno hanno 32K nov 11 10:24 fire.map -r--r--r-- 1 hanno hanno 4,1K nov 11 10:24 forfirst.itm -r--r--r-- 1 hanno hanno 2,7K nov 11 10:24 forfirst.mon -r--r--r-- 1 hanno hanno 2,4K nov 11 10:24 forfirst.spl -r--r--r-- 1 hanno hanno 22K nov 11 10:24 forrest.map -r--r--r-- 1 hanno hanno 14K nov 11 10:24 garten.map -r--r--r-- 1 hanno hanno 43K nov 11 10:24 kanal.map -r--r--r-- 1 hanno hanno 45K nov 11 10:24 schloss.map -r--r--r-- 1 hanno hanno 33K nov 11 10:24 unbek.map
I know those files are older than that!
For some files such as photos with embedded meta data, this is not an issue. But for my old, archived documents and projects, I do want to know when I last touched them.
The good news is that git-annex
has support for arbitrary metadata and even
allows the modification date of a file to be automatically recorded when the
file is being added to the annex: unfortunately, it is off by default, so I ran
git config annex.genmetadata true
inside the repository to enable this
feature. Now, we can retrieve the metadata:
$ git annex metadata 50\ Years\ of\ Text\ Games\ -\ Aaron\ A.\ Reed.epub metadata 50 Years of Text Games - Aaron A. Reed.epub day=27 day-lastchanged=2024-11-14@12-59-21 lastchanged=2024-11-14@12-59-21 month=07 month-lastchanged=2024-11-14@12-59-21 year=2024 year-lastchanged=2024-11-14@12-59-21 ok
Great! It keeps the year, month and date as well as when the respective
information was last changed. However, there are two crucial drawbacks: firstly,
this will only work for files we are adding after this setting became active.
As it is not the default, I already had a bunch of data without this metadata in
my repository. Secondly, while a cloned repository will have this meta
information available too, the file’s mtime
will still be set to when the
cloning was done and not to the date from the metadata.
Let’s address the first issue and add mtime
metadata to all files in the
repository. I had a newly-cloned annex (without metadata nor the correct
modification times set) as well as an old rsync
’d copy from which I want to
retrieve and transfer the modification times. So I created a bash script:
1: ANNEX="/home/hanno/annex/Documents" 2: DOCS_OLD="/home/hanno/Documents.old" 3: 4: function set_meta { 5: afile="$(echo "$1" | sed "s#$DOCS_OLD#$ANNEX#")" 6: if [ -e "$afile" -a ! -d "$afile" ] 7: then 8: DAY=$(date -r "$1" "+%d") 9: MONTH=$(date -r "$1" "+%m") 10: YEAR=$(date -r "$1" "+%Y") 11: git annex metadata "$afile" -s day=$DAY -s month=$MONTH -s year=$YEAR 12: else 13: if [ ! -d "$afile" ] 14: then 15: echo "File not in annex: $afile" 16: fi 17: fi 18: } 19: 20: function set_mdate { 21: afile="$(echo "$1" | sed "s#$DOCS_OLD#$ANNEX#")" 22: if [ -e "$afile" ] 23: then 24: MDATE=$(date -r "$1" "+%Y%m%d%H%M") 25: touch -t $MDATE "$afile" 26: touch -h -t $MDATE "$afile" 27: fi 28: } 29: 30: 31: find $DOCS_OLD | while read file 32: do 33: set_meta "$file" 34: set_mdate "$file" 35: done
The script searches through the =rclone
’d path on line 31 and calls two
functions for each file found: set_meta
(on line 4) and set_mdate
(on line 20) which set the metadata in the annex and adjust the mtime
of the file in the annex, respectively. This is provided that the file exists
under the same (sub) path in the annex, of course, as the if
conditions in the
functions ensure. set_mdate
will adjust the date for both the symlink itself
as well as the file linked to by running touch
once with -h
flag and once
without.
In case the files in your local annex still have the original modification time
but not the corresponding metadata set in the annex, you should be able to run
the above script setting the reference variable DOCS_OLD
to your annex. Best
remove the call to set_mdate
as well, as that would be unnecessary in this
case.
When all files finally have the necessary metadata entries, we can use the
following script to set the files’ mtime
accordingly on freshly cloned
repositories:
1: ANNEX="/home/hanno/annex/Documents" 2: 3: find $ANNEX | while read file 4: do 5: if [ ! -d "$file" ] 6: then 7: META="$(git annex metadata "$file")" 8: if echo "$META" | grep -q day 9: then 10: DAY=$(echo "$META" | grep 'day=' | sed 's/.*=//') 11: MONTH=$(echo "$META" | grep 'month=' | sed 's/.*=//') 12: YEAR=$(echo "$META" | grep 'year=' | sed 's/.*=//') 13: MDATE="${YEAR}${MONTH}${DAY}1201" 14: echo "Setting mtime $MDATE on $file" 15: touch -t $MDATE "$afile" 16: touch -h -t $MDATE "$afile" 17: else 18: echo "$file has no metadata set." 19: fi 20: fi 21: done
Note that the time is simply set to 12:01 on line 13 as this information is lacking in the metadata.
I plan on only running this once, as for files retrieved later the deviation
from the “real” modification time should be minor (and the correct one will be
stored in the annex’ metadata). But don’t forget to run git config
annex.genmetadata true
in all cloned repositories as the configuration option
is not synced between annexes!
With all this in place, I can finally browse my old data sets and see this:
$ ls -lHh * -r--r--r-- 1 hanno hanno 49K jun 20 1994 dorf.map -r--r--r-- 1 hanno hanno 18K apr 3 1994 end.map -r--r--r-- 1 hanno hanno 12K apr 6 1994 fire2.map -r--r--r-- 1 hanno hanno 32K apr 6 1994 fire.map -r--r--r-- 1 hanno hanno 4,1K dec 3 1993 forfirst.itm -r--r--r-- 1 hanno hanno 2,7K nov 23 1994 forfirst.mon -r--r--r-- 1 hanno hanno 2,4K dec 3 1993 forfirst.spl -r--r--r-- 1 hanno hanno 22K apr 6 1994 forrest.map -r--r--r-- 1 hanno hanno 14K apr 7 1994 garten.map -r--r--r-- 1 hanno hanno 43K dec 1 1994 kanal.map -r--r--r-- 1 hanno hanno 45K dec 3 1993 schloss.map -r--r--r-- 1 hanno hanno 33K dec 3 1993 unbek.map
Yes, that looks about right! 😸
In case you are wondering, these files belong to one of my first attempts at making a video game using the Bard’s Tale Construction Set.