hoowl

Physicist, Emacser, Digitales Spielkind

git-annex: Managing my most ancient data
Published on Nov 15, 2024.

At the moment, my various files are quite spread out between different computers, a number of USB drives and even old hard drives I usually keep offline. For the stuff I need access to everyday and always want to have the newest version in use, such as my notes, Emacs configuration or music files, I have been relying on Syncthing: a private, continuous file synchronisation tool which only keeps copies of my files on my own devices.

Generally, that is a great solution, but for me it breaks down when adding either large or very many files that I sometimes want access to. If I were to sync, for example, all video files as well as all archived project files between all devices, then I would have to invest into new storage first. But if I don’t sync all files, then I am sure to have to search a while before I find what I am looking for.

This is one of the scenarios that git-annex is out to solve. It builds upon git but allows to keep files in an annex instead for the standard repository. When cloning the repository, you do not automatically get all the files in the annex: on your target machine, you will instead only see a (broken) symbolic link. Just like git, git-annex is decentralized and allows many different types of remotes, including “special” ones like the hosted storage offerings of large-brand tech companies. Using git-annex, you can not only find out how many copies of any given file exist and where they are located but even retrieve the file with a single command. That is, as long as e.g. the necessary drive is connected or the remote computer is reachable, depending on your configuration and distribution of data.

This way, you can carry only what you need with you on your laptop knowing that you can always retrieve missing files on the go. When deleting (“dropping” in git-annex terminology) files, git-annex will make sure that a minimum number of copies are still around to prevent you from deleting the last remaining copy of your holiday pictures from last year.

If you are interested to get started, I found the walkthrough on the project’s homepage very helpful. I would suggest to try it out on a practice repository first: some concepts, such as “unlocking” files for editing them take some time to get used to and might not be the right approach to your data management needs. I am myself still quite new to it and have been testing git-annex for a few weeks now. I am quite satisfied with it so far, even if the day-to-day workflows are not really in my muscle memory yet and I am still figuring out how to configure everything to my needs.

For me, the main issue so far has been that git-annex by default does not track the modification time of a file. While it keeps it intact when adding a file, any cloned repository will have all files and their symlinks appear to have last been modified when the repository was cloned. So even when I look into the oldest data I have on this computer, I only see:

$ ls -lHh *
-r--r--r-- 1 hanno hanno  49K nov 11 10:24 dorf.map
-r--r--r-- 1 hanno hanno  18K nov 11 10:24 end.map
-r--r--r-- 1 hanno hanno  12K nov 11 10:24 fire2.map
-r--r--r-- 1 hanno hanno  32K nov 11 10:24 fire.map
-r--r--r-- 1 hanno hanno 4,1K nov 11 10:24 forfirst.itm
-r--r--r-- 1 hanno hanno 2,7K nov 11 10:24 forfirst.mon
-r--r--r-- 1 hanno hanno 2,4K nov 11 10:24 forfirst.spl
-r--r--r-- 1 hanno hanno  22K nov 11 10:24 forrest.map
-r--r--r-- 1 hanno hanno  14K nov 11 10:24 garten.map
-r--r--r-- 1 hanno hanno  43K nov 11 10:24 kanal.map
-r--r--r-- 1 hanno hanno  45K nov 11 10:24 schloss.map
-r--r--r-- 1 hanno hanno  33K nov 11 10:24 unbek.map

I know those files are older than that!

For some files such as photos with embedded meta data, this is not an issue. But for my old, archived documents and projects, I do want to know when I last touched them.

The good news is that git-annex has support for arbitrary metadata and even allows the modification date of a file to be automatically recorded when the file is being added to the annex: unfortunately, it is off by default, so I ran git config annex.genmetadata true inside the repository to enable this feature. Now, we can retrieve the metadata:

$ git annex metadata 50\ Years\ of\ Text\ Games\ -\ Aaron\ A.\ Reed.epub
metadata 50 Years of Text Games - Aaron A. Reed.epub
  day=27
  day-lastchanged=2024-11-14@12-59-21
  lastchanged=2024-11-14@12-59-21
  month=07
  month-lastchanged=2024-11-14@12-59-21
  year=2024
  year-lastchanged=2024-11-14@12-59-21
ok

Great! It keeps the year, month and date as well as when the respective information was last changed. However, there are two crucial drawbacks: firstly, this will only work for files we are adding after this setting became active. As it is not the default, I already had a bunch of data without this metadata in my repository. Secondly, while a cloned repository will have this meta information available too, the file’s mtime will still be set to when the cloning was done and not to the date from the metadata.

Let’s address the first issue and add mtime metadata to all files in the repository. I had a newly-cloned annex (without metadata nor the correct modification times set) as well as an old rsync’d copy from which I want to retrieve and transfer the modification times. So I created a bash script:

 1: ANNEX="/home/hanno/annex/Documents"
 2: DOCS_OLD="/home/hanno/Documents.old"
 3: 
 4: function set_meta {
 5:     afile="$(echo "$1" | sed "s#$DOCS_OLD#$ANNEX#")"
 6:     if [ -e  "$afile" -a ! -d "$afile" ]
 7:     then
 8:         DAY=$(date -r "$1" "+%d")
 9:         MONTH=$(date -r "$1" "+%m")
10:         YEAR=$(date -r "$1" "+%Y")
11:         git annex metadata "$afile" -s day=$DAY -s month=$MONTH -s year=$YEAR
12:     else
13:         if [ ! -d "$afile" ]
14:         then
15:             echo "File not in annex: $afile"
16:         fi
17:     fi
18: }
19: 
20: function set_mdate {
21:     afile="$(echo "$1" | sed "s#$DOCS_OLD#$ANNEX#")"
22:     if [ -e  "$afile" ]
23:     then
24:         MDATE=$(date -r "$1" "+%Y%m%d%H%M")
25:         touch -t $MDATE "$afile"
26:         touch -h -t $MDATE "$afile"
27:     fi
28: }
29: 
30: 
31: find $DOCS_OLD | while read file
32: do
33:     set_meta "$file"
34:     set_mdate "$file"
35: done

The script searches through the =rclone’d path on line 31 and calls two functions for each file found: set_meta (on line 4) and set_mdate (on line 20) which set the metadata in the annex and adjust the mtime of the file in the annex, respectively. This is provided that the file exists under the same (sub) path in the annex, of course, as the if conditions in the functions ensure. set_mdate will adjust the date for both the symlink itself as well as the file linked to by running touch once with -h flag and once without.

In case the files in your local annex still have the original modification time but not the corresponding metadata set in the annex, you should be able to run the above script setting the reference variable DOCS_OLD to your annex. Best remove the call to set_mdate as well, as that would be unnecessary in this case.

When all files finally have the necessary metadata entries, we can use the following script to set the files’ mtime accordingly on freshly cloned repositories:

 1: ANNEX="/home/hanno/annex/Documents"
 2: 
 3: find $ANNEX | while read file
 4: do
 5:     if [ ! -d "$file" ]
 6:        then
 7:            META="$(git annex metadata "$file")"
 8:            if echo "$META" | grep -q day
 9:               then
10:                   DAY=$(echo "$META"  | grep 'day=' | sed 's/.*=//')
11:                   MONTH=$(echo "$META"  | grep 'month=' | sed 's/.*=//')
12:                   YEAR=$(echo "$META"  | grep 'year=' | sed 's/.*=//')
13:                   MDATE="${YEAR}${MONTH}${DAY}1201"
14:                   echo "Setting mtime $MDATE on $file"
15:                   touch -t $MDATE "$afile"
16:                   touch -h -t $MDATE "$afile"
17:            else
18:                echo "$file has no metadata set."
19:                fi
20:         fi
21: done

Note that the time is simply set to 12:01 on line 13 as this information is lacking in the metadata.

I plan on only running this once, as for files retrieved later the deviation from the “real” modification time should be minor (and the correct one will be stored in the annex’ metadata). But don’t forget to run git config annex.genmetadata true in all cloned repositories as the configuration option is not synced between annexes!

With all this in place, I can finally browse my old data sets and see this:

$ ls -lHh *
-r--r--r-- 1 hanno hanno  49K jun 20  1994 dorf.map
-r--r--r-- 1 hanno hanno  18K apr  3  1994 end.map
-r--r--r-- 1 hanno hanno  12K apr  6  1994 fire2.map
-r--r--r-- 1 hanno hanno  32K apr  6  1994 fire.map
-r--r--r-- 1 hanno hanno 4,1K dec  3  1993 forfirst.itm
-r--r--r-- 1 hanno hanno 2,7K nov 23  1994 forfirst.mon
-r--r--r-- 1 hanno hanno 2,4K dec  3  1993 forfirst.spl
-r--r--r-- 1 hanno hanno  22K apr  6  1994 forrest.map
-r--r--r-- 1 hanno hanno  14K apr  7  1994 garten.map
-r--r--r-- 1 hanno hanno  43K dec  1  1994 kanal.map
-r--r--r-- 1 hanno hanno  45K dec  3  1993 schloss.map
-r--r--r-- 1 hanno hanno  33K dec  3  1993 unbek.map

Yes, that looks about right! 😸

In case you are wondering, these files belong to one of my first attempts at making a video game using the Bard’s Tale Construction Set.

Tags: git, gitannex