hoowl

Physicist, Emacser, Digitales Spielkind

Revising history: How to clean a git project prior to publication
Published on Jul 27, 2020.

Version control systems such as git are amazing tools for software development and most projects involving more than a handful developers would hardly work without them. Especially git has proven to be incredibly useful for me even in small, personal projects as well as for other things than source code, such as my notes and (system) configuration files. Being able to review an evening’s worth of frantic hacking or figuring out why things look different from what I remember when returning weeks later really eases my mind.

But of course, I usually don’t keep the same ’hygiene’ for those private repositories and commit binary files, include passwords or other secrets and, on occasion, add rather explicit comments. That can be an obstacle when one later wants to publish any of those repositories after all.

Of course, one could simply strip all of the history created by git e.g. by deleting the .git/ directory in the root of the repository and creating a fresh repository from the current working directory. When that is not desired, there is an alternative option though: git-filter-repo.

With git-filter-repo, one can easily rewrite the history of a repository: removing files, changing commit messages or even search-replace strings across all files to remove e.g. passwords. To be clear, this will lead to incompatible histories with your collaborators in most cases – so use with care, make a backup and read the documentation! The latter provides several useful examples. Here is some of the ones I have already found applications for:

Removing files accidentally included in commits

As a version control system, git keeps a copy of any file deleted from the current working tree in its history, forever.

To search for files that were deleted but are still accessible in the history, you can run git log and filter on deletions:

git log --diff-filter=D --summary

Usually, that is exactly what one expects and not a problem at all. But at some point one might have committed a very large file, a bunch of temporary files or files containing sensitive information. So, how to get rid of these for good?

As an example, when I was still using Apple OSX, I sometimes included those .DS_Store meta information files that the OS scatters across the file system. No reason to keep them accumulating in the repository’s history:

git filter-repo --invert-paths --path '.DS_Store' --use-base-name

--use-base-name matches on file base names instead of full paths and --invert-paths keeps anything not matching. To match a pattern, you can use instead

git filter-repo --invert-paths --path-glob '*/*.jpg'

which removes all files with the jpg extension from any path.

Removing sensitive information

If you entered sensitive information into text or source files that you don’t want to delete entirely, you can even replace those passwords or explicit language with something safe to push to the outside world! Look for any mention of those unwanted strings of text in any commit first:

git log -SMyPassword -c

This searches for any mention of “MyPassword” in any commit. Should the command return any results, you can use git-filter-repo to replace the sensitive string with something else. Create a file expresions.txt which contains the desired translations:

MyPassword ==> ***REMOVED***

Even regular expressions are allowed if you prefix the line with regex:. Then apply the replacements:

git filter-repo --replace-text expressions.txt

Changing author name and/or email

In case you have used a different commit author name or email address from the one you want to see publicly, you can permanently change it across all commits quite easily. First, you have to create a file containing the mapping of the old to the new name/email. This file follows the mailmap format. For adjustments to only the email address, use for example:

<proper@email.example.com> <commit@email.example.com>

If you called this file mailmap.txt, then simply execute

git filter-repo --mailmap mailmap.txt

to make the changes.

Summary

git-filter-repo is a powerful tool to clean up a git repository once you decide to make your private project public and share it with others. Your cleaned repository will end up with a rewritten history that is incompatible with previous incarnations – so keep that in mind should you already have a public copy out there!

Comments? Leave them as reply to the post on Mastodon.

Tags: git, programming