Revising history: How to clean a git project prior to publication
Published on Jul 27, 2020.
Version control systems such as git
are amazing tools for software development
and most projects involving more than a handful developers would hardly work
without them. Especially git
has proven to be incredibly useful for me even
in small, personal projects as well as for other things than source code, such
as my notes and (system) configuration files. Being able to review an evening’s
worth of frantic hacking or figuring out why things look different from what I
remember when returning weeks later really eases my mind.
But of course, I usually don’t keep the same ’hygiene’ for those private repositories and commit binary files, include passwords or other secrets and, on occasion, add rather explicit comments. That can be an obstacle when one later wants to publish any of those repositories after all.
Of course, one could simply strip all of the history created by git
e.g. by
deleting the .git/
directory in the root of the repository and creating a
fresh repository from the current working directory. When that is not desired,
there is an alternative option though: git-filter-repo.
With git-filter-repo
, one can easily rewrite the history of a repository:
removing files, changing commit messages or even search-replace strings across
all files to remove e.g. passwords. To be clear, this will lead to incompatible
histories with your collaborators in most cases – so use with care, make a backup
and read the documentation! The latter provides several useful examples. Here is
some of the ones I have already found applications for:
Removing files accidentally included in commits
As a version control system, git
keeps a copy of any file deleted from the current working
tree in its history, forever.
To search for files that were deleted but are still accessible in the history, you can run git log
and filter on deletions:
git log --diff-filter=D --summary
Usually, that is exactly what one expects and not a problem at all. But at some point one might have committed a very large file, a bunch of temporary files or files containing sensitive information. So, how to get rid of these for good?
As an example, when I was still using Apple OSX, I sometimes included those
.DS_Store
meta information files that the OS scatters across the file system.
No reason to keep them accumulating in the repository’s history:
git filter-repo --invert-paths --path '.DS_Store' --use-base-name
--use-base-name
matches on file base names instead of full paths and --invert-paths
keeps anything not matching.
To match a pattern, you can use instead
git filter-repo --invert-paths --path-glob '*/*.jpg'
which removes all files with the jpg
extension from any path.
Removing sensitive information
If you entered sensitive information into text or source files that you don’t want to delete entirely, you can even replace those passwords or explicit language with something safe to push to the outside world! Look for any mention of those unwanted strings of text in any commit first:
git log -SMyPassword -c
This searches for any mention of “MyPassword” in any commit. Should the command
return any results, you can use git-filter-repo
to replace the sensitive
string with something else. Create a file expresions.txt
which contains the
desired translations:
MyPassword ==> ***REMOVED***
Even regular expressions are allowed if you prefix the line with regex:
. Then apply the replacements:
git filter-repo --replace-text expressions.txt
Changing author name and/or email
In case you have used a different commit author name or email address from the
one you want to see publicly, you can permanently change it across all commits
quite easily. First, you have to create a file containing the mapping of the old
to the new name/email. This file follows the mailmap
format. For adjustments
to only the email address, use for example:
<proper@email.example.com> <commit@email.example.com>
If you called this file mailmap.txt
, then simply execute
git filter-repo --mailmap mailmap.txt
to make the changes.
Summary
git-filter-repo
is a powerful tool to clean up a git
repository once you
decide to make your private project public and share it with others. Your
cleaned repository will end up with a rewritten history that is incompatible
with previous incarnations – so keep that in mind should you already have a
public copy out there!
Comments? Leave them as reply to the post on Mastodon.
Tags: git, programming