How to Optimize a Git Repo

You’re looking at your git repository – maybe because you want to back it up, or to move it to another location – and you freak out. 14,552 files? 153 megabytes? This can’t be right!

Or can it?

Type:File folder
Location:C:\Users\Krishty\source\repos
Size:153.4 MB (160,861,559 bytes)
Size on disk:181.2 MB (189,956,096 bytes)
Contains:14,552 Files, 882 Folders

Here’s the radical diet for your repository …

tl;dr

This deletes your stashes! Also, make sure nobody else is writing to the repo!

rm objects/info/commit-graph
git reflog expire --expire-unreachable=now --all
git gc --aggressive --prune=now
git repack -ad -F --depth=4095 --window=999
rm hooks/*.sample
rm description
rm gitk.cache

Delete What You Don’t Need

It sounds like a platitude, but it’s the foundation to any optimization: delete branches and commits when you’re sure you don’t need them any more.

This may not immediately make your repo any smaller, though. We’ll shortly see why.

Garbage Collection

Git uses garbage collection – like some programming languages (Java, C#, …). This means that it does not waste your time tidying up the repo during your normal day-to-day use. Rather, git waits until lots of garbage have accumulated before collecting it. This is also the reason for repos sometimes not shrinking right after large branches have been deleted.

What is Garbage Anyway?

The primary source of garbage is inaccessible commits. Commits can become inaccessible when they are undone (git reset). Amending commits (git commit --amend) orphans the original commit as well. After all, commits are identified by their checksum, and the checksum changes when you change a commit. You can probably imagine that rebasing leaves lots of garbage, as it potentially rewrites many commits.

Cleaning Up

Unreachable commits can easily be removed from the repository by running

git gc --prune=now

Make sure nobody else is writing to the repository during this time! --prune=now would then lead to damage.

This should already improve your repo size drastically.

Compression

Git repositories are compressed – but rarely in the optimal way.

One reason being, optimal compression is slow. The other reason being, compression is hindered by the garbage in the repository.

Now that all garbage has been removed from the repo, re-compress it entirely by running:

git gc --aggressive

This may take a long time – possibly minutes for medium-sized repos, and hours for gigabyte-sized repositories. Make sure that nobody else is writing to the repository during this time.

With the garbage collection step mentioned earlier, it can be combined into

git gc --aggressive --prune=now

The repo should now no longer consist of thousands of files. The compression should have re-packed it into a few dozen ones.

In earlier git versions, this was achieved by running the git repack command. The functionality has since been merged into git gc --aggressive.

If you really want to go to 11 and squeeze out the last bytes, you can run git repack to re-compress the repository with higher settings than git gc does:

git repack -ad -F --depth=4095 --window=999

In my experiments, this compressed at most 2 % better than git gc and took very long to complete.

The Reflog

You may notice that the command above still doesn’t remove all garbage from the repository. Even more confusing: If you retry the command a few weeks later, it may suddenly compress better!

Git maintains a reflog. This is a list of all commits that had been checked out quite recently (within the past two weeks, usually). This permits awesome commands like show me what I worked on ten days ago (git show HEAD@{10.days.ago}). But it also prevents garbage collection from removing anything that is still referenced by the reflog.

Clear the reflog via:

git reflog expire --expire=all --expire-unreachable=now --all

This deletes your stashed changes!

Do this before garbage collection and re-compression.

Sample Hooks

Git supports hooks – scripts that are called on specific events, e.g. when pushing to a branch.

Git, by default, nicely places sample scripts in the hooks directory as a guideline for writing your own ones.

These are not used and serve no other purpose, so it doesn’t make sense to keep them in backups and deployments. You can delete all *.sample files from the hooks directory.

Gitk’s Cache

If you have ever used gitk, it probably created a cache file in your repository. This makes it run faster on subsequent starts, but it is not normally something you’d like to back up or publish.

To remove it, delete the gitk.cache file.

Visual Studio’s Commit Graph

If you ever used the repository with Visual Studio, then you’ll probably find another cache: the commit graph.

It’s a quite new feature that has been added specifically to speed up displaying the commit graph. Again, it’s quite useful for day-to-day work, but not at all for backups or deployments.

To remove it, delete the objects/info/commit-graph file.

Description

Git places a description file in any new repo by default. This feature is rarely used.

If you don’t use the description, feel free to delete the description file.

Summary

Before optimizing a repo, make sure nobody will be writing it during the optimization.

Skip the reflog step if you want to keep your stashed changes.

Garbage collection is relatively fast and reduces the repo size drastically, albeit only after clearing the reflog.

Re-compression is very slow, but reduces the number of files from several thousands to less than a dozen (for bare repos; a few more for other repos).

Some programs place their caches in the repo, and you can usually just delete them.

rm objects/info/commit-graph
git reflog expire --expire-unreachable=now --all
git gc --aggressive --prune=now
rm hooks/*.sample
rm description
rm gitk.cache
Type:File folder
Location:C:\Users\Krishty\source\repos
Size:122.0 MB (127,926,272 bytes)
Size on disk:122.1 MB (127,959,040 bytes)
Contains:9 Files, 8 Folders