Finding and deleting large files in a git repository

Large files can slow down cloning and fetching operations, and make your developers less efficient. Use this guide to find and delete those files

Finding and deleting large files in a git repository

Why is it important to have small git repositories?

GitHub recommends that git repositories be < 1 GiB is size. The most important reasons why I think this matters:

  • It takes longer to clone and fetch larger git repositories. This may impact the time for Continuous Integration (CI) jobs to run, also it may impact the speed of deployments to your hosting infrastructure. It will definitely slow down developers looking to get setup for the first time.
  • It adds to space needed to house the git repository. Both GitHub and GitLab will reach out to you if you are storing too much data. Not to mention the lost GB on your laptop.

Find the current size of your git repo

There is a command built in to git (since 2013) that does this very easily:

$ git count-objects -vH

count: 0
size: 0 bytes
in-pack: 4024
packs: 1
size-pack: 4.27 GiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes
Here you can see the size of the git repo is > 4 GiB 😱

Find the large files in your git repository

Now that you know your git repository is excessively large, the next step is to work out why.

Using this amazing script on stackoverflow:

git rev-list --objects --all |
  git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |
  sed -n 's/^blob //p' |
  sort --numeric-sort --key=2 |
  cut -c 1-12,41- |
  $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest
List all files in a git repo, sorted by size ascending

N.B. on Mac OSX, you will need brew install coreutils for the magic number formatting to work.

This script will list every single file in your git repository, with the largest files at the bottom, in the example git repository I was looking into, these were the largest files:

26aa9bdfd920  564KiB themes/site/hypejs/hypejs.css
1d8f89f0d18a  564KiB themes/site/hypejs/hypejs.css
cf4d7ceb1d4f  684KiB themes/site/css/fonts/fontawesome/fa-brands-400 2.svg
401b7f7b3219  823KiB themes/site/css/fonts/fontawesome/fa-solid-900 2.svg
989b349bb493  4.3GiB
Committing a 4.3 GiB file to your git repo will slow down a lot of operations

You can spot a single large zip file in the repository, coming in at a whooping 4.3GiB. In this particular case, this was due to a developer accidentally committing the file in one commit, and then thinking they can remove it by deleting it in the next. Git however, remembers data in all commits for all time.

So how do we actually properly delete and reclaim that space?

Remove the large files by rewriting git history - using the BFG repo cleaner

The BFG repo cleaner is an excellent example of a tool to make complex git operations more approachable for the average punter. I use docker to run it, but if you have java installed locally, then running it natively might be easier for you.

docker run -it --rm \
  --volume "$PWD:/home/bfg/workspace" \
  koenrh/bfg \
  --strip-blobs-bigger-than 100M
Running BFG repo cleaner in a docker container

The output is amazing from the BFG repo cleaner as well, there is a report supplied showing the files it removed, and which commits needed to be altered.

Output of BFG repo cleaner. So fancy.

You also need to run git garbage collection to reclaim the space

  git reflog expire --expire=now --all && git gc --prune=now --aggressive
Using git gc command to strip out the unwanted dirty data

After you are all done, re-run the command to find the new size

$ git count-objects -vH

count: 0
size: 0 bytes
in-pack: 4024
packs: 1
size-pack: 8.29 MiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes
git repo is now 8.29 MiB. Wow. Such small.

If you are happy then force push over your remote

git push
Push the newly rewritten commits over to your remote. 

N.B. all rewritten commits will have a different SHA, and all other developers using the same repo will need to do steps to ensure they have the most recent version of the git repo.

git fetch origin
git reset --hard origin/master
Other developers using the same git repo can update their local copies with this command. Or they can reclone if they want (the repo is tiny now).

Hope this helps somebody out there