Finding and deleting large files in a git repository
Large files can slow down cloning and fetching operations, and make your developers less efficient. Use this guide to find and delete those files
Why is it important to have small git repositories?
GitHub recommends that git repositories be < 1 GiB is size. The most important reasons why I think this matters:
- It takes longer to clone and fetch larger git repositories. This may impact the time for Continuous Integration (CI) jobs to run, also it may impact the speed of deployments to your hosting infrastructure. It will definitely slow down developers looking to get setup for the first time.
- It adds to space needed to house the git repository. Both GitHub and GitLab will reach out to you if you are storing too much data. Not to mention the lost GB on your laptop.
Find the current size of your git repo
There is a command built in to git (since 2013) that does this very easily:
Find the large files in your git repository
Now that you know your git repository is excessively large, the next step is to work out why.
Using this amazing script on stackoverflow:
N.B. on Mac OSX, you will need brew install coreutils
for the magic number formatting to work.
This script will list every single file in your git repository, with the largest files at the bottom, in the example git repository I was looking into, these were the largest files:
You can spot a single large zip file files.zip
in the repository, coming in at a whooping 4.3GiB. In this particular case, this was due to a developer accidentally committing the file in one commit, and then thinking they can remove it by deleting it in the next. Git however, remembers data in all commits for all time.
So how do we actually properly delete files.zip
and reclaim that space?
Remove the large files by rewriting git history - using the BFG repo cleaner
The BFG repo cleaner is an excellent example of a tool to make complex git operations more approachable for the average punter. I use docker to run it, but if you have java
installed locally, then running it natively might be easier for you.
The output is amazing from the BFG repo cleaner as well, there is a report supplied showing the files it removed, and which commits needed to be altered.
You also need to run git garbage collection to reclaim the space
After you are all done, re-run the command to find the new size
If you are happy then force push over your remote
N.B. all rewritten commits will have a different SHA, and all other developers using the same repo will need to do steps to ensure they have the most recent version of the git repo.
Hope this helps somebody out there