Git and large repos

As you may heard, at Shopify we prefer monolith application to microservices. Dealing a monolyth repo can become tricky.

Today the Shopify project directory takes > 2 Gb on my SSD drive, plus 2.5 Gb of non-CVS tracked files like Ruby and Javascript dependencies and cache.

I imagine that it's not the largest Git repo in the world, but sometimes Git still breaks on such a huge repo. Last week I've got the following symptoms:

That was 12 background git fetch processes that consumed 360% of CPU and a ton of memory. As Git launched these processes automatically in background, there was definetely something that went wrong.

Killing them didn't help: Git launched them again. I've also tried to disable auto GC (git config --global gc.auto 0) but it didn't help either.

As I found, there is a Git command to verify validity of the repo database: git fsck. While in the clear scenario the output of git fsck should be empty, in my case it printed a huge list of invalid objects.

Afterwards, I launched garbage collection manually with git gc --prune=now. And it finally solved the issue.

I hope my story will help someone with the the similar Git symptoms.

Written in August 2016.
Kir Shatrov

Kir Shatrov helps businesses to grow by scaling the infrastructure. He writes about software, scalability and the ecosystem. Follow him on Twitter to get the latest updates.