As a DevOps engineer managing an active Git repository with hundreds of commits daily, you’ll inevitably face challenges that require a solid understanding of Git and its ecosystem. This post covers common issues that teams encounter while using Git and the tools used to address them. Each tool mentioned deserves a dedicated post, but here’s a brief overview of the essentials.
The tools we’ll cover:
- Maintaining the health of the repository.
- Managing large files.
- Permanently deleting files from the entire history.
- Purging large historical data.
- Speeding up work on huge repositories.
- Automatically generating a changelog for releases.
Cleaning Dead and Unreachable Items in Your Repo:
Just like performing a “Disk Cleanup” on your computer, a Git repository can accumulate unreachable items (such as old commits or branches). Git’s git gc (garbage collection) command cleans these up. This command:
- Compresses objects.
- Cleans up orphaned data (like unreachable commits or branches).
- Enhances performance.
Remarkable to know that some commands will do built-in auto
gc
the following commands do that:
- git commit
- git merge
- git pull
- git rebase
Although these commands perform built-in cleaning, it’s still recommended to run git gc manually from time to time. Some DevOps teams automate this process in their CI/CD pipelines after every release.
Managing Large Files:
When you add a file to Git, it stores the file in its database along with metadata for tracking. For very large files (e.g., in the gigabytes range), this process can be costly, which is why Git limits the size of uploaded files. For example, GitHub’s file size limits are as follows:
Product | Max file size |
---|---|
GitHub Free | 2 GB |
GitHub Pro | 2 GB |
GitHub Team | 4 GB |
GitHub Enterprise Cloud | 5 GB |
To manage large files effectively, Git offers an extension called Git Large File Storage. Instead of storing large files in the Git database, LFS stores them externally and tracks them using lightweight pointers in Git.
How to use Git LFS?
After downloading and installing Git LFS, enable it by running:
1git lfs install
You can then configure specific file types to be tracked by LFS, for example:
1git lfs track "*.pdf"
From this point, any PDF files will be handled by LFS.
Permanently Deleting Files from Git History and Purging Repositories
Sometimes, a sensitive file (like credentials) or a large, unnecessary file gets committed by mistake. In such cases, you’ll need to permanently remove it from the Git history. Additionally, you may want to purge a repository’s old, unneeded branches and tags to reduce its size.
To handle these tasks, two tools can rewrite Git history:
- git filter-repo
- BFG Repo-Cleaner
Important: These tools rewrite the repository history. Be sure to close any open pull requests first, as the PRs may break after the history changes.
Using BFG Repo-Cleaner
The BFG Repo-Cleaner specializes in removing unwanted files from a Git repository. It’s faster and simpler than git filter-repo
, but it is more limited, focusing only on file deletion and text replacement.
To use it:
1# create bare repository
2git clone --mirror git://github.com/<username>/myrepo.git
3
4# delete the required file
5java -jar bfg.jar --delete-files <path to files>
6
7# or replace text in a file, which will arease all data
8java -jar bfg.jar --replace-text <path to files>
9
10# then clean your local reflog from those orphan entries
11git reflog expire --expire=now --all
12# and run garbage collection to prune all orphan entries from the repository
13$ git gc --prune=now --aggressive
14# then you need to push with force the cleaned repository back
15git push --force
This process alter the repo’s history, which means will change the SHAs for that commit you are altering, and will change all its child and dependent commits.
So before doing that, make sure to close any pull requestes in your repository.
BFG and other tools that will change the history will work on bare repository
Using git filter-repo
git-filter-repo is a more comprehensive tool for rewriting Git history, purging unwanted data, and trimming repository size. It’s more versatile than BFG and handles many complex use cases.
Dealing with Large Repositories:
If your team works with a large repository, you may need a solution to improve performance without sacrificing history. The best tool for this is Scalar, developed by Microsoft after managing their massive Windows repository.
Scalar is built on top of VFS For Git
, or Virtual File System For Git
, and still uses this technology under the hood.
Scalar works seamlessly with large repositories by using a combination of Git LFS and efficient file tracking systems.
Scalar includes a background service that runs on developer machines to monitor and sync repositories, keeping them healthy and up-to-date. It requires a Git service provider that supports the GVFS protocol, such as GitHub or Azure Repos.
Automatically Generating a Changelog for Releases
During development, your team adds new features, fixes bugs, and commits meaningful messages. When it’s time for a release, you’ll want a summary of these changes. A common approach is to create a CHANGELOG
file.
You can use git log
to list changes since the last release:
1git log <tag-or-branch-of-last-release>..<current-tag>
However, this output can be noisy. Two tools can help generate cleaner, more useful changelogs:
- gitchangelog
- github-changelog-generator
gitchangelog
gitchangelog is a python tool that parses commit messages using a defined pattern to create a well-organized report from Git logs.
github-changelog-generator
github-changelog-generator is a Ruby tool that generates changelogs by parsing pull request messages on GitHub.