- Impact on performance
- Profiling repositories
- Optimizing large repositories for GitLab
Managing monorepos
Monorepos have become a regular part of development team workflows. While they have many advantages, monorepos can present performance challenges when using them in GitLab. Therefore, you should know:
- What repository characteristics can impact performance.
- Some tools and steps to optimize monorepos.
Impact on performance
Because GitLab is a Git-based system, it is subject to similar performance constraints as Git when it comes to large repositories that are gigabytes in size.
Monorepos can be large for many reasons.
Large repositories pose a performance risk performance when used in GitLab, especially if a large monorepo receives many clones or pushes a day, which is common for them.
Git itself has performance limitations when it comes to handling monorepos.
Gitaly is our Git storage service built on top of Git. This means that any limitations of Git are experienced in Gitaly, and in turn by end users of GitLab.
Profiling repositories
Large repositories generally experience performance issues in Git. Knowing why your repository is large can help you develop mitigation strategies to avoid performance problems.
You can use git-sizer
to get a snapshot
of repository characteristics and discover problem aspects of your monorepo.
For example:
Processing blobs: 1652370
Processing trees: 3396199
Processing commits: 722647
Matching commits to trees: 722647
Processing annotated tags: 534
Processing references: 539
| Name | Value | Level of concern |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size | | |
| * Commits | | |
| * Count | 723 k | * |
| * Total size | 525 MiB | ** |
| * Trees | | |
| * Count | 3.40 M | ** |
| * Total size | 9.00 GiB | **** |
| * Total tree entries | 264 M | ***** |
| * Blobs | | |
| * Count | 1.65 M | * |
| * Total size | 55.8 GiB | ***** |
| * Annotated tags | | |
| * Count | 534 | |
| * References | | |
| * Count | 539 | |
| | | |
| Biggest objects | | |
| * Commits | | |
| * Maximum size [1] | 72.7 KiB | * |
| * Maximum parents [2] | 66 | ****** |
| * Trees | | |
| * Maximum entries [3] | 1.68 k | * |
| * Blobs | | |
| * Maximum size [4] | 13.5 MiB | * |
| | | |
| History structure | | |
| * Maximum history depth | 136 k | |
| * Maximum tag depth [5] | 1 | |
| | | |
| Biggest checkouts | | |
| * Number of directories [6] | 4.38 k | ** |
| * Maximum path depth [7] | 13 | * |
| * Maximum path length [8] | 134 B | * |
| * Number of files [9] | 62.3 k | * |
| * Total size of files [9] | 747 MiB | |
| * Number of symlinks [10] | 40 | |
| * Number of submodules | 0 | |
In this example, a few items are raised with a high level of concern. See the following sections for information on solving:
- A high number of references.
- Large blobs.
Large number of references
A reference in Git (a branch or tag) is used to refer to a commit. Each
reference is stored as an individual file. If you are curious, you can go
to any .git
directory and look under the refs
directory.
A large number of references can cause performance problems because, with more references, object walks that Git does are larger for various operations such as clones, pushes, and housekeeping tasks.
Mitigation strategies
To mitigate the effects of a large number of references in a monorepo:
- Create an automated process for cleaning up old branches.
-
If certain references don’t need to be visible to the client, hide them using the
transfer.hideRefs
configuration setting. Because Gitaly ignores any on-server Git configuration, you must change the Gitaly configuration itself in/etc/gitlab/gitlab.rb
:gitaly['configuration'] = { # ... git: { # ... config: [ # ... { key: "transfer.hideRefs", value: "refs/namespace_to_hide" }, ], }, }
In Git 2.42.0 and later, different Git operations can skip over hidden references when doing an object graph walk.
Using LFS for large blobs
Because Git is built to handle text data, it doesn’t handle large binary files efficiently.
Therefore, you should store binary or blob files (for example, packages, audio, video, or graphics) as Large File Storage (LFS) objects. With LFS, the objects are stored externally, such as in Object Storage, which reduces the number and size of objects in the repository. Storing objects in external Object Storage can improve performance.
To analyze if a repository has large objects, you can use a tool like
git-sizer
for detailed analysis. This
tool shows details about what makes up the repository, and highlights any areas
of concern. If any large objects are found, you can then remove them with a tool
such as git filter-repo
.
For more information, refer to the Git LFS documentation.
Optimizing large repositories for GitLab
Other than modifying your workflow and the actual repository, you can take other steps to maximize performance of monorepos with GitLab.
Gitaly pack-objects cache
For very active repositories with a large number of references and files, consider using the Gitaly pack-objects cache. The pack-objects cache:
- Benefits all repositories on your GitLab server.
- Automatically works for forks.
You should always:
- Fetch incrementally. Do not clone in a way that recreates all of the worktree.
- Use shallow clones to reduce data transfer. Be aware that this puts more burden on GitLab instance because of higher CPU impact.
Control the clone directory if you heavily use a fork-based workflow. Optimize
git clean
flags to ensure that you remove or keep data that might affect or
speed-up your build.
For more information, see Pack-objects cache.
Reduce concurrent clones in CI/CD
Large repositories tend to be monorepos. This usually means that these repositories get a lot of traffic not only from users, but from CI/CD.
CI/CD loads tend to be concurrent because pipelines are scheduled during set times. As a result, the Git requests against the repositories can spike notably during these times and lead to reduced performance for both CI/CD and users alike.
You should reduce CI/CD pipeline concurrency by staggering them to run at different times. For example, a set running at one time and another set running several minutes later.
Shallow cloning
GitLab and GitLab Runner perform a shallow clone by default.
Ideally, you should always use GIT_DEPTH
with a small number
like 10. This instructs GitLab Runner to perform shallow clones.
Shallow clones make Git request only the latest set of changes for a given branch,
up to desired number of commits as defined by the GIT_DEPTH
variable.
This significantly speeds up fetching of changes from Git repositories, especially if the repository has a very long backlog consisting of a number of big files because we effectively reduce amount of data transfer. The following pipeline configuration example makes the runner shallow clone to fetch only a given branch. The runner does not fetch any other branches nor tags.
variables:
GIT_DEPTH: 10
test:
script:
- ls -al
Git strategy
By default, GitLab is configured to use the fetch
Git strategy,
which is recommended for large repositories.
This strategy reduces the amount of data to transfer and
does not really impact the operations that you might do on a repository from CI/CD.
Git clone path
GIT_CLONE_PATH
allows you to
control where you clone your repositories. This can have implications if you
heavily use big repositories with a fork-based workflow.
A fork, from the perspective of GitLab Runner, is stored as a separate repository with a separate worktree. That means that GitLab Runner cannot optimize the usage of worktrees and you might have to instruct GitLab Runner to use that.
In such cases, ideally you want to make the GitLab Runner executor be used only for the given project and not shared across different projects to make this process more efficient.
The GIT_CLONE_PATH
must be
in the directory set in $CI_BUILDS_DIR
. You can’t pick any path from disk.
Git clean flags
GIT_CLEAN_FLAGS
allows you to control
whether or not you require the git clean
command to be executed for each CI/CD
job. By default, GitLab ensures that:
- You have your worktree on the given SHA.
- Your repository is clean.
GIT_CLEAN_FLAGS
is disabled when set
to none
. On very big repositories, this might be desired because git
clean
is disk I/O intensive. Controlling that with GIT_CLEAN_FLAGS: -ffdx
-e .build/
(for example) allows you to control and disable removal of some
directories in the worktree between subsequent runs, which can speed-up
the incremental builds. This has the biggest effect if you re-use existing
machines and have an existing worktree that you can re-use for builds.
For exact parameters accepted by
GIT_CLEAN_FLAGS
, see the documentation
for git clean
. The available parameters
are dependent on the Git version.
Git fetch extra flags
GIT_FETCH_EXTRA_FLAGS
allows you
to modify git fetch
behavior by passing extra flags.
For example, if your project contains a large number of tags that your CI/CD jobs don’t rely on,
you could add --no-tags
to the extra flags to make your fetches faster and more compact.
Also in the case where you repository does not contain a lot of
tags, --no-tags
can make a big difference in some cases.
If your CI/CD builds do not depend on Git tags, setting --no-tags
is worth trying.
For more information, see the GIT_FETCH_EXTRA_FLAGS
documentation.
Fork-based workflow
Following the guidelines above, let’s imagine that we want to:
- Optimize for a big project (more than 50k files in directory).
- Use forks-based workflow for contributing.
- Reuse existing worktrees. Have preconfigured runners that are pre-cloned with repositories.
- Runner assigned only to project and all forks.
Let’s consider the following two examples, one using shell
executor and
other using docker
executor.
shell
executor example
Let’s assume that you have the following config.toml
.
concurrent = 4
[[runners]]
url = "GITLAB_URL"
token = "TOKEN"
executor = "shell"
builds_dir = "/builds"
cache_dir = "/cache"
[runners.custom_build_dir]
enabled = true
This config.toml
:
- Uses the
shell
executor, - Specifies a custom
/builds
directory where all clones are stored. - Enables the ability to specify
GIT_CLONE_PATH
, - Runs at most 4 jobs at once.
docker
executor example
Let’s assume that you have the following config.toml
.
concurrent = 4
[[runners]]
url = "GITLAB_URL"
token = "TOKEN"
executor = "docker"
builds_dir = "/builds"
cache_dir = "/cache"
[runners.docker]
volumes = ["/builds:/builds", "/cache:/cache"]
This config.toml
:
- Uses the
docker
executor, - Specifies a custom
/builds
directory on disk where all clones are stored. We host mount the/builds
directory to make it reusable between subsequent runs and be allowed to override the cloning strategy. - Doesn’t enable the ability to specify
GIT_CLONE_PATH
as it is enabled by default. - Runs at most 4 jobs at once.
Our .gitlab-ci.yml
Once we have the executor configured, we need to fine tune our .gitlab-ci.yml
.
Our pipeline is most performant if we use the following .gitlab-ci.yml
:
variables:
GIT_CLONE_PATH: $CI_BUILDS_DIR/$CI_CONCURRENT_ID/$CI_PROJECT_NAME
build:
script: ls -al
This YAML setting configures a custom clone path. This path makes it possible to re-use worktrees between the parent project and forks because we use the same clone path for all forks.
Why use $CI_CONCURRENT_ID
? The main reason is to ensure that worktrees used are not conflicting
between projects. The $CI_CONCURRENT_ID
represents a unique identifier within the given executor.
When we use it to construct the path, this directory does not conflict
with other concurrent jobs running.
Store custom clone options in config.toml
Ideally, all job-related configuration should be stored in .gitlab-ci.yml
.
However, sometimes it is desirable to make these schemes part of the runner’s configuration.
In the above example of forks, making this configuration discoverable for users may be preferred,
but this brings administrative overhead as the .gitlab-ci.yml
needs to be updated for each branch.
In such cases, it might be desirable to keep the .gitlab-ci.yml
clone path agnostic, but make it
a configuration of the runner.
We can extend our config.toml
with the following specification that is used by the runner if .gitlab-ci.yml
does not override it:
concurrent = 4
[[runners]]
url = "GITLAB_URL"
token = "TOKEN"
executor = "docker"
builds_dir = "/builds"
cache_dir = "/cache"
environment = [
"GIT_CLONE_PATH=$CI_BUILDS_DIR/$CI_CONCURRENT_ID/$CI_PROJECT_NAME"
]
[runners.docker]
volumes = ["/builds:/builds", "/cache:/cache"]
This makes the cloning configuration to be part of the given runner
and does not require us to update each .gitlab-ci.yml
.
Reference architectures
Large repositories tend to be found in larger organisations with many users. The GitLab Quality and Support teams provide several reference architectures that are the recommended way to deploy GitLab at scale.
In these types of setups, the GitLab environment used should match a reference architecture to improve performance.
Gitaly Cluster
Gitaly Cluster can notably improve large repository performance because it holds multiple replicas of the repository across several nodes. As a result, Gitaly Cluster can load balance read requests against those replicas and is fault-tolerant.
Though Gitaly Cluster is recommended for large repositories, it is a large solution with additional complexity of setup and management. Refer to the Gitaly Cluster documentation for more information, specifically the Before deploying Gitaly Cluster section.
Keep GitLab up to date
You should keep GitLab updated to the latest version where possible to benefit from performance improvements and fixes are added continuously to GitLab.