This page contains information related to upcoming products, features, and functionality. It is important to note that the information presented is for informational purposes only. Please do not rely on this information for purchasing or planning purposes. As with all projects, the items mentioned on this page are subject to change or delay. The development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
Status Authors Coach DRIs Owning Stage Created
proposed -

Cells: Routing Service

This document describes design goals and architecture of Routing Service used by Cells. To better understand where the Routing Service fits into architecture take a look at Deployment Architecture.

Goals

The routing layer is meant to offer a consistent user experience where all Cells are presented under a single domain (for example, gitlab.com), instead of having to navigate to separate domains.

The user will be able to use https://gitlab.com to access Cell-enabled GitLab. Depending on the URL access, it will be transparently proxied to the correct Cell that can serve this particular information. For example:

  • All requests going to https://gitlab.com/users/sign_in are randomly distributed to all Cells.
  • All requests going to https://gitlab.com/gitlab-org/gitlab/-/tree/master are always directed to Cell 5, for example.
  • All requests going to https://gitlab.com/my-username/my-project are always directed to Cell 1.
  1. Technology.

    We decide what technology the routing service is written in. The choice is dependent on the best performing language, and the expected way and place of deployment of the routing layer. If it is required to make the service multi-cloud it might be required to deploy it to the CDN provider. Then the service needs to be written using a technology compatible with the CDN provider.

  2. Cell discovery.

    The routing service needs to be able to discover and monitor the health of all Cells.

  3. User can use single domain to interact with many Cells.

    The routing service will intelligently route all requests to Cells based on the resource being accessed versus the Cell containing the data.

  4. Router endpoints classification.

    The stateless routing service will fetch and cache information about endpoints from one of the Cells. We need to implement a protocol that will allow us to accurately describe the incoming request (its fingerprint), so it can be classified by one of the Cells, and the results of that can be cached. We also need to implement a mechanism for negative cache and cache eviction.

  5. GraphQL and other ambiguous endpoints.

    Most endpoints have a unique sharding key: the Organization, which directly or indirectly (via a Group or Project) can be used to classify endpoints. Some endpoints are ambiguous in their usage (they don’t encode the sharding key), or the sharding key is stored deep in the payload. In these cases, we need to decide how to handle endpoints like /api/graphql.

  6. Small.

    The Routing Service is configuration-driven and rules-driven, and does not implement any business logic. The maximum size of the project source code in initial phase is 1_000 lines without tests. The reason for the hard limit is to make the Routing Service to not have any special logic, and could be rewritten into any technology in a matter of a few days.

Requirements

Requirement Description Priority
Discovery needs to be able to discover and monitor the health of all Cells. high
Security only authorized cells can be routed to high
Single domain e.g. GitLab.com high
Caching can cache routing information for performance high
50 ms of increased latency   high
Path-based can make routing decision based on path high
Complexity the routing service should be configuration-driven and small high
Stateless does not need database, Cells provide all routing information medium
Secrets-based can make routing decision based on secret (e.g. JWT) medium
Observability can use existing observability tooling low
Self-managed can be eventually used by self-managed low
Regional can route requests to different regions low

Low Latency

The target latency for routing service should be less than 50 ms.

Looking at the urgency: high request we don’t have a lot of headroom on the p50. Adding an extra 50 ms allows us to still be in or SLO on the p95 level.

There is 3 primary entry points for the application; web, api, and git. Each service is assigned a Service Level Indicator (SLI) based on latency using the apdex standard. The corresponding Service Level Objectives (SLOs) for these SLIs require low latencies for large amount of requests. It’s crucial to ensure that the addition of the routing layer in front of these services does not impact the SLIs. The routing layer is a proxy for these services, and we lack a comprehensive SLI monitoring system for the entire request flow (including components like the Edge network and Load Balancers) we use the SLIs for web, git, and api as a target.

The main SLI we use is the rails requests. It has multiple satisfied targets (apdex) depending on the request urgency:

Urgency Duration in ms
:high 250 ms
:medium 500 ms
:default 1000 ms
:low 5000 ms

Analysis

The way we calculate the headroom we have is by using the following:

\mathrm{Headroom}\ {ms} = \mathrm{Satisfied}\ {ms} - \mathrm{Duration}\ {ms}

web:

Target Duration Percentile Headroom
5000 ms p99 4000 ms
5000 ms p95 4500 ms
5000 ms p90 4600 ms
5000 ms p50 4900 ms
1000 ms p99 500 ms
1000 ms p95 740 ms
1000 ms p90 840 ms
1000 ms p50 900 ms
500 ms p99 0 ms
500 ms p95 60 ms
500 ms p90 100 ms
500 ms p50 400 ms
250 ms p99 140 ms
250 ms p95 170 ms
250 ms p90 180 ms
250 ms p50 200 ms

Analysis was done in https://gitlab.com/gitlab-org/gitlab/-/issues/432934#note_1667993089

api:

Target Duration Percentile Headroom
5000 ms p99 3500 ms
5000 ms p95 4300 ms
5000 ms p90 4600 ms
5000 ms p50 4900 ms
1000 ms p99 440 ms
1000 ms p95 750 ms
1000 ms p90 830 ms
1000 ms p50 950 ms
500 ms p99 450 ms
500 ms p95 480 ms
500 ms p90 490 ms
500 ms p50 490 ms
250 ms p99 90 ms
250 ms p95 170 ms
250 ms p90 210 ms
250 ms p50 230 ms

Analysis was done in https://gitlab.com/gitlab-org/gitlab/-/issues/432934#note_1669995479

git:

Target Duration Percentile Headroom
5000 ms p99 3760 ms
5000 ms p95 4280 ms
5000 ms p90 4430 ms
5000 ms p50 4900 ms
1000 ms p99 500 ms
1000 ms p95 750 ms
1000 ms p90 800 ms
1000 ms p50 900 ms
500 ms p99 280 ms
500 ms p95 370 ms
500 ms p90 400 ms
500 ms p50 430 ms
250 ms p99 200 ms
250 ms p95 230 ms
250 ms p90 240 ms
250 ms p50 240 ms

Analysis was done in https://gitlab.com/gitlab-org/gitlab/-/issues/432934#note_1671385680

Non-Goals

Not yet defined.

Proposal

TBD

Technology

TBD

Alternatives

TBD