This page contains information related to upcoming products, features, and functionality. It is important to note that the information presented is for informational purposes only. Please do not rely on this information for purchasing or planning purposes. As with all projects, the items mentioned on this page are subject to change or delay. The development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.

Status	Authors	Coach	DRIs	Owning Stage	Created
proposed					-

Cells: Routing Service

This document describes design goals and architecture of Routing Service used by Cells. To better understand where the Routing Service fits into architecture take a look at Deployment Architecture.

Goals

The routing layer is meant to offer a consistent user experience where all Cells are presented under a single domain (for example, gitlab.com), instead of having to navigate to separate domains.

The user will be able to use https://gitlab.com to access Cell-enabled GitLab. Depending on the URL access, it will be transparently proxied to the correct Cell that can serve this particular information. For example:

All requests going to https://gitlab.com/users/sign_in are randomly distributed to all Cells.
All requests going to https://gitlab.com/gitlab-org/gitlab/-/tree/master are always directed to Cell 5, for example.
All requests going to https://gitlab.com/my-username/my-project are always directed to Cell 1.

Technology.
We decide what technology the routing service is written in. The choice is dependent on the best performing language, and the expected way and place of deployment of the routing layer. If it is required to make the service multi-cloud it might be required to deploy it to the CDN provider. Then the service needs to be written using a technology compatible with the CDN provider.
Cell discovery.
The routing service needs to be able to discover and monitor the health of all Cells.
User can use single domain to interact with many Cells.
The routing service will intelligently route all requests to Cells based on the resource being accessed versus the Cell containing the data.
Router endpoints classification.
The stateless routing service will fetch and cache information about endpoints from one of the Cells. We need to implement a protocol that will allow us to accurately describe the incoming request (its fingerprint), so it can be classified by one of the Cells, and the results of that can be cached. We also need to implement a mechanism for negative cache and cache eviction.
GraphQL and other ambiguous endpoints.
Most endpoints have a unique sharding key: the Organization, which directly or indirectly (via a Group or Project) can be used to classify endpoints. Some endpoints are ambiguous in their usage (they don’t encode the sharding key), or the sharding key is stored deep in the payload. In these cases, we need to decide how to handle endpoints like /api/graphql.
Small.
The Routing Service is configuration-driven and rules-driven, and does not implement any business logic. The maximum size of the project source code in initial phase is 1_000 lines without tests. The reason for the hard limit is to make the Routing Service to not have any special logic, and could be rewritten into any technology in a matter of a few days.

Requirements

Requirement	Description	Priority
Discovery	needs to be able to discover and monitor the health of all Cells.	high
Security	only authorized cells can be routed to	high
Single domain	e.g. GitLab.com	high
Caching	can cache routing information for performance	high
50 ms of increased latency		high
Path-based	can make routing decision based on path	high
Complexity	the routing service should be configuration-driven and small	high
Stateless	does not need database, Cells provide all routing information	medium
Secrets-based	can make routing decision based on secret (e.g. JWT)	medium
Observability	can use existing observability tooling	low
Self-managed	can be eventually used by self-managed	low
Regional	can route requests to different regions	low

Low Latency

The target latency for routing service should be less than 50 ms.

Looking at the urgency: high request we don’t have a lot of headroom on the p50. Adding an extra 50 ms allows us to still be in or SLO on the p95 level.

There is 3 primary entry points for the application; web, api, and git. Each service is assigned a Service Level Indicator (SLI) based on latency using the apdex standard. The corresponding Service Level Objectives (SLOs) for these SLIs require low latencies for large amount of requests. It’s crucial to ensure that the addition of the routing layer in front of these services does not impact the SLIs. The routing layer is a proxy for these services, and we lack a comprehensive SLI monitoring system for the entire request flow (including components like the Edge network and Load Balancers) we use the SLIs for web, git, and api as a target.

The main SLI we use is the rails requests. It has multiple satisfied targets (apdex) depending on the request urgency:

Urgency	Duration in ms
`:high`	250 ms
`:medium`	500 ms
`:default`	1000 ms
`:low`	5000 ms

Analysis

The way we calculate the headroom we have is by using the following:

\mathrm{Headroom}\ {ms} = \mathrm{Satisfied}\ {ms} - \mathrm{Duration}\ {ms}

web:

Target Duration	Percentile	Headroom
5000 ms	p99	4000 ms
5000 ms	p95	4500 ms
5000 ms	p90	4600 ms
5000 ms	p50	4900 ms
1000 ms	p99	500 ms
1000 ms	p95	740 ms
1000 ms	p90	840 ms
1000 ms	p50	900 ms
500 ms	p99	0 ms
500 ms	p95	60 ms
500 ms	p90	100 ms
500 ms	p50	400 ms
250 ms	p99	140 ms
250 ms	p95	170 ms
250 ms	p90	180 ms
250 ms	p50	200 ms

Analysis was done in https://gitlab.com/gitlab-org/gitlab/-/issues/432934#note_1667993089

api:

Target Duration	Percentile	Headroom
5000 ms	p99	3500 ms
5000 ms	p95	4300 ms
5000 ms	p90	4600 ms
5000 ms	p50	4900 ms
1000 ms	p99	440 ms
1000 ms	p95	750 ms
1000 ms	p90	830 ms
1000 ms	p50	950 ms
500 ms	p99	450 ms
500 ms	p95	480 ms
500 ms	p90	490 ms
500 ms	p50	490 ms
250 ms	p99	90 ms
250 ms	p95	170 ms
250 ms	p90	210 ms
250 ms	p50	230 ms

Analysis was done in https://gitlab.com/gitlab-org/gitlab/-/issues/432934#note_1669995479

git:

Target Duration	Percentile	Headroom
5000 ms	p99	3760 ms
5000 ms	p95	4280 ms
5000 ms	p90	4430 ms
5000 ms	p50	4900 ms
1000 ms	p99	500 ms
1000 ms	p95	750 ms
1000 ms	p90	800 ms
1000 ms	p50	900 ms
500 ms	p99	280 ms
500 ms	p95	370 ms
500 ms	p90	400 ms
500 ms	p50	430 ms
250 ms	p99	200 ms
250 ms	p95	230 ms
250 ms	p90	240 ms
250 ms	p50	240 ms

Analysis was done in https://gitlab.com/gitlab-org/gitlab/-/issues/432934#note_1671385680

Non-Goals

Not yet defined.

Proposal

TBD

Technology

TBD

Alternatives

TBD

Cells: Routing Service

Goals

Requirements

Low Latency

Analysis

Non-Goals

Proposal

Technology

Alternatives

Links