Status | Authors | Coach | DRIs | Owning Stage | Created |
---|---|---|---|---|---|
proposed | - |
Cells: Routing Service
This document describes design goals and architecture of Routing Service used by Cells. To better understand where the Routing Service fits into architecture take a look at Deployment Architecture.
Goals
The routing layer is meant to offer a consistent user experience where all Cells are presented under a single domain (for example, gitlab.com
), instead of having to navigate to separate domains.
The user will be able to use https://gitlab.com
to access Cell-enabled GitLab.
Depending on the URL access, it will be transparently proxied to the correct Cell that can serve this particular information.
For example:
- All requests going to
https://gitlab.com/users/sign_in
are randomly distributed to all Cells. - All requests going to
https://gitlab.com/gitlab-org/gitlab/-/tree/master
are always directed to Cell 5, for example. - All requests going to
https://gitlab.com/my-username/my-project
are always directed to Cell 1.
-
Technology.
We decide what technology the routing service is written in. The choice is dependent on the best performing language, and the expected way and place of deployment of the routing layer. If it is required to make the service multi-cloud it might be required to deploy it to the CDN provider. Then the service needs to be written using a technology compatible with the CDN provider.
-
Cell discovery.
The routing service needs to be able to discover and monitor the health of all Cells.
-
User can use single domain to interact with many Cells.
The routing service will intelligently route all requests to Cells based on the resource being accessed versus the Cell containing the data.
-
Router endpoints classification.
The stateless routing service will fetch and cache information about endpoints from one of the Cells. We need to implement a protocol that will allow us to accurately describe the incoming request (its fingerprint), so it can be classified by one of the Cells, and the results of that can be cached. We also need to implement a mechanism for negative cache and cache eviction.
-
GraphQL and other ambiguous endpoints.
Most endpoints have a unique sharding key: the Organization, which directly or indirectly (via a Group or Project) can be used to classify endpoints. Some endpoints are ambiguous in their usage (they don’t encode the sharding key), or the sharding key is stored deep in the payload. In these cases, we need to decide how to handle endpoints like
/api/graphql
. -
Small.
The Routing Service is configuration-driven and rules-driven, and does not implement any business logic. The maximum size of the project source code in initial phase is 1_000 lines without tests. The reason for the hard limit is to make the Routing Service to not have any special logic, and could be rewritten into any technology in a matter of a few days.
Requirements
Requirement | Description | Priority |
---|---|---|
Discovery | needs to be able to discover and monitor the health of all Cells. | high |
Security | only authorized cells can be routed to | high |
Single domain | e.g. GitLab.com | high |
Caching | can cache routing information for performance | high |
50 ms of increased latency | high | |
Path-based | can make routing decision based on path | high |
Complexity | the routing service should be configuration-driven and small | high |
Stateless | does not need database, Cells provide all routing information | medium |
Secrets-based | can make routing decision based on secret (e.g. JWT) | medium |
Observability | can use existing observability tooling | low |
Self-managed | can be eventually used by self-managed | low |
Regional | can route requests to different regions | low |
Low Latency
The target latency for routing service should be less than 50 ms.
Looking at the urgency: high
request we don’t have a lot of headroom on the p50.
Adding an extra 50 ms allows us to still be in or SLO on the p95 level.
There is 3 primary entry points for the application; web
, api
, and git
.
Each service is assigned a Service Level Indicator (SLI) based on latency using the apdex standard.
The corresponding Service Level Objectives (SLOs) for these SLIs require low latencies for large amount of requests.
It’s crucial to ensure that the addition of the routing layer in front of these services does not impact the SLIs.
The routing layer is a proxy for these services, and we lack a comprehensive SLI monitoring system for the entire request flow (including components like the Edge network and Load Balancers) we use the SLIs for web
, git
, and api
as a target.
The main SLI we use is the rails requests.
It has multiple satisfied
targets (apdex) depending on the request urgency:
Urgency | Duration in ms |
---|---|
:high
| 250 ms |
:medium
| 500 ms |
:default
| 1000 ms |
:low
| 5000 ms |
Analysis
The way we calculate the headroom we have is by using the following:
\mathrm{Headroom}\ {ms} = \mathrm{Satisfied}\ {ms} - \mathrm{Duration}\ {ms}
web
:
Target Duration | Percentile | Headroom |
---|---|---|
5000 ms | p99 | 4000 ms |
5000 ms | p95 | 4500 ms |
5000 ms | p90 | 4600 ms |
5000 ms | p50 | 4900 ms |
1000 ms | p99 | 500 ms |
1000 ms | p95 | 740 ms |
1000 ms | p90 | 840 ms |
1000 ms | p50 | 900 ms |
500 ms | p99 | 0 ms |
500 ms | p95 | 60 ms |
500 ms | p90 | 100 ms |
500 ms | p50 | 400 ms |
250 ms | p99 | 140 ms |
250 ms | p95 | 170 ms |
250 ms | p90 | 180 ms |
250 ms | p50 | 200 ms |
Analysis was done in https://gitlab.com/gitlab-org/gitlab/-/issues/432934#note_1667993089
api
:
Target Duration | Percentile | Headroom |
---|---|---|
5000 ms | p99 | 3500 ms |
5000 ms | p95 | 4300 ms |
5000 ms | p90 | 4600 ms |
5000 ms | p50 | 4900 ms |
1000 ms | p99 | 440 ms |
1000 ms | p95 | 750 ms |
1000 ms | p90 | 830 ms |
1000 ms | p50 | 950 ms |
500 ms | p99 | 450 ms |
500 ms | p95 | 480 ms |
500 ms | p90 | 490 ms |
500 ms | p50 | 490 ms |
250 ms | p99 | 90 ms |
250 ms | p95 | 170 ms |
250 ms | p90 | 210 ms |
250 ms | p50 | 230 ms |
Analysis was done in https://gitlab.com/gitlab-org/gitlab/-/issues/432934#note_1669995479
git
:
Target Duration | Percentile | Headroom |
---|---|---|
5000 ms | p99 | 3760 ms |
5000 ms | p95 | 4280 ms |
5000 ms | p90 | 4430 ms |
5000 ms | p50 | 4900 ms |
1000 ms | p99 | 500 ms |
1000 ms | p95 | 750 ms |
1000 ms | p90 | 800 ms |
1000 ms | p50 | 900 ms |
500 ms | p99 | 280 ms |
500 ms | p95 | 370 ms |
500 ms | p90 | 400 ms |
500 ms | p50 | 430 ms |
250 ms | p99 | 200 ms |
250 ms | p95 | 230 ms |
250 ms | p90 | 240 ms |
250 ms | p50 | 240 ms |
Analysis was done in https://gitlab.com/gitlab-org/gitlab/-/issues/432934#note_1671385680
Non-Goals
Not yet defined.
Proposal
TBD
Technology
TBD
Alternatives
TBD