Status | Authors | Coach | DRIs | Owning Stage | Created |
---|---|---|---|---|---|
proposed | - |
Cells 1.0 Disaster Recovery
Terms used
- Primary Cell: GitLab.com SaaS which is the current GitLab.com deployment. A special purpose Cell that serves as a cluster-wide service in this architecture.
- Secondary Cells: A Cell that connects to the Primary Cell to ensure cluster-wide uniqueness.
- Global Service: A service to keep global uniqueness, manage database sequences across the cluster, and help classify which resources belong to which Cell.
- Routing Service: The Routing Service depends on the Global Service and is for managing routing rules to different cells.
- RTO: Recovery Time Objective
- RPO: Recovery Point Objective
- WAL: Write-ahead logging
Goals
Cells 1.0 is the first iteration of cells where multiple Secondary Cells can be operated independently of the Primary Cell. Though it can be operated independently it has a dependency on the Global Service and Routing Service. For Disaster Recovery, the Global Service might still have dependencies on the Primary Cell in Cells 1.0. [^cells-1.0] A decision on whether or not we use Geo for Cells DR is pending in the Using Geo for Cells 1.0 tracking issue.
This document focuses only on defining the strategy for recovering secondary Cells. It does not cover recovering the Global Service, Routing Service, Primary Cell, or any other external service.
Disaster Recovery for Cells creates a fork in our existing recovery process because cells are provisioned with different tooling. For example:
- Different processes, runbooks, and tooling to recover a Cell.
- Different RPO/RTO for primary Cell and the other cells.
Due to this, there are different goals for RPO/RTO for the Primary and Secondary Cells.
- Meet or exceed the RTO and RPO FY24 targets that have been validated for zonal outages which are covered in the GitLab.com Disaster Recovery Blueprint.
- Take into account the FY25 plans for regional recovery on the Primary Cell including regional recovery and alternate region selection.
- Leverage the same DR for procedure we use for Dedicated for Cells.
RTO/RPO Targets
Zonal Outages:
RTO | RPO | |
---|---|---|
Primary Cell (current) | 2 hours | 1 hour |
Primary Cell (FY25 Target) | <1 minute | <1 minute |
Cells 1.0 (without the primary cell) | unknown | unknown |
Regional Outages:
RTO | RPO | |
---|---|---|
Primary Cell (current) | 96 hours | 2 hours |
Primary Cell (FY25 Target) | 48 hours | <1 minute 1 |
Cells 1.0 (without the primary cell) | unknown | unknown |
Disaster Recovery Overview
Zonal recovery refers to a disaster, outage, or deletion that is limited in scope to a single availability zone. The outage might affect the entire zone, or a subset of infrastructure in a zone. Regional recovery refers to a disaster, outage, or deletion that is limited in scope to an entire region. The outage might affect the entire region, or a subset of infrastructure that affects more than one zone.
Service | Zonal Disaster Recovery | Estimated RTO | Estimated RPO |
---|---|---|---|
GitLab Rails | All services running in a cell are redundant across zones. There is no data stored for this service. | <=1 minute | not applicable |
Gitaly Cluster | Gitaly Cluster consists of a single SPOF (single point of failure) node and remains so for Cells 1.0. It requires a restore from backup in the case of a zonal failure. | <=30 min | <=1 hr for snapshot restore until WAL is available for restore. 2 |
Redis Cluster | Redis is deployed in multiple availability zones and be capable of recovering automatically from a service interruption in a single zone. | <=1 minute | <=1 minute |
PostgreSQL Cluster | PostgreSQL cluster is deployed in multiple availability zones and be capable of recovering automatically from a service interruption in a single zone. A small amount of data-loss might occur on failover. | <=1 minute | <=1 minute |
Service | Regional Disaster Recovery | Estimated RTO | Estimated RPO |
---|---|---|---|
GitLab Rails | All services running in a cell are local to a region and require a rebuild on a regional failure. There is no data stored for this service. | <=12 hours | not applicable |
Gitaly Cluster | Initially, Gitaly Cluster consists of a single SPOF node and remains so for Cells 1.0. It requires a rebuild in the case of a regional failure. | Unknown | <=1 hr for snapshot restore until WAL is available for restore. 2 |
Redis Cluster | Redis is deployed in a single region and requires a rebuild in the case of a regional failure. In flight jobs, session data and cache can not be recovered. | Unknown | not applicable |
PostgreSQL Cluster | The PostgreSQL cluster is deployed in a single region and requires a rebuild in the case of a regional failure. Recovery is from backups and WAL files. A small amount of data-loss might occur on failover. | Unknown | <=5 minutes |
Disaster Recovery Validation
Disaster Recovery for Cells needs to be validated through periodic restore testing. This recovery should be done on a Cell in the Production environment. This testing is done once a quarter and is completed by running game-days using the disaster recovery runbook.
Risks
- The Primary Cell is not using Dedicated for deployment and operation where the Secondary Cells are. This might split our processes and runbooks and add to our RTO.
- The current plan is to run Secondary Cells using Dedicated. The process for Disaster Recovery on Dedicated has a large number of manual steps and is not yet automated.3
- The Dedicated DR runbook has guidance, but is not structured in a way that can be followed by an SRE in the event of a Disaster. 4
-
On the Primary cell and Cells 1.0 backups and data are stored on Google Object Storage which makes no RPO guarantees for regional failure. At this time, there are no plans to use dual-region buckets which have a 15 minute RPO guarantee. ↩
-
See the DR Blueprint ↩ ↩2
-
See this tracking epic for the work that was done to validate DR on Dedicated and this issue for future plans to improve the Dedicated runbooks. ↩
-
See this note on why this is the case and that Dedicated and how Geo is the preferred method for Disaster Recovery. ↩