Status | Authors | Coach | DRIs | Owning Stage | Created |
---|---|---|---|---|---|
proposed |
@andrewn
|
@grzesiek
| 2023-04-13 |
- Executive Summary
- Scope
- Running Production ML/AI experiment workloads
GitLab Service-Integration: AI and Beyond
This document is an abbreviated proposal for Service-Integration to allow teams within GitLab to rapidly build new application features that leverage AI, ML, and data technologies.
Executive Summary
This document proposes a service-integration approach to setting up infrastructure to allow teams within GitLab to build new application features that leverage AI, ML, and data technologies at a rapid pace. The scope of the document is limited specifically to internally hosted features, not third-party APIs. The current application architecture runs most GitLab application features in Ruby. However, many ML/AI experiments require different resources and tools, implemented in different languages, with huge libraries that do not always play nicely together, and have different hardware requirements. Adding all these features to the existing infrastructure will increase the size of the GitLab application container rapidly, resulting in slower startup times, increased number of dependencies, security risks, negatively impacting development velocity, and increasing complexity due to different hardware requirements. As an alternative, the proposal suggests adding services to avoid overloading GitLabs main workloads. These services will run independently with isolated resources and dependencies. By adding services, GitLab can maintain the availability and security of GitLab.com, and enable engineers to rapidly iterate on new ML/AI experiments.
Scope
The infrastructure, platform, and other changes related to ML/AI experiments is broad. This blueprint is limited specifically to the following scope:
- Production workloads, running (directly or indirectly) as a result of requests into the GitLab application (
gitlab.com
), or an associated subdomains (for example,codesuggestions.gitlab.com
). - Excludes requests from the GitLab application, made to third-party APIs outside of our infrastructure. From an Infrastructure point-of-view, external AI/ML API requests are no different from other API (non ML/AI) requests and generally follow the existing guidelines that are in place for calling external APIs.
- Excludes training and tuning workloads not directly connected to our production workloads. Training and tuning workloads are distinct from production workloads and will be covered by their own blueprint(s).
Running Production ML/AI experiment workloads
Why Not Simply Continue To Use The Existing Application Architecture?
Let’s start with some background on how the application is deployed:
- Most GitLab application features are implemented in Ruby and run in one of two types of Ruby deployments: broadly Rails and Sidekiq (although we do partition this traffic further for different workloads).
- These Ruby workloads have two main container images
gitlab-webservice-ee
andgitlab-sidekiq-ee
. All the code, libraries, binaries, and other resources that we use to support the main Ruby part of the codebase are embedded within these images. - There are thousands of pods running these containers in production for GitLab.com at any moment in time. They are started up and shut down at a high rate throughout the day as traffic demands on the site fluctuate.
- For most new features developed, any new supporting resources need to be added to either one, or both of these containers.
\ source
Many of the initial discussions focus on adding supporting resources to these existing containers (example). Choosing this approach would have many downsides, in terms of both the velocity at which new features can be iterated on, and in terms of the availability of GitLab.com.
Many of the AI experiments that GitLab is considering integrating into the application are substantially different from other libraries and tools that have been integrated in the past.
- ML toolkits are implemented in a plethora of languages, each requiring separate runtimes. Python, C, C++ are the most common, but there is a long tail of languages used.
- There are a very large number of tools that we’re looking to integrate with and no single tool will support all the features that are being investigated. Tensorflow, PyTorch, Keras, Scikit-learn, Alpaca are just a few examples.
- These libraries are huge. Tensorflow’s container image with GPU support is 3GB, PyTorch is 5GB, Keras is 300MB. Prophet is ~250MB.
- Many of these libraries do not play nicely together: they may have dependencies that are not compatible, or require different versions of Python, or GPU driver versions.
It’s likely that in the next few months, GitLab will experiment with many different features, using many different libraries.
Trying to deploy all of these features into the existing infrastructure would have many downsides:
- The size of the GitLab application container would expand very rapidly as each new experiment introduces a new set of supporting libraries, each library is as big, or bigger, than the existing GitLab application within the container.
- Startup times for new workloads would increase, potentially impacting the availability of GitLab.com during high-traffic periods.
- The number of dependencies within the container would increase rapidly, putting pressure on the engineering teams to keep ahead of exploits and vulnerabilities.
- The security attack surface within the container would be greatly increased with each new dependency. These containers include secrets which, if leaked via an exploit would need costly application-wide secret rotation to be done.
- Development velocity will be negatively impacted as engineers work to avoid dependency conflicts between libraries.
- Additionally there may be extra complexity due to different hardware requirements for different libraries with appropriate drivers etc for GPUs, TPUs, CUDA versions, etc.
- Our Kubernetes workloads have been tuned for the existing multithreaded Ruby request (Rails) and message (Sidekiq) processes. Adding extremely resource-intensive applications into these workloads would affect unrelated requests, starving requests of CPU and memory and requiring complex tuning to ensure fairness. Failure to do this would impact our availability of GitLab.com.
\ source
Proposal: Avoid Overfilling GitLabs Application Containers with Service-Integration
GitLab.com migrated to Kubernetes several years back, but for numerous good reasons, the application architecture deployed for GitLab.com remains fairly simple.
Instead of embedding these applications directly into the Rails and/or Sidekiq containers, we run them as small, independent Kubernetes deployments, isolated from the main workload.
\ source
The service-integration approach has already been used for the GitLab Duo Suggested Reviewers feature that has been deployed to GitLab.com.
This approach would have many advantages:
- Componentization and Replaceability: some of these AI feature experiments will likely be short-lived. Being able to shut them down (possibly quickly, in an emergency, such as a security breach) is important. If they are terminated, they are less likely to leave technical debt behind in our main application workloads.
-
Security Isolation: experimental services can run with access to a minimal set of secrets, or possibly none. Ideally, the services would be stateless, with data being passed in, processed, and returned to the caller without access to PostgreSQL or other data sources. In the event of a remote code exploit or other security breach, the attacker would have limited access to sensitive data.
- In lieu of direct access to the main or CI Postgres clusters, services would be provided with access to the internal GitLab API through a predefined internal URL. The platform should provide instrumentation and monitoring on this address.
- In future iterations, but out of scope for the initial delivery, the platform could facilitate automatic authentication against the internal API, for example by managing and injecting short-lived API tokens into internal API calls, or OIDC etc.
- Resource Isolation: resource-intensive workloads would be isolated to individual containers. OOM failures would not impact requests outside of the experiment. CPU saturation would not slow down unrelated requests.
- Dependency Isolation: different AI libraries will have conflicting dependencies. This will not be an issue if they’re run as separate services in Kubernetes.
- Container Size: the size of the main application containers is not drastically increased, placing a burden on the application.
- Distribution Team Bottleneck: The Distribution team avoids becoming a bottleneck as demands for many different libraries to be included in the main application containers increase.
- Stronger Ownership of Workloads: teams can better understand how their workloads are running as they run in isolation.
However, there are several outstanding questions:
- Availability Requirements: would experimental services have the same availability requirements (and alerting requirements) as the main application?
- Oncall: would teams be responsible for handling pager alerts for their services?
-
Support for non-SAAS GitLab instances: initially all experiments would target GitLab.com, but eventually we may need to consider how to support other instances.
- There are three possible modes for services:
-
M1
: GitLab.com only: only GitLab.com supports the service. -
M2
: SAAS-hosted for use with self-managed instance and instance-hosted: a singular SAAS-hosted service supports self-managed instances and GitLab.com. This is similar to the GitLab Plus proposal. -
M3
: Instance-hosted: each instance has a copy of the service. GitLab.com has a copy for GitLab.com. Self-managed instances host their copy of the service. This is similar to the container registry or Gitaly today.
-
- Initially, most experiments will probably be option 1 but may be promoted to 2 or 3 as they mature.
- There are three possible modes for services:
- Promotion Process: ML/AI experimental features will need to be promoted to non-experimental status as they mature. A process for this will need to be established.
Proposed Guidelines for Building ML/AI Services
- Avoid adding any large ML/AI libraries needed to support experimentation to the main application.
- Create an platform to support individual ML/AI experiments.
- Encourage supporting services to be stateless (excluding deployed models and other resources generated during ML training).
- ML/AI experiment support services must not access main application datastores, including but not limited to main PostgreSQL, CI PostgreSQL, and main application Redis instances.
- In the main application, client code for services should reside behind a feature-flag toggle, for fine-grained control of the feature.
Technical Details
Some points, in greater detail:
Traffic Access
- Ideally these services should not be exposed externally to Internet traffic: only internally to our existing Rails and Sidekiq workloads should be routed.
- For services intended to run at
M2
: “SAAS-hosted for use with self-managed instance and instance-hosted”, we would expect to migrate the service to a public endpoint once sufficient security review has been performed.
- For services intended to run at
Platform Requirements
In order to quickly deploy and manage experiments, an minimally viable platform will need to be provided to stage-group teams. The technical implementation details of this platform are out of scope for this blueprint and will require their own blueprint (to follow).
However, Service-Integration will establish certain necessary and optional requirements that the platform will need to satisfy.
Ease of Use, Ownership Requirements
ID | Required | Detail | Epic/Issue | Done? |
---|---|---|---|---|
R100
| Required | The platform should be easy to use: imagine Heroku with GitLab Production Readiness-approved defaults. | Runway to [BETA] : Increased Adoption and Self Service | No |
R110
| Required | With the exception of an Infrastructure-led onboarding process, services are owned, deployed and managed by stage-group teams. In other words,services follow a “You Build It, You Run It” model of ownership. | [Paused] Discussion: Tiered Support Model for Runway | No |
R120
| Required | Programming-language agnostic: no requirements for services. Services should be packaged as container images. | Runway to [BETA] : Increased Adoption and Self Service | No |
R130
| Recommended | Each service should be evaluated against the GitLab.com Service Maturity Model. | Discussion: Introduce an ‘Infrastructure Well-Architected Service Framework’ | No |
R140
| Recommended | Services using the platform have expedited production-readiness processes.
|
Observability Requirements
ID | Required | Detail | Epic/Issue | Done? |
---|---|---|---|---|
R200
| Required | The platform must provide SLIs for services out-of-the-box.
| Observability: Default Metrics, Observability: Custom Metrics | Yes |
R210
| Required | Observability dashboards, rules, alerts (with per-term routing) must be generated from a manifest. | Observability: Metrics Catalog | Yes |
R220
| Required | Standardized logging infrastructure.
| Observability: Logs in Elasticsearch for model-gateway, Observability: Runway logs available to users |
Deployment Requirements
ID | Required | Detail | Epic/Issue | Done? |
---|---|---|---|---|
R300
| Required | No secrets stored in CI/CD.
| Secrets Management | No |
R310
| Required | Multiple environment should be supported, eg Staging and Production. | Yes | |
R320
| Required | The platform should be cost-effective. Kubernetes clusters should support multiple services and teams. | ||
R330
| Recommended | Gradual rollouts, rollbacks, blue-green deployments. | ||
R340
| Required | Services should be isolated from one another. | ||
R350
| Recommended | Services should have the ability to specify node characteristic requirements (eg, GPU). | ||
R360
| Required | Developers should not need knowledge of Helm, Kubernetes, Prometheus in order to deploy. All required values are configured and validated in project-hosted manifest before generating Kubernetes manifests, Prometheus rules, etc. | ||
R370
| Initially services should be synchronous only - using REST or GRPC requests.
| |||
R390
| Each service hosted in its own GitLab repository with deployment manifest stored in the repository.
|
Security Requirements
ID | Required | Detail | Epic/Issue | Done? |
---|---|---|---|---|
R400
| Stateful services deployed on the platform that utilize their own stateful storage (for example, custom deployed Postgres instance), must not store application security tokens, cloud-provider service keys or other long-lived security tokens in their stateful stores. | |||
R410
| Long-lived shared secrets are discouraged, and should be referenced in the service manifest as such, to allow for accounting and monitoring. | |||
R420
| Services using long-lived shared secrets should ensure that secret rotation can take place without downtime.
|
Common Service Libraries
ID | Required | Detail | Epic/Issue | Done? |
---|---|---|---|---|
R500
| Required | Experimental services would be required to adopt and use LabKit (for Go services), or LabKit-Ruby for observability, context, correlation, FIPs verification, etc.
| Scalability: Labkit as the in-application platform toolkit |