Set up a self-hosted large language model with LiteLLM

LiteLLM is an OpenAI proxy server. You can use LiteLLM to simplify the integration with different large language models (LLMs) by leveraging the OpenAI API spec. Use LiteLLM to easily switch between different LLMs.

LiteLLM architectureShows how GitLab sends requests to the AI Gateway when set up with the LiteLLM OpenAI proxy server.Self Hosted setup using an OpenAI Proxy ServerOllamaLiteLLMAI GatewayGitLabOllamaLiteLLMAI GatewayGitLabClientSend requestCreate prompt and send requestPerform API request to the AI model using the OpenAI formatTranslate and forward the requestto the model provider specific formatRespond to the promptForward AI responseForward AI responseForward AI responseClient

On Kubernetes

On Kubernetes environments, Ollama can be installed with a Helm chart or following the example in the official documentation.

Example setup with LiteLLM and Ollama

  1. Pull and serve the model with Ollama:

    ollama pull codegemma:2b
    ollama serve
    
  2. Create the LiteLLM proxy configuration that routes a request from the AI Gateway directed to a specific model version instead of the generic named codegemma model. In this example we are using codegemma:2b, which is being served at http://localhost:11434 by Ollama:

    # config.yaml
    model_list:
    - model_name: codegemma
      litellm_params:
          model: ollama/codegemma:2b
          api_base: http://localhost:11434
    
  3. Run the proxy:

    litellm --config config.yaml
    
  4. Send a test request:

    curl --request 'POST' \
    'http://localhost:5052/v2/code/completions' \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
    "current_file": {
       "file_name": "app.py",
       "language_identifier": "python",
       "content_above_cursor": "<|fim_prefix|>def hello_world():<|fim_suffix|><|fim_middle|>",
       "content_below_cursor": ""
    },
    "model_provider": "litellm",
    "model_endpoint": "http://127.0.0.1:4000",
    "model_name": "codegemma",
    "telemetry": [],
    "prompt_version": 2,
    "prompt": ""
    }' | jq
    
    {
       "id": "id",
       "model": {
          "engine": "litellm",
          "name": "text-completion-openai/codegemma",
          "lang": "python"
       },
       "experiments": [],
       "object": "text_completion",
       "created": 1718631985,
       "choices": [
          {
             "text": "print(\"Hello, World!\")",
             "index": 0,
             "finish_reason": "length"
          }
       ]
    }
    

Example setup for Codestral with Ollama

When serving the Codestral model through Ollama, there is an additional step required to make Codestral work with both code completions and code generations.

  1. Pull the Codestral model:

    ollama pull codestral
    
  2. Edit the default template used for Codestral:

    ollama run codestral
    
    >>> /set template {{ .Prompt }}
    Set prompt template.
    >>> /save codestral
    Created new model 'codestral'