How to configure llama.cpp on Ubuntu

There are a lot of various docs and guides on running llama.cpp locally on Ubuntu, but they are all disjointed and provide only bits and pieces.

I wanted to run llama.cpp locally in Docker on my Ubuntu laptop and it took quite a bit of wrangling to make it all work.

The steps documented below are mostly for myself to help redo it all if I ever need to do it again.

Prerequirements: You should have NVidia drivers and CUDA installed already. The quick test is to run nvidia-smi on the host and check if the GPU device you want to use shows up in the list.

I use Docker from snap. Which is probably sudo snap install docker, but I ran it long time ago. The fact that the Docker is a Snap adds a few complications, of course.

First, install Nvidia Container Toolkit. It will mostly work, but the linked guide assumes Docker installed directly, not via Snap. Instead of the documented nvidia-ctk step, do this (just a path change):

sudo nvidia-ctk runtime configure --runtime=docker \
  --config=/var/snap/docker/current/etc/docker/daemon.json \
  --set-as-default

sudo snap restart docker

Check that the GPU is visible from inside the container. It is the same nvidia-smi command, but now from inside of a container:

docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

The guides online conflict on if both --runtime and --gpus parameters are required. I needed both.

Now to the final command that downloads and launches the model:

docker run \
  --runtime=nvidia --gpus all \
  -p 8000:8000 \
  -v /media/data1/llama.cpp-cache:/root/.cache/llama.cpp \
  ghcr.io/ggml-org/llama.cpp:full-cuda \
  -s \
  -hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_M \
  --api-key something \
  --port 8000 \
  --host 0.0.0.0 \
  --jinja \
  --ctx-size 10000 \
  --gpu-layers 8

This looks way too long and a few things here are different from the command you'd see in the other guides, so I'll go over the individual bits.

--runtime=nvidia --gpus all just to make sure the GPU is visible.

-p 8000:8000 port forwarding. You know Docker, right?

-v /media/data1/llama.cpp-cache:/root/.cache/llama.cpp Mount the host folder for the llama.cpp to place its models in. You do not want to re-downloads tens of gigabytes everytime your restart the container, right? Note that the guides mention /models folder inside of the container for this purposes. I have not seen llama.cpp use that one at all.

ghcr.io/ggml-org/llama.cpp:full-cuda Docker image name. The CUDA suffix in the end is important! Otherwise you'd be still running on the CPU.

-s Just tell llama.cpp to launch a server. It is not like chatting to it via stdin/stdout is much fun.

-hf unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF:Q4_K_M The name of the model to launch. This is to launch a model from HuggingFace, complete with a qualifier. Yes, I am running a model quantified all the way down to 4-bit floating point numbers. Even that one only fits up to about 60% into VRAM and the rest is computed on the CPU.

--api-key something Because whatever little security there is would never hurt.

--port 8000 --host 0.0.0.0 It defaults to port 8080 which is busy.

--jinja Required parameter to make tool calling to work. You may want to experiment with --chat-template chatml as well, depending on the model and the AI application that is going to connect to the local server.

--ctx-size 10000 Default context size it initializes is way to small for anything useful. Using --ctx-size 0 to defer the decision to the model is a great way to test your OOM Killer, as models tend to ask for all the memory in the world. Feel free to experiment with the value that works for you. The default is 4096 btw. I do not know the units of this parameter, but I suspect it is in tokens - making it non-trivial to calculate the best value.

--gpu-layers 8 Unlike ollama, llama.cpp does not automatically limits the VRAM memory allocation to what's available. Starting with the usual --gpu-layers 99 only results with a cudaMalloc error and a helpful suggestion to reduce the GPU-allocated layer count. Feel free to experiment with the value. The 8 is pretty safe (that's only about 4Gb worth of VRAM for most of the models I played with so far).

Once you launch docker container and watch it be done downloading the model, loading it and warming it up, you can hit the localhost URL it prints out for a regular chat interface. I'd recommend testing that it works with a reasonable speed before proceeding.

The final step is to configure the AI Application to use the local server. They really do not like listing this as an option, but they all support it. Simply because llama.cpp implements an OpenAI-compatible API.

For Zed that'd be clicking the Settings in the AI panel and finding the small "+ Add Provider" button. Select "OpenAI" from the list (at the time of writing, this is the ONLY option in the list) and fill the details like the API URL. You might be tempted to get smart and put http://127.0.0.1:8000/completion, but in my experience just pointing to the correct port of the root is working: http://127.0.0.1:8000. The application figures the proper URL out on its own. Make sure to select capabilities, for me that important one is the tool calling.

If you are more inclined to just edit the settings.json file, here is the snippet (model name does not matter - it will run the model you loaded when you launched the Docker image):

"language_models": {
  "openai_compatible": {
    "llama.cpp": {
      "api_url": "http://127.0.0.1:8000",
      "available_models": [
        {
          "name": "Qwen3-Coder-30B-A3B-Instruct",
          "max_tokens": 200000,
          "max_output_tokens": 32000,
          "max_completion_tokens": 200000,
          "capabilities": {
            "tools": true,
            "images": false,
            "parallel_tool_calls": false,
            "prompt_cache_key": false
          }
        },
      ]
    }
  }
},

You would still need to enter the API Key into the llama.cpp provider in the list after that. And it never hurts to restart the editor - as it does not properly reload anything on the changes to those settings.

With that, you can run the AI locally and even see it try to call tools. As much as it can...

Posted On

Category:

Tags: /