Running llama.cpp GPU server on Jetson (Orin Nano)
0. SYSTEM PREREQS
sudo apt update
sudo apt install -y git cmake build-essential python3 python3-venv python3-pip curl
1. Build llama.cpp (GPU ENABLED) from source
cd ~
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
Build with CUDA (Jetson / Orin Super Nano)
cmake -B build \
-DGGML_CUDA=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
Verify GPU is visible
./build/bin/llama-cli --list-devices
Expected:
CUDA0: Orin ...
2. CREATE MODEL DIRECTORY
mkdir -p ~/models
3. Installing HUGGINGFACE CLI (IMPORTANT)
python3 -m venv ~/.venv-hf
source ~/.venv-hf/bin/activate
pip install -U huggingface_hub
Login:
hf auth login
If permission error:
mkdir -p ~/.cache/huggingface
chmod -R 700 ~/.cache/huggingface
4. DOWNLOAD WORKING MODELS (GUARANTEED REPOS)
✅ BEST SMALL CODING MODEL (RECOMMENDED FOR ORIN)
Qwen Coder 3B (BEST STABILITY)
hf download Qwen/Qwen2.5-Coder-3B-Instruct-GGUF \
qwen2.5-coder-3b-instruct-q4_k_m.gguf \
--local-dir ~/models
Optional stronger model (if RAM allows)
hf download bartowski/deepseek-coder-6.7B-instruct-GGUF \
DeepSeek-Coder-6.7B-Instruct-Q4_K_M.gguf \
--local-dir ~/models
5. VERIFY MODELS EXIST
ls -lh ~/models
You should see:
qwen2.5-coder-3b-instruct-q4_k_m.gguf
6. RUN llama-server MANUALLY (TEST)
IMPORTANT FIXES FOR YOUR ERRORS:
- set model explicitly
- reduce GPU layers
- increase context for OpenCode
./build/bin/llama-server \
--host 0.0.0.0 \
--port 8080 \
-m ~/models/qwen2.5-coder-3b-instruct-q4_k_m.gguf \
-c 16384 \
-ngl 20
If DeepSeek (6.7B):
-ngl 10
7. VALIDATE API
curl http://localhost:8080/v1/models
Chat test:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role":"user","content":"Explain Kubernetes in simple terms"}
],
"temperature": 0.7
}'
⚠️ No API key needed — if you saw:
Invalid API Key
you were hitting OpenAI proxy mode or wrong endpoint config in OpenCode.
8. CREATE SYSTEMD SERVICE (PROPER FIX)
Create service:
sudo nano /etc/systemd/system/llama-server.service
Paste this (NO blanks, fully working):
[Unit]
Description=llama.cpp GPU Server
After=network.target
[Service]
Type=simple
User=nvidia
WorkingDirectory=/home/nvidia/llama.cpp
ExecStart=/home/nvidia/llama.cpp/build/bin/llama-server \
-m /home/nvidia/models/qwen2.5-coder-3b-instruct-q4_k_m.gguf \
--host 0.0.0.0 \
--port 8080 \
--ctx-size 16384 \
--batch-size 128 \
--ubatch-size 128 \
--n-gpu-layers 99 \
--flash-attn auto
Restart=always
RestartSec=5
Environment=CUDA_VISIBLE_DEVICES=0
[Install]
WantedBy=multi-user.target
9. ENABLE SERVICE
sudo systemctl daemon-reload
sudo systemctl enable llama-server
sudo systemctl start llama-server
CHECK LOGS
journalctl -u llama-server -f
10. FIX YOUR OPENCODE ISSUE (CRITICAL)
Your error:
context size 2048 too small
Fix:
You MUST run server with:
-c 4096
or:
-c 8192
11. OPENAI COMPAT MODE FOR OPCODE
Use this base URL:
http://localhost:8080/v1
NO API KEY REQUIRED.
Example OpenCode config:
{
"provider": "openai",
"base_url": "http://localhost:8080/v1",
"api_key": "dummy",
"model": "/home/nvidia/models/qwen2.5-coder-3b-instruct-q4_k_m.gguf"
}
12. TROUBLESHOOTING YOU HIT (EXPLAINED)
❌ GPU OOM (DeepSeek 6.7B)
Fix:
-ngl 10
-c 4096
Or switch to:
- Qwen 3B (recommended)
❌ systemd crash loop
Cause:
- bad CLI flag (
--flash-attnmissing value)
Fix:
--flash-attn auto
❌ hf repo not found
Cause:
- wrong repo name (many “TheBloke” repos renamed/deprecated)
Fix: Use:
Qwen/Qwen2.5-Coder-3B-Instruct-GGUFbartowski/*
❌ hf CLI not found
Fix:
pip install huggingface_hub
hf auth login
13. BEST STACK that runs well on this small formfactor of NVIDIA
| Component | Choice |
|---|---|
| Model | Qwen2.5-Coder-3B-Instruct |
| Quant | Q4_K_M |
| Context | 4096 |
| GPU layers | 20 |
| Server | llama-server |