Local LLMs¶
Liter-llm routes to any local inference engine that exposes an OpenAI-compatible API. Run models on your own hardware with zero cloud dependencies and no API key.
Supported Providers¶
| Provider | Default URL | Prefix | Notes |
|---|---|---|---|
| Ollama | http://localhost:11434/v1 |
ollama/ |
Most popular, easy setup |
| LM Studio | http://localhost:1234/v1 |
lmstudio/ |
GUI-based, beginner-friendly |
| vLLM | http://localhost:8000/v1 |
vllm/ |
High-throughput serving |
| llamafile | http://localhost:8080/v1 |
llamafile/ |
Single-file executable |
All of these providers are registered in the provider registry. LocalAI and llama.cpp are also built in with the localai/ and llamacpp/ prefixes. For any other OpenAI-compatible server, use a custom provider — register the prefix and base URL once, then route to it like any other provider.
All listed engines also support streaming via SSE and model listing via /v1/models. Tool calling, vision, and multimodal inputs work through the chat endpoint where the underlying model supports them.
Quick Start with Ollama¶
1. Install Ollama¶
2. Pull a Model¶
3. Use with liter-llm¶
import asyncio
from liter_llm import create_client
from liter_llm._internal_bindings import ChatCompletionRequest
async def main() -> None:
# No API key needed for local providers
client = create_client(api_key="", base_url="http://localhost:11434/v1")
request = ChatCompletionRequest.from_json(
'{"model":"ollama/qwen2:0.5b","messages":[{"role":"user","content":"Hello!"}]}'
)
response = await client.chat(request)
print(response.choices[0].message.content)
asyncio.run(main())
import { createClient } from "@xberg-io/liter-llm";
// No API key needed for local providers
const client = createClient("", "http://localhost:11434/v1");
const response = await client.chat({
model: "ollama/qwen2:0.5b",
messages: [{ role: "user", content: "Hello!" }],
});
console.log(response.choices[0].message.content);
use liter_llm::{
ChatCompletionRequest, ClientConfigBuilder, DefaultClient, LlmClient,
Message, UserContent, UserMessage,
};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// No API key needed for local providers
let config = ClientConfigBuilder::new("")
.base_url("http://localhost:11434/v1")
.build();
let client = DefaultClient::new(config, Some("ollama/qwen2:0.5b"))?;
let request = ChatCompletionRequest {
model: "ollama/qwen2:0.5b".into(),
messages: vec![Message::User(UserMessage {
content: UserContent::Text("Hello!".into()),
name: None,
})],
..Default::default()
};
let response = client.chat(request).await?;
if let Some(choice) = response.choices.first() {
println!("{}", choice.message.content.as_deref().unwrap_or(""));
}
Ok(())
}
package main
import (
"encoding/json"
"fmt"
llm "github.com/xberg-io/liter-llm/packages/go"
)
func main() {
// Local providers (Ollama, LM Studio, ...) don't require an API key,
// but a placeholder value is still required by the binding.
baseURL := "http://localhost:11434/v1"
client, err := llm.CreateClient("not-needed", &baseURL, nil, nil, nil)
if err != nil {
panic(err)
}
var req llm.ChatCompletionRequest
if err := json.Unmarshal([]byte(`{
"model": "ollama/qwen2:0.5b",
"messages": [{"role": "user", "content": "Hello!"}]
}`), &req); err != nil {
panic(err)
}
resp, err := client.Chat(req)
if err != nil {
panic(err)
}
if len(resp.Choices) > 0 && resp.Choices[0].Message.Content != nil {
fmt.Println(*resp.Choices[0].Message.Content)
}
}
No API key required
Local providers do not require an API key. Pass an empty string ("") as the api_key parameter.
Model Naming Convention¶
Liter-llm uses the standard provider/model-name prefix convention for local providers, just like cloud providers:
ollama/llama3.2 -> Ollama running Llama 3.2
ollama/qwen2:0.5b -> Ollama running Qwen2 0.5B
lmstudio/my-model -> LM Studio
vllm/meta-llama/Llama-3 -> vLLM
llamafile/my-model -> llamafile
The prefix determines which base URL and configuration to use. The model name after the / is forwarded to the local server as-is.
Streaming¶
All local providers support streaming responses via Server-Sent Events (SSE), identical to the cloud provider streaming interface:
Embeddings¶
Several local providers support embedding models. Use the standard embeddings API:
Popular local embedding models include all-minilm (384 dims), nomic-embed-text (768 dims), and mxbai-embed-large (1024 dims) on Ollama.
Provider Configuration¶
Ollama¶
Ollama runs on port 11434 by default. No additional configuration is needed:
# liter-llm.toml
api_key = ""
[[providers]]
name = "ollama"
base_url = "http://localhost:11434/v1"
model_prefixes = ["ollama/"]
Ollama model names
Ollama uses its own model naming (e.g., llama3.2, qwen2:0.5b, codellama:13b). Use ollama list to see installed models.
LM Studio¶
LM Studio runs on port 1234 by default. Load a model in the LM Studio GUI, then use it:
# liter-llm.toml
api_key = ""
[[providers]]
name = "lmstudio"
base_url = "http://localhost:1234/v1"
model_prefixes = ["lmstudio/"]
VLLM¶
Start vLLM with the OpenAI-compatible server:
# liter-llm.toml
api_key = ""
[[providers]]
name = "vllm"
base_url = "http://localhost:8000/v1"
model_prefixes = ["vllm/"]
Llama.cpp¶
Start the llama.cpp server:
# liter-llm.toml
api_key = ""
[[providers]]
name = "llamacpp"
base_url = "http://localhost:8080/v1"
model_prefixes = ["llamacpp/"]
LocalAI¶
# liter-llm.toml
api_key = ""
[[providers]]
name = "localai"
base_url = "http://localhost:8080/v1"
model_prefixes = ["localai/"]
Llamafile¶
Download and run a llamafile:
# liter-llm.toml
api_key = ""
[[providers]]
name = "llamafile"
base_url = "http://localhost:8080/v1"
model_prefixes = ["llamafile/"]
Custom Base URL¶
If your local provider runs on a non-default port or remote host, override the base URL when constructing the client:
Or in liter-llm.toml:
Docker Compose¶
Run Ollama alongside the liter-llm proxy for a self-contained local setup:
# docker-compose.local.yaml
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
liter-llm:
image: ghcr.io/xberg-io/liter-llm:latest
ports:
- "4000:4000"
environment:
- LITER_LLM_API_KEY=""
volumes:
- ./liter-llm-proxy.toml:/etc/liter-llm/liter-llm-proxy.toml
depends_on:
- ollama
volumes:
ollama_data:
Example proxy config for local use:
# liter-llm-proxy.toml
[server]
host = "0.0.0.0"
port = 4000
[[providers]]
name = "ollama"
base_url = "http://ollama:11434/v1"
model_prefixes = ["ollama/"]
Start the stack:
docker compose -f docker-compose.local.yaml up -d
# Pull a model into Ollama
docker exec -it $(docker compose -f docker-compose.local.yaml ps -q ollama) \
ollama pull qwen2:0.5b
# Chat via the proxy
curl http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "ollama/qwen2:0.5b", "messages": [{"role": "user", "content": "Hello!"}]}'
Troubleshooting¶
Connection Refused¶
The local server is not running or is on a different port. Verify:
# Check if Ollama is running
curl http://localhost:11434/v1/models
# Check if the port is in use
lsof -i :11434
Tip
Make sure the server is started before making requests. Ollama starts automatically on macOS but may need ollama serve on Linux.
Model Not Found¶
The model is not downloaded. Pull it first:
Timeout Errors¶
Local models can be slow to load on first request (especially large models). Increase the timeout:
Docker Networking¶
When running liter-llm in Docker and a local provider on the host:
- Linux: Use
http://host.docker.internal:11434/v1orhttp://172.17.0.1:11434/v1 - macOS/Windows: Use
http://host.docker.internal:11434/v1
# liter-llm-proxy.toml (inside Docker)
[[providers]]
name = "ollama"
base_url = "http://host.docker.internal:11434/v1"
model_prefixes = ["ollama/"]
GPU / Performance¶
- Ollama: Automatically uses GPU if available. Check with
ollama ps. - vLLM: Pass
--tensor-parallel-size Nfor multi-GPU. - llama.cpp: Use
-ngl Nto offload N layers to GPU. - LocalAI: Set
GPU_LAYERSenvironment variable.