Local LLMs¶
liter-llm supports local inference engines that expose an OpenAI-compatible API. Run models on your own hardware with zero cloud dependencies and no API key.
Supported Providers¶
| Provider | Default URL | Prefix | Notes |
|---|---|---|---|
| Ollama | http://localhost:11434/v1 |
ollama/ |
Most popular, easy setup |
| LM Studio | http://localhost:1234/v1 |
lm_studio/ or lmstudio/ |
GUI-based, beginner-friendly |
| vLLM | http://localhost:8000/v1 |
vllm/ |
High-throughput serving |
| llama.cpp | http://localhost:8080/v1 |
llamacpp/ |
Lightweight C++ inference |
| LocalAI | http://localhost:8080/v1 |
localai/ |
Drop-in OpenAI replacement |
| llamafile | http://localhost:8080/v1 |
llamafile/ |
Single-file executable |
All of these providers are registered in the provider registry with their default base URLs. liter-llm routes requests automatically based on the model prefix.
Supported Capabilities¶
| Provider | Chat | Completions | Embeddings | Rerank | Audio | Images |
|---|---|---|---|---|---|---|
| Ollama | — | — | — | |||
| LM Studio | — | — | — | |||
| vLLM | — | — | ||||
| llama.cpp | — | — | ||||
| LocalAI | ||||||
| llamafile | — | — |
All providers also support streaming via SSE and model listing via /v1/models.
Tool calling and vision/multimodal are supported through the chat endpoint where the underlying model supports it.
Quick Start with Ollama¶
1. Install Ollama¶
2. Pull a Model¶
3. Use with liter-llm¶
import asyncio
from liter_llm import LlmClient
async def main() -> None:
# No API key needed for local providers
client = LlmClient(api_key="", base_url="http://localhost:11434/v1")
response = await client.chat(
model="ollama/qwen2:0.5b",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
asyncio.run(main())
import { LlmClient } from "@kreuzberg/liter-llm";
// No API key needed for local providers
const client = new LlmClient({
apiKey: "",
baseUrl: "http://localhost:11434/v1",
});
const response = await client.chat({
model: "ollama/qwen2:0.5b",
messages: [{ role: "user", content: "Hello!" }],
});
console.log(response.choices[0].message.content);
use liter_llm::{
ChatCompletionRequest, ClientConfigBuilder, DefaultClient, LlmClient,
Message, UserContent, UserMessage,
};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// No API key needed for local providers
let config = ClientConfigBuilder::new("")
.base_url("http://localhost:11434/v1")
.build();
let client = DefaultClient::new(config, Some("ollama/qwen2:0.5b"))?;
let request = ChatCompletionRequest {
model: "ollama/qwen2:0.5b".into(),
messages: vec![Message::User(UserMessage {
content: UserContent::Text("Hello!".into()),
name: None,
})],
..Default::default()
};
let response = client.chat(request).await?;
if let Some(choice) = response.choices.first() {
println!("{}", choice.message.content.as_deref().unwrap_or(""));
}
Ok(())
}
package main
import (
"context"
"fmt"
llm "github.com/kreuzberg-dev/liter-llm/packages/go"
)
func main() {
// No API key needed for local providers
client := llm.NewClient(
llm.WithAPIKey(""),
llm.WithBaseURL("http://localhost:11434/v1"),
)
resp, err := client.Chat(context.Background(), &llm.ChatCompletionRequest{
Model: "ollama/qwen2:0.5b",
Messages: []llm.Message{
llm.NewTextMessage(llm.RoleUser, "Hello!"),
},
})
if err != nil {
panic(err)
}
if len(resp.Choices) > 0 && resp.Choices[0].Message.Content != nil {
fmt.Println(*resp.Choices[0].Message.Content)
}
}
No API key required
Local providers do not require an API key. Pass an empty string ("") as the api_key parameter.
Model Naming Convention¶
liter-llm uses the standard provider/model-name prefix convention for local providers, just like cloud providers:
ollama/llama3.2 -> Ollama running Llama 3.2
ollama/qwen2:0.5b -> Ollama running Qwen2 0.5B
lm_studio/my-model -> LM Studio
vllm/meta-llama/Llama-3 -> vLLM
llamacpp/my-model -> llama.cpp server
localai/gpt-3.5-turbo -> LocalAI
llamafile/my-model -> llamafile
The prefix determines which base URL and configuration to use. The model name after the / is forwarded to the local server as-is.
Streaming¶
All local providers support streaming responses via Server-Sent Events (SSE), identical to the cloud provider streaming interface:
Embeddings¶
Several local providers support embedding models. Use the standard embeddings API:
Popular local embedding models include all-minilm (384 dims), nomic-embed-text (768 dims), and mxbai-embed-large (1024 dims) on Ollama.
Provider Configuration¶
Ollama¶
Ollama runs on port 11434 by default. No additional configuration is needed:
# liter-llm.toml
api_key = ""
[[providers]]
name = "ollama"
base_url = "http://localhost:11434/v1"
model_prefixes = ["ollama/"]
Ollama model names
Ollama uses its own model naming (e.g., llama3.2, qwen2:0.5b, codellama:13b). Use ollama list to see installed models.
LM Studio¶
LM Studio runs on port 1234 by default. Load a model in the LM Studio GUI, then use it:
# liter-llm.toml
api_key = ""
[[providers]]
name = "lm_studio"
base_url = "http://localhost:1234/v1"
model_prefixes = ["lm_studio/", "lmstudio/"]
vLLM¶
Start vLLM with the OpenAI-compatible server:
# liter-llm.toml
api_key = ""
[[providers]]
name = "vllm"
base_url = "http://localhost:8000/v1"
model_prefixes = ["vllm/"]
llama.cpp¶
Start the llama.cpp server:
# liter-llm.toml
api_key = ""
[[providers]]
name = "llamacpp"
base_url = "http://localhost:8080/v1"
model_prefixes = ["llamacpp/"]
LocalAI¶
# liter-llm.toml
api_key = ""
[[providers]]
name = "localai"
base_url = "http://localhost:8080/v1"
model_prefixes = ["localai/"]
llamafile¶
Download and run a llamafile:
# liter-llm.toml
api_key = ""
[[providers]]
name = "llamafile"
base_url = "http://localhost:8080/v1"
model_prefixes = ["llamafile/"]
Custom Base URL¶
If your local provider runs on a non-default port or remote host, override the base URL:
Or in liter-llm.toml:
Docker Compose¶
Run Ollama alongside the liter-llm proxy for a self-contained local setup:
# docker-compose.local.yaml
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
liter-llm:
image: ghcr.io/kreuzberg-dev/liter-llm:latest
ports:
- "4000:4000"
environment:
- LITER_LLM_API_KEY=""
volumes:
- ./liter-llm-proxy.toml:/etc/liter-llm/liter-llm-proxy.toml
depends_on:
- ollama
volumes:
ollama_data:
Example proxy config for local use:
# liter-llm-proxy.toml
[server]
host = "0.0.0.0"
port = 4000
[[providers]]
name = "ollama"
base_url = "http://ollama:11434/v1"
model_prefixes = ["ollama/"]
Start the stack:
docker compose -f docker-compose.local.yaml up -d
# Pull a model into Ollama
docker exec -it $(docker compose -f docker-compose.local.yaml ps -q ollama) \
ollama pull qwen2:0.5b
# Chat via the proxy
curl http://localhost:4000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "ollama/qwen2:0.5b", "messages": [{"role": "user", "content": "Hello!"}]}'
Troubleshooting¶
Connection Refused¶
The local server is not running or is on a different port. Verify:
# Check if Ollama is running
curl http://localhost:11434/v1/models
# Check if the port is in use
lsof -i :11434
Tip
Make sure the server is started before making requests. Ollama starts automatically on macOS but may need ollama serve on Linux.
Model Not Found¶
The model is not downloaded. Pull it first:
Timeout Errors¶
Local models can be slow to load on first request (especially large models). Increase the timeout:
Docker Networking¶
When running liter-llm in Docker and a local provider on the host:
- Linux: Use
http://host.docker.internal:11434/v1orhttp://172.17.0.1:11434/v1 - macOS/Windows: Use
http://host.docker.internal:11434/v1
# liter-llm-proxy.toml (inside Docker)
[[providers]]
name = "ollama"
base_url = "http://host.docker.internal:11434/v1"
model_prefixes = ["ollama/"]
GPU / Performance¶
- Ollama: Automatically uses GPU if available. Check with
ollama ps. - vLLM: Pass
--tensor-parallel-size Nfor multi-GPU. - llama.cpp: Use
-ngl Nto offload N layers to GPU. - LocalAI: Set
GPU_LAYERSenvironment variable.