Why Run Your Own OpenAI-Compatible API?
You are building an application that integrates with OpenAI’s API. Everything works great in development. But as you move toward production, the concerns start to surface:
- API costs add up fast. Every GPT-4 call, every embedding request, every chat completion — they all flow to OpenAI’s billing.
- Data leaves your infrastructure. Prompts, responses, and usage patterns are all sent to a third party.
- Rate limits can block your app. Hit the API limit, and users see errors.
- You need to test offline. Development on a laptop is fine, but CI/CD pipelines and air-gapped environments need local inference.
LocalAI solves these problems by giving you a drop-in OpenAI API replacement that runs entirely on your own hardware. Same endpoints. Same request and response formats. Same error codes. Your existing code that calls https://api.openai.com/v1 just works when you point it at http://localhost:8080/v1 instead.
We see this pattern frequently with teams building AI features. They develop against OpenAI’s generous free tier, then deploy to production where the costs spike. LocalAI gives you that development-to-production parity without any code changes.
What You’ll Need
To run LocalAI in production, you need:
- CPU: 4+ cores (more for larger models)
- RAM: 8GB minimum, 16GB+ recommended for larger models like Llama 2
- Storage: 50GB+ SSD (model files can be large)
- GPU: Optional but significantly faster for inference. LocalAI supports NVIDIA, AMD, and Apple Metal.
- OS: Ubuntu 22.04 or similar Linux distribution
- Docker: 20.10+ with Docker Compose
For Canadian hosting with GPU options, Canadian Web Hosting’s GPU Servers provide NVIDIA-powered instances with 24/7 support. For CPU-only deployments, their Cloud VPS plans start at competitive rates with Canadian data centres in Vancouver and Toronto.
Installing LocalAI
Step 1: Create the Project Directory
Create a dedicated directory for LocalAI configuration and model storage:
mkdir -p ~/localai/{models,config}
cd ~/localai
Step 2: Create the Docker Compose File
Create docker-compose.yml with LocalAI configuration:
cat > ~/localai/docker-compose.yml << 'EOF'
services:
localai:
image: localai/localai:latest
container_name: localai
ports:
- "8080:8080"
volumes:
- ./models:/build/models
- ./config:/build/config
environment:
- THREADS=4
- CONTEXT_SIZE=512
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/ready"]
interval: 30s
timeout: 10s
retries: 3
EOF
The key environment variables:
THREADS=4— Number of parallel inference threads (match your CPU cores)CONTEXT_SIZE=512— Context window in tokens (adjust based on your models)
Step 3: Pull and Start LocalAI
cd ~/localai
docker compose up -d
Verify LocalAI is running:
docker compose ps
# Should show localai as "up" or "healthy"
docker compose logs --tail 20 localai
You should see LocalAI start up and begin listening on port 8080. The first startup takes a few seconds as the container initializes.
Downloading Your First Model
LocalAI does not include models by default — you download them explicitly. For OpenAI compatibility, we recommend starting with a conversational model like Llama 2 or Mistral.
Using the Built-in Gallery
LocalAI includes a gallery of popular models. List available models:
curl http://localhost:8080/models
Install a model from the gallery:
# Install Llama 2 (quantized for CPU)
curl http://localhost:8080/models/apply -X "model" "llama-2-7b-chat-q4"
# Install Mistral 7B (good balance of speed and quality)
curl http://localhost:8080/models/apply -X "model" "mistral-7b-q4"
Using Custom Models
For models not in the gallery, download them manually:
cd ~/localai/models
# Download a GGUF model from Hugging Face
wget https://huggingface.co/TheBloke/Mistral-7B-GGUF/resolve/main/mistral-7b.Q4_K_M.gguf
# LocalAI auto-detects .gguf files in the models directory
Verify the model is loaded:
curl http://localhost:8080/v1/models | jq
Configuration for OpenAI Compatibility
LocalAI is designed to be a drop-in replacement. The base URL changes, but the endpoints remain the same.
API Endpoint Mapping
| OpenAI Endpoint | LocalAI Equivalent |
|---|---|
/v1/chat/completions |
/v1/chat/completions |
/v1/completions |
/v1/completions |
/v1/embeddings |
/v1/embeddings |
/v1/models |
/v1/models |
/v1/audio/transcriptions |
/v1/audio/transcriptions |
/v1/images/generations |
/v1/images/generations |
Environment Variables for Your Application
Point your application at LocalAI instead of OpenAI:
# Before (OpenAI)
OPENAI_API_BASE=https://api.openai.com/v1
OPENAI_API_KEY=sk-your-key-here
# After (LocalAI)
OPENAI_API_BASE=http://localhost:8080/v1
OPENAI_API_KEY=not-needed # LocalAI ignores this but SDKs require it
For production deployments behind a reverse proxy, use your server's domain:
OPENAI_API_BASE=https://localai.yourdomain.com/v1
Verify It Works
Test Chat Completion
Test the chat completions endpoint with curl:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-2-7b-chat-q4",
"messages": [
{"role": "user", "content": "Say hello in one word"}
]
}'
You should receive a JSON response with the completion:
{
"choices": [{
"message": {
"role": "assistant",
"content": "Hello!"
},
"finish_reason": "stop"
}],
"model": "llama-2-7b-chat-q4",
"usage": {
"prompt_tokens": 12,
"completion_tokens": 3,
"total_tokens": 15
}
}
Test Embeddings
Embeddings are useful for semantic search and RAG applications:
curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "all-MiniLM-L6-v2",
"input": "Hello world"
}'
Streaming Responses
For real-time responses, enable streaming:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-2-7b-chat-q4",
"messages": [{"role": "user", "content": "Count from 1 to 10"}],
"stream": true
}'
Production Hardening
Set Up a Reverse Proxy with Caddy
For production, put LocalAI behind a reverse proxy with HTTPS. Caddy handles Let's Encrypt certificates automatically:
cat > /etc/caddy/Caddyfile <<'EOF'
localai.yourdomain.com {
reverse_proxy localhost:8080 {
header_up Host {host}
header_up X-Real-IP {remote_host}
}
}
EOF
systemctl reload caddy
Configure Firewall Rules
Restrict access to LocalAI's direct port and allow only the reverse proxy:
# Allow only localhost access to LocalAI direct port
ufw allow from 127.0.0.1 to any port 8080
# Allow HTTPS through firewall
ufw allow 443/tcp
# Reload firewall
ufw reload
Set Up Backups
Back up your downloaded models and configuration:
cat > ~/localai/backup.sh <<'EOF'
#!/bin/bash
BACKUP_DIR="/backup/localai"
DATE=$(date +%Y%m%d_%H%M%S)
mkdir -p $BACKUP_DIR
# Backup models directory (incremental with rsync)
rsync -av --delete ~/localai/models/ $BACKUP_DIR/models/
# Backup config
cp -r ~/localai/config/ $BACKUP_DIR/config_$DATE/
# Keep only last 7 config backups
find $BACKUP_DIR -name "config_*" -type d -mtime +7 -exec rm -rf {} \;
echo "Backup completed at $DATE"
EOF
chmod +x ~/localai/backup.sh
Add to crontab for daily backups:
# Daily backup at 2 AM
0 2 * * * /home/youruser/localai/backup.sh >> /var/log/localai-backup.log 2>&1
Troubleshooting
Model Not Found Error
Symptom: API returns {"error": {"message": "model not found", "type": "invalid_request_error"}}
Cause: The model name does not match any downloaded models.
Fix: List available models and use the exact name:
curl http://localhost:8080/v1/models | jq -r '.data[].id'
Out of Memory Errors
Symptom: Container crashes or becomes unresponsive during inference.
Cause: Model requires more RAM than available, or context size is too large.
Fix: Reduce context size or use a smaller quantization:
# In docker-compose.yml, reduce context size
CONTEXT_SIZE=256
# Or use a smaller model quantization (Q4 instead of Q8)
curl http://localhost:8080/models/apply -X "model" "llama-2-7b-chat-q2"
Slow Response Times
Symptom: API requests take 30+ seconds to complete.
Cause: Running on CPU without GPU acceleration, or model is too large.
Fix: Enable GPU support or use a smaller model:
# For NVIDIA GPU support, add to docker-compose.yml:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
# Or use a smaller, faster model
curl http://localhost:8080/models/apply -X "model" "tinyllama-1.1b-chat"
Connection Refused
Symptom: curl: (7) Failed to connect to localhost port 8080
Cause: Container is not running or port is blocked.
Fix: Check container status and restart if needed:
docker compose ps
docker compose restart localai
# Check logs for errors
docker compose logs --tail 50 localai
When to Use LocalAI vs Alternatives
LocalAI is not the only option for self-hosted AI inference. Here is how to choose:
| Scenario | Recommended Tool | Why |
|---|---|---|
| Need OpenAI API compatibility | LocalAI | Drop-in replacement, no code changes |
| Running models locally on desktop | LM Studio or Ollama | Better UI, easier setup for personal use |
| Production inference at scale | vLLM or TensorRT-LLM | Higher throughput, better GPU utilization |
| Simple chat interface | Open WebUI | ChatGPT-like interface, easier for end users |
Next Steps
With LocalAI running, you can:
- Replace OpenAI in your applications. Change the API base URL and test locally before deployment.
- Build RAG pipelines. Combine LocalAI embeddings with vector databases for semantic search.
- Create AI-powered features. Add chat, completion, and image generation to your apps without API costs.
For teams building AI features, self-hosting with LocalAI on a Canadian Cloud VPS gives you full control over your data and costs. If you need GPU acceleration for larger models, GPU Servers provide the compute power without the complexity of managing your own hardware.
Need help setting this up? Canadian Web Hosting's Managed Support team can handle the installation, configuration, and ongoing maintenance — so you can focus on building your application.
Be First to Comment