Run Your Own OpenAI-Compatible API with LocalAI

Why Run Your Own OpenAI-Compatible API?

You are building an application that integrates with OpenAI’s API. Everything works great in development. But as you move toward production, the concerns start to surface:

API costs add up fast. Every GPT-4 call, every embedding request, every chat completion — they all flow to OpenAI’s billing.
Data leaves your infrastructure. Prompts, responses, and usage patterns are all sent to a third party.
Rate limits can block your app. Hit the API limit, and users see errors.
You need to test offline. Development on a laptop is fine, but CI/CD pipelines and air-gapped environments need local inference.

LocalAI solves these problems by giving you a drop-in OpenAI API replacement that runs entirely on your own hardware. Same endpoints. Same request and response formats. Same error codes. Your existing code that calls https://api.openai.com/v1 just works when you point it at http://localhost:8080/v1 instead.

We see this pattern frequently with teams building AI features. They develop against OpenAI’s generous free tier, then deploy to production where the costs spike. LocalAI gives you that development-to-production parity without any code changes.

What You’ll Need

To run LocalAI in production, you need:

CPU: 4+ cores (more for larger models)
RAM: 8GB minimum, 16GB+ recommended for larger models like Llama 2
Storage: 50GB+ SSD (model files can be large)
GPU: Optional but significantly faster for inference. LocalAI supports NVIDIA, AMD, and Apple Metal.
OS: Ubuntu 22.04 or similar Linux distribution
Docker: 20.10+ with Docker Compose

For Canadian hosting with GPU options, Canadian Web Hosting’s GPU Servers provide NVIDIA-powered instances with 24/7 support. For CPU-only deployments, their Cloud VPS plans start at competitive rates with Canadian data centres in Vancouver and Toronto.

Installing LocalAI

Step 1: Create the Project Directory

Create a dedicated directory for LocalAI configuration and model storage:

mkdir -p ~/localai/{models,config}
cd ~/localai

Step 2: Create the Docker Compose File

Create docker-compose.yml with LocalAI configuration:

cat > ~/localai/docker-compose.yml << 'EOF'
services:
  localai:
    image: localai/localai:latest
    container_name: localai
    ports:
      - "8080:8080"
    volumes:
      - ./models:/build/models
      - ./config:/build/config
    environment:
      - THREADS=4
      - CONTEXT_SIZE=512
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/ready"]
      interval: 30s
      timeout: 10s
      retries: 3
EOF

The key environment variables:

THREADS=4 — Number of parallel inference threads (match your CPU cores)
CONTEXT_SIZE=512 — Context window in tokens (adjust based on your models)

Step 3: Pull and Start LocalAI

cd ~/localai
docker compose up -d

Verify LocalAI is running:

docker compose ps

# Should show localai as "up" or "healthy"
docker compose logs --tail 20 localai

You should see LocalAI start up and begin listening on port 8080. The first startup takes a few seconds as the container initializes.

Downloading Your First Model

LocalAI does not include models by default — you download them explicitly. For OpenAI compatibility, we recommend starting with a conversational model like Llama 2 or Mistral.

Using the Built-in Gallery

LocalAI includes a gallery of popular models. List available models:

curl http://localhost:8080/models

Install a model from the gallery:

# Install Llama 2 (quantized for CPU)
curl http://localhost:8080/models/apply -X "model" "llama-2-7b-chat-q4"

# Install Mistral 7B (good balance of speed and quality)
curl http://localhost:8080/models/apply -X "model" "mistral-7b-q4"

Using Custom Models

For models not in the gallery, download them manually:

cd ~/localai/models

# Download a GGUF model from Hugging Face
wget https://huggingface.co/TheBloke/Mistral-7B-GGUF/resolve/main/mistral-7b.Q4_K_M.gguf

# LocalAI auto-detects .gguf files in the models directory

Verify the model is loaded:

curl http://localhost:8080/v1/models | jq

Configuration for OpenAI Compatibility

LocalAI is designed to be a drop-in replacement. The base URL changes, but the endpoints remain the same.

API Endpoint Mapping

OpenAI Endpoint	LocalAI Equivalent
`/v1/chat/completions`	`/v1/chat/completions`
`/v1/completions`	`/v1/completions`
`/v1/embeddings`	`/v1/embeddings`
`/v1/models`	`/v1/models`
`/v1/audio/transcriptions`	`/v1/audio/transcriptions`
`/v1/images/generations`	`/v1/images/generations`

Environment Variables for Your Application

Point your application at LocalAI instead of OpenAI:

# Before (OpenAI)
OPENAI_API_BASE=https://api.openai.com/v1
OPENAI_API_KEY=sk-your-key-here

# After (LocalAI)
OPENAI_API_BASE=http://localhost:8080/v1
OPENAI_API_KEY=not-needed  # LocalAI ignores this but SDKs require it

For production deployments behind a reverse proxy, use your server's domain:

OPENAI_API_BASE=https://localai.yourdomain.com/v1

Verify It Works

Test Chat Completion

Test the chat completions endpoint with curl:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-2-7b-chat-q4",
    "messages": [
      {"role": "user", "content": "Say hello in one word"}
    ]
  }'

You should receive a JSON response with the completion:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "Hello!"
    },
    "finish_reason": "stop"
  }],
  "model": "llama-2-7b-chat-q4",
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 3,
    "total_tokens": 15
  }
}

Test Embeddings

Embeddings are useful for semantic search and RAG applications:

curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "all-MiniLM-L6-v2",
    "input": "Hello world"
  }'

Streaming Responses

For real-time responses, enable streaming:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-2-7b-chat-q4",
    "messages": [{"role": "user", "content": "Count from 1 to 10"}],
    "stream": true
  }'

Production Hardening

Set Up a Reverse Proxy with Caddy

For production, put LocalAI behind a reverse proxy with HTTPS. Caddy handles Let's Encrypt certificates automatically:

cat > /etc/caddy/Caddyfile <<'EOF'
localai.yourdomain.com {
    reverse_proxy localhost:8080 {
        header_up Host {host}
        header_up X-Real-IP {remote_host}
    }
}
EOF

systemctl reload caddy

Configure Firewall Rules

Restrict access to LocalAI's direct port and allow only the reverse proxy:

# Allow only localhost access to LocalAI direct port
ufw allow from 127.0.0.1 to any port 8080

# Allow HTTPS through firewall
ufw allow 443/tcp

# Reload firewall
ufw reload

Set Up Backups

Back up your downloaded models and configuration:

cat > ~/localai/backup.sh <<'EOF'
#!/bin/bash
BACKUP_DIR="/backup/localai"
DATE=$(date +%Y%m%d_%H%M%S)

mkdir -p $BACKUP_DIR

# Backup models directory (incremental with rsync)
rsync -av --delete ~/localai/models/ $BACKUP_DIR/models/

# Backup config
cp -r ~/localai/config/ $BACKUP_DIR/config_$DATE/

# Keep only last 7 config backups
find $BACKUP_DIR -name "config_*" -type d -mtime +7 -exec rm -rf {} \;

echo "Backup completed at $DATE"
EOF

chmod +x ~/localai/backup.sh

Add to crontab for daily backups:

# Daily backup at 2 AM
0 2 * * * /home/youruser/localai/backup.sh >> /var/log/localai-backup.log 2>&1

Troubleshooting

Model Not Found Error

Symptom: API returns {"error": {"message": "model not found", "type": "invalid_request_error"}}

Cause: The model name does not match any downloaded models.

Fix: List available models and use the exact name:

curl http://localhost:8080/v1/models | jq -r '.data[].id'

Out of Memory Errors

Symptom: Container crashes or becomes unresponsive during inference.

Cause: Model requires more RAM than available, or context size is too large.

Fix: Reduce context size or use a smaller quantization:

# In docker-compose.yml, reduce context size
CONTEXT_SIZE=256

# Or use a smaller model quantization (Q4 instead of Q8)
curl http://localhost:8080/models/apply -X "model" "llama-2-7b-chat-q2"

Slow Response Times

Symptom: API requests take 30+ seconds to complete.

Cause: Running on CPU without GPU acceleration, or model is too large.

Fix: Enable GPU support or use a smaller model:

# For NVIDIA GPU support, add to docker-compose.yml:
deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: 1
          capabilities: [gpu]

# Or use a smaller, faster model
curl http://localhost:8080/models/apply -X "model" "tinyllama-1.1b-chat"

Connection Refused

Symptom: curl: (7) Failed to connect to localhost port 8080

Cause: Container is not running or port is blocked.

Fix: Check container status and restart if needed:

docker compose ps
docker compose restart localai

# Check logs for errors
docker compose logs --tail 50 localai

When to Use LocalAI vs Alternatives

LocalAI is not the only option for self-hosted AI inference. Here is how to choose:

Scenario	Recommended Tool	Why
Need OpenAI API compatibility	LocalAI	Drop-in replacement, no code changes
Running models locally on desktop	LM Studio or Ollama	Better UI, easier setup for personal use
Production inference at scale	vLLM or TensorRT-LLM	Higher throughput, better GPU utilization
Simple chat interface	Open WebUI	ChatGPT-like interface, easier for end users

Next Steps

With LocalAI running, you can:

Replace OpenAI in your applications. Change the API base URL and test locally before deployment.
Build RAG pipelines. Combine LocalAI embeddings with vector databases for semantic search.
Create AI-powered features. Add chat, completion, and image generation to your apps without API costs.

For teams building AI features, self-hosting with LocalAI on a Canadian Cloud VPS gives you full control over your data and costs. If you need GPU acceleration for larger models, GPU Servers provide the compute power without the complexity of managing your own hardware.

Need help setting this up? Canadian Web Hosting's Managed Support team can handle the installation, configuration, and ongoing maintenance — so you can focus on building your application.