Why Run Your Own LLM?

Every time you send a prompt to ChatGPT or Claude, your data leaves your infrastructure. For most people, that’s fine. But if you’re working with customer data, proprietary code, or regulated information, that’s a problem.

Self-hosting an LLM changes the equation. Your prompts never leave your server. You get consistent performance without API rate limits. And you control the costs — no surprise bills when your team gets enthusiastic about AI.

Ollama has become the easiest way to run large language models locally. It handles model management, GPU acceleration, and provides a simple API that works with most AI tooling. Here’s how to set up Ollama for production use on a dedicated server or VPS.

What You’ll Need

For a production Ollama instance, we recommend:

Component Minimum Recommended
CPU 4 cores 8+ cores
RAM 16 GB 32 GB
GPU NVIDIA with 8GB VRAM NVIDIA with 16GB+ VRAM
Storage 50 GB SSD 100 GB SSD
OS Ubuntu 22.04/24.04 Ubuntu 24.04 LTS

For teams running 7B-parameter models like Llama 3.2 8B, a dedicated GPU server handles both comfortably. For 7B-parameter models, a single GPU with 8GB+ VRAM works well. For larger models (70B parameters), you’ll want multiple GPUs.

Can I run Ollama without a GPU? Yes, but expect significantly slower inference. CPU-only works for testing, but production workloads need GPU acceleration for acceptable response times.

Installing Ollama

Step 1: Prepare Your Server

Start with a fresh Ubuntu server. Update the system and install prerequisites:

sudo apt update && sudo apt upgrade -y
sudo apt install -y curl wget ca-certificates gnupg

If you have an NVIDIA GPU, verify the drivers and CUDA toolkit:

nvidia-smi

You should see output listing your GPU. If not, install NVIDIA drivers first:

sudo apt install -y nvidia-driver-535 nvidia-cuda-toolkit
sudo reboot

Step 2: Install Ollama

Ollama provides a simple installation script:

curl -fsSL https://ollama.com/install.sh | sh

This downloads and installs Ollama to /usr/local/bin/ollama and sets up the systemd service.

Step 3: Start the Ollama Service

Enable and start Ollama as a system service:

sudo systemctl enable ollama
sudo systemctl start ollama
sudo systemctl status ollama

You should see active (running) in the status output.

Running Your First Model

Pulling a Model

Ollama supports many open-source models. Let’s start with Llama 3.2, Meta’s latest open model:

ollama pull llama3.2

This downloads the 8B parameter variant (~4.7GB). For a smaller, faster model:

ollama pull llama3.2:1b

Other popular options:

Model Size Best For Pull Command
Llama 3.2 8B 4.7 GB General-purpose, good balance ollama pull llama3.2
Llama 3.2 1B 1.3 GB Fast responses, limited reasoning ollama pull llama3.2:1b
Mistral 7B 4.1 GB Code, reasoning, European languages ollama pull mistral
CodeLlama 7B 3.8 GB Code generation, debugging ollama pull codellama
Gemma 2 9B 5.5 GB Google’s model, strong reasoning ollama pull gemma2

Testing the Model

Run an interactive chat session:

ollama run llama3.2

You’ll see a prompt where you can chat with the model:

>>> What's the capital of Canada?
The capital of Canada is Ottawa, located in the province of Ontario.

Type /bye to exit the chat.

Using the Ollama API

Ollama exposes an OpenAI-compatible API on port 11434. This makes it easy to integrate with existing AI tooling.

Generate Endpoint

Generate a completion:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Why is Canadian web hosting important for data sovereignty?",
  "stream": false
}'

Chat Endpoint

For multi-turn conversations:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "user", "content": "Explain PIPEDA in one paragraph"}
  ],
  "stream": false
}'

OpenAI Compatibility

Set your OpenAI client’s base URL to point at Ollama:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required but ignored
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

This means any tool built for OpenAI’s API works with Ollama — just change the base URL.

Production Hardening

Configure Network Binding

By default, Ollama only listens on localhost. For production, you typically want it accessible from other services on your network.

Create an override for the systemd service:

sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf <<'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
EOF

sudo systemctl daemon-reload
sudo systemctl restart ollama

Security note: Binding to 0.0.0.0 exposes Ollama to all network interfaces. Always use a firewall to restrict access.

Set Up Firewall Rules

Restrict access to trusted IPs only. Using UFW:

sudo ufw allow 22/tcp    # SSH
sudo ufw allow from 10.0.0.0/8 to any port 11434  # Internal network only
sudo ufw allow from 172.16.0.0/12 to any port 11434
sudo ufw allow from 192.168.0.0/16 to any port 11434
sudo ufw --force enable

For more comprehensive server hardening, see our guide on VPS Security Hardening in 30 Minutes.

Add a Reverse Proxy with HTTPS

For remote access, put Nginx in front of Ollama with SSL termination:

sudo apt install -y nginx certbot python3-certbot-nginx

sudo tee /etc/nginx/sites-available/ollama <<'EOF'
server {
    listen 80;
    server_name your-domain.com;

    location / {
        proxy_pass http://127.0.0.1:11434;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        
        # Increase timeout for long generations
        proxy_read_timeout 300s;
    }
}
EOF

sudo ln -sf /etc/nginx/sites-available/ollama /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl reload nginx

sudo certbot --nginx -d your-domain.com

Enable Authentication

Ollama doesn’t have built-in authentication. For production, add authentication at the reverse proxy level or use a VPN.

A simple approach with HTTP basic auth:

sudo apt install -y apache2-utils
sudo htpasswd -c /etc/nginx/.htpasswd aiuser

# Add to your Nginx config inside the location block:
# auth_basic "Ollama API";
# auth_basic_user_file /etc/nginx/.htpasswd;

Integrating with Other Tools

Open WebUI

For a ChatGPT-like interface, self-host Open WebUI and point it at your Ollama instance. It provides conversation history, user management, and a polished chat interface.

AI Agents

Ollama works with AI agent frameworks like AutoGPT and CrewAI. If you’re building autonomous agents, see our comparison of self-hosted AI agent frameworks.

Troubleshooting

Model Download Fails

Symptom: ollama pull times out or shows connection errors.

Cause: Network issues or DNS problems.

Fix: Check your network and try a different mirror:

# Check connectivity
curl -I https://ollama.com

# Try again with verbose output
OLLAMA_DEBUG=1 ollama pull llama3.2

Out of Memory Errors

Symptom: Model fails to load with CUDA out of memory errors.

Cause: GPU VRAM insufficient for the model.

Fix: Use a smaller quantization or model:

# List available versions
ollama list

# Use a quantized version (smaller, slightly less accurate)
ollama pull llama3.2:8b-q4_0

Or offload more layers to CPU (slower but works):

OLLAMA_NUM_GPU=1 ollama run llama3.2

Slow Inference

Symptom: Responses take many seconds to generate.

Cause: Running on CPU or insufficient GPU layers.

Fix: Verify GPU detection:

# Check if Ollama sees your GPU
ollama show --system

# Should list GPU under "System Information"

If GPU isn’t detected, check NVIDIA drivers and CUDA installation.

Service Won’t Start

Symptom: systemctl status ollama shows failed.

Cause: Port conflict or permission issues.

Fix: Check logs:

journalctl -u ollama -n 50 --no-pager

# Check for port conflicts
sudo lsof -i :11434

Monitoring and Maintenance

Log Management

Ollama logs to journald. View recent logs:

journalctl -u ollama -f

Model Updates

Update models periodically for improvements:

ollama pull llama3.2  # Re-pull to update

Disk Space

Models are stored in /usr/share/ollama/.ollama/models. Check usage:

du -sh /usr/share/ollama/.ollama/models

Remove unused models:

ollama rm mistral  # Remove a specific model

When to Use Self-Hosted AI

Self-hosting an LLM isn’t always the right choice. Here’s when it makes sense:

Self-Host When… Use SaaS When…
Processing sensitive or regulated data Working with public data only
High volume makes API costs prohibitive Sporadic usage
Need guaranteed latency and availability Occasional burst usage is acceptable
Want to customize or fine-tune models Using models as-is is fine
Building products that embed AI Experimenting or prototyping

Next Steps

With Ollama running, you can:

  • Set up Open WebUI for a ChatGPT-style interface
  • Build AI agents with CrewAI or AutoGPT
  • Integrate with your existing applications via the OpenAI-compatible API
  • Experiment with different models to find what works for your use case

Running your own AI infrastructure gives you control over data privacy, costs, and availability. Canadian Web Hosting offers dedicated GPU servers optimized for AI workloads, with Canadian data centres in Vancouver and Toronto — keeping your data within Canadian jurisdiction.

Questions about setting up AI infrastructure? Contact our team — we help businesses self-host AI tools every day.