Why Run Your Own LLM?
Every time you send a prompt to ChatGPT or Claude, your data leaves your infrastructure. For most people, that’s fine. But if you’re working with customer data, proprietary code, or regulated information, that’s a problem.
Self-hosting an LLM changes the equation. Your prompts never leave your server. You get consistent performance without API rate limits. And you control the costs — no surprise bills when your team gets enthusiastic about AI.
Ollama has become the easiest way to run large language models locally. It handles model management, GPU acceleration, and provides a simple API that works with most AI tooling. Here’s how to set up Ollama for production use on a dedicated server or VPS.
What You’ll Need
For a production Ollama instance, we recommend:
| Component | Minimum | Recommended |
|---|---|---|
| CPU | 4 cores | 8+ cores |
| RAM | 16 GB | 32 GB |
| GPU | NVIDIA with 8GB VRAM | NVIDIA with 16GB+ VRAM |
| Storage | 50 GB SSD | 100 GB SSD |
| OS | Ubuntu 22.04/24.04 | Ubuntu 24.04 LTS |
For teams running 7B-parameter models like Llama 3.2 8B, a dedicated GPU server handles both comfortably. For 7B-parameter models, a single GPU with 8GB+ VRAM works well. For larger models (70B parameters), you’ll want multiple GPUs.
Can I run Ollama without a GPU? Yes, but expect significantly slower inference. CPU-only works for testing, but production workloads need GPU acceleration for acceptable response times.
Installing Ollama
Step 1: Prepare Your Server
Start with a fresh Ubuntu server. Update the system and install prerequisites:
sudo apt update && sudo apt upgrade -y
sudo apt install -y curl wget ca-certificates gnupg
If you have an NVIDIA GPU, verify the drivers and CUDA toolkit:
nvidia-smi
You should see output listing your GPU. If not, install NVIDIA drivers first:
sudo apt install -y nvidia-driver-535 nvidia-cuda-toolkit
sudo reboot
Step 2: Install Ollama
Ollama provides a simple installation script:
curl -fsSL https://ollama.com/install.sh | sh
This downloads and installs Ollama to /usr/local/bin/ollama and sets up the systemd service.
Step 3: Start the Ollama Service
Enable and start Ollama as a system service:
sudo systemctl enable ollama
sudo systemctl start ollama
sudo systemctl status ollama
You should see active (running) in the status output.
Running Your First Model
Pulling a Model
Ollama supports many open-source models. Let’s start with Llama 3.2, Meta’s latest open model:
ollama pull llama3.2
This downloads the 8B parameter variant (~4.7GB). For a smaller, faster model:
ollama pull llama3.2:1b
Other popular options:
| Model | Size | Best For | Pull Command |
|---|---|---|---|
| Llama 3.2 8B | 4.7 GB | General-purpose, good balance | ollama pull llama3.2 |
| Llama 3.2 1B | 1.3 GB | Fast responses, limited reasoning | ollama pull llama3.2:1b |
| Mistral 7B | 4.1 GB | Code, reasoning, European languages | ollama pull mistral |
| CodeLlama 7B | 3.8 GB | Code generation, debugging | ollama pull codellama |
| Gemma 2 9B | 5.5 GB | Google’s model, strong reasoning | ollama pull gemma2 |
Testing the Model
Run an interactive chat session:
ollama run llama3.2
You’ll see a prompt where you can chat with the model:
>>> What's the capital of Canada?
The capital of Canada is Ottawa, located in the province of Ontario.
Type /bye to exit the chat.
Using the Ollama API
Ollama exposes an OpenAI-compatible API on port 11434. This makes it easy to integrate with existing AI tooling.
Generate Endpoint
Generate a completion:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Why is Canadian web hosting important for data sovereignty?",
"stream": false
}'
Chat Endpoint
For multi-turn conversations:
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{"role": "user", "content": "Explain PIPEDA in one paragraph"}
],
"stream": false
}'
OpenAI Compatibility
Set your OpenAI client’s base URL to point at Ollama:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Required but ignored
)
response = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
This means any tool built for OpenAI’s API works with Ollama — just change the base URL.
Production Hardening
Configure Network Binding
By default, Ollama only listens on localhost. For production, you typically want it accessible from other services on your network.
Create an override for the systemd service:
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf <<'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
EOF
sudo systemctl daemon-reload
sudo systemctl restart ollama
Security note: Binding to 0.0.0.0 exposes Ollama to all network interfaces. Always use a firewall to restrict access.
Set Up Firewall Rules
Restrict access to trusted IPs only. Using UFW:
sudo ufw allow 22/tcp # SSH
sudo ufw allow from 10.0.0.0/8 to any port 11434 # Internal network only
sudo ufw allow from 172.16.0.0/12 to any port 11434
sudo ufw allow from 192.168.0.0/16 to any port 11434
sudo ufw --force enable
For more comprehensive server hardening, see our guide on VPS Security Hardening in 30 Minutes.
Add a Reverse Proxy with HTTPS
For remote access, put Nginx in front of Ollama with SSL termination:
sudo apt install -y nginx certbot python3-certbot-nginx
sudo tee /etc/nginx/sites-available/ollama <<'EOF'
server {
listen 80;
server_name your-domain.com;
location / {
proxy_pass http://127.0.0.1:11434;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Increase timeout for long generations
proxy_read_timeout 300s;
}
}
EOF
sudo ln -sf /etc/nginx/sites-available/ollama /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl reload nginx
sudo certbot --nginx -d your-domain.com
Enable Authentication
Ollama doesn’t have built-in authentication. For production, add authentication at the reverse proxy level or use a VPN.
A simple approach with HTTP basic auth:
sudo apt install -y apache2-utils
sudo htpasswd -c /etc/nginx/.htpasswd aiuser
# Add to your Nginx config inside the location block:
# auth_basic "Ollama API";
# auth_basic_user_file /etc/nginx/.htpasswd;
Integrating with Other Tools
Open WebUI
For a ChatGPT-like interface, self-host Open WebUI and point it at your Ollama instance. It provides conversation history, user management, and a polished chat interface.
AI Agents
Ollama works with AI agent frameworks like AutoGPT and CrewAI. If you’re building autonomous agents, see our comparison of self-hosted AI agent frameworks.
Troubleshooting
Model Download Fails
Symptom: ollama pull times out or shows connection errors.
Cause: Network issues or DNS problems.
Fix: Check your network and try a different mirror:
# Check connectivity
curl -I https://ollama.com
# Try again with verbose output
OLLAMA_DEBUG=1 ollama pull llama3.2
Out of Memory Errors
Symptom: Model fails to load with CUDA out of memory errors.
Cause: GPU VRAM insufficient for the model.
Fix: Use a smaller quantization or model:
# List available versions
ollama list
# Use a quantized version (smaller, slightly less accurate)
ollama pull llama3.2:8b-q4_0
Or offload more layers to CPU (slower but works):
OLLAMA_NUM_GPU=1 ollama run llama3.2
Slow Inference
Symptom: Responses take many seconds to generate.
Cause: Running on CPU or insufficient GPU layers.
Fix: Verify GPU detection:
# Check if Ollama sees your GPU
ollama show --system
# Should list GPU under "System Information"
If GPU isn’t detected, check NVIDIA drivers and CUDA installation.
Service Won’t Start
Symptom: systemctl status ollama shows failed.
Cause: Port conflict or permission issues.
Fix: Check logs:
journalctl -u ollama -n 50 --no-pager
# Check for port conflicts
sudo lsof -i :11434
Monitoring and Maintenance
Log Management
Ollama logs to journald. View recent logs:
journalctl -u ollama -f
Model Updates
Update models periodically for improvements:
ollama pull llama3.2 # Re-pull to update
Disk Space
Models are stored in /usr/share/ollama/.ollama/models. Check usage:
du -sh /usr/share/ollama/.ollama/models
Remove unused models:
ollama rm mistral # Remove a specific model
When to Use Self-Hosted AI
Self-hosting an LLM isn’t always the right choice. Here’s when it makes sense:
| Self-Host When… | Use SaaS When… |
|---|---|
| Processing sensitive or regulated data | Working with public data only |
| High volume makes API costs prohibitive | Sporadic usage |
| Need guaranteed latency and availability | Occasional burst usage is acceptable |
| Want to customize or fine-tune models | Using models as-is is fine |
| Building products that embed AI | Experimenting or prototyping |
Next Steps
With Ollama running, you can:
- Set up Open WebUI for a ChatGPT-style interface
- Build AI agents with CrewAI or AutoGPT
- Integrate with your existing applications via the OpenAI-compatible API
- Experiment with different models to find what works for your use case
Running your own AI infrastructure gives you control over data privacy, costs, and availability. Canadian Web Hosting offers dedicated GPU servers optimized for AI workloads, with Canadian data centres in Vancouver and Toronto — keeping your data within Canadian jurisdiction.
Questions about setting up AI infrastructure? Contact our team — we help businesses self-host AI tools every day.
Be First to Comment