We manage hundreds of servers at Canadian Web Hosting. When a customer calls at 3 AM because their site is down, the first question is always the same: what changed? Without monitoring, that question turns into 45 minutes of SSH-ing into boxes and reading logs. With monitoring, it’s a 30-second glance at a dashboard.
If you’re running anything in production — a web app, a database, a container stack — you need monitoring. But the open-source landscape is overwhelming. Prometheus? Zabbix? Netdata? Checkmk? They all claim to be the best. We’ve deployed most of them for customers over the years, so here’s what we’ve actually learned.
What Self-Hosted Monitoring Gets You
Before comparing tools, let’s be clear about what monitoring solves:
- Downtime detection — know before your customers do
- Capacity planning — see trends before you hit limits
- Root cause analysis — correlate CPU, memory, disk, and network when something breaks
- Compliance evidence — SOC 2 and PCI DSS require monitoring and alerting
Self-hosting your monitoring means your data stays in Canada (or wherever you choose), you control retention, and there are no per-host fees that balloon as you scale.
The Comparison: 12 Monitoring Tools, Tested
| Tool | Best For | Min RAM | Architecture |
|---|---|---|---|
| Prometheus | Metrics + alerting (cloud-native) | 2 GB | Pull-based, PromQL |
| Grafana | Dashboards + visualization | 512 MB | Query layer (pairs with everything) |
| Zabbix | Enterprise infra monitoring | 4 GB | Agent-based, auto-discovery |
| Netdata | Real-time per-host metrics | 256 MB | Per-node agent, zero config |
| Checkmk | Traditional IT (network + servers) | 2 GB | Nagios-derived, agent + SNMP |
| Nagios | Legacy check-based monitoring | 1 GB | Plugin-based, active checks |
| Icinga 2 | Modern Nagios replacement | 2 GB | Cluster-capable, REST API |
| Uptime Kuma | Simple uptime + status pages | 256 MB | Node.js, SQLite |
| VictoriaMetrics | Long-term Prometheus storage | 1 GB | Prometheus-compatible, better compression |
| LibreNMS | Network device monitoring | 2 GB | SNMP-based, auto-discovery |
| Sensu Go | Pipeline-based observability | 2 GB | Agent + event pipeline |
| Monit | Process watchdog | 32 MB | Lightweight, auto-restart |
Prometheus + Grafana: The Modern Standard
If you’re starting fresh, this is probably the right answer. Prometheus scrapes metrics from your services every 15 seconds, stores them in a time-series database, and fires alerts through Alertmanager. Grafana gives you the dashboards.
Why it wins:
- PromQL is powerful once you learn it — percentile calculations, rate functions, label filtering
- Massive ecosystem: exporters exist for Docker, MySQL, PostgreSQL, Redis, Nginx, and hundreds more
- Cloud-native: designed for containers and Kubernetes from the start
- Alertmanager handles routing, silencing, and deduplication
Where it struggles:
- Not a logs solution — you need a separate logging stack (Loki, ELK, syslog-ng)
- Single-node by default — for long-term storage, add VictoriaMetrics or Thanos
- Pull-based model means you need network access to every target
Production setup (Docker Compose):
version: "3.8"
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=90d'
ports:
- "9090:9090"
grafana:
image: grafana/grafana:latest
volumes:
- grafana_data:/var/lib/grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=changeme
node-exporter:
image: prom/node-exporter:latest
pid: host
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
alertmanager:
image: prom/alertmanager:latest
ports:
- "9093:9093"
volumes:
prometheus_data:
grafana_data:
Key Exporters Worth Adding
| Exporter | What It Monitors | Port |
|---|---|---|
| node-exporter | CPU, memory, disk, network | 9100 |
| cAdvisor | Container metrics | 8080 |
| mysqld-exporter | MySQL/MariaDB queries, connections | 9104 |
| postgres-exporter | PostgreSQL stats, locks | 9187 |
| blackbox-exporter | HTTP/TCP/ICMP probes (uptime) | 9115 |
Zabbix: The Enterprise Workhorse
Zabbix has been around since 2001. It’s not flashy, but it monitors everything: servers, network gear, SNMP devices, IPMI, JMX, cloud APIs. If you have a mixed fleet of physical servers, switches, and VMs, Zabbix handles it all in one place.
Why teams choose it:
- Auto-discovery finds new hosts and services automatically
- Built-in alerting with escalation chains (email ? Slack ? PagerDuty)
- Template library covers thousands of device types
- Agent-based with low overhead per monitored host
The trade-offs:
- Web UI feels dated compared to Grafana dashboards
- MySQL/PostgreSQL backend needs tuning at scale (500+ hosts)
- Configuration is template-heavy — steep learning curve
- Needs 4+ GB RAM for the server itself
Netdata: Instant Visibility, Zero Config
Netdata is the fastest way to see what’s happening on a server. Install the agent, open port 19999, and you get 2,000+ metrics per second with per-second granularity. No configuration needed.
What makes it unique:
- Installs in 30 seconds:
bash <(curl -Ss https://get.netdata.cloud/kickstart.sh) - Per-second resolution (most tools do 15-60 second intervals)
- Anomaly detection built in — flags unusual patterns automatically
- Extremely low overhead (~2% CPU, ~100 MB RAM)
Limitations:
- Per-host dashboards — no centralized multi-host view without Netdata Cloud (SaaS)
- Short default retention (depends on RAM allocated to dbengine)
- Not designed for alerting pipelines or complex routing
Uptime Kuma: Simple Status Pages
Not every project needs a full observability stack. If you just need to know “is it up?” and want a clean status page for customers, Uptime Kuma is the answer.
- HTTP, TCP, DNS, Docker, and ping monitors
- Beautiful status pages you can share with clients
- Notifications via Slack, Discord, Telegram, email, and 90+ integrations
- Runs on 256 MB RAM — fits on the smallest VPS
VictoriaMetrics: When Prometheus Runs Out of Disk
VictoriaMetrics is a drop-in replacement for Prometheus’s storage engine. It speaks PromQL, accepts Prometheus remote-write, and compresses data 7-10x better. If you’re keeping 6+ months of metrics, this saves significant disk space.
- Drop-in: point Prometheus
remote_writeat VictoriaMetrics, done - Single binary, no dependencies
- Handles millions of time series on modest hardware
- Also works standalone (without Prometheus) using vmagent
The Legacy Options: Nagios, Cacti, Observium
We still see these on older infrastructure. They work, but there’s rarely a reason to deploy them fresh in 2026:
- Nagios — the original. Plugin-based, check scripts, CGI web UI. Icinga 2 is the modern fork with clustering and a REST API.
- Cacti — SNMP-based graphing. Good for network bandwidth charts, limited beyond that.
- Observium — network monitoring. Community edition is free but feature-limited. LibreNMS is the actively-maintained fork.
If you’re running Nagios today, consider migrating to Checkmk (which runs Nagios plugins) or Icinga 2 (which has a Nagios-compatible config format).
How to Choose: Decision Tree
| Scenario | Recommended Stack | Why |
|---|---|---|
| Containers + microservices | Prometheus + Grafana | Built for dynamic, ephemeral workloads |
| Mixed fleet (servers + network) | Zabbix | Agent + SNMP + auto-discovery |
| Quick single-server visibility | Netdata | Zero config, instant dashboards |
| Simple uptime monitoring | Uptime Kuma | Lightweight, clean status pages |
| Long-term metrics storage | VictoriaMetrics + Grafana | Better compression, PromQL compatible |
| Process watchdog | Monit | 32 MB RAM, auto-restarts crashed services |
| Legacy Nagios migration | Checkmk or Icinga 2 | Plugin-compatible, modern features |
| Network devices only | LibreNMS | SNMP-native, auto-discovery |
Hosting Requirements
Monitoring tools range from tiny (Monit at 32 MB) to resource-hungry (Zabbix at 4+ GB). Here’s what we recommend for a production deployment:
| Stack | CPU | RAM | Storage |
|---|---|---|---|
| Prometheus + Grafana (small) | 2 cores | 4 GB | 50 GB SSD |
| Prometheus + Grafana (100+ targets) | 4 cores | 8 GB | 200 GB SSD |
| Zabbix (enterprise) | 4 cores | 8 GB | 100 GB SSD |
| Netdata (per host) | 1 core | 1 GB | 10 GB |
| Uptime Kuma | 1 core | 512 MB | 5 GB |
| VictoriaMetrics (long-term) | 2 cores | 4 GB | 500 GB SSD |
A Canadian Web Hosting Cloud VPS handles Prometheus + Grafana comfortably starting at the 4 GB tier. For Zabbix monitoring 500+ hosts, a dedicated server gives you the I/O headroom the database needs.
Running monitoring for compliance (SOC 2, PCI DSS)? Our infrastructure is SOC 2 Type II certified, and our Managed Security team can help with the alerting and audit trail requirements.
Hardening Checklist
Whichever tool you choose, lock it down before exposing it:
- Reverse proxy — put Nginx or Caddy in front with TLS. Never expose Prometheus or Grafana directly on port 9090/3000.
- Authentication — enable auth on Grafana (default admin/admin is a gift to attackers). Use OAuth or LDAP if possible.
- Firewall — restrict exporter ports (9100, 9090, etc.) to your monitoring server’s IP only.
- Retention policy — set
--storage.tsdb.retention.timein Prometheus. 90 days is a good default; use VictoriaMetrics for longer. - Backups — snapshot Prometheus data and Grafana dashboards. A monitoring system that loses its history is useless for trend analysis.
Need help with the initial setup or ongoing management? Our Managed Support team handles monitoring deployments — we’ll configure alerting, dashboards, and retention so you can focus on your application.
What’s Next
Monitoring is only half the picture. You also need centralized logging to correlate metrics with events. When a CPU spike happens, logs tell you why. We cover logging stacks (Loki, ELK, Graylog) in a separate comparison.
For container-heavy environments, pair your monitoring with proper Docker troubleshooting practices and VPS hardening to keep everything stable.
Be First to Comment