Infrastructure Details
Monitoring CPU, RAM, disk, and server performance metrics
Real-time monitoring of your server's health and performance. Track CPU usage, memory consumption, disk space, and network activity at a glance.
Technical Implementation
CtrlOps uses a high-frequency polling mechanism to keep your infrastructure metrics up-to-date without overloading the server or the desktop application.
Data Collection Lifecycle
- Rust Backend Polling: The Tauri backend uses the
sysinfocrate and specialized shell commands (liketop,df -h, andfree -m) to gather raw system data. - Periodic Interval: In an active session, metrics are polled every 2-5 seconds (configurable in settings).
- IPC Event Emitters: Once polled, the Rust backend emits an event (e.g.,
infra:update-stats) containing a JSON payload of all CPU, RAM, and Disk metrics. - React Frontend Subscription: The UI listens for these emitters and updates the dashboard charts in real-time, using a virtualized data layer to maintain 60fps performance even during high activity.
Service Health Logic
For services like Nginx, MySQL, or Docker, CtrlOps specifically monitors the systemd status:
- Active: Service is running and reporting healthy.
- Inactive/Failed: Service is stopped or has crashed. CtrlOps parses the
exit-codeandjournalctllogs to provide immediate troubleshooting context.
Dashboard Overview
The Infrastructure panel gives you instant visibility into:
┌─────────────────────────────────────────────────────────────┐
│ Server Health — web-server-01 │
├─────────────────────────────────────────────────────────────┤
│ │
│ 💻 CPU Usage 🧠 Memory Usage │
│ ████████░░ 45% ██████░░░░ 62% │
│ 8 cores active 4.2 GB / 8 GB │
│ │
│ 💾 Disk Usage 🌐 Network │
│ ██████████ 78% ↓ 2.4 MB/s │
│ 234 GB / 300 GB ↑ 890 KB/s │
│ │
│ ⚡ Load Average 🔄 Processes │
│ 0.52 0.48 0.61 142 running │
│ (1m 5m 15m) 3 zombie │
│ │
└─────────────────────────────────────────────────────────────┘CPU Monitoring
Understanding CPU Metrics
CPU Usage (%):
- 0-50%: Healthy, plenty of headroom
- 50-80%: Moderate load, monitor trends
- 80-95%: High load, investigate
- 95%+: Critical, immediate attention needed
Load Average: Three numbers representing average load over 1, 5, and 15 minutes.
Rule of thumb:
- Below 1.0 per CPU core = good
- Above 1.0 per CPU core = overloaded
Example: On a 4-core server:
- Load 2.0 = 50% utilized (healthy)
- Load 4.0 = 100% utilized (busy)
- Load 8.0 = 200% utilized (overloaded, queue forming)
CPU Breakdown
CtrlOps shows:
- User processes: Your applications
- System processes: OS kernel tasks
- I/O wait: Waiting for disk/network
- Steal time: (VMs only) Time stolen by hypervisor
Troubleshooting High CPU
Step 1: Identify the culprit
# Show processes by CPU usage
htop
# Or in CtrlOps terminal
ps aux --sort=-%cpu | head -10Step 2: Analyze
- Is it a legitimate process?
- Has resource usage spiked suddenly?
- Is it consuming more than expected?
Step 3: Take action
- Optimize the application
- Add more CPU resources
- Kill runaway processes (carefully!)
- Schedule heavy tasks for off-peak
Memory Monitoring
Memory Metrics Explained
Total Memory: Physical RAM installed Used Memory: Currently allocated Free Memory: Completely unused Cached: Frequently accessed data (can be freed) Buffers: Disk cache (can be freed)
Available Memory = Free + Cached + Buffers (This is what matters!)
When to Worry
Memory usage patterns:
| Scenario | Status | Action |
|---|---|---|
| < 50% used | ✅ Healthy | None needed |
| 50-80% used | ⚠️ Monitor | Watch trends |
| 80-95% used | 🔶 Warning | Investigate soon |
| > 95% used | 🔴 Critical | Immediate action |
Out of Memory (OOM)
When memory is exhausted:
- System becomes sluggish
- Services may crash
- Linux OOM killer terminates processes
CtrlOps alerts you before this happens!
Finding Memory Hogs
# Top memory consumers
ps aux --sort=-%mem | head -10
# Memory usage by process
pmap <process_id>Common culprits:
- Memory leaks in applications
- Too many worker processes
- Large database queries
- Unoptimized code
Disk Usage
Disk Space Metrics
Total: Physical disk capacity Used: Space consumed Available: Free space Reserved: System reserved (usually 5%)
Disk Partitions
Typical layout:
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 30G 12G 16G 43% /
/dev/sda2 100G 67G 28G 71% /var
/dev/sdb1 500G 45G 430G 10% /mnt/dataKey directories:
/— Root filesystem (OS, applications)/var— Logs, databases, variable data/home— User files/tmp— Temporary files
Disk Health
CtrlOps monitors:
- Disk I/O — Read/write operations per second
- Latency — Time for disk operations
- Throughput — Data transfer rate
- Errors — SMART errors, bad sectors
Disk failures are often preceded by increased error rates. Monitor SMART data regularly.
Cleaning Up Disk Space
Find large files:
# Largest files in /var
sudo du -h /var | sort -rh | head -20
# Old log files
find /var/log -name "*.log" -mtime +30 -size +100M
# Docker cleanup
docker system prune -aSafe to delete:
- Old logs (> 30 days)
- Package caches (
apt clean) - Temporary files
- Rotated backups
Network Monitoring
Network Metrics
Bandwidth Usage:
- Incoming (RX): Data downloaded
- Outgoing (TX): Data uploaded
Measured in:
- Bytes per second (B/s)
- Kilobytes per second (KB/s)
- Megabytes per second (MB/s)
Packet Statistics:
- Packets received/transmitted
- Errors
- Dropped packets
- Collisions
Network Troubleshooting
High latency:
# Check connection to external site
ping google.com
# Trace route
mtr google.comSlow transfers:
- Check bandwidth utilization
- Look for packet loss
- Verify network interface speed
- Check for throttling
Bandwidth Monitoring
CtrlOps tracks:
- Daily/weekly/monthly usage
- Peak usage times
- Per-process network usage
- Unusual traffic patterns
Process Monitoring
Process States
Running: Currently executing Sleeping: Waiting for something (I/O, timer) Stopped: Paused (Ctrl+Z) Zombie: Finished but not cleaned up Defunct: Similar to zombie
Important Processes to Watch
System-critical:
sshd— SSH daemonsystemd— Init systemcron— Scheduled tasksrsyslogd— Logging
Application-specific:
- Web server (nginx, Apache)
- Database (MySQL, PostgreSQL)
- Application workers
- Background jobs
Kill vs Terminate
Soft kill (SIGTERM):
kill <pid>Politely asks process to shut down. Allows cleanup.
Hard kill (SIGKILL):
kill -9 <pid>Forcefully terminates. Use when soft kill fails.
SIGKILL doesn't allow cleanup. May leave behind temp files or corrupt data.
Historical Data
Performance Graphs
CtrlOps keeps historical data for:
- CPU usage (hourly, daily, weekly)
- Memory consumption
- Disk I/O
- Network traffic
Useful for:
- Identifying trends
- Capacity planning
- Troubleshooting past issues
- Proving SLA compliance
Setting Up Alerts
Configure alerts for:
- CPU > 80% for 5 minutes
- Memory > 90%
- Disk < 10% free
- Load average > number of cores
- Network errors increasing
Notification channels:
- Slack
- SMS (critical alerts)
- Webhook
Performance Optimization
CPU Optimization
- Identify inefficient processes
- Optimize database queries
- Use caching (Redis, Memcached)
- Scale horizontally (add servers)
- Upgrade hardware (more/faster cores)
Memory Optimization
- Find memory leaks (valgrind, profiling)
- Reduce buffer sizes
- Implement swapping (not ideal, but buys time)
- Add more RAM
- Optimize application code
Disk Optimization
- Use SSD instead of HDD
- Implement RAID for redundancy/speed
- Partition wisely
- Monitor inode usage
- Schedule regular cleanup
Network Optimization
- Enable compression
- Use CDN for static content
- Implement caching
- Upgrade bandwidth
- Optimize payload sizes
Reporting & Analytics
Weekly Reports
Automated email with:
- Average resource usage
- Peak usage times
- Trends vs previous week
- Recommendations
Capacity Planning
Based on trends, predict when you'll need:
- More CPU cores
- Additional RAM
- Larger disks
- More bandwidth
Example:
"Current growth rate suggests disk will be full in 45 days. Consider archiving old logs or expanding storage."
Summary
Monitor these key metrics:
- ✅ CPU Load Average — Should be < number of cores
- ✅ Memory Available — Don't let it hit zero
- ✅ Disk Free Space — Keep 15%+ free
- ✅ Network Latency — Should be stable
- ✅ Process Health — Watch for zombies, high CPU/memory
Proactive monitoring catches issues before they become outages. Set up alerts for thresholds before they become critical.