Infrastructure Details

Real-time monitoring of your server's health and performance. Track CPU usage, memory consumption, disk space, and network activity at a glance.

Technical Implementation

CtrlOps uses a high-frequency polling mechanism to keep your infrastructure metrics up-to-date without overloading the server or the desktop application.

Data Collection Lifecycle

Rust Backend Polling: The Tauri backend uses the sysinfo crate and specialized shell commands (like top, df -h, and free -m) to gather raw system data.
Periodic Interval: In an active session, metrics are polled every 2-5 seconds (configurable in settings).
IPC Event Emitters: Once polled, the Rust backend emits an event (e.g., infra:update-stats) containing a JSON payload of all CPU, RAM, and Disk metrics.
React Frontend Subscription: The UI listens for these emitters and updates the dashboard charts in real-time, using a virtualized data layer to maintain 60fps performance even during high activity.

Service Health Logic

For services like Nginx, MySQL, or Docker, CtrlOps specifically monitors the systemd status:

Active: Service is running and reporting healthy.
Inactive/Failed: Service is stopped or has crashed. CtrlOps parses the exit-code and journalctl logs to provide immediate troubleshooting context.

Dashboard Overview

The Infrastructure panel gives you instant visibility into:

┌─────────────────────────────────────────────────────────────┐
│  Server Health — web-server-01                              │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  💻 CPU Usage                    🧠 Memory Usage            │
│  ████████░░ 45%                  ██████░░░░ 62%             │
│  8 cores active                  4.2 GB / 8 GB              │
│                                                             │
│  💾 Disk Usage                   🌐 Network                 │
│  ██████████ 78%                  ↓ 2.4 MB/s               │
│  234 GB / 300 GB                 ↑ 890 KB/s               │
│                                                             │
│  ⚡ Load Average                🔄 Processes                │
│  0.52 0.48 0.61                  142 running              │
│  (1m 5m 15m)                     3 zombie                 │
│                                                             │
└─────────────────────────────────────────────────────────────┘

CPU Monitoring

Understanding CPU Metrics

CPU Usage (%):

0-50%: Healthy, plenty of headroom
50-80%: Moderate load, monitor trends
80-95%: High load, investigate
95%+: Critical, immediate attention needed

Load Average: Three numbers representing average load over 1, 5, and 15 minutes.

Rule of thumb:

Below 1.0 per CPU core = good
Above 1.0 per CPU core = overloaded

Example: On a 4-core server:

Load 2.0 = 50% utilized (healthy)
Load 4.0 = 100% utilized (busy)
Load 8.0 = 200% utilized (overloaded, queue forming)

CPU Breakdown

CtrlOps shows:

User processes: Your applications
System processes: OS kernel tasks
I/O wait: Waiting for disk/network
Steal time: (VMs only) Time stolen by hypervisor

Troubleshooting High CPU

Step 1: Identify the culprit

# Show processes by CPU usage
htop

# Or in CtrlOps terminal
ps aux --sort=-%cpu | head -10

Step 2: Analyze

Is it a legitimate process?
Has resource usage spiked suddenly?
Is it consuming more than expected?

Step 3: Take action

Optimize the application
Add more CPU resources
Kill runaway processes (carefully!)
Schedule heavy tasks for off-peak

Total Memory: Physical RAM installed Used Memory: Currently allocated Free Memory: Completely unused Cached: Frequently accessed data (can be freed) Buffers: Disk cache (can be freed)

Available Memory = Free + Cached + Buffers (This is what matters!)

When to Worry

Memory usage patterns:

Scenario	Status	Action
< 50% used	✅ Healthy	None needed
50-80% used	⚠️ Monitor	Watch trends
80-95% used	🔶 Warning	Investigate soon
> 95% used	🔴 Critical	Immediate action

Out of Memory (OOM)

When memory is exhausted:

System becomes sluggish
Services may crash
Linux OOM killer terminates processes

CtrlOps alerts you before this happens!

Finding Memory Hogs

# Top memory consumers
ps aux --sort=-%mem | head -10

# Memory usage by process
pmap <process_id>

Common culprits:

Memory leaks in applications
Too many worker processes
Large database queries
Unoptimized code

Disk Usage

Disk Space Metrics

Total: Physical disk capacity Used: Space consumed Available: Free space Reserved: System reserved (usually 5%)

Disk Partitions

Typical layout:

Filesystem      Size  Used  Avail  Use%  Mounted on
/dev/sda1        30G   12G   16G   43%   /
/dev/sda2       100G   67G   28G   71%   /var
/dev/sdb1       500G   45G  430G   10%   /mnt/data

Key directories:

/ — Root filesystem (OS, applications)
/var — Logs, databases, variable data
/home — User files
/tmp — Temporary files

Disk Health

CtrlOps monitors:

Disk I/O — Read/write operations per second
Latency — Time for disk operations
Throughput — Data transfer rate
Errors — SMART errors, bad sectors

Disk failures are often preceded by increased error rates. Monitor SMART data regularly.

Cleaning Up Disk Space

Find large files:

# Largest files in /var
sudo du -h /var | sort -rh | head -20

# Old log files
find /var/log -name "*.log" -mtime +30 -size +100M

# Docker cleanup
docker system prune -a

Safe to delete:

Old logs (> 30 days)
Package caches (apt clean)
Temporary files
Rotated backups

Network Monitoring

Network Metrics

Bandwidth Usage:

Incoming (RX): Data downloaded
Outgoing (TX): Data uploaded

Measured in:

Bytes per second (B/s)
Kilobytes per second (KB/s)
Megabytes per second (MB/s)

Packet Statistics:

Packets received/transmitted
Errors
Dropped packets
Collisions

Network Troubleshooting

High latency:

# Check connection to external site
ping google.com

# Trace route
mtr google.com

Slow transfers:

Check bandwidth utilization
Look for packet loss
Verify network interface speed
Check for throttling

Bandwidth Monitoring

CtrlOps tracks:

Daily/weekly/monthly usage
Peak usage times
Per-process network usage
Unusual traffic patterns

sshd — SSH daemon
systemd — Init system
cron — Scheduled tasks
rsyslogd — Logging

Application-specific:

Web server (nginx, Apache)
Database (MySQL, PostgreSQL)
Application workers
Background jobs

Kill vs Terminate

Soft kill (SIGTERM):

kill <pid>

Politely asks process to shut down. Allows cleanup.

Hard kill (SIGKILL):

kill -9 <pid>

Forcefully terminates. Use when soft kill fails.

SIGKILL doesn't allow cleanup. May leave behind temp files or corrupt data.

Historical Data

Performance Graphs

CtrlOps keeps historical data for:

CPU usage (hourly, daily, weekly)
Memory consumption
Disk I/O
Network traffic

Useful for:

Identifying trends
Capacity planning
Troubleshooting past issues
Proving SLA compliance

Setting Up Alerts

Configure alerts for:

CPU > 80% for 5 minutes
Memory > 90%
Disk < 10% free
Load average > number of cores
Network errors increasing

Notification channels:

Email
Slack
SMS (critical alerts)
Webhook

Performance Optimization

CPU Optimization

Identify inefficient processes
Optimize database queries
Use caching (Redis, Memcached)
Scale horizontally (add servers)
Upgrade hardware (more/faster cores)

Memory Optimization

Find memory leaks (valgrind, profiling)
Reduce buffer sizes
Implement swapping (not ideal, but buys time)
Add more RAM
Optimize application code

Disk Optimization

Use SSD instead of HDD
Implement RAID for redundancy/speed
Partition wisely
Monitor inode usage
Schedule regular cleanup

Network Optimization

Enable compression
Use CDN for static content
Implement caching
Upgrade bandwidth
Optimize payload sizes

Reporting & Analytics

Weekly Reports

Automated email with:

Average resource usage
Peak usage times
Trends vs previous week
Recommendations

Capacity Planning

Based on trends, predict when you'll need:

More CPU cores
Additional RAM
Larger disks
More bandwidth

Example:

"Current growth rate suggests disk will be full in 45 days. Consider archiving old logs or expanding storage."

Summary

Monitor these key metrics:

✅ CPU Load Average — Should be < number of cores
✅ Memory Available — Don't let it hit zero
✅ Disk Free Space — Keep 15%+ free
✅ Network Latency — Should be stable
✅ Process Health — Watch for zombies, high CPU/memory

Proactive monitoring catches issues before they become outages. Set up alerts for thresholds before they become critical.