Production Diagnostics Guide
Symptom-based troubleshooting using Ze's built-in diagnostic commands. All commands work on gokrazy appliances without external Linux tools.
Quick Reference
| Symptom | First Command |
|---|---|
| BGP session won't establish | show tcp-check <peer-ip> 179 |
| Path/routing issue | show traceroute <dest> or monitor traceroute <dest> |
| BGP session flapping | show system kernel-log level warning count 50 |
| High CPU | show system profile cpu duration 10s |
| Memory leak | show system profile heap |
| FD exhaustion | show system file-descriptors summary |
| Goroutine leak | show system goroutines summary |
| DNS failure | show dns lookup <name> type A |
| Latency/reachability | monitor ping <target> |
| Process killed | show system kernel-log level err |
| Route/link/addr changes | monitor system netlink all |
| Packet-level debugging | show capture interface eth0 tcp port 179 count 10 format text |
Failure Categories
1. BGP Session Won't Establish
Verify TCP connectivity:
show tcp-check <peer-ip> 179
If "refused": peer is not listening. If "timeout": firewall or routing issue.
Trace the path to the peer:
show traceroute <peer-ip>
If hops stop before the peer, there is a routing or firewall issue at that hop.
Check sockets for existing connections:
show system sockets tcp state ESTABLISHED port 179
Check DNS resolution of peer hostname:
show dns lookup <peer-hostname>
Inspect raw BGP messages:
show capture raw start bgp
# Wait for connection attempt, then:
show capture raw dump bgp
2. BGP Session Flapping
Check kernel log for link events:
show system kernel-log level warning count 50
Stream live link/route events to correlate with flaps:
monitor system netlink all
Check socket state churn:
show system sockets tcp port 179
Inspect raw packets during flap:
show capture raw start bgp
show capture raw dump bgp pcap
Live packet capture on the interface (replaces tcpdump):
show capture interface eth0 tcp port 179 count 20 format text
show capture interface eth0 tcp port 179 duration 10s format pcap
3. BGP Routes Not Received
show bgp peer <selector> detail
show capture raw start bgp
Check UPDATE messages in the capture for the expected NLRI.
4. BGP Routes Not Advertised
show bgp peer <selector> detail
Check advertised route counts, filter configuration, and export policy.
5. High CPU Usage
Capture CPU profile:
show system profile cpu duration 10s
Decode the base64 output with go tool pprof.
Check goroutine distribution:
show system goroutines summary
A large count in one state (e.g., "running") suggests a hot loop.
Check system metrics:
show system cpu
6. Memory Leak / High Memory
Capture heap profile:
show system profile heap
Check process memory from the kernel's perspective:
show system memory-map
Compare VmRSS with Go's heap-in-use from show system memory:
show system memory
If VmRSS is much larger than heap-in-use, memory is held outside Go's heap (cgo, mmap).
Check goroutines for leaks:
show system goroutines summary
show system goroutines blocked
7. File Descriptor Exhaustion
Check FD usage and limits:
show system file-descriptors summary
If total is close to soft-limit, the process is near exhaustion.
Inspect individual FDs:
show system file-descriptors detail
Look for unexpected socket or pipe accumulation.
Cross-reference with sockets:
show system sockets
8. Goroutine Leak
Get current count and distribution:
show system goroutines summary
Normal count depends on configuration. A steadily increasing count indicates a leak.
Find blocked goroutines:
show system goroutines blocked
Large numbers stuck in "chan receive" or "select" with the same stack suggest leaked goroutines.
Full stack dump for analysis:
show system goroutines full
9. Process Killed (OOM / Signal)
Check kernel log for OOM killer or signal events:
show system kernel-log level err
Look for "Out of memory" or "Killed process" messages.
Check current memory state:
show system memory-map
Review warnings and errors:
show warnings
show errors
10. Interface Down / Link Flap
show interface
show system kernel-log level warning
monitor system netlink link
monitor system netlink link streams live link state changes (up/down/create/delete). Press Esc to stop.
11. Kernel Route Missing
show route
show system sockets
monitor system netlink route
monitor system netlink route streams live kernel route changes to observe additions and deletions in real time.
12. DNS Resolution Failure
Test resolution directly:
show dns lookup <hostname> type A
Check cache state:
show dns cache stats
show dns cache list
High miss rate with low hit rate suggests upstream resolver issues. list
shows all cached entries with remaining TTL, useful for spotting stale or
unexpectedly short-lived records.
Inspect a specific cached name:
show dns cache record <hostname>
Flush and start fresh:
clear dns cache
Verify socket connectivity to resolver:
show tcp-check <dns-server> 53
13. Config Commit Failure
show errors
show config diff
14. Plugin Crash / Restart Loop
show errors count 20
show warnings
show system goroutines summary
show system kernel-log level err count 20
15. CLI / SSH Unresponsive
If the daemon is reachable via another path (web, MCP):
show system goroutines blocked
show system sockets
show system file-descriptors summary
Look for goroutines stuck in "semacquire" or "IO wait".
16. Web UI Unreachable
show tcp-check <router-ip> 3443
show system sockets tcp port 3443
show system goroutines summary
show errors
17. Telemetry / Metrics Gaps
show metrics-query ze_peer_state
show system sockets
show system profile cpu duration 5s
18. Latency and Reachability
Continuous ping to measure latency and loss:
monitor ping <target>
monitor ping <target> interval 500ms
Shows live stats: Sent, Recv, Loss%, Last, Min, Avg, Max, StDev. Use | log for scrollback output suitable for correlation with other events.
Continuous traceroute to observe path changes:
monitor traceroute <target>
monitor traceroute <target> | log | origin
mtr-style display with per-hop loss and latency statistics. | log | origin
appends one line per round and annotates hops with ASN names, useful for
identifying which network a path change occurs in. | log | resolve adds
reverse DNS hostnames instead.
Profiling Workflow
CPU Profile
show system profile cpu duration 10s
Save the base64 output, decode, and analyze:
echo '<base64-data>' | base64 -d > cpu.pprof
go tool pprof cpu.pprof
Heap Profile
show system profile heap
Same workflow: decode base64, analyze with go tool pprof.
Concurrent Profiling
CPU profiling is mutex-protected. A second concurrent request returns an error. This prevents resource contention from overlapping profiles.
Platform Detection
show system platform reports the runtime platform type and capability flags:
show system platform
show system platform | json
Detected platforms: gokrazy, systemd, container, plain-linux, darwin.
Capability flags: read-only-root, perm-available, systemd-available, gokrazy-update-socket, gokrazy-ui-available, reboot-allowed, persistent-storage-writable, fd-limit-soft-current, fd-limit-hard-max, fd-limit-raisable.
Platform information is also included in ze doctor checks (e.g. gokrazy /perm writability) and ze support archives (as the platform module).
Platform Notes
Commands that read /proc (sockets, kernel-log, file-descriptors, memory-map) are Linux-only. On other platforms they return "not available on this platform". The remaining commands (tcp-check, traceroute, goroutines, dns, profile) work on all platforms but traceroute requires CAP_NET_RAW.