Alright—here’s a surgical, no-BS path to isolate what’s actually causing the multi-second stalls. Given your symptoms (SSH connect delay, keystroke lag, HTTP latency, idle CPU, low IO/BW), the most likely culprits are:
(1) name-resolution blocking, (2) NIC/driver/IRQ softirq saturation on a single core, (3) kernel power/NUMA/THP stalls, (4) conntrack or firewall pathology, (5) PCIe/AER/hardware errors. Below is a tight checklist that you can run
in order. Each step either confirms/eliminates an entire branch of the tree.
0) When it’s slow, capture the moment (one-shot triage)
Run these
during a slow spell; keep a tmux ready.
# 10-second snapshots
date; uptime; who; ss -s; netstat -s | egrep 'retransmit|timeouts|loss'
cat /proc/interrupts | egrep -i 'eth|mlx|enp|eno' # NIC IRQ distribution
grep -E 'RCU stall|soft lockup|hard LOCKUP' /var/log/kern.log /var/log/syslog || true
cat /proc/softirqs
top -b -n1 | head -50
pidstat -w -u -r -d 1 5
dmesg -T | tail -n 200
What to look for
- ksoftirqd/… pegged on one CPU while others idle → NIC/IRQ/driver queueing.
- TCP retransmits/timeouts climbing → network path/driver/MTU/offload problem.
- Any AER/MCE/PCIe errors → hardware/PCIe bus.
- Run queue ~0 but stalls exist → blocking on DNS/IO/entropy/locks.
1) Fastest likely win: kill all name-resolution latency
Multi-second
SSH connection delay screams reverse-DNS or resolver timeouts.
Do these now (no restart of machine required):
# SSH: never do reverse DNS on clients
echo 'UseDNS no' | sudo tee -a /etc/ssh/sshd_config.d/99-nodns.conf
echo 'AddressFamily inet' | sudo tee -a /etc/ssh/sshd_config.d/99-ipv4.conf # optional: skips IPv6 stalls
sudo systemctl reload ssh || sudo systemctl restart ssh
# MariaDB: never resolve hostnames
echo -e "[mysqld]\nskip-name-resolve\n" | sudo tee /etc/mysql/conf.d/99-skip-name-resolve.cnf
sudo systemctl restart mariadb
# Ensure your own hostname never triggers DNS
getent hosts $(hostname -f) || echo "127.0.1.1 $(hostname -f) $(hostname)" | sudo tee -a /etc/hosts
# Resolver sanity: avoid dead upstreams; prefer a local caching resolver
grep -v '^#' /etc/resolv.conf
# If you see flaky providers, temporarily point to a local unbound or to 1.1.1.1/9.9.9.9 to test.
Quick test:
time ssh -o 'GSSAPIAuthentication=no' localhost 'echo ok' # should be < 0.1s wall on-box
If SSH and PHP/MySQL latencies suddenly vanish or become rare → DNS was your root cause (or a major contributor).
2) NIC, driver, and softirq: the classic “one hot core, everything else idle”
Intermittent multi-second stalls with “idle CPUs” often hide in
softirq on a single CPU servicing the NIC.
Check distribution & drops
ethtool -S <nic> | egrep 'rx_dropped|rx_missed|errors|timeout|reset'
cat /proc/interrupts | egrep -i 'eth|mlx|enp|eno'
Fixes (apply all; they’re safe on big iron):
# Spread IRQs across cores
sudo apt-get install irqbalance -y
sudo systemctl enable --now irqbalance
# Enable RSS and multiple RX/TX queues
ethtool -l <nic> # see current channels
ethtool -L <nic> combined 32 # example, set queues to 32 (match NIC capability)
# RPS/RFS to distribute processing beyond IRQ CPU
echo ffffffff | sudo tee /sys/class/net/<nic>/queues/rx-*/rps_cpus
echo 32768 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
for f in /sys/class/net/<nic>/queues/rx-*/rps_flow_cnt; do echo 2048 | sudo tee $f; done
Disable problematic offloads temporarily (test):
# Test run; if it helps, make persistent
ethtool -K <nic> gro off lro off tso off gso off rx off tx off
# If that helps, re-enable rx/tx, then selectively restore tso/gso/gro later.
MTU/fragmentation sanity:
ip link show <nic> # confirm MTU
ping -M do -s 8972 <peer> # jumbo path test; or -s 1472 for std MTU
If this removes the “sometimes instant, sometimes 3s” behavior → you had softirq or offload pathologies.
3) Power management, C-states, and NUMA/THP: stop the scheduler from sleeping you to death
Big multi-socket/NUMA boxes with 256 cores often get wrecked by deep C-states, THP defrag, and automatic NUMA balancing.
Immediate runtime toggles:
# Performance governor
for c in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo performance | sudo tee $c; done
# Disable THP (esp. defrag)
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
# NUMA auto-balancer can spike latencies
echo 0 | sudo tee /proc/sys/kernel/numa_balancing
VM tunables that help PHP/MySQL latency:
sysctl -w vm.swappiness=1
sysctl -w vm.dirty_background_ratio=5
sysctl -w vm.dirty_ratio=15
sysctl -w vm.compaction_proactiveness=0
sysctl -w vm.zone_reclaim_mode=0
Service pinning (keeps hot paths local to one NUMA node):
- Pin nginx and php-fpm worker processes to the same NUMA node as the NIC that serves them (numactl --cpunodebind=0 --membind=0 via a systemd drop-in).
- Pin mariadb to a different node if you have separate IO paths; otherwise co-locate for locality.
4) ZFS present? Make sure it’s not poking the kernel even if “unused”
Even “idle” ZFS can do background work that causes stalls.
zpool status
zpool iostat 5
cat /proc/spl/kstat/zfs/arcstats | egrep 'hits|misses|size'
- If you’re not actively using SNEED for live traffic: export the pool and unload ZFS to test:
zpool export SNEED
modprobe -r zfs
- If you are using it: set zfs_arc_max (e.g., 64–128G), and ensure no scrub/resilver is running during tests.
5) Conntrack / firewall stalls
A half-full or thrashing conntrack can add seconds.
sysctl net.netfilter.nf_conntrack_max
grep -h conntrack /proc/net/nf_conntrack | wc -l
dmesg -T | egrep -i 'conntrack.*full|nf_conntrack'
If close to full or dropping, raise headroom:
sysctl -w net.netfilter.nf_conntrack_max=$(( 1024*1024 ))
sysctl -w net.netfilter.nf_conntrack_buckets=$(( 1024*1024/4 ))
Also check iptables/nftables rule-set size—giant per-connection rules are death.
6) PCIe / hardware flakiness (motherboard, bus, NIC, NVMe)
You suspected this; confirm before buying parts.
dmesg -T | egrep -i 'pcie|aer|corrected|uncorrected|MCE|EDAC|nvme timeout|NVMe reset'
journalctl -k --since "2 hours ago" | egrep -i 'pcie|aer|mce|nvme|reset'
lspci -vv | egrep -i 'Errors

LnkSta:'
- Any AER corrected/uncorrected errors, NVMe timeouts/resets, or MCEs → you likely have a lane, slot, or board issue. Reseat NIC/NVMe; try a different slot; disable ASPM in BIOS; update NIC firmware.
7) Prove/deny that it’s
Bypass every layer and hit the box locally:
# Raw kernel network path (no DNS)
time nc -zv 127.0.0.1 22
time curl -s -o /dev/null -w '%{time_connect} %{time_starttransfer} %{time_total}\n'
http://127.0.0.1/
# PHP-FPM ping (enable /status and /ping in php-fpm + nginx)
curl -s -o /dev/null -w '%{time_connect} %{time_starttransfer} %{time_total}\n'
http://127.0.0.1/status?full
If localhost is instant while remote is slow → networking/IRQ/DNS.
If localhost is also stalling → kernel/NUMA/THP/ZFS/conntrack.
Attachment pipeline reality check (XenForo’s PHP streaming)
Yes, XenForo’s default attachment path sucks. Even so,
that would not explain SSH typing lag. Fix the platform first. After the system is healthy:
- Put all attachment serving behind a dedicated subdomain that never hits PHP (object store + signed URLs or nginx X-Accel-Redirect from a minimal internal location).
- Even if XenForo wants PHP to gate access, do the auth in PHP then immediately X-Accel-Redirect to a file on local NVMe or to an S3-compatible endpoint (SeaweedFS volume server or MinIO gateway).
- In MariaDB, ensure innodb_flush_method=O_DIRECT, innodb_flush_log_at_trx_commit=2 (latency-friendly), innodb_buffer_pool_size sized to hot set, and skip-name-resolve (already above).
9) Kernel netsysctls that reduce tail latency (apply after fixes above)
sysctl -w net.core.somaxconn=65535
sysctl -w net.core.netdev_max_backlog=250000
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.tcp_fin_timeout=20
sysctl -w net.ipv4.tcp_syncookies=1
sysctl -w net.ipv4.tcp_max_syn_backlog=262144
Minimal “do this right now” bundle
- Disable DNS lookups in sshd and MariaDB; ensure /etc/hosts covers your hostnames.
- Set performance governor, disable THP and NUMA balancing.
- Spread NIC IRQs (irqbalance, RSS, RPS/RFS) and temporarily disable GRO/LRO/TSO/GSO to test.
- Check dmesg for AER/MCE/NVMe resets; if present, reseat/update/disable ASPM.
- Run the localhost vs remote curl/ssh timing to pin blame to network vs kernel/userspace.
If after steps 1–3 your SSH and HTTP latencies drop from “seconds” to “sub-100ms” most of the time, you’ve found it. If they don’t, paste the outputs from sections 0–3 and I’ll read the tea leaves and tell you exactly where it’s choking.