The Year of Endless Technical Problems

  • 🏰 The Fediverse is up. If you know, you know.
  • Want to keep track of this thread?
    Accounts can bookmark posts, watch threads for updates, and jump back to where you stopped reading.
    Create account
Status
Not open for further replies.
that sounds very frustrating

the fact that ssh hangs like that is particularly strange


unsolicited suggestion from a retard:

mprime/prime95 to check if CPUs will get max load + what temps, etc etc
 
Maybe I'm used to larger budgets but I'd be running this stuff in a cluster of servers even a small one. Letting ingress load balancers handle load and get the mass file storage into its own distributed cluster of sorts. Seems like the move off minio was a good start but to be in line with modern professional setups this stuff should be in a nice cluster running with CPU and men limits set and regularly seeing pod restarts.

Sorry if that's not helpful and comes across as "toss 5k more at the problem" but my first thought was why are we talking about one big server instead of say a couple medium used servers off ebay for a cluster and all the freebies you get when you move to that model.
 
i yeeted your post into chatgpt to find a solution, much fun.
Alright—here’s a surgical, no-BS path to isolate what’s actually causing the multi-second stalls. Given your symptoms (SSH connect delay, keystroke lag, HTTP latency, idle CPU, low IO/BW), the most likely culprits are: (1) name-resolution blocking, (2) NIC/driver/IRQ softirq saturation on a single core, (3) kernel power/NUMA/THP stalls, (4) conntrack or firewall pathology, (5) PCIe/AER/hardware errors. Below is a tight checklist that you can run in order. Each step either confirms/eliminates an entire branch of the tree.




0) When it’s slow, capture the moment (one-shot triage)​


Run these during a slow spell; keep a tmux ready.

# 10-second snapshots
date; uptime; who; ss -s; netstat -s | egrep 'retransmit|timeouts|loss'
cat /proc/interrupts | egrep -i 'eth|mlx|enp|eno' # NIC IRQ distribution
grep -E 'RCU stall|soft lockup|hard LOCKUP' /var/log/kern.log /var/log/syslog || true
cat /proc/softirqs
top -b -n1 | head -50
pidstat -w -u -r -d 1 5
dmesg -T | tail -n 200


What to look for


  • ksoftirqd/… pegged on one CPU while others idle → NIC/IRQ/driver queueing.
  • TCP retransmits/timeouts climbing → network path/driver/MTU/offload problem.
  • Any AER/MCE/PCIe errors → hardware/PCIe bus.
  • Run queue ~0 but stalls exist → blocking on DNS/IO/entropy/locks.



1) Fastest likely win: kill all name-resolution latency​


Multi-second SSH connection delay screams reverse-DNS or resolver timeouts.


Do these now (no restart of machine required):

# SSH: never do reverse DNS on clients
echo 'UseDNS no' | sudo tee -a /etc/ssh/sshd_config.d/99-nodns.conf
echo 'AddressFamily inet' | sudo tee -a /etc/ssh/sshd_config.d/99-ipv4.conf # optional: skips IPv6 stalls
sudo systemctl reload ssh || sudo systemctl restart ssh

# MariaDB: never resolve hostnames
echo -e "[mysqld]\nskip-name-resolve\n" | sudo tee /etc/mysql/conf.d/99-skip-name-resolve.cnf
sudo systemctl restart mariadb

# Ensure your own hostname never triggers DNS
getent hosts $(hostname -f) || echo "127.0.1.1 $(hostname -f) $(hostname)" | sudo tee -a /etc/hosts

# Resolver sanity: avoid dead upstreams; prefer a local caching resolver
grep -v '^#' /etc/resolv.conf
# If you see flaky providers, temporarily point to a local unbound or to 1.1.1.1/9.9.9.9 to test.


Quick test:
time ssh -o 'GSSAPIAuthentication=no' localhost 'echo ok' # should be < 0.1s wall on-box


If SSH and PHP/MySQL latencies suddenly vanish or become rare → DNS was your root cause (or a major contributor).




2) NIC, driver, and softirq: the classic “one hot core, everything else idle”​


Intermittent multi-second stalls with “idle CPUs” often hide in softirq on a single CPU servicing the NIC.


Check distribution & drops


ethtool -S <nic> | egrep 'rx_dropped|rx_missed|errors|timeout|reset'
cat /proc/interrupts | egrep -i 'eth|mlx|enp|eno'



Fixes (apply all; they’re safe on big iron):


# Spread IRQs across cores
sudo apt-get install irqbalance -y
sudo systemctl enable --now irqbalance

# Enable RSS and multiple RX/TX queues
ethtool -l <nic> # see current channels
ethtool -L <nic> combined 32 # example, set queues to 32 (match NIC capability)

# RPS/RFS to distribute processing beyond IRQ CPU
echo ffffffff | sudo tee /sys/class/net/<nic>/queues/rx-*/rps_cpus
echo 32768 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
for f in /sys/class/net/<nic>/queues/rx-*/rps_flow_cnt; do echo 2048 | sudo tee $f; done


Disable problematic offloads temporarily (test):


# Test run; if it helps, make persistent
ethtool -K <nic> gro off lro off tso off gso off rx off tx off
# If that helps, re-enable rx/tx, then selectively restore tso/gso/gro later.



MTU/fragmentation sanity:

ip link show <nic> # confirm MTU
ping -M do -s 8972 <peer> # jumbo path test; or -s 1472 for std MTU



If this removes the “sometimes instant, sometimes 3s” behavior → you had softirq or offload pathologies.




3) Power management, C-states, and NUMA/THP: stop the scheduler from sleeping you to death​


Big multi-socket/NUMA boxes with 256 cores often get wrecked by deep C-states, THP defrag, and automatic NUMA balancing.


Immediate runtime toggles:


# Performance governor
for c in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo performance | sudo tee $c; done


# Disable THP (esp. defrag)
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/defrag


# NUMA auto-balancer can spike latencies
echo 0 | sudo tee /proc/sys/kernel/numa_balancing



VM tunables that help PHP/MySQL latency:


sysctl -w vm.swappiness=1
sysctl -w vm.dirty_background_ratio=5
sysctl -w vm.dirty_ratio=15
sysctl -w vm.compaction_proactiveness=0
sysctl -w vm.zone_reclaim_mode=0



Service pinning (keeps hot paths local to one NUMA node):


  • Pin nginx and php-fpm worker processes to the same NUMA node as the NIC that serves them (numactl --cpunodebind=0 --membind=0 via a systemd drop-in).
  • Pin mariadb to a different node if you have separate IO paths; otherwise co-locate for locality.



4) ZFS present? Make sure it’s not poking the kernel even if “unused”​


Even “idle” ZFS can do background work that causes stalls.




zpool status
zpool iostat 5
cat /proc/spl/kstat/zfs/arcstats | egrep 'hits|misses|size'



  • If you’re not actively using SNEED for live traffic: export the pool and unload ZFS to test:



zpool export SNEED
modprobe -r zfs

  • If you are using it: set zfs_arc_max (e.g., 64–128G), and ensure no scrub/resilver is running during tests.



5) Conntrack / firewall stalls​


A half-full or thrashing conntrack can add seconds.


sysctl net.netfilter.nf_conntrack_max
grep -h conntrack /proc/net/nf_conntrack | wc -l
dmesg -T | egrep -i 'conntrack.*full|nf_conntrack'



If close to full or dropping, raise headroom:


sysctl -w net.netfilter.nf_conntrack_max=$(( 1024*1024 ))
sysctl -w net.netfilter.nf_conntrack_buckets=$(( 1024*1024/4 ))



Also check iptables/nftables rule-set size—giant per-connection rules are death.




6) PCIe / hardware flakiness (motherboard, bus, NIC, NVMe)​


You suspected this; confirm before buying parts.


dmesg -T | egrep -i 'pcie|aer|corrected|uncorrected|MCE|EDAC|nvme timeout|NVMe reset'
journalctl -k --since "2 hours ago" | egrep -i 'pcie|aer|mce|nvme|reset'
lspci -vv | egrep -i 'Errors:|LnkSta:'



  • Any AER corrected/uncorrected errors, NVMe timeouts/resets, or MCEs → you likely have a lane, slot, or board issue. Reseat NIC/NVMe; try a different slot; disable ASPM in BIOS; update NIC firmware.



7) Prove/deny that it’s​


Bypass every layer and hit the box locally:




# Raw kernel network path (no DNS)
time nc -zv 127.0.0.1 22
time curl -s -o /dev/null -w '%{time_connect} %{time_starttransfer} %{time_total}\n' http://127.0.0.1/
# PHP-FPM ping (enable /status and /ping in php-fpm + nginx)
curl -s -o /dev/null -w '%{time_connect} %{time_starttransfer} %{time_total}\n' http://127.0.0.1/status?full



If localhost is instant while remote is slow → networking/IRQ/DNS.
If localhost is also stalling → kernel/NUMA/THP/ZFS/conntrack.




8) Attachment pipeline reality check (XenForo’s PHP streaming)​


Yes, XenForo’s default attachment path sucks. Even so, that would not explain SSH typing lag. Fix the platform first. After the system is healthy:


  • Put all attachment serving behind a dedicated subdomain that never hits PHP (object store + signed URLs or nginx X-Accel-Redirect from a minimal internal location).
  • Even if XenForo wants PHP to gate access, do the auth in PHP then immediately X-Accel-Redirect to a file on local NVMe or to an S3-compatible endpoint (SeaweedFS volume server or MinIO gateway).
  • In MariaDB, ensure innodb_flush_method=O_DIRECT, innodb_flush_log_at_trx_commit=2 (latency-friendly), innodb_buffer_pool_size sized to hot set, and skip-name-resolve (already above).



9) Kernel netsysctls that reduce tail latency (apply after fixes above)​


sysctl -w net.core.somaxconn=65535
sysctl -w net.core.netdev_max_backlog=250000
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.tcp_fin_timeout=20
sysctl -w net.ipv4.tcp_syncookies=1
sysctl -w net.ipv4.tcp_max_syn_backlog=262144



Minimal “do this right now” bundle​


  1. Disable DNS lookups in sshd and MariaDB; ensure /etc/hosts covers your hostnames.
  2. Set performance governor, disable THP and NUMA balancing.
  3. Spread NIC IRQs (irqbalance, RSS, RPS/RFS) and temporarily disable GRO/LRO/TSO/GSO to test.
  4. Check dmesg for AER/MCE/NVMe resets; if present, reseat/update/disable ASPM.
  5. Run the localhost vs remote curl/ssh timing to pin blame to network vs kernel/userspace.

If after steps 1–3 your SSH and HTTP latencies drop from “seconds” to “sub-100ms” most of the time, you’ve found it. If they don’t, paste the outputs from sections 0–3 and I’ll read the tea leaves and tell you exactly where it’s choking.
 
Dear trannies who attack Kiwifarms: When you make the site run slow I will just bleed out into your communities to make fun of you.
If you really hate everyone on Kiwifarms, donate to make the server run better so none of us go off site.
 
I can't help your problem directly but can recommend a strategy I use to minimize the cost of field repair in multi-million dollar systems that have many complex interconnected nodes that no one person on our team understands and which has limited spare parts and time to replace them.

As a cost-minimizing strategy, instead of looking for light-switch solutions, (I changed config 7373736 and it works now) I recommend using locators to discern where the problem exists (A is slow, B is fast, C is slow, D is fast and they share these in common).

You have done this already and exhausted the light-switch solutions and are looking into hardware as the source after testing. This may very well be true so I recommend you carefully devise a "Design of Experiment" to determine WHERE the issue manifests in hardware and to prioritize what your replace first and in what order you replace them. You have been doing that naturally but really formalize it and repeat tests several times in different orders. Make a table of "RAM", "PCI bus", "CPU", etc. as the rows and columns . Devise tests for first-order and second-order interactions of these components. You may find that the slowest shit ALWAYS involves the CPU and never the RAM but also implicates a certain set of hardware on a specific PCI lane but only if it is also trying to do drive reads into CPU cache at the same time. This implies the CPU may be at fault but also that specific lanes on the motherboard could be at fault when operated together. So you can strategize the cheapest step first (CPU swap?) and if it fixes nothing you know that the more expensive option, MoBo swap, is then warranted. Continue logically from there until you replace the whole system and commit die when it still doesn't work.

God Sneed.
 
Raid A is an 21TB SSD ZFS named 'SNEED', Raid B is Raid-6 NVMe with 7TB named 'CHUCK'.
I know jack shit about websites, i just wanted to read the thread to maybe learn something, but this made my day.

Edit : As a question, could specific boards or threads be causing issues? Is it possible to "turn off" threads or boards temporarily to see if something specific to them are causing the issues

(Sorry if that sounds stupid, my main understanding of computers is that they are functionally pure chaos and problems effectively can crop up for no sensical reason)
 
Last edited:
unsolicited suggestion from a retard:

mprime/prime95 to check if CPUs will get max load + what temps, etc etc
seconding this as another retard who doesnt know any better:
try benchmarking individual components to make sure the percieved lack of bottlenecking is actually from something else and not from something misreporting
 
Update: since your CPUs are idle most of the time, setting all disk schedulers to BFQ will have almost zero downside. I don't have a system with 512 gigabytes of RAM, so I'm taking a slightly smaller system down for "maintenance" to try to replicate your level of lag ASAP.

Major monkey suggestion: set all the disk schedulers to BFQ (even on all the SSDs/NVMEs) and then swap out the entire CPU scheduler with sched-ext / scx

Try scx_cosmos first because there are few options (start out with sudo /usr/bin/nice -n -15 scx_cosmos --mm-affinity, said to be good for databases) , then if that doesn't work go to scx_lavd (from the latest master) and see if you can get LLC partitions set up right. Fiddle with all the other settings there.

General tuning: Set vm.overcommit_ratio = 3000 and make sure vm.overcommit_memory isn't disabled (check that at run time, maybe something is turning it off)
Set vm.max_map_count = 1048576 or higher. Make sure all your RAM and overcommit can be mapped at the same time by setting this value high enough.

Do this on all your Linux machines.
Docs:


Note: overcommit_kbytes is the counterpart of overcommit_ratio. Only one of them may be specified at a time. Setting one disables the other (which then appears as 0 when read).

(Check this isn't screwed up.)



overcommit_ratio:​


This value contains a flag that enables memory overcommitment.

When this flag is 0, the kernel compares the userspace memory request size against total memory plus swap and rejects obvious overcommits.

When this flag is 1, the kernel pretends there is always enough memory until it actually runs out.

When this flag is 2, the kernel uses a “never overcommit” policy that attempts to prevent any overcommit of memory. Note that user_reserve_kbytes affects this policy.

This feature can be very useful because there are a lot of programs that <span>malloc()</span> huge amounts of memory “just-in-case” and don’t use much of it.

The default value is 0.

See Overcommit Accounting and mm/util.c::<span>__vm_enough_memory()</span> for more information.


(Check you aren't running a too-old kernel.)
 
Last edited:
@Null Personally, I start with sysstat - https://sysstat.github.io/versions.html

It's another tool to install, but monkeying around with iperf/mtr and hdparm for something when you just need a top down is disassembling the damn truck to find a bad sparkplug.
To add on to this, consider recap https://github.com/rackerlabs/recap
It uses sysstat and keeps reports so you can correlate readings when the site is fine against when the site is slow and see what jumps out.
tcp-bbr is an auto install for anything i run, easy and free performance gains https://github.com/KozakaiAya/TCP_BBR

Troubleshooting is a bitch and it's only made worse because null is unfortunate enough to have to deal with PHP and the absolute myriad of ways PHP can be choking and fucking your ass.
I was working on a server just last night and the problem was due to exhausted open file descriptors, there really are so many things it can be. If it'd be helpful I can rack my brain and just throw out a bunch of possibilities. Ultimately though it'll likely lead down a hideous path of sysctl tweaks which is just hell really.
 
I know jack shit about websites, i just wanted to read the thread to maybe learn something, but this made my day.
The sign is a subtle joke. The shop is called "Sneed's Feed & Seed", where feed and seed both end in the sound "-eed", thus rhyming with the name of the owner, Sneed. The sign says that the shop was "Formerly Chuck's", implying that the shop was owned by Chuck at some point. So, when Chuck owned the shop, it would have been called "Chuck's Fuck and Suck".
 
@Null is the FS latency universal or is it more noticeable with files that have a high rate of access?
 
@Null Update update: BFQ+sudo /usr/bin/nice -n -15 scx_cosmos --mm-affinity is straight up working on a system with 256 GB RAM.
I got the thing massively overloaded to the point that I can't type due to the lag and I'm getting key repeats (similar to your SSH issue), then I changed the disk and CPU schedulers (to BFQ on the NVMEs too) and it was usable again.

I repeated multiple times and it's not a fluke. If you have NUMA you should alternate services across the nodes, but that's separate from what I did here (I had that already done and it was still lagging hard under the default schedulers)

Getting someone in to tune one of the other SCX schedulers with more settings might be worth it, but even Cosmos might be a big win. It was on my system.
Edit: set the MMAP and overcommit settings first though. See my first post.
Edit 2: if any of this helps some but you start losing performance because of too much RAM use, or you run out of available memory, you can slowly start tuning the overcommit ratio back down again. Slowly.
 
Last edited:
Looks like the usual shitposters at the beginning of a thread have gotten their shitposting done and now there is some interesting discussion going on, and Im learning some new things.

This is a level beyond my linux autism unfortunately so all I can do is wish you luck
 
Status
Not open for further replies.
Back
Top Bottom