The Year of Endless Technical Problems

  • 🏰 The Fediverse is up. If you know, you know.
  • Want to keep track of this thread?
    Accounts can bookmark posts, watch threads for updates, and jump back to where you stopped reading.
    Create account
Status
Not open for further replies.
I tend to start low level:

I'm going to assume that "iostat -xnz" is clean(%util low r_await and w_await low). vmstat isn't showing excessive swapin/swapout(or any, really), interrupts, or context switches. Network errors are clear both on the switch(es) and host ( ip -stats link for Linux) Triple check the actual negotiated link speeds, not that that bit me in the ass a week ago, fucking cables.

Also, check thermals on everything, drives and CPU. Not sure what the fancy of the day is, used to be lm-sensors and smartmon.

Since you say it's intermittent. Start graphing. Even using something as brain dead simple as Munin.

I assume you're on the latest reasonable kernel for your OS. Sometimes they fix things... other times they make it worse.
 
I had tons of issues with a computer once, i ended up taking a 12 gauge shotgun and pumped 4 shots into the motherfucker. I never had problems with that particular computer again, but the motherboard was fucked.
 
Try to time the "slowness" down to the second or millisecond.

If you can figure out the delay, you can figure out what timeout it might be.

And it certainly sounds like some kind of fucking time out. (It's always DNS and if you have systemd fucking around, you're probably fucked).

If you shut down the KF server (turn off nginx) we won't fucking notice or care for a bit, and you can see if SSH suddenly starts being fast again - if so, it is informative.

I had a similar fuckery from AI bots being absolute fucks and displaying every possible tag combination but I assume AI bots can't hit you directly.

dmesg -c is your friend, anything ANYTHING that appears needs to be investigated.
 
Another thing it could be is full conntrack tables. I used to have a link bookmarked that was like 20 years old and probably written by some kind of sysadmin greybeard wizard but I can't find it. This one looks decent:
It's a long shot but the symptoms of it align with what you're experiencing and it should be pretty quick for you to check.
 
  • Typing in SSH takes multiple seconds.
  • Connecting via SSH takes 10+ seconds.
  • Requests to the Kiwi Farms take multiple seconds even for rudimentary requests.
SSH latency interests me more than anything else here, because that's almost impossible to excuse. It takes insanely high CPU/IO load to mess with SSH, and you don't have that, so there's either something very badly misconfigured or broken, or you haven't found the load. Throw tcpdump, strace, and perf at an SSH server process and find out exactly what it's doing during the moments when it's not responding. Try to find TCP retransmits / checksum errors as evidence of packet loss somewhere. iperf is also a good suggestion.

On the more desperate side, I've seen CPU issues manifest in bizarre ways- try burning CPU on something to see if an uninterrupted CPU bound workload runs as fast as you think it should. Verify every core individually. Freqs / thermals / any other evidence of weird power issues. Benchmark other parts (disk IO, network bandwidth, memory bandwidth, ...) to see if you can get anything to run, or more interestingly not run, at advertised specs. Second (third...) guess whatever you're using to monitor load. isolcpus / cset shield + pin processes to cores to see if anything interesting happens.
 
Another thing it could be is full conntrack tables. I used to have a link bookmarked that was like 20 years old and probably written by some kind of sysadmin greybeard wizard but I can't find it. This one looks decent:
It's a long shot but the symptoms of it align with what you're experiencing and it should be pretty quick for you to check.
You might as well be suggesting hugepages on CentOS 6.
 
Any way possible to distribute the server load? Might help to have several blades, one dedicated to a specific thing etc... you get my point.

Also, I wonder if it's partially on my end, since my DNS resolution seems to suck donkey balls.
 
If anything looks weird in here, don't run it. It's pretty straightforwards though:


Bash:
#!/bin/sh

# I don't know if Debian has memory compress or swap turned on. Turn both of those off if you don't want this line.
sysctl -w vm.swappiness=1

# Processes might map memory in small chunks, requiring many maps for all your memory + overcommit.
sysctl -w vm.max_map_count=1048576

# Processes want lots of virtual memory sometimes.
sysctl -w vm.overcommit_ratio=3000
# Counter intuitively, this enables memory over committing to to amount set above, even though it says it disables it.
sysctl -w vm.overcommit_memory=2

# This is 1% of available (not free) RAM. Should be enough on your 512 GB machine. On lesser machines, you might have to
# increase this value to 2%. Making this higher is not a "go faster" button! Your drives are fast! Bigger buffers = more delay!
sysctl -w vm.dirty_ratio=1
#sysctl -w vm.dirty_ratio=2

# Set drive sched to BFQ. Contains heuristics for reducing latency.
DRIVES="$(mktemp)"
# You might need to add more drive types.
find /sys/class/block/ -mindepth 1 -maxdepth 1 ! -name "$(printf "*\n*")" -regextype 'posix-extended' -regex '^.+sd[a-z]+$|^.+nvme[0-9]+n[0-9]+$|^.+sr[0-9]+$|^.+mmcblk[0-9]+$' > "${DRIVES}"
while IFS= read -r d_path
do
    echo setting "${d_path}" sched
    echo 'bfq' > "${d_path}/queue/scheduler"
    sleep 0.25
    # Might help with latency, but bandwidth may be too slow.
    #echo '1' > "${d_path}/queue/iosched/strict_guarantees"

    # Enable the heuristics.
    echo '1' > "${d_path}/queue/iosched/low_latency"
done < "${DRIVES}"

# Contains heuristics for reducing latency + the mm option tries to keep threads using the same memory on the same CPU core.
/usr/bin/nice -n -15 scx_cosmos --mm-affinity

If you're technical enough you might just be able to create a Debian package straight from the build binaries of the Arch Linux kernel and the scx-scheds package to avoid compiling difficulty.
 
So I may be late to the party but I've had issues like this at work before. My knee jerk reaction is some kind of resource contention (files open, mutexes, whatever) or something small that is usually ignored is happening a lot more than it should (or failing) like the DNS suggestion.

Once you've exhausted the "try this!" or "replace this!" you have to go with the slow, annoying, pain in the ass answer of monitoring and more monitoring. Pick one symptom (lets say video uploads failing) and you need logging at every step of this process. A given request should have a UUID associated with it through the whole chain so you can construct a timeline of the entire request chain. Then you line everything up in order and see where things are taking longer than they should. If the request fails you see how far it got in the chain. Hopefully this gets you a culprit. If not you add more monitoring.

I had an issue once with a db application being far too slow. The load on the DB was basically nothing, the load on the client was basically nothing, there was not much network traffic, DB was in the same office so on the same LAN. The query was pretty simple and basically instant (lookup on a single column with an index). Since it needed to operate on a lot of rows (millions) the slowness meant any significant action would be days or more (i.e. completely useless).

Turns out it was doing one row at a time. Even though the DB was in the same office the round trip for this was like 30ms.
 
Are you sure your hardware is running at an optimal 77°F or 25°C? Additionally, you may want to install dehumidification air handling units if you do not have a cooling solution that pre-dries the supply air entering the space. Hardware can run improperly in your server room if the air is not at the optimal temperature or it is near dew point. You can monitor the temperature and humidity in your server room and send alerts to your email and trend the performance of your HVAC system.

I also know nothing about computers.
 
Raid B is Raid-6 NVMe with 7TB named 'CHUCK'.
This post got long as shit so I've split it up into sections. The section I highlighted with :!: is the one I reckon is the most valuable in terms of low risk / high potential reward and very little effort required.

RAID

What's the purpose of CHUCK? If it's running database shit, RAID 10 is probably a better choice due to faster recovery from failures and since it's a mirror + stripe you get read and write benefits that scale with the number of striped sets of mirrors you have.

Also you mentioned ZFS for the first RAID but just called this one RAID 6, is CHUCK actually some form of hardware RAID or using Linux software RAID (md or LVM)? If it's not ZFS I'd suggest converting it into a ZFS pool of some type.

nginx​

Last year I wrote a long ass post talking about proxy request buffering in nginx in response to you providing samples of the middle node's config. The idea was to address uploads stalling at 100% due to the middle nodes soaking up the request buffer before sending it on. The config you were running at that time looked vulnerable to goloris-style worker exhaustion attacks and if you haven't tightened up the timeouts, you should. The default limit is 512 * cpu_cores for worker connections.

:!: TCP Congestion Control Algorithms :!:

The default TCP Congestion Control algorithm in the Linux kernel is CUBIC and frankly it sucks shit if you have any packet loss whatsoever. It uses packet loss as a feedback mechanism to say "Holy shitballs we're overloading this pathetic faggot link!" and back off, creating a sawtooth pattern all too familiar to any autist who stares at network graphs all day.

Google created an algorithm relatively recently called BBR that is designed around link speed prediction. The effect of this is that transfer rates remain high even if you're experiencing some packet loss as this is not used as a signal to back off. I switched to BBR a few weeks ago on BMJ TV and it did show an improvement in estimated connection speeds for clients.

If you do set this up, you'll need to switch the algorithm to BBR from the backend all the way to the L4 frontends to experience any real benefit. No reboot required and it can coexist fine with CUBIC end-users.

Add net.ipv4.tcp_congestion_control = bbr to sysctl to use BBR.

MTU​

Between your zero trust L4 proxies, Kiwi Flare and the Super Secret backend, are you doing any tunneling? IPsec, GRE, WireGuard, OpenVPN... anything? The reason I ask is that encapsulation adds overhead and if not every hop along the journey is fully aware of the extent of that overhead, you will have issues.

What I would suggest is if you're using WireGuard (as an example) between the L4 proxy and Kiwi Flare, drop the MTU for the public facing interface to match the tunnel MTU (by default 1420 bytes). The reason is that 1460/1440 bytes MSS (IPv4 and IPv6 respectively) is being negotiated because both ends think they have 1500 bytes to work with, but when it tries to squeeze that through the tunnel, it won't fit and will either fragment (depending on the DF flag) or drop the packet.

This all being said, it's a tad unlikely that's the major cause of performance issues. MTU size issues generally cause packet drops, not inexplicable sluggishness, and it's actually pretty easy to rule it in or out: just change your client to the lowest possible MTU (1280 bytes) and see if the site magically becomes more usable. I've already tried this and honestly the site still runs like shit. Still, it doesn't hurt to spend a few minutes with a calculator to make sure there's no unaddressed bottlenecks.

Virtualization​

Are you doing any virtualization at all? Is everything running bare metal? If so, I'd suggest you seriously consider a re-architecture around converting the host into a hypervisor and separating the services from one another into separate guest VMs.

MariaDB, the self-hosted S3 shit, Redis, web frontends, the reverse proxies themselves: they should all be on separate VMs. You're not able to effectively utilize the resources you have because you're butting up against OS limits. If you find yourself having to go down the path of investigating TCP port exhaustion or ulimits, then you're at the point you need to figure out "How do I spread my resources out?"

My suggestion is create a separate VM for each of the services which can't be easily clustered (or where clustering is just too painful). Focus on scaling out the web frontends (the shit running PHP FPM) to where you have something like 4 upstreams in nginx for serving up XenForo pages. When backends start crumbling under pressure, just add more instead of trying to figure out how to stretch their resources.

All of this can operate on an internal network that exists within the hypervisor itself so you aren't exhausting external IP resources and having to fret over firewall rules. Use a software router like OPNsense to act as the gateway for this network and establish tunnels back to Kiwi Flare nodes, or expose one or more reverse proxies to the Internet and setup iptables rules in Proxmox to manage inbound traffic. This isn't the only answer, you've got a lot of options for how to do this.

Server Sperg User Group​

Last thing I'll bring up is that it sounds like you're pretty much trying to do everything alone and only reaching out to the community as an absolute last resort. A lot of the replies in this thread are useless noise, even the ones that are well intentioned, and it makes it much harder to figure out what has already been suggested and find additional information you've shared.

What I reckon might be helpful is a post in somewhere like Supporters, Inner Circle or I&T which is strictly a place for you to ask questions and get suggestions. Nothing off topic allowed, no glib answers tolerated, instant thread ban for shit stirrirs or brainlets.

Edit: Thank you XenForo for adding [ICODE] into every fucking paragraph for no reason! Very Cool!
 
By my experience, the SSH delays are not even CPU/Memory tied as kernel 4.4 and above (iirc) reserves the resources for the openssh server and other basic tools to function, if needed (like df, top, stat basically most things in Android's busybox).
I would investigate network issues before anything else. DNS timeouts, circular/split routes, interface traffic shaping (QoS etc) shit like this. If you just increase verbosity on the openssh server and client, you'd get a lot of useful information. The times it has happened - it helped a ton. Also avoid using putty/kitty as a client in case you are using Windows. The SSHv2 implementation they use doesn't support the same level of verbosity as the openssh client.

On a side note, why S3? Why the cloud at all? Vying on network storage, while hosting the server elsewhere doesn't seem logical. 9TB isn't that much data in these years. It would be cheaper over a year or less - to just buy a storage server with 12-24 NVMe slots and create a good ZFS array.
 
Get metrics all over. Have AI vibe code you a dashboard and have alerts. In programming you profile and optimize hot spots, in sysadmin work you get metrics and optimize.

Since nothing is maxed out you should have all the probes/hooks you can everywhere for really detailed logs. Then you can work from evidence rather than guesses and what-ifs.

Don't try to win the lottery. "Cheat" by seeing what numbers are getting picked.

Promethus and Grafana are pretty one-size-fits-all. Cockpit is nice 'just to have'.

Also make sure you have mtr or equivalent and smoketest and pings going constantly. You can correlate with the more info you have about when things are fine versus when the issue happens.

Don't stay willfully blind.
 
A lot of the replies in this thread are useless noise, even the ones that are well intentioned, and it makes it much harder to figure out what has already been suggested and find additional information you've shared.
It's 5 pages and the one part of your post that you singled out was already suggested ~10 hours ago. The rest doesn't really have anything to do with the actual problem - if it's intermittent and happens even during an SSH session then it won't be because of either RAID array, nginx or how the proxies are tunneled.
I wholeheartedly agree with the virtualizing part, all the way down to proxmox and your choice of OPNsense though it'd probably be overkill. I wouldn't envy him having to migrate the site elsewhere and then tear-down/rebuild everything again though. Would probably take a week.
 
if it's intermittent and happens even during an SSH session then it won't be because of either RAID array, nginx or how the proxies are tunneled.
The SSH thing is a big WTF but I didn't want to fixate on it because others had provided good suggestions already and the goal is to make the forum usable, not fixate on a single unusual symptom of the lag. My ire was mainly directed at the page 1 responses where there's basically only one post trying to be helpful.

I know you already suggested BBR but I wrote about it anyway because you don't actually explain in your post why it's preferable to CUBIC. Also the GitHub repo you linked to seems to be some random anime avatar's BBR driver which hasn't been updated for years since it has been bundled with Linux for ages. You can change net.ipv4.tcp_congestion_control to bbr and reap the rewards immediately, no special kernel modules required.

It's why I highlighted it. It's something Null can do risk free right now and it will improve performance.
 
Proper way to debug this would be to install a small monitoring stack such as Node Exporter + Prometheus + Grafana so you can see what the various metrics look like over time.

Here's an example that you can bootstrap easily using docker, the loki portion is probably overkill though.

Understanding what your system is using in terms of metrics like iowait, system CPU (and where it's being spent by looking at interrupts and other IO), disk await, etc. is critical to debugging it.

There's various exporters for monitoring certain services such as MySQL too.

Site is pretty snappy now so it seems like you got things figured out, but feel free to ping me if you need help with it in the future. Debugging complex systems (a lot of which I didn't setup and am not the owner of) has been my job for a while now.

Edit: Forgot to mention you can use OTel to get zero-code instrumentation for PHP.
 
Status
Not open for further replies.
Back
Top Bottom