The Year of Endless Technical Problems

  • 🏰 The Fediverse is up. If you know, you know.
  • Want to keep track of this thread?
    Accounts can bookmark posts, watch threads for updates, and jump back to where you stopped reading.
    Create account
Status
Not open for further replies.
1757522727103.webp

Just had this happen by merely clicking "What's new" by the way.
 
this has absolutely and completely and totally slaughtered the system. iowait is 25%+ with no cpu usage.
Is
Code:
$ zcat /proc/config.gz | grep CONFIG_HZ_1000
CONFIG_HZ_1000=y
on your machine?
Sorry for not telling you about that. You need a fast tick rate. Back to the drawing board, I'll try to test with a load that creates higher IOWAIT on a system I have access to.

Do you have a particular NUMA setup? What program gets pinned to what node, that sort of thing. That might be useful to know for testing.

Ah yeah, the scheduler behaves differently when IOWAIT is high but CPU is low.
 
>an unexpected database error occurred.
I thought we'd lost the ability to sneed forever.
Buckle up cowboy, we're testing in PROD.

SNEED must be mq-deadline or the site eats shit.
That's weird. You said SNEED is a SSD, have you tried explicitly setting to none already?
 
Last edited:
Do you have compression enabled on any filesystems or zfs datasets? If so turn it off right now. I had a problem where my system would stall for a minute after using a Windows VM. Turns out it was caused by f2fs's kernel threads compressing all the writes that were done to the VM image.
A way to check for similar issues is to show kernel threads in htop (shift-k) and look for ones with high priority and CPU usage. Also keep a window open with dmesg -w -H and watch for anything interesting to show up.
 
I didn't make much progress. I don't want to bother you more. If you get a CONFIG_HZ_1000=y kernel and details on your NUMA setup (if you have one) I can try to help again.
 
I had some thoughts more thoughts about this since yesterday. The correct way is still add monitoring until the problem becomes apparent but seeing as we're doin' the cowboy thing I have a few things to try that so far haven't been suggested (in this thread at least).

Have you tried turning pcie power management off? Just add "pcie_aspm=off" to the grub linux command, update grub, and reboot. I've seen a few times where buggy power management can tank performance or imitate a flaky pcie device or connection. And since you have nvme drives...

I assume the server has ECC memory but do you have rasdaemon setup so you actually will see ECC (and other machine check) errors? ECC errors will tank performance but can be sporadic based on what (or nothing) is using that memory or even memory temperature and it won't necessarily crash if the ECC can recover. Since the site hasn't been down for several days recently you probably haven't run memtest but at this point it might be worth it. You MUST use the free version of memtest86+ from the company website. The one bundled with most linux distros WILL NOT REPORT CORRECTED ECC ERRORS.

I know you said you use debian but we are on a new-ish server. What kernel version are we on currently? If it's older a yolo upgrade to the newest LTS might just werk (YeeHaw!)

I presume you've checked dmesg for anything suspicious. But giving us a copy of dmesg to look at might yield some clues.

Edit: I feel like this must've been checked but during the slowness there's no packet loss right?
 
Last edited:
It's so weird, I've had this burning all consuming desire to fix the site all week, I sat down and did 6 hours of work on it today, and almost as soon as I got it working great, Charlie got shot.
 
dog-accepting-fate.gif
how it feels to finally be able to use 3000+ page threads again without the site shitting itself and doing nothing

thanks null
 
any updates on this @Null, did you manage to fix it?
did anything here help?
 
In last week’s MATI, Josh said AI suggested an issue with having lots of requests allocating and releasing lots of memory each and reducing that happened to solve the issue. Not because of not enough memory but because you can’t do infinity of these memory operations at once and apparently we hit the limit because Josh was feeling RAM rich and upped the spending limits like a nigger getting his first credit card.

At this rate he might just abandon us and just post to his AI so be can get the answers he wants, and just have AI Josh niggerpost in a random thread every other day.
 
It still feels kinda slow, like half the time the reaction image icons are not even loading for more, and some images
 
It still feels kinda slow, like half the time the reaction image icons are not even loading for more, and some images
I don't disagree, but it's been reliably slow. No more random 504 errors, very few "clicked a link and it took 15 seconds to load" issues. That's a major step in the right direction.
 
Status
Not open for further replies.
Back
Top Bottom