AI Megathread

  • Want to keep track of this thread?
    Accounts can bookmark posts, watch threads for updates, and jump back to where you stopped reading.
    Create account
At 12Gb of VRAM, I've had the best results with uncensored MoE models like Gemma4 26B A4B or Qwen3.6 35B A3B at Q4. They usually end up writing better code.
I've decided that I am going to buy a GPU with 36GB of VRAM. With Gemma, that should be more than enough to have high tokens per second right? Also, curious if the new Deepseek models will beat Gemma in real world use.
 
What's your quant for Gemma 4?
I'm using Unsloth's quants and oobabooga textgen on a 3060. The quant is UD-IQ4_NL, loaded in with Q_8 cache. I'm offloading some weights in exchange for extra context, which I wouldn't do if it was dense but it works since it's an MoE. I extend my context it until I notice it slow down, then scale it back.
oobabooga_settings.png
 
I've decided that I am going to buy a GPU with 36GB of VRAM. With Gemma, that should be more than enough to have high tokens per second right? Also, curious if the new Deepseek models will beat Gemma in real world use.
The highest you can go with a Cuda consumer card is 32GB with the RTX 5090. There's no point in going over that, as you won't hit the savings range over renting an enterprise card instance at $0.60 on runpod for months under constant load.

I'm not sure I mentioned this, but you really can't stick multiple high end consumer GPUs on a consumer ATX board. Most don't support bifurication on CPU Pcie lanes. This leaves you limited to one 16x CPU lane with maybe a few x4s connected to the chipset.

You need specialized hardware to run multiple GPUs efficiently without diminishing returns. Motherboards with PCIe bifurcation, and CPUs with a shitload of PCIe lanes. Most of which are only available on EATX boards or custom hardware configurations made by dell or lenovo for their high end tower workstations and rackmounts servers made by everyone else. Most of which don't have the power supplies designed to drive multiple 350w cards without outlets for multiple PCIe 6 pin plugs. They are designed to run the more efficient enterprise cards.


Edit:
Almost forgot if you really need that extra 8GB of VRAM for an LLM, and you have a single slot open on a consumer motherboard and 150W to spare . The NVIDIA Quadro RTX 4000 is an option, but it's not as fast as most consumer cards.
 
Last edited:
You need specialized hardware to run multiple GPUs efficiently without diminishing returns. Motherboards with PCIe bifurcation, and CPUs with a shitload of PCIe lanes.
an HEDT pc can run multiple 3090ti if you need to. The only problem you run into is a power supply issue. 1200W power supplies are failure-prone.

If you're running multiple graphics cards you NEED a Threadripper with quad-channel RAM. However, an older threadripper is sufficient.

Even the 1920x has 64 lanes. Intel? Not so much
 
an HEDT pc can run multiple 3090ti if you need to. The only problem you run into is a power supply issue. 1200W power supplies are failure-prone.

If you're running multiple graphics cards you NEED a Threadripper with quad-channel RAM. However, an older threadripper is sufficient.

Even the 1920x has 64 lanes. Intel? Not so much
My old setup was a dual cpu broadwell machine with 128GB of DDR4 ECC RAM and a shitload of x16 slots on a supermicro server board. With a 3090 and a 3080. Still had power problems with wattage surges on the GPUs that caused them to desync from the operating system. My power supply was a 1200W seasonic platinum.
 
So I have been using OpenCoder (OpenCode Zen) model. I managed to have it build a small web project just using prompts that isn't far off what I would call professional quality.
Not tried messing around with the Ollama models yet. I have the suspicion though that my 6800XT isn't going to cut it and I am going to have to get a 3090.
 
I've been using AI for a larger-scale coding project recently. These are two rules I've imposed on myself to make sure it doesn't turn into total unmaintainable slop, and so far they've been working pretty well.
  1. Do not "vibe code." Take things slowly and methodically. Don't try and make it build the entire thing in one prompt, split each step of the process into its own chunk, and be very intentional and particular about what you want from it.
  2. Do not copy and paste. Instead, physically type in what the LLM spits out. This helps you get a slightly better idea of what exactly it did, in the same way that writing down notes helps you remember stuff. This also indirectly helps with the first rule, discouraging you from making it generate too much code at a time.
 
Considering how threads are often derailed, I could think of a few ways to derail this one myself.
First of all, thanks for making the megathread. I have been unable to find any thread, in general, that would allow me to explain how batshit insane some usages of AI are. Obvious examples are AI-generated advertisements, AI-powered scams, AI agents having a meltdown like an autistic toddler, and so on.
I'll watch this thread. Thanks again.

Let’s use this thread to share models, hardware setups, uncensored alternatives, and general AI news
In regards to AI news, considering that I don't own an X account, I manually type the usernames of two accounts:
  • AISafetyMemes (nitter, X): "Techno-optimist, but AGI is not like the other technologies." Is often perceived as a doomer from all the crazy shit that he posts. Many have criticized him for being overly negative, but the truth is that he just reposts scientific findings and real-world news, so he's not making anything up. Everything he says is reasonable and I could not sniff out any mental illness from him. He's legit.
  • Testing Catalog News (nitter, X): "Reporting AI nonsense. A future news media, driven by virtual assistants 🤖" He consistently shares news about AI news in regards to new models, AI-related tools, and massive investments between AI companies. He's got a newsletter, but he never talks about his personal life and barely a few posts of him you've got to pay to read (I, obviously, do not pay). He's also legit.
For any Zoom-Zoom's out there that do not like to read posts on X, feel free to watch YouTube videos from this Drew guy (YouTube, nitter, X, Instagram). The videos are somewhat exaggerated, many of them hypothetical, but they are based on realistic predictions and world news. It is good to get a gist of what is going on and what could go on, but that's about it.
Politics-related, here's the following:
  • Pause AI (website): "Don't let AI companies gamble away our future"
  • Stop AI (website): "Stop AI is a grassroots movement of everyday people using democratic and non-violent methods to disrupt the reckless development of destructive artificial intelligence technology."
About uncensored alternatives, I did give Dolphin (HuggingFace, website) a try many months ago, but I do not need to make meth. Truth be told, making meth by itself costs a lot of money in lab equipment, and my house is not spacious enough for it. Obviously, the legality of it all does not allow me to make drugs; and meth-buying clients would be mentally deranged and criminals, so why would I expose myself to that.
Dolphin, in my opinion, was kind of lame. It will tell you how to make meth but will act uncomfortable when talking negatively about jews. Okay.

Some more stuff:
  • AI 2027 (website): "We predict that the impact of superhuman AI over the next decade will be enormous, exceeding that of the Industrial Revolution."
  • METR (website): "METR conducts research and evaluations to improve public understanding of the capabilities and risks of frontier AI systems."
  • ARC Prize (website): "ARC Prize Foundation is a nonprofit advancing open-source artificial general intelligence research through benchmarks & prizes."
 
Dolphin, in my opinion, was kind of lame. It will tell you how to make meth but will act uncomfortable when talking negatively about jews. Okay.
Dolphin came out more than a year ago. It was quickly surpassed by Illama 2 and its fine tunes. Stability pretty much gave up on it because 60B can't be ran locally. Their main claim to fame was local image models which were accessible to consumer GPUs. Aside from that their models are subpar at best compared to SOTA. People still run SDXL and its finetunes and Loras years later.
 
Dolphin came out more than a year ago. It was quickly surpassed by Illama 2 and its fine tunes. Stability pretty much gave up on it because 60B can't be ran locally. Their main claim to fame was local image models which were accessible to consumer GPUs. Aside from that their models are subpar at best compared to SOTA. People still run SDXL and its finetunes and Loras years later.
Oh. Good to hear.
For a while, I got a bit annoying on obliterated models because I wanted one for myself, despite not being able to even run one of these locally. But it's good to know that there are options.

I am also considering voice cloning for language-learning purposes; has anyone tried it?
 
I am also considering voice cloning for language-learning purposes; has anyone tried it?
How would you use voice cloning to learn a language?

I've used voice generation to help with language, just screenshotting text in other languages and having the LLM pronounce it aloud so I can learn where the emPHAsis is placed, which letters behave oddly, etc. But where does voice-cloning come into the picture?
 
My main desktop has a 5060Ti(16GB) and 5070Ti in it. I use it for image gen, these days I've been playing with Z-Image Turbo, what it lacks in making successful images it makes up for in failing really quickly. And then sometimes I switch over to Llama to play with things in parallel. Obviously not a great solution but it works well enough for the various things I've thrown at it. Next up on my hit list is finding a decent OCR to re-process all my old documents, and new ones, and then probably another model to be able to make a list of what each document is, what company, etc so I can search them. My "Gaming" system has a 7900 XTX which I think they've finally gotten to the point of ROCM mostly working, several years later.
 
What are your guys thoughts on the Hermes agent harness? I’ve really been enjoying it recently and I was able to successfully crack steam drm with it. I cracked the trump assassins steam game and archived it as it was removed on the steam store. I did this all using Hermes agent and open source models.
 
I've been burning so many tokens with OpenClaw via OpenRouter. Z.ai's GLM 5.1 is UNREAL. Took me a while to switch from Kimi and even K2.6 sucks in comparison. Gemma 4 31B is great for spawning low cost subagents.
Hey, an OpenClaw user! I was just roasting OpenClaw in the AI Derangement CW thread, and no one else in the thread had any experience with it to explain what it's really good for. What do you use it for? Good news: unless you have a hand-curated OpenClaw setup that segregates the AI-required functionality from the programmable functionality, you may stand to save a lot of money in API costs. Harnesses are extremely token-inefficient in general.
I'm a lawyer too (BigLaw for Public Entities). It is unreal how many pro pers are filing Complaints using AI. Most however are easily susceptible to Demurrers and Motions to Dismiss though. They never want to meet and confer because over the phone they would be exposed about knowing nothing regarding civil procedure and what they actually filed.

I can tell most us ChatGPT which is the worst for legal writing. Opus is by far the best for drafting legal work although Gemini with Deep Research usually can catch a case cite or too in which the citation doesn't really jive.
If I may derail the thread, how are industry lawyers (mis)using AI? I might be the mirror image of you (ML engineer with autistic interest in law). A lot of my leftover tokens go to asking Claude to generate simple imaginary LARP cases (like small business tax assessments), then I practice writing filings/opinions for them and ask Claude to check my work. I curate sources myself and feed them to Claude for the check instead of relying on the model knowledge (otherwise the citations are constantly wrong; it's an inherent problem in the architecture). I have no idea if I'm doing something retarded because Claude sounds like it knows what it's saying and can criticize me pretty harshly. No I'm not trying to train myself to write pro se filings, I'm a good boy who dindu nuffin and don't have to deal with the courts IRL.
imho the best card for AI at the present moment is the 24gb 3090
This depends on what models you intend to run locally. In the peasant consumer class the 3090 (24GB) and the 5070Ti (16GB) share the same price point in my region. Despite the higher VRAM, the drawback of the 3090 is that you have to get it second-hand, and the 30-series tensor cores don't support 8-bit and 4-bit quantization natively unlike the 50-series. That means that whatever TPS improvements you're meant to get from using a lower quant don't really apply to the 3090 as it re-expands the weights to FP16 during inference. I'm not saying that the 5070Ti is the best consumer card for local AI, 16GB VRAM is dogshit, but it's better than the 3090 if one of it is all you can afford. The reason why the 3090 is still holding up in price is because people are stacking them. The most popular configuration is 4x3090 for 96GB VRAM, also known as the poor man's RTX Pro 6000. If you're going to have 4 of the same card, the extra 32GB VRAM edges out the architectural improvements.

For LLMs, the VRAM of a single card doesn't matter a lot - LLMs have the strongest support for multi-card deployments, and going up from 16GB to 24GB doesn't give you access to a whole new tier of models. You're still stuck in the sub-30B parameter zone. Some low quant 30Bs can squeak into 24GB, but the 3090's FP16 processing and low memory overhead will make them run like snails. If you have the know-how to source them, Chinese hackers sell modded 4090s with 48GB VRAM which gives you a real step-up to Q3/4 70B models, >though I can't imagine these are much cheaper than A6000s in this shitty market. You can get away with stacking old 3D rendering cards like Quadros and Mi50s if you really need a chungus LLM at home, but you need the specialized hardware and power equipment to sustain the data bandwidth and yuge power draw. I'd rather just pay for an external API at that point.

For image and video generation, multi-GPU isn't as well-supported as for LLMs, so a single good card with yuge VRAM is better than several okay ones. For image generation, both 16GB and 24GB are more than enough for SOTA. Okay, maybe the newer Fluxes are a bit too fat for 24GB, but there isn't a great performance drop for their quants. A newer-generation 16GB like the 5070Ti/5080 (5070Tis are binned 5080s) will outperform the 3090 in generation speed and access to native quantization. For video generation, neither 16GB nor 24GB are enough for SOTA, so the 50-series wins hands down in generation speed for smaller models as it has access to Flash Attention while the 30-series doesn't.

BTW to anyone still considering it: The DGX Spark is Nvidia's version of the Mac Studio and AMD Strix Halo. It's a "shared memory" mini PC. It's not a GPU.
I have been unable to find any thread, in general, that would allow me to explain how batshit insane some usages of AI are. Obvious examples are AI-generated advertisements, AI-powered scams, AI agents having a meltdown like an autistic toddler, and so on.
The AI Derangement Syndrome thread welcomes you! We're trying to get more Pro-AI derangement (AI ads/scams/marketing campaigns, AI paranoia/dooming, "AI is God" hype) because the discussion is currently skewed to the Anti-AI side.

Also, if I remember correctly, the "AI Skeptic" communities you listed, especially the "Pause AI" one, were frequented by the people who attacked Scam Altman's house.
Dolphin, in my opinion, was kind of lame. It will tell you how to make meth but will act uncomfortable when talking negatively about jews. Okay.
Most "uncensored" models are like that because abliteration only targets explicit refusals. The uncensored model won't say "no" outright, but with enough negative reinforcement in its weights, it will try to worm itself into subverting no-no requests like OpenAI demonstrates in this post. Jailbreaking with push prompting (explicitly tell the model you wish to gas the kikes so it will prioritize associated tokens) is the stronger solution overall.
 
The most popular configuration is 4x3090 for 96GB VRAM, also known as the poor man's RTX Pro 6000. If you're going to have 4 of the same card, the extra 32GB VRAM edges out the architectural improvements.
At that point you're better off just building a mining rig, and slapping one rtx3090 into a x8 for SD and hoping you can split the other x8 into 8 x1 lanes for the rest. You'd still be saving 2 or 3 grand at that point for 216GB VRAM vs a Pro 6000, but converting a max of around 2.5khw into heat and compute. Plus those space requirements would be insane. We're talking an open air triple layer rack stack(8-12U).
 
Last edited:
My main desktop is an old x299 workstation with quad channel ram and plenty of PCI lanes. it seems like half the advice i've heard says that you're limited by your card's vram but i also see people putting multiple 5090s in workstations so im assuming you can cluster multiple cards together? i don't know how much this stuff can be abstracted, i've just messed around with swarmui and comfyui. How the fuck does that even work? or are people just running different models on the cards at the same time in parallel
 
My main desktop is an old x299 workstation with quad channel ram and plenty of PCI lanes. it seems like half the advice i've heard says that you're limited by your card's vram but i also see people putting multiple 5090s in workstations so im assuming you can cluster multiple cards together? i don't know how much this stuff can be abstracted, i've just messed around with swarmui and comfyui. How the fuck does that even work? or are people just running different models on the cards at the same time in parallel
LLMs you can very easily split between cards with Illamacpp. For image generation, you can split the complete parts of your model between GPUs, but not separate a component into pieces. For example, you can run the whole text encoder and VAE on different cards, but not the weights of the larger model across cards.
 
LLMs you can very easily split between cards with Illamacpp. For image generation, you can split the complete parts of your model between GPUs, but not separate a component into pieces. For example, you can run the whole text encoder and VAE on different cards, but not the weights of the larger model across cards.
Personally I often random prompts for imagegen, so I just run independent generations on different cards. It's not any faster per individual image but more throughput and obviously won't increase the model size that's possible.
 
Back
Top Bottom