Llama 4 failed because it was the first model to launch with a broken day 1 support (this is now industry standard and expected) but more importantly dropped support for GPUs with less than 80GB of VRAM. People got mad when it didn't work out of the box on their 3080 and shit their pants that they couldn't run it, so all their opinions are second hand from people who had a bad experience with a "added support for llama 4" patch that didn't work perfectly. A bad first impression due to bad implementations of a broken chat template killed it.
What is it with the number 4 and shit tier support? Gemma 4 came out without a valid template ready for Silly Tavern and other consumer software pages for its think tags. On top of that, it can't keep the starting think tags straight in any Quant's. Regardless of google being the ones to come out with turbo quant.
Llama 4 failed because it was the first model to launch with a broken day 1 support (this is now industry standard and expected) but more importantly dropped support for GPUs with less than 80GB of VRAM. People got mad when it didn't work out of the box on their 3080 and shit their pants that they couldn't run it, so all their opinions are second hand from people who had a bad experience with a "added support for llama 4" patch that didn't work perfectly. A bad first impression due to bad implementations of a broken chat template killed it.
I never heard anything about "Day 1 Support" being brought up when it comes to Llama 4, it's universally agreed upon that the models are simply dogshit and hence why it flopped.
Artificial Analysis estimates that Scout (109B-A17B-16E, aka 109B parameters, 17B activation tokens and 16 experts) has an intelligence index of 14, while Maverick (400B-A17B-128E) has one of 18.
If you look at benchmarks you'll find that Maverick is one of the worst performant models of all frontier models ever released, and a score of 18 is so disgusting in fact that even Gemma 4 E4B (an 8B model with 4.5B activation tokens) is better than it, a model that's a fraction of Maverick's size (50x smaller).
As you can see discussions online about it are completely focused on its performance, possible fraudulent benchmarks, and the fact that Llama 4 was so disappointing despite all the hype that surrounded it before its launch that Meta cancelled Behemoth altogether. So I genuinely don't know what you're talking about when you're describing people "shitting their pants" because of "day 1 support" and "hardware problems".
So I genuinely don't know what you're talking about when you're describing people "shitting their pants" because of "day 1 support" and "hardware problems".
I'm not dying on the hill of llama 4, I'm just saying it was never given a fair shot at life.
1. What do you think all these benchmarks are based on? They were all run on day one launch implementations. This was the first open-source MoE model to my recollection and I don't think anyone got the implementation right. Just compare this to the release of Gemma 4, where tons of people couldn't get anything but retarded garbage out if it for the first month while others swore by it.
2. I never mentioned hardware problems, I was alluding to how the model was designed to run on an H100 rather than locally on a 3060. People were anticipating a llama 4 model small enough that they would be able to run it, but there was no 9B llama4. Expert offloading wasn't a thing yet, so you could run it locally on a CPU if you had enough RAM. That's obviously not the same thing as "bad performance". This made people very angry and uncharitable, and is the reason none of the benchmarks were ever corrected. Because everyone abandoned it within a week, none of the issues ever got discovered and fixed like with gemma 4.
Also, are you really comparing it to a model over a year newer with reasoning? Why not compare it to a contemporary like GPT-4o? This is what it was designed to go up against.
According to the website you're referencing, Gemma 4 E4B without reasoning was more expensive to run than GPT OSS 120B with reasoning and Llama 4 Maverick, which is obviously bullshit. So no, I don't trust anything but real world usage.
Artificial Analysis is a famous organization that tests models on a series of benchmarks (dozens and dozens, on a variety of tasks depending on what the model was trained for, you're not going to benchmark a text-generator on images for example), and the Intelligence Index is simply an average that's calculated based on a sum of all benchmarks put together. They're largely accepted as being accurate (and they literally have no reason to fake Llama 4's results, unless you have some juicy anti-meta conspiracy theory going on you wanna share with us?).
This is irrelevant because they can run models themselves (its a big organization with a lot of money to throw around, they don't depend on providers like normal people). Also, the idea that a provider would drastically impact intelligence is not only extremely rare, but the idea that ALL providers in the world at the same time were at fault for Llama 4's poor performance is ridiculous. If it were the case it would've been noticed at launch and especially POST-LAUNCH due to widespread inconsistency (and again, its insane to suggest the countless of providers that exist would blatantly cause something like this). To this day, Llama 4 is still dogshit even tho its been quite well over a year since it launched.. so what's up? Are the providers still cucking it for some reason?
Not by a long shot my nigga. To give you a perspective of how long ago Open-Source MoE models already existed, Llama 4 Scout released on the 5th of April 2025, while "deepseek-moe-16b-base" released on the 8th January 2024 (and I'm quite sure there's even older MoEs than this one, but this is the oldest renowned MoE I can think of from the top of my head rn).
Just compare this to the release of Gemma 4, where tons of people couldn't get anything but retarded garbage out if it for the first month while others swore by it.
I genuinely have no idea what you're talking about and where you're getting this information from, this is the second instance where you're claiming a model had severe launch issues to the point "it only output garbage!". I literally downloaded its weights when it launched and it worked just fine (expect for the fact that it was slow at shit, but then again I have a potato PC.. so whatever). Not to mention how at launch Gemma 4 was universally praised for its performance, as well as the fact that Google finally licensed it under apache-2.0. The only blatant complaints and criticisms about its launch that flew around HuggingFace, Twitter, Reddit, Youtube.. were related to shit like "inference speed via Google's API being inconsistent", or "the knowledge cutoff of January 2025 sucks for an April 2026 release" or even "incompatibility with Flash Attention 2 due to the model's head dimensions being incompatible with its kernel" (again, very specific and obscure issues, not a widespread "THIS LITTLE NIGGER IS OUTPUTTING CRAP" scandal across the internet).
2. I never mentioned hardware problems, I was alluding to how the model was designed to run on an H100 rather than locally on a 3060. People were anticipating a llama 4 model small enough that they would be able to run it, but there was no 9B llama4. Expert offloading wasn't a thing yet, so you could run it locally on a CPU if you had enough RAM. That's obviously not the same thing as "bad performance". This made people very angry and uncharitable, and is the reason none of the benchmarks were ever corrected. Because everyone abandoned it within a week, none of the issues ever got discovered and fixed like with gemma 4.
There's some major logical inconsistencies about what you're claiming here man.
1 - Maverick has 400B parameters in total, even a caveman can tell that its literally impossible to fully fit this on a single 3060 (this is an absolutely insane idea), and neither did Meta claim that it could run on a single 3060 to cause such apparent hype that "broke people's hearts" (i'll also need to verify your claim about "Expert Offloading" not existing then, but if it didn't exist back then... WHY THE FUCK would people expect to run this on a single 3060? It didn't exist back then apparently.. so what gives? What exactly gave these supposed people the impression that they could realistically run this?). This is also an insane claim because Open-Source models of equivalent size like MiniMax-01 (456B) already existed back then and people who could run it did so without issues. Unless you're telling me that people back then knew that they couldn't run 400B+ models on a single 3060 (and therefore used APIs or private servers), but then suddenly forgot about it by the time Llama 4 released?
2 - You talking about hardware limitations when we're supposed to be talking about performance is a contradiction that completely breaks this narrative, due to the simple fact that LOW RAM DOESN'T AFFECT BENCHMARK RESULTS. I could run a big model on a potato PC at 0.1 tokens per second and its still going to output results like the same model being ran on a NASA PC at a trillion tokens/s (low memory makes a model slow, not enough memory simply crashes it). The only ways you can affect how well a model performs on benchmark scores (and therefore response quality) are shit like precision, benchmark quality (or cross-contamination), inference script issues.. etc. Running a model on lower-end hardware is only gonna make it slow, but computations remain the same, so this debunks what you're saying in itself, people ran Llama 4 on all sorts of hardware and came to the unanimous conclusion that it sucks (and Artificial Analysis alongside other AI benchmark organizations confirmed this as well).
3 - Did you know that Llama 4 is still used to this day? OpenRouter so far recorded a weekly token count of 26.8 billion, and there's 4 providers on the platform. So the idea that "everyone abandoned it within a week" is simply not true, and I'm sure that the people who use that piece of shit (for some reason) would be loud and make it obvious if Llama 4 really was this gem that we simply don't realize it is.
At this point I can only come to the conclusion that you either don't know anything about AI and is making this shit up, or you were fed this information from someone else and is now regurgitating it. Because I never seen such blatant cases of inconsistent claims (and I never seen someone defend Llama 4 so adamantly like this, its insane).
Also, are you really comparing it to a model over a year newer with reasoning? Why not compare it to a contemporary like GPT-4o? This is what it was designed to go up against.
I'm not. I opened Maverick's page to illustrate my point and that's how it looked. But sure, the first GPT-4o released on the 13th of May 2024 and it scores 14 on Artificial Analysis (the same as Llama 4 Scout) while Maverick scores 18. Insane (but again this doesn't mean that the model is objectively good. The Intelligence Index is an average taking into account ALL benchmarks summed up together. Maverick can excel at a particular task and still be dogshit at 5 different benchmarks). And this also yet again contradicts your narrative that the benchmarks are somehow biased against Llama 4, since it did score higher than some older models from that year .
According to the website you're referencing, Gemma 4 E4B without reasoning was more expensive to run than GPT OSS 120B with reasoning and Llama 4 Maverick, which is obviously bullshit. So no, I don't trust anything but real world usage.
That's not what its saying. It's saying that when you run their benchmarks on an API, Gemma 4 E4B is more expensive than GPT-OSS 120B and Llama 4 Maverick, which is objectively true:
Gemma 4:
= 0.3$ per a million input tokens
- 1.25$ per a million output tokens
Maverick:
- 0.35$ per a million input tokens
- 0.85$ per a million output tokens
GPT=OSS-120B:
- 0.15$ per a million input tokens
- 0.6$ per a million output tokens
This shows that Gemma 4 is too expensive compared to models of equivalent size (hence why its little price icon is red despite its input token cost being cheaper than Maverick's), Llama 4 Maverick is fair, and GPT-OSS is a little expensive. Your other screenshot also points out that Gemma 4 generates more tokens to answer the benchmarks compared to other models (Gemma 4: 500M, GPT-OSS: 220M, Maverick: 170M), therefore.. if you have this model that's more expensive and more verbose (generating way more tokens than the other models)... its going to cost more, no? Thus what Artificial Analysis says checks out.
Also, what do you mean "So no, I don't trust anything but real world usage". Define what that is. A benchmark is a score that checks a model's aptitude on a particular task, so if a model scores high on some math scores, it SHOULD be good at doing math for me (on real world usage). Unless you're seriously suggesting "OOGA BOOGA ME RIGHT, EVERYONE ELSE BAD!", which is some cope that's regurgitated by some dumbass fucks like Flat-Earthers ("all bodies of science wrong because me right").
3 - Did you know that Llama 4 is still used to this day? OpenRouter so far recorded a weekly token count of 26.8 billion, and there's 4 providers on the platform. So the idea that "everyone abandoned it within a week" is simply not true, and I'm sure that the people who use that piece of shit (for some reason) would be loud and make it obvious if Llama 4 really was this gem that we simply don't realize it is.
I didn't know this, thanks for telling me and for the history lesson on MoE models. But I just want to clarify that you aren't just acting retarded here:
And then you end up spending the rest of the time arguing about how much better Maverick is than Gemma 4 E4B? Also, what I mean by calling BS on the cost of inference is most people running Gemma 4 E4B probably not paying API costs, they are running it locally and have a negligable per 1M/tok cost.
Define what that is. A benchmark is a score that checks a model's aptitude on a particular task, so if a model scores high on some math scores, it SHOULD be good at doing math for me (on real world usage).
Openrouter usage / local server hosting corperate popularity would be my main measure of if a model was a success, but more specifically it's if I end up using it at work. If the LLM can help solve my work related problems it's good, and the better is does that, the better the model is. It needs to pass my company's internal test suite to see how well it works with propritary data and operate in our specific domain, and be made in the USA. I don't care what the Artificial Analysis score or AIME26 scores are or if it's AGI if it can't do my work.
As for home use, the most important factor is the ability to self-host, so I usually rely on community sentiment before downloading 50GB and getting rate limited. As for where I'm sourcing all my general community reception, it's r/LocalLlama (yeah, I know, reddit) since they tend to run similar hardware setups as what I have. At the time they treated Llama 4 like it was Brutus for not including reasoning and not running on consumer GPUs smaller than a 3090. Here's a snapshot of how they reacted at the time.
Also, I need to fix something I was unclear about in a previous statement, which is that this attitude was the reason none of the benchmarks were ever corrected. I don't think there was a grand conspiracy, I think people just got burned initially and hypothesize it might be better now than at launch (like with literally every other LLM).
the benchmarks were never corrected because there was no point. They fucked up the training trying to censor it and it came out retarded
You can't teach the AI things like a nigger isn't subhuman, no it doesn't look like a gorilla. It's not a person, it doesn't understand that kind of nuance. When you try to get it to do that it ends up going through the entire model and fucking up everything that's even tangentially related to persons, black people, or gorillas.
The person who taught the AI that black people don't look like gorillas doesn't really believe black people don't look like gorillas. If they really believed that their whole brain would be fucked up and wouldn't work. The AI really believes it, now. And everything that comes with it. Congratulations, it's retarded.
And then you end up spending the rest of the time arguing about how much better Maverick is than Gemma 4 E4B? Also, what I mean by calling BS on the cost of inference is most people running Gemma 4 E4B probably not paying API costs, they are running it locally and have a negligable per 1M/tok cost.
I deadass showed you man, these prices are associated with the average API cost for these models. Maverick costs less than Gemma 4 on an API.. so its prices are better than Gemma 4. I'm not "spending the rest of the time arguing about how much better Maverick is than Gemma 4 E4B", I was answering your claims, essentially just stating "Gemma 4 is indeed more expensive, here's the proof comparing it to the other models", objectively showing you that Artificial Analysis didn't fake their claims like you accused.
Also, what I mean by calling BS on the cost of inference is most people running Gemma 4 E4B probably not paying API costs, they are running it locally and have a negligable per 1M/tok cost.
Yeah but its irrelevant because not everyone can run Gemma 4, so there exist(ed) APIs and providers that allow people to use these models. The average input and output price for that model at launch was the one that Artificial Analysis put on their website.
I'll concede on this tho, the prices aren't accurate anymore because I just checked OpenRouter and a bunch of different providers (including Google), and the prices dropped significantly. For example, the average for Gemma 4's 31B is around ~0.1$ for the input and ~0.3$ for the output rn. Meanwhile, E4B doesn't even have any providers anymore. Therefore, if Artificial Analysis ran its benchmarks on E4B today while using an API, it wouldn't even be able to in the first place (since none exist anymore), so the input/output would be both set at 0, just like these models over here (including Gemma 4 31B for some reason, but E2B is indeed set at 0):
Openrouter usage / local server hosting corperate popularity would be my main measure of if a model was a success, but more specifically it's if I end up using it at work. If the LLM can help solve my work related problems it's good, and the better is does that, the better the model is. It needs to pass my company's internal test suite to see how well it works with propritary data and operate in our specific domain, and be made in the USA. I don't care what the Artificial Analysis score or AIME26 scores are or if it's AGI if it can't do my work.
If that's the case you shouldn't even look at the Intelligence Index. The index is an AVERAGE, Artificial Analysis grabs the results of countless benchmarks, does some calculations and sums them up into a neat little score. Like I said before, Maverick can be good at one particular task, but be dogshit on 5 different benchmarks (and that's why I showed Maverick's index, because it objectively performed poorly on most benchmarks to have a score that low despite its incredible size).
If I score 100 on a specific exam but score 53, 47, 22 and 5 on other exams, my average will be 45.4 ( (100+53+47+22+5)/5 ), which is a genuinely bad score. Literally the same principle, you'll definitely want me around for that one particular subject I got 100 on, but my overall performance is terrible, so I definitely wouldn't be hired on most jobs unless my employer truly has a reason to, it's THE SAME principle, someone might use Llama 4 for something (like the people who are clearly using it on OpenRouter rn), but the large majority of people universally agree that it sucks ass for a frontier model, hence why we dropped it (at least its cheap).
And regardless of this, you can't talk shit on benchmarks because they inherently show aptitude on tasks. So I'd recommend you look at specific benchmark scores on subjects that interest you or are relevant to your work. Popularity scores from platforms like OpenRouter are useful, but they're definitely not perfect and can be very unreliable. For example, Deepseek V4 Flash is being projected to the stratosphere because it ranks 2 on RP right now, however, I've seen hundreds of different instances of people talking shit on it online and saying that GLM 5 is still better than it (a model that's not even in the top 10). So why is Deepseek V4 Flash in number 2 despite better alternatives not even being visible in the leaderboards? Because its autistically cheap and "competent enough", so popularity doesn't always mean "best", but instead just "the most accessible".
As for home use, the most important factor is the ability to self-host, so I usually rely on community sentiment before downloading 50GB and getting rate limited. As for where I'm sourcing all my general community reception, it's r/LocalLlama (yeah, I know, reddit) since they tend to run similar hardware setups as what I have.
It's fine, I also check there and other subreddits like SillyTavern and TavernAI whenever there's a hot release (my main focus concerns creative writing so I'm always on high alert for models that excel in RP).
At the time they treated Llama 4 like it was Brutus for not including reasoning and not running on consumer GPUs smaller than a 3090. Here's a snapshot of how they reacted at the time.
^These guys are absolutely right on the money, see how much emphasis there is towards its performance? It was universally agreed upon that it was terrible (long before benchmarks proved it, they were the final nail in the coffin). And Maverick was also expensive despite having a lower performance compared to alternatives that existed (which is insane).
^However, reading these snippets finally made me understand what you meant by your previous emphasis on the 3060. Sorry I came on strongly against you, but the way you wrote it previously heavily implied that there was some kind of hype going on about running these comically large models on a mere 3060 (which is obviously impossible). But yeah what these people mean is the fact that yeah, Meta previously released Llama models on a variety of sizes, from 8B to 405B, so the fact that they changed this in order to make these comically large models that force us to depend on their API truly feels like some kind of "betrayal" from the previous Open-Source "Llama philosophy" (and the fact that we can't even run these models on top of the fact that they were expensive and TERRIBLE, truly ruined Llama for good). But look at all these comments notice how there's barely any emphasis on model size (so its not the reason why we all "crucified" Llama 4), since tbh this is just some pee in the bucket because regardless of size, as long as a model is good, people will definitely pay for it (Deepseek models are examples of this), so as you can see, people were more mad about the performance than anything (not having smaller sizes is just a secondary issue).
^This guy is wrong tho because MoEs existed for over a year by this point, and there was no controversy related to this type of stuff. There's also the fact that even if Llama 4 is MoE, the architecture isn't meaningfully different compared to Llama 3 to cause such a massive discrepancy in performance (and to the point of confusing people).
Finally, there's also the fact that most AI companies for many years at this point explicitly put examples of how to run their models on the model cards themselves, so there's just no way for this to be the reason why results were/are terrible, because even Meta put it in Maverick's page:
Yes, it's that easy. Though we and everyone know that nobody followed this because who the fuck has enough hardware in their house to casually run this shit acceptably (and not at 0.00001 tokens/s). Mr Zuck clearly smoked that lean green to assume something so nefarious like this, lol.