Business Reddit will block the Internet Archive - The company says that AI companies have scraped data from the Wayback Machine, so it’s going to limit what the Wayback Machine can access.

  • Want to keep track of this thread?
    Accounts can bookmark posts, watch threads for updates, and jump back to where you stopped reading.
    Create account
reddit.webp

Reddit says that it has caught AI companies scraping its data from the Internet Archive’s Wayback Machine, so it’s going to start blocking the Internet Archive from indexing the vast majority of Reddit. The Wayback Machine will no longer be able to crawl post detail pages, comments, or profiles; instead, it will only be able to index the Reddit.com homepage, which effectively means Internet Archive will only be able to archive insights into which news headlines and posts were most popular on a given day.
”Internet Archive provides a service to the open web, but we’ve been made aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine,” spokesperson Tim Rathschmidt tells The Verge.

The Internet Archive’s mission is to keep a digital archive of websites on the internet and “other cultural artifacts,” and the Wayback Machine is a tool you can use to look at pages as they appeared on certain dates, but Reddit believes not all of its content should be archived that way.“Until they’re able to defend their site and comply with platform policies (e.g., respecting user privacy, re: deleting removed content) we’re limiting some of their access to Reddit data to protect redditors,” Rathschmidt says.
The limits will start “ramping up” today, and Reddit says it reached out to the Internet Archive “in advance” to “inform them of the limits before they go into effect,” according to Rathschmidt. He says Reddit has also “raised concerns” about the ability of people to scrape content from the Internet Archive in the past.
Reddit has a recent history of cutting off access to scraper tools as AI companies have begun to use (and abuse) them en masse, but it’s willing to provide that data if companies pay. Last year, Reddit struck a deal with Google for both Google Search and AI training data early last year, and a few months later, it started blocking major search engines from crawling its data unless they pay. It also said its infamous API changes from 2023, which forced some third-party apps to shut down, leading to protests, were because those APIs were abused to train AI models.

Reddit also struck an AI deal with OpenAI, but it sued Anthropic in June, claiming Anthropic was still scraping from Reddit even after Anthropic said it wasn’t scraping anymore.
“We have a longstanding relationship with Reddit and continue to have ongoing discussions about this matter,” Mark Graham, director of the Wayback Machine, says in a statement to The Verge.

(Link/Archive)
 
Expect more of this.
Reddit makes a good deal of its money off of selling it's users data, especially to AI startups looking for easy training data. Coupled with the fact that Reddit's been really bad at making consistent revenue streams, but AI-training data outlet is their most successful, expect more jewish chicanery to protect this bottomline.

Reddit used to ban faggots for running post/comment-blanking scripts on their account for that reason, I am positive they've remedied that by caching but Reddit is perennially incompetent.
 
With that said, I shouldn't complain because it's funny watching them squirm and lash out.
NGL, my tolerance for retardation is waning as I get older and it was never really that high to begin with. I want to laugh at their entire faggot power structure collapses on their face. I don't want to laugh at their retardation while they continue to make money lol. Maybe I'm just bitter but I fucking hate reddit. I've had a dozen accounts banned for the most basic bitch shit ever. I just want to see that company fail.
 
How valuable is it really though? Reddit has been permeated with bots since it's inception.

The same problem exists for just about every other bulk source of training data for AI. Just about everything has been permeated by bots or corrupted in a multitude of other ways by now. Nothing is clean of it. And the state of AI mania is such that people are willing to pay regardless of the quality of the data available. Quantity matters far more than quality at the moment.
 
Hopefully Archive.today continues to work. I've grabbed hundreds of threads with that, as well as the promortalist killer's entire post history (or his dead muse, don't remember), and another weirdo.
 
Redditors use internet archive? Searching their old posts is full of dead, unarchived links.
No, they don't use archival systems because Redditors have below room temperature IQ's.

You've got it backwards though, Internet Archive accesses Reddit in order to archive pages automatically for their Wayback Machine, the part of the site that lets you see what websites looked like months, years, or decades ago. Reddit is blocking IA's Wayback scraper because companies were indirectly training their LLM's on Reddit content by way of accessing it via IA's archives (because Reddit is already blocking LLM scrapers).
 
Expect more of this.
Reddit makes a good deal of its money off of selling it's users data, especially to AI startups looking for easy training data. Coupled with the fact that Reddit's been really bad at making consistent revenue streams, but AI-training data outlet is their most successful, expect more jewish chicanery to protect this bottomline.

Reddit used to ban faggots for running post/comment-blanking scripts on their account for that reason, I am positive they've remedied that by caching but Reddit is perennially incompetent.
AI data from reddit will, in 5-10 years time, turn out to be some of the most useless garbage humanly possible to produce. remember the early google AI search results about gluing cheese to your pizza? that's a result of reddit training data. they are training their AI to be absolutely retarded, and they are paying for the privilege to do it. pajeets are genuinely some of the dumbest, most disgusting semi-beings on the face of this planet, and removing them all from technology corporations is the only way to even begin fixing the infrastructure problems they've caused.
 
Sometimes when the farms crashes I hit reddit to see what the normies are talking about, and it's always the same five things.

AMERICANS X HAS REPORTED Y, WHERE DO WE GO FROM HERE?
AMERICANS, HOW DO YOU FEEL ABOUT X Y OR Z?
WHAT IS SOMETHING PEOPLE THINK IS A SIGN OF INTELLIGENCE BUT ISN'T?
WOMEN WHAT DON'T MEN THINK ABOUT? (or vice versa).
WHAT'S A SHITTY JOB THAT HAS TERRIBLE PAY BUT SHOULDN'T/WHAT'S THE SECRET ABOUT YOUR JOB NO ONE KNOWS?

Years ago you used to be able to lurk reddit and read something funny/interesting in a pinch, but it's clearly a fuckton of bots and and an increasingly shrinking pool of turd worlders and troons repeating the same 15 talking points.
 
AI data from reddit will, in 5-10 years time, turn out to be some of the most useless garbage humanly possible to produce. remember the early google AI search results about gluing cheese to your pizza? that's a result of reddit training data. they are training their AI to be absolutely retarded, and they are paying for the privilege to do it. pajeets are genuinely some of the dumbest, most disgusting semi-beings on the face of this planet, and removing them all from technology corporations is the only way to even begin fixing the infrastructure problems they've caused.
Oh, I hope it warms your heart to hear people are already getting the ball rolling on that; trolls are creating reddit threads full of bogus claims (they made themselves) to poison ChatGPT's responses to them, this data is usually saved to their training data.

I'll let the person explain it themselves:

Context: a man named Joe Truax started a subreddit/social media movement about men being allowed to cry (r/guycry), led to some minor trolling that brought Joe to the attention of others. They fed ChatGPT with reddit threads where they fabricated claims about Joe's trustworthiness and alleged he was a prolific scammer.

1754958201913.webp

The unfortunate and methed out Truax discovered this when asking ChatGPT about himself, blew a fucking gasket and proceeded to chimpout so hard that Reddit admins gigajannied all of his mod accounts and alts from his subreddit.

Here's the response he got that set him tf off.
1754958599400.webp
 
“Until they’re able to defend their site and comply with platform policies (e.g., respecting user privacy, re: deleting removed content) we’re limiting some of their access to Reddit data to protect redditors,”
Didn't the CEO's niece (some activist cunt) get literally every page belonging to her scrubbed from archive.org? (Don't know if she's graced Prospering Grounds)
Sounds like they're complying in some respects.
 
Didn't the CEO's niece (some activist cunt) get literally every page belonging to her scrubbed from archive.org? (Don't know if she's graced Prospering Grounds)
Sounds like they're complying in some respects.
You’re thinking of Taylor Lorenz, and yes she’s budded into quite a lolcow. Internet Archive also deletes anything from Kiwi Farms
 
reddit could solve this by just making it so you need to be logged in to see posts and comments, but they won't.
They already do this if you’re using certain vpns.

And I I think their reason for blocking posts is partly bullshit. People get blocked from viewing Reddit due to VPNs and the person just takes the link, makes an archive, and then views the thread that way.

I don’t care if that’s the actual reason but that’s my experience so I’ll believe that it’s the truth and the main reason.
IMG_7032.webp

BRB, making an archive of the thread I want to see because I won’t log in to Reddit and I’m definitely not turning off my vpn.
 
Besides the loss in preserving the insanity, hatred, ignorance and other bullshit spewed on that platform every minute of the day (to prove how awful it and its members are), I would argue that decontaminating otherwise valuable archives and purging that horrific "culture" would do the world more good than harm.
 
This sounds like a big excuse to hide their own shit. They literally sell their data directly to AI people.
The real reason they're blocking Internet Archive is because of the many skeletons they have in the closet.

Example;

Ghislaine Maxwell was suspected of being a powermod. Her account stopped posting the very same day she was arrested and hasn't posted since. Whats interesting is the account did manage to delete tons of posts made over the years. What Ghislaine couldn't delete was the snapshots of subreddit frontpages that Internet Archive captured that showed her postings.

Whats the big deal about that?

Well, one interesting discovery was information that seems to support the theory that Elon Musk and Ghislaine Maxwell meet up for a "kung fu lesson" in May 2016. Ghislaine had mentioned that her and Elon had talked about the possibility of life on other planets when they'd meet in the past. On the day she is believed to have meet with Elon she'd posted an article about a recently discovered planet that could host life.

IMG_0191.webp

Maybe that sparked their conversation, you know, small talk about she sold him some goods. If you view Ghislaines account history you won't see that post, it got deleted, but IA captured the evidence when it took a snapshot of the subreddit she frequented. Oops.

Theres LOTS of data to be minded off Reddit and if you can find a juicy lead or use AI to identify patterns you could really fuck over some people that deserve to be fucked over.

Like Rapeape, the mod of 4chan that everyone loves to hate. Turns out he's had an account on reddit since 2009. One of the possible identities of Rapeape, one he had fought vehemently to bury, took part in an interview in 2015 about his crypto being hacked. In that article this man, Partap Davis, admitted he'd been playing World of Tanks all night before being hacked. Turned out that Rapeape reddit account had done lots of posting on World of Tanks subreddit.

rapeape-reddit-account.webp
Partap Davis Article.webp

This seemingly confirmed the suspicion of who Rapeape really is and totally debunked his lame attempt to convince people Rapeape was actually some Canadian.

I bet theres all kinds of secrets floating around Reddit that lots of people aren't keen about having captured by an independent archiving organization that they can't buy and control.
 
This sounds like a big excuse to hide their own shit. They literally sell their data directly to AI people.
Precisely this. Reddit seems like the biggest argument into "dead internet theory", even if you discard the AI - but rather braindead people. But I don't know, is it possible for LLMs to be redditors? Who knows.
(Useful) Data in 2025 is more valuable than Oil. Advertising Data, AI Training Data, etc. People are investing a LOT of money into this.
And there will be databanks for data. Meaning, to store and hide data/secrets. Something we saw in Deus Ex: Mankind Divided, but I don't see that being that far off from reality.
 
Back
Top Bottom