Gizmodo: When the Internet Archive Forgets - Or why Archive.fo > Archive.org

  • Want to keep track of this thread?
    Accounts can bookmark posts, watch threads for updates, and jump back to where you stopped reading.
    Create account
https://archive.fo/p8DmQ
On the internet, there are certain institutions we have come to rely on daily to keep truth from becoming nebulous or elastic. Not necessarily in the way that something stupid like Verrit aspired to, but at least in confirming that you aren’t losing your mind, that an old post or article you remember reading did, in fact, actually exist. It can be as fleeting as using Google Cache to grab a quickly deleted tweet, but it can also be as involved as doing a deep dive of a now-dead site’s archive via the Wayback Machine. But what happens when an archive becomes less reliable, and arguably has legitimate reasons to bow to pressure and remove controversial archived material?

A few weeks ago, while recording my podcast, the topic turned the old blog written by The Ultimate Warrior, the late bodybuilder turned chiropractic student turned pro wrestler turned ranting conservative political speaker under his legal name of, yes, “Warrior.” As described by Deadspin’s Barry Petchesky in the aftermath of Warrior’s 2014 passing, he was “an insane dick,” spouting off in blogs and campus speeches about people with disabilities, gay people, New Orleans residents, and many others. But when I went looking for a specific blog post, I saw that the blogs were not just removed, the site itself was no longer in the Internet Archive, replaced by the error message: “This URL has been excluded from the Wayback Machine.”

Apparently, Warrior’s site had been de-archived for months, not long after Rob Rousseau pored over it for a Vice Sports article on the hypocrisy of WWE using Warrior’s image for their Breast Cancer Awareness Month campaign. The campaign was all about getting women to “Unleash Your Warrior,” complete with an Ultimate Warrior motif, but since Warrior’s blogs included wishing death on a cancer-survivor, this wasn’t a good look. Rousseau was struck by how the archive was removed “almost immediately after my piece went up, like within that week,” he told Gizmodo.

Rousseau suspected that WWE was somehow behind it, but a WWE spokesman told Gizmodo that they were not involved. Steve Wilton, the business manager for Ultimate Creations also denied involvement. A spokesman for the Internet Archive, though, told Gizmodo that the archive was removed because of a DMCA takedown request from the company’s business manager (Wilton’s job for years) on October 29, 2017, two days after the Vice article was published. (He has not replied to a follow-up email about the takedown request.)

Over the last few years, there has been a change in how the Wayback Machine is viewed, one inspired by the general political mood. What had long been a useful tool when you came across broken links online is now, more than ever before, seen as an arbiter of the truth and a bulwark against erasing history.

That archive sites are trusted to show the digital trail and origin of content is not just a must-use tool for journalists, but effective for just about anyone trying to track down vanishing web pages. With that in mind, that the Internet Archive doesn’t really fight takedown requests becomes a problem. That’s not the only recourse: When a site admin elects to block the Wayback crawler using a robots.txt file, the crawling doesn’t just stop. Instead, the Wayback Machine’s entire history of a given site is removed from public view.

In other words, if you deal in a certain bottom-dwelling brand of controversial content and want to avoid accountability, there are at least two different, standardized ways of erasing it from the most reliable third-party web archive on the public internet.

For the Internet Archive, like with quickly complying with takedown notices challenging their seemingly fair use archive copies of old websites, the robots.txt strategy, in practice, does little more than mitigating their risk while going against the spirit of the protocol. And if someone were to sue over non-compliance with a DMCA takedown request, even with a ready-made, valid defense in the Archive’s pocket, copyright litigation is still incredibly expensive. It doesn’t matter that the use is not really a violation by any metric. If a rightsholder makes the effort, you still have to defend the lawsuit.

“The fair use defense in this context has never been litigated,” noted Annemarie Bridy, a law professor at the University of Idaho and an Affiliate Scholar at the Center for Internet and Society at Stanford Law School. “Internet Archive is a non-profit, so the exposure to statutory damages that they face is huge, and the risk that they run is pretty great ... given the scope of what they do; that they’re basically archiving everything that is on the public web, their exposure is phenomenal. So you can understand why their impulse might be to act cautiously even if that creates serious tension with their core mission, which is to create an accurate historical archive of everything that has been there and to prevent people from wiping out evidence of their history.”

While the Internet Archive did not respond to specific questions about its robots.txt policy, its proactive response to takedown requests, or if any potential fair use defenses have been tested by them in court, a spokesperson did send this statement along:

Several months after the Wayback Machine was launched in late 2001, we participated with a group of outside archivists, librarians, and attorneys in the drafting of a set of recommendations for managing removal requests (the Oakland Archive Policy) that the Internet Archive more or less adopted as guidelines over the first decade or so of the Wayback Machine.

Earlier this year, we convened with a similar group to review those guidelines and explore the potential value of an updated version. We are still pondering many issues and hope that before too long we might be able to present some updated information on our site to better help the public understand how we approach take down requests. You can find some of our thoughts about robots.txt at http://blog.archive.org/2017/04/17/...arch-engines-dont-work-well-for-web-archives/.

At the end of the day, we strive to strike a balance between the concerns that site owners and rights holders sometimes bring to us with the broader public interest in free access for everyone to a history of the Internet that is as comprehensive as possible.

All of that said, the Internet Archive has always held itself out to be a library; in theory, shouldn’t that matter?

“Under current copyright law, although there are special provisions that give certain rights to libraries, there is no definition of a library,” explained Brandon Butler, the Director of Information Policy for the University of Virginia Library. “And that’s a thing that rights holders have always fretted over, and they’ve always fretted over entities like the Internet Archive, which aren’t 200-year-old public libraries, or university-affiliated libraries. They often raise up a stand that there will be faux libraries, that they’d call themselves libraries but it’s really just a haven for piracy. That specter of the sort of sham library really hasn’t arisen.” The lone exception that Butler could think of was when American Buddha, a non-profit, online library of Buddhist texts, found itself sued by Penguin over a few items that they asserted copyright over. “The court didn’t really care that this place called itself a library; it didn’t really shield them from any infringement allegations.” That said, as Butler notes, while being a library wouldn’t necessarily protect the Internet Archive as much as it could, “the right to make copies for preservation,” as Butler puts it, is definitely a point in their favor.

That said, “libraries typically don’t get sued; it’s bad PR,” Butler says. So it’s not like there’s a ton of modern legal precedent about libraries in the digital age, barring some outliers like the various Google Books cases.

As Bridy notes, in the United States, copyright is “a commercial right.” It’s not about reputational harm, it’s about protecting the value of a work and, more specifically, the ability to continuously make money off of it. “The reason we give it is we want artists and creative people to have an incentive to publish and market their work,” she said. “Using copyright as a way of trying to control privacy or reputation ... it can be used that way, but you might argue that’s copyright misuse, you might argue it falls outside of the ambit of why we have copyright.”

We take a lot of things for granted, especially as we rely on technology more and more. “The internet is forever” may be a common refrain in the media, and the underlying wisdom about being careful may be sound, but it is also not something that should be taken literally. People delete posts. Websites and entire platforms disappear for business and other reasons. Rich, famous, and powerful bad actors don’t care about intimidating small non-profit organizations. It’s nice to have safeguards, but there are limits to permanence on the internet, and where there are limits, there are loopholes.

Interestingly enough Jason Scott AKA Textfiles (who works for Archive.org) wasn't too thrilled at the article:
upload_2018-11-28_14-48-20.png
https://archive.fo/zQnUP
 
That archive sites are trusted to show the digital trail and origin of content is not just a must-use tool for journalists, but effective for just about anyone trying to track down vanishing web pages.

Firstly, journalists consider archive sites a must use tool? Weird so many of them block their sites from being archived then, along with regularly deleting their own articles or changing them all the fucking time.

Second, this sentence is broken grammatically, I'm pretty sure. If we try to break down what he's literally saying here, it's "Archive sites being trusted to show the digital trail and origin of content is a must-use tool for journalists." The trust of those things is the must use tool, not the things themselves, in this sentence. What I would assume (or hope) he meant is "Archive sites are a must-use tool for journalists, because they're trusted to show the digital trail and origin of content".

Jesus fucking christ man, who's the editor at this gizmodo operation?
 
"Archive.fo > Archive.org" You say this, but archive.today isn't the one holding copies of TempleOS and a crawl someone did of Kiwi Farms this year. It can't even archive .pdfs properly.

I kid, they both have their pros and cons, though I use both (and a few others) heavily. While archive.today's owner doesn't seem to be particularly transparent about things to the general public, hasn't been around for nearly as long as archive.org, and had that weird incident with blocking the entirety of Finland, the site does seem to be more or less steadfast with keeping content up. (Besides hiding things like the Goodreads domain for the US, I'm just baffled by that.) I'm little more than an web archiving amateur enthusiast though.

I saw this thread and thought it was a very late posting of an article and reaction I saw before; this is a repeat of something that happened earlier this year. In May-ish Motherboard posted an article about an archived version of a website they were investigating being abruptly excluded in the Wayback Machine, and @textfiles got mad at Motherboard and said elsewhere the article was written without proper comment from IA people. Which, I dunno, I don't even know how journalism works. I'm curious what the hypothetical better article would include though, besides something about ethics in archiving and that the archives aren't deleted, just pulled from public view. At least this article isn't as braindead as the Motherboard one was and notes how risky something like a lawsuit would be for IA.

And since I have nowhere else relevant to barf about my thoughts on web archives, this reminds me of something: I remember reading on a Wayback FAQ that full-text search of the archives wasn't available "--yet." Yet. If and when that day comes, I fear it will be both wonderful and terrible, for a number of reasons to both the archivists and the archived. What takedown requests there will be :heart-empty:
 
Last edited:
All this because the Ultimate Warrior made fun of troons and welfare queens on a blog over a decade ago.
 
Holy shit they have 12 editors. I thought I was being facetious, I assumed there weren't any editors. Well that's 12 people not doing their job.
In this case I would bet that editor is just a title they give to anyone who works for the company directly and isn't a freelancer.
 
I wonder if having 12 editors created a bystander effect where the editors just assumed someone else would do thier job. Meaning very little actually got done
I just wonder what they actually do. I mean, obviously not fact checking, that just gets in the way of a good story. But you'd think they could at least make sure it makes some kind of grammatical sense.
 
Back
Top Bottom