The text file that runs the internet - For decades, robots.txt governed the behavior of web crawlers. But as unscrupulous AI companies seek out more and more data, the basic social contract of the web is falling apart.

BirdUp · Feb 17, 2024

yeah it blocks the webcrawlers you specify
why does this need an article

Anti Snigger · Feb 17, 2024

Thinking about this, robots.txt probably falls into the same category as email or SMS, where it's way too old for the modern internet.
It would be hard to suggest a more robust alternative though, since it's effectively just a polite informal agreement

Rozzy · Feb 17, 2024

Doesn't robots.txt stop the web-archive from archiving website pages?

DumbDude43 · Feb 17, 2024

Doubly Punished Snigger said:
It would be hard to suggest a more robust alternative though

blacklist IPs known to be operated by or associated with any of the big AI institutions
but it's not like that's foolproof either, i'm sure the folks at openai can figure out how to get around an IP block

Anti Snigger · Feb 17, 2024

DumbDude43 said:
blacklist IPs known to be operated by or associated with any of the big AI institutions
but it's not like that's foolproof either, i'm sure the folks at openai can figure out how to get around an IP block

IP has become far less effective as an identification tool, this would be like putting up caution tape to block an aisle at a store

Mordecai "3 Finger" Brown · Feb 17, 2024

Breadbassket said:
The Internet Archive, for example, simply announced in 2017 that it was no longer abiding by the rules of robots.txt. “Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes,” Mark Graham, the director of the Internet Archive’s Wayback Machine, wrote at the time. And that was that.

Rozzy said:
Doesn't robots.txt stop the web-archive from archiving website pages?

Apparently not.

The article seems to suggest that compliance with the robots.txt file permissions is completely voluntary.

But they didn't specifically explain how the system works.

Anyone care to give an ELI5 explanation to a Luddite?

Do crawlers and AI scrapers have some sort of coding built into them to read the permissions contained within each site's simple text file and the respect the permissions contained within?

I'm assuming based on the ginormous scale of archiving and scraping that there's no manual or human element involved re: permissions?

clipartfan92 · Feb 17, 2024

Mordecai 3 Finger Brown said:
Do crawlers and AI scrapers have some sort of coding built into them to read the permissions contained within each site's simple text file and the respect the permissions contained within?

That's how it's supposed to work. Here's a section from the wget documentation (a):

9.1 Robot Exclusion

It is extremely easy to make Wget wander aimlessly around a web site, sucking all the available data in progress. ‘wget -r site’, and you’re set. Great? Not for the server admin.

As long as Wget is only retrieving static pages, and doing it at a reasonable rate (see the ‘--wait’ option), there’s not much of a problem. The trouble is that Wget can’t tell the difference between the smallest static page and the most demanding CGI. A site I know has a section handled by a CGI Perl script that converts Info files to HTML on the fly. The script is slow, but works well enough for human users viewing an occasional Info file. However, when someone’s recursive Wget download stumbles upon the index page that links to all the Info files through the script, the system is brought to its knees without providing anything useful to the user (This task of converting Info files could be done locally and access to Info documentation for all installed GNU software on a system is available from the info command).

To avoid this kind of accident, as well as to preserve privacy for documents that need to be protected from well-behaved robots, the concept of robot exclusion was invented. The idea is that the server administrators and document authors can specify which portions of the site they wish to protect from robots and those they will permit access.

The most popular mechanism, and the de facto standard supported by all the major robots, is the “Robots Exclusion Standard” (RES) written by Martijn Koster et al. in 1994. It specifies the format of a text file containing directives that instruct the robots which URL paths to avoid. To be found by the robots, the specifications must be placed in /robots.txt in the server root, which the robots are expected to download and parse.

Although Wget is not a web robot in the strictest sense of the word, it can download large parts of the site without the user’s intervention to download an individual page. Because of that, Wget honors RES when downloading recursively. For instance, when you issue:
wget -r http://www.example.com/
First the index of ‘www.example.com’ will be downloaded. If Wget finds that it wants to download more documents from that server, it will request ‘http://www.example.com/robots.txt’ and, if found, use it for further downloads. robots.txt is loaded only once per each server.

Until version 1.8, Wget supported the first version of the standard, written by Martijn Koster in 1994 and available at http://www.robotstxt.org/orig.html. As of version 1.8, Wget has supported the additional directives specified in the internet draft ‘<draft-koster-robots-00.txt>’ titled “A Method for Web Robots Control”. The draft, which has as far as I know never made to an RFC, is available at http://www.robotstxt.org/norobots-rfc.txt.

This manual no longer includes the text of the Robot Exclusion Standard.

The second, less known mechanism, enables the author of an individual document to specify whether they want the links from the file to be followed by a robot. This is achieved using the META tag, like this:
<meta name="robots" content="nofollow">
This is explained in some detail at http://www.robotstxt.org/meta.html. Wget supports this method of robot exclusion in addition to the usual /robots.txt exclusion.

If you know what you are doing and really really wish to turn off the robot exclusion, set the robots variable to ‘off’ in your .wgetrc. You can achieve the same effect from the command line using the -e switch, e.g. ‘wget -e robots=off url...’.

As you can see, it's very easy to circumvent depending on the software you're using. An example of a wget command I use to archive sites, ignoring robots.txt and sending a browser user-agent instead of the regular wget string. 99% of the time you want to be using a browser ua when using anything that's not a browser.

Using neuter.mchang.xyz as an example.

Code:

wget \
--mirror \
--warc-file=neuter.mchang.xyz \
--warc-cdx \
--page-requisites \
--html-extension \
--convert-links \
--execute robots=off \
--directory-prefix=. \
--span-hosts \
--domains=neuter.mchang.xyz \
--user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0" \
--wait=3 \
--random-wait \
https://neuter.mchang.xyz/

The text file that runs the internet - For decades, robots.txt governed the behavior of web crawlers. But as unscrupulous AI companies seek out more and more data, the basic social contract of the web is falling apart.

BirdUp

The Worst Show on Television

Anti Snigger

h̸͋̉̈́́̐́͑̇̅̄͛́̀̿̏̅̅̀̆̎͛̆̀̑̈́͊̐̈́͒̔͒͛̍͑̉͂̏̅̈̔̒̕̚͘̕͘͘̚

Rozzy

DM chain friendly Kiwi invites are welcome.

DumbDude43

False & Lying Hater

Anti Snigger

h̸͋̉̈́́̐́͑̇̅̄͛́̀̿̏̅̅̀̆̎͛̆̀̑̈́͊̐̈́͒̔͒͛̍͑̉͂̏̅̈̔̒̕̚͘̕͘͘̚

Mordecai "3 Finger" Brown

Your RF has been dead for 130 years!

clipartfan92

Award Winning

9.1 Robot Exclusion

The text file that runs the internet - For decades, robots.txt governed the behavior of web crawlers. But as unscrupulous AI companies seek out more and more data, the basic social contract of the web is falling apart.

The Worst Show on Television

h̸͋̉̈́́̐́͑̇̅̄͛́̀̿̏̅̅̀̆̎͛̆̀̑̈́͊̐̈́͒̔͒͛̍͑̉͂̏̅̈̔̒̕̚͘̕͘͘̚

DM chain friendly Kiwi invites are welcome.

False & Lying Hater

h̸͋̉̈́́̐́͑̇̅̄͛́̀̿̏̅̅̀̆̎͛̆̀̑̈́͊̐̈́͒̔͒͛̍͑̉͂̏̅̈̔̒̕̚͘̕͘͘̚

Your RF has been dead for 130 years!

Award Winning

9.1 Robot Exclusion​

9.1 Robot Exclusion