Why does it matter? Because the Arse commentards are very angry and threatening to cancel subscriptions.
Benj Edwards - 8/21/2024
On Tuesday, OpenAI announced a partnership with Ars Technica parent company Condé Nast to display content from prominent publications within its AI products, including ChatGPT and a new SearchGPT prototype. It also allows OpenAI to use Condé content to train future AI language models. The deal covers well-known Condé brands such as Vogue, The New Yorker, GQ, Wired, Ars Technica, and others. Financial details were not disclosed.
One immediate effect of the deal will be that users of ChatGPT or SearchGPT will now be able to see information from Condé Nast publications pulled from those assistants' live views of the web. For example, a user could ask ChatGPT, "What's the latest Ars Technica article about Space?" and ChatGPT can browse the web and pull up the result, attribute it, and summarize it for users while also linking to the site.
In the longer term, the deal also means that OpenAI can openly and officially utilize Condé Nast articles to train future AI language models, which includes successors to GPT-4o. In this case, "training" means feeding content into an AI model's neural network so the AI model can better process conceptual relationships.
AI training is an expensive and computationally intense process that happens rarely, usually prior to the launch of a major new AI model, although a secondary process called "fine-tuning" can continue over time. Having access to high-quality training data, such as vetted journalism, improves AI language models' ability to provide accurate answers to user questions.
It's worth noting that Condé Nast internal policy still forbids its publications from using text created by generative AI, which is consistent with its AI rules before the deal.
In an internal email to Condé Nast staff, CEO Roger Lynch framed the multi-year partnership as a strategic move to expand the reach of the company's content, adapt to changing audience behaviors, and ensure proper compensation and attribution for using the company's IP. "This partnership recognizes that the exceptional content produced by Condé Nast and our many titles cannot be replaced," Lynch wrote in the email, "and is a step toward making sure our technology-enabled future is one that is created responsibly."
The move also brings additional revenue to Condé Nast, Lynch added, at a time when "many technology companies eroded publishers’ ability to monetize content, most recently with traditional search." The deal will allow Condé to "continue to protect and invest in our journalism and creative endeavors," Lynch wrote.
OpenAI COO Brad Lightcap said in a statement, "We’re committed to working with Condé Nast and other news publishers to ensure that as AI plays a larger role in news discovery and delivery, it maintains accuracy, integrity, and respect for quality reporting."
It was only after legal challenges began brewing in 2023 that OpenAI started licensing content from publishers to secure access to high-quality training data while also defending its fair use claims in court (after being sued by The New York Times). Around the same time, OpenAI published instructions on how sites could block its AI training-data web crawler, GPTBot, and many sites, including those owned by Condé Nast, did so quickly.
Lynch testified in the US Senate earlier this year, saying that training generative AI on scraped web content did not constitute fair use (as OpenAI has claimed) and that the technology was built with "stolen goods." But publications blocking GPTBot created a separate problem for OpenAI: At the time, blocking that bot also blocked ChatGPT's ability to merely browse these sites (separate from scraping for training) to pull answers into ChatGPT. To remedy that, OpenAI unveiled a new crawling bot with the launch of the SearchGPT prototype in July called OAI-SearchBot.
One ramification of this new deal is that OpenAI’s web crawlers are no longer excluded via robots.txt. With the robots.txt exclusion for OpenAI now gone, the startup is free to crawl any CN property, including Ars Technica. This means that, once again, OpenAI can crawl any part of the site that does not require a login to view, including user comments. To be clear, user comments were being crawled before they were blocked, but now, after the 11-month hiatus, they will be crawled again. Between publisher deals, the voluntary nature of robots.txt compliance, and the hordes of pirated data out there, it seems as though the only reliable way to escape the crawlers (Google, OpenAI, Perplexity, Microsoft, ad infinitum) is not to participate—a pyrrhic option at best.
Ultimately, Lynch feels partnering with OpenAI and receiving compensation is the best approach to the new world of AI assistants rapidly unfolding in the tech space. It aligns with his mission to defend Condé's intellectual property from unfair use, undertaken since his visit to the Senate in January.
"It is just the beginning, and we will continue what we started in Washington earlier this year," wrote Lynch in his email of the deal, "the fight for fair deals and partnerships across the industry until all entities developing and deploying artificial intelligence take seriously, as OpenAI has, the rights of publishers."
Comment section is over 500 responses now, here's some samples:
Benj Edwards - 8/21/2024
On Tuesday, OpenAI announced a partnership with Ars Technica parent company Condé Nast to display content from prominent publications within its AI products, including ChatGPT and a new SearchGPT prototype. It also allows OpenAI to use Condé content to train future AI language models. The deal covers well-known Condé brands such as Vogue, The New Yorker, GQ, Wired, Ars Technica, and others. Financial details were not disclosed.
One immediate effect of the deal will be that users of ChatGPT or SearchGPT will now be able to see information from Condé Nast publications pulled from those assistants' live views of the web. For example, a user could ask ChatGPT, "What's the latest Ars Technica article about Space?" and ChatGPT can browse the web and pull up the result, attribute it, and summarize it for users while also linking to the site.
In the longer term, the deal also means that OpenAI can openly and officially utilize Condé Nast articles to train future AI language models, which includes successors to GPT-4o. In this case, "training" means feeding content into an AI model's neural network so the AI model can better process conceptual relationships.
AI training is an expensive and computationally intense process that happens rarely, usually prior to the launch of a major new AI model, although a secondary process called "fine-tuning" can continue over time. Having access to high-quality training data, such as vetted journalism, improves AI language models' ability to provide accurate answers to user questions.
It's worth noting that Condé Nast internal policy still forbids its publications from using text created by generative AI, which is consistent with its AI rules before the deal.
Not waiting on fair use
With the deal, Condé Nast joins a growing list of publishers partnering with OpenAI, including Associated Press, Axel Springer, The Atlantic, and others. Some publications, such as The New York Times, have chosen to sue OpenAI over content use, and there's reason to think they could win.In an internal email to Condé Nast staff, CEO Roger Lynch framed the multi-year partnership as a strategic move to expand the reach of the company's content, adapt to changing audience behaviors, and ensure proper compensation and attribution for using the company's IP. "This partnership recognizes that the exceptional content produced by Condé Nast and our many titles cannot be replaced," Lynch wrote in the email, "and is a step toward making sure our technology-enabled future is one that is created responsibly."
The move also brings additional revenue to Condé Nast, Lynch added, at a time when "many technology companies eroded publishers’ ability to monetize content, most recently with traditional search." The deal will allow Condé to "continue to protect and invest in our journalism and creative endeavors," Lynch wrote.
OpenAI COO Brad Lightcap said in a statement, "We’re committed to working with Condé Nast and other news publishers to ensure that as AI plays a larger role in news discovery and delivery, it maintains accuracy, integrity, and respect for quality reporting."
Of bots and com-men-ts
From a technical standpoint, the deal removes Condé Nast's recent robots.txt restrictions on OpenAI's web crawlers, or "bots." That means that OpenAI bots can, after an 11-month hiatus, resume gathering information for training AI models and real-time web information for ChatGPT's retrieval augmentation capabilities. The hiatus was brought about after OpenAI's web crawling practices came under wide scrutiny last year, and publishers realized that OpenAI did not typically seek permission from publications to use their data (such as articles) to train their AI models. For example, Condé Nast content was already baked into large language models like GPT-4.It was only after legal challenges began brewing in 2023 that OpenAI started licensing content from publishers to secure access to high-quality training data while also defending its fair use claims in court (after being sued by The New York Times). Around the same time, OpenAI published instructions on how sites could block its AI training-data web crawler, GPTBot, and many sites, including those owned by Condé Nast, did so quickly.
Lynch testified in the US Senate earlier this year, saying that training generative AI on scraped web content did not constitute fair use (as OpenAI has claimed) and that the technology was built with "stolen goods." But publications blocking GPTBot created a separate problem for OpenAI: At the time, blocking that bot also blocked ChatGPT's ability to merely browse these sites (separate from scraping for training) to pull answers into ChatGPT. To remedy that, OpenAI unveiled a new crawling bot with the launch of the SearchGPT prototype in July called OAI-SearchBot.
One ramification of this new deal is that OpenAI’s web crawlers are no longer excluded via robots.txt. With the robots.txt exclusion for OpenAI now gone, the startup is free to crawl any CN property, including Ars Technica. This means that, once again, OpenAI can crawl any part of the site that does not require a login to view, including user comments. To be clear, user comments were being crawled before they were blocked, but now, after the 11-month hiatus, they will be crawled again. Between publisher deals, the voluntary nature of robots.txt compliance, and the hordes of pirated data out there, it seems as though the only reliable way to escape the crawlers (Google, OpenAI, Perplexity, Microsoft, ad infinitum) is not to participate—a pyrrhic option at best.
Ultimately, Lynch feels partnering with OpenAI and receiving compensation is the best approach to the new world of AI assistants rapidly unfolding in the tech space. It aligns with his mission to defend Condé's intellectual property from unfair use, undertaken since his visit to the Senate in January.
"It is just the beginning, and we will continue what we started in Washington earlier this year," wrote Lynch in his email of the deal, "the fight for fair deals and partnerships across the industry until all entities developing and deploying artificial intelligence take seriously, as OpenAI has, the rights of publishers."
Comment section is over 500 responses now, here's some samples: