AI could mean the end of the Wayback Machine, as news websites are increasingly blocking it to prevent content scraping

A growing number of major news sites are blocking the Wayback MachineThat reportedly includes 23 organizations that are preventing their content from appearing in the archiveThis is happening due to fears that the Wayback Machine is being exploited for AI content scraping

The Wayback Machine is under serious threat (and not for the first time), as a growing number of major news websites appear to be blocking the archiving system.

If you’re not familiar with the Wayback Machine, it’s run by the non-profit Internet Archive, and is essentially a time machine that preserves a history of the web (and more besides). This can be vital when it comes to historical research, for example, or monitoring changes to websites.

As Wired reports (via 9 to 5 Mac), there’s a growing trend of online news outlets blocking the web crawler that the Internet Archive uses to gather its snapshots. Some 23 big news sites are now doing so, according to Originality AI (which specializes in AI detection).

That includes the New York Times (based on a Nieman Lab report) and USA Today, with Wired highlighting that the latter recently published a report on how the US Immigrations and Customs Enforcement delayed the disclosure of key info about the impact of detainment policies. This was a piece which used the Wayback Machine extensively in its research.

The irony of USA Today using this data in such a way, and yet blocking the Wayback Machine from accessing its own content — which could potentially keep the news site itself honest in the future — isn’t lost on Wayback Machine director Mark Graham.

Graham told Wired: “They’re able to pull together their story research because the Wayback Machine exists. At the same time, they’re blocking access.”

Of course, if more and more organizations start to block the Wayback Machine, then its ability to keep a historical record of online content is going to be increasingly eroded.

(Image credit: Getty Images)

Analysis: blame AI (again)

So why is this happening? This isn’t about readers circumventing paywalled content using the Wayback Machine, in case you thought that was the issue at stake. Would it surprise you to learn that it’s actually about AI, in a roundabout way? Of course it wouldn’t, and in predictable fashion it seems that the Internet Archive is caught up in the broad backlash against AI here.

What these news organizations say they object to is not a historical record of their content being maintained, but the fact that this archive can be used by third-party AI firms to train their models (LLMs).

As Wired points out, New York Times spokesperson Graham James said: “The issue is that Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us.”

In short, the worry for these companies is that they might be able to block such AI scraping activities themselves, but this will still be happening behind their backs via the Wayback Machine. It’s not just major news outlets that have these worries, either, but also social media platforms, notably Reddit, which has blocked the Wayback Machine’s web crawler due to the exact same concerns.

While there are other possible sources and ways of indirectly scraping news content, the Wayback Machine is the most obvious target for rogue AI operators, as it maintains such an extensive library of web history.

So, this is a complex issue bound up in AI scraping and a whole lot of grey areas in terms of the legality therein. However, the effect on what is an important resource for keeping a check on governments or media giants — and holding them accountable for what was said in the past, or what’s been entirely deleted from the web in some cases — is clearly a worrying one.

Graham asserts that: “There’s no question that the general locking-down of more and more of the public web is impacting society’s ability to understand what’s going on in our world.”

A petition entitled ‘Journalists applaud the Internet Archive’s role in preserving the public record’ has been put together and sent off with over 100 signatures from working journalists. Meanwhile, a dialogue remains ongoing between the Internet Archive and said news publishers, so hope of finding a workable solution here isn’t lost yet.

Source: Latest from TechRadar US in Internet News 

Be the first to comment

Leave a Reply

Your email address will not be published.


*