comp-journalism EN

UK news industry backs law to stop deceptive AI scraping

The silhouette of a spider is formed by gaps in binary code to portray AI crawler story

Picture: Enzozo/Shutterstock

A proposed UK law is being drafted to stop companies deploying AI bots from using deceptive tactics to scrape websites.

The Automated Online Software (Access and Transparency) Bill is unlikely to become law, unless it secures backing from the Government, but it has the support of publishers and could well help shape other legislation.

The move follows New York state passing the Stealth Crawler Preservation Act .

Both seek to address the issue of bots that hide or do not disclose their intention – such as search indexing or AI training – and who is behind them.

Media consultant Matthew Scott Goldstein, citing data from a company called Mordor, has suggested the network of third-party scrapers and brokers together comprise a $1bn industry.

Some scrapers pretend to be humans while others may falsely label themselves, for example as a Google bot.

This makes it difficult for website owners to understand how their content is being used, stops them from being able to block bots they don’t want to access their content, and ultimately makes it harder to negotiate with AI firms.

A recent report from cloud network Fastly found that 49% of website traffic now comes from bots – but that 99% of bot traffic is unwanted or unverifiable (for reasons such as scraping copyrighted content without permission or impersonating legitimate services).

The New York Stealth Crawler Prohibition Act aims to “prevent AI companies from deploying stealth crawlers, or automated bots that scrape online news content, in a manner that damages the operation of a news site”.

The bill, which has passed the New York Senate and Assembly and now only needs to be signed by the state governor, will make it an offence to “damage, impair or burden the operation of a covered news site or otherwise cause a news site economic harm”.

It will allow “aggrieved” news organisations to request a subpoena against a service provider to identify an alleged violator, and enable them to seek an injunction and recover damages.

The justification for the measures published with the bill states: “Stealth web crawlers, or automated bots that scrape online content while evading detection, pose a growing threat to New York’s news publishers, digital markets and the public interest. AI developers have begun to deploy these bots in recent years with the goal of extracting journalism without authorisation only for them to turn around and reformat this content for AI consumption.

“In other words, stealth crawlers have enabled tech companies to free ride off of the hard work of dedicated journalists, all while diverting readers away from the publishers’ own websites. This has resulted in a decrease in subscription and advertising revenue for the news publishers, thus denying compensation to the very journalists we all depend onto separate the truth from the lies.”

It added that the public are left less well informed by bots “facilitating the spread of unreliable AI-generated content”.

“What’s more, stealth crawlers impose significant operational costs on publishers’ technological infrastructure,” the Senate website adds. “Because bots generate a ton of web traffic to these news sites all of which must be processed before the bots can be filtered or blocked publishers are forced to scale their infrastructure to handle peak volumes.

“Not even a paywall is enough to stop these stealth crawlers: some bots have been found to retrieve entire articles hidden under a paywall. As a result, publishers must invest millions of dollars in increased bandwidth and enhanced cybersecurity tools to fend off the AI bots that are causing them to lose revenue.”

The UK version of this legislation is a Private Members’ Bill put forward by Conservative MP Damian Hinds, who sits on the Culture, Media and Sport Committee, working with the News Media Association to draft the bill.

Hinds said: “Too many UK businesses are having their websites raided for valuable content, with no visibility on who is extracting the value from their work. But a functioning economy depends on property rights, and being able to trade and be paid. If news outlets can’t secure fair remuneration for their work, through subscription or ad revenue, journalism will become unsustainable – and there’d be nothing left for the bots to scrape from.

“My bill will do one simple thing: if you run an online bot that accesses a website and takes content and data, you have to say who you are and what you’ll do with what you take. This isn’t heavy-handed regulation, and it doesn’t seek to regulate AI models or dictate behaviour. It just requires basic transparency.

“It will give British website owners, from online retailers to local newspapers, the tools to see who’s at their door and the ability to strike a fair deal for what they’ve built.”

NMA chief executive Theo Bamber said: “For years, news publishers have watched their journalism taken without permission by unidentified bots. This means they don’t know who is accessing their content and then have no say in how it’s used. This bill will change that by giving publishers, and thousands of other businesses, the right to see who’s trying to gain access to their sites and then negotiate any access on their own terms.”

Some publishers have just introduced a new tactic: adding search-only contracts to website terms and conditions (replacing previous robots.txt notices banning bots) so they can attempt to invoice per article scraped without fighting a lengthy court battle on copyright.

Email [email protected] to point out mistakes, provide story tips or send in a letter for publication on our “Letters Page” blog