AI Data Drought? News Sites Leading Pack Of Platforms Preventing Data Harvest

← Back to blogPublishing

“You are what you eat” is no longer just a warning to heed as we kick into high football-watching gear.

It’s been a fundamental concern for anyone who’s given AI technology more than a minute’s thought, and a recent study by Data Provenance Initiative of the training data AI models use shows a pattern to which, well, junk-food bingers might relate.

“Researchers analyzed robots.txt files and terms of use for 14,000 web domains that serve as sources for popular AI training datasets like C4, RefinedWeb, and Dolma,” writes Matthias Bastian of The Decoder about the academic study of the tokens used for training, such as individual sentences and word components. “From April 2023 to April 2024, the percentage of tokens in these datasets completely blocked for AI crawlers rose from about 1% to 5-7%.”

That rise is even more prominent amongst key data sources, a segment whose share of blocked tokens has gone from under 3% to as much as 33%.

As Kevin Roose says in The New York Times, the study not only showed that data is “drying up,” but it “discovered an ‘emerging crisis in consent,’ as publishers and online platforms have taken steps to prevent their data from being harvested.”

News sites were highlighted as a major restrictor, whose share of “completely blocked tokens surged from 3% to 45% within a year,” Bastian writes. “As a result, their representation in the training data is likely to decline in favor of corporate and e-commerce sites, which have fewer restrictions but often lower quality content. This trend could particularly affect AI developers, as the industry has realized that learning from high-quality data produces better models.”

“Share of tokens per web service and their monetization through paywalls/advertising” (Source: The Decoder)

The Decoder foresees AI being more difficult and more expensive to train should this trend continue, but also shares a potential bright spot for publishers willing to play into that game (as big-name publishers like Conde Nast, News Corp, Time, and many others already have).

“High-quality content providers could potentially find new revenue streams through licensing deals with AI companies,” Bastian writes.

SEE FOR YOURSELF

The Magazine Manager is a web-based CRM solution designed to help digital and print publishers manage sales, production, and marketing in a centralized platform.

Request a Demo

Schedule a free demo with an experienced software consultant to help make your publishing efforts successful.

Related Articles