A Data Provenance Initiative new according to his research the amount of downloadable content from collections used to create AI has been dramatically reduced.
For years, companies building high-performance artificial intelligence systems have been using vast amounts of text, images, and video from the Internet to train their models. By now, these data are becoming more and more obsolete. In the past year, many of the top Internet sources used to train AI models have restricted the use of their data, according to a study published this week by the MIT-led Data Provenance Initiative. The study examined 14,000 web domains included in three commonly used training datasets. Based on this, they discovered an “emerging crisis in consent” as publishers and online platforms took steps to prevent their data from being collected.
According to the researchers’ estimate, 5 percent of all data and 25 percent of data from the highest quality sources were restricted in the three datasets – C4, RefinedWeb and Dolma. These restrictions are set using the Robots Exclusion Protocol, a decade-old method for website owners to prevent AI companies from crawling their pages using a file called robots.txt. The study also found that 45 percent of data in one set, C4, is restricted by websites’ terms of service. “We’re seeing a rapid decline in data use consent across the web, which will impact not only AI companies, but also researchers, academics, and nonprofits,” said Shayne Longpre, lead author of the study.
Data is the main component of today’s generative AI systems, fed by billions of texts, images and videos. Much of the data is saved by researchers from public websites and compiled into large datasets that can be downloaded and freely used or supplemented with data from other sources. Learning from data enables generative AI tools like ChatGPT, Google Gemini and Anthropic Claude to create images and videos, write lines of code or poems. The more high-quality data fed into these models, the better their outputs tend to be.
For years, AI developers were able to collect data relatively easily. However, the generative AI boom of recent years has led to tensions with the owners of the data – many of whom are reluctant to give it away as AI learning materials, or at least want to get paid for it. As backlash grew, some publishers erected paywalls or changed their terms of service to limit the use of their data for AI training. Others have blocked automated downloaders used by companies such as OpenAI, Anthropic and Google.
Sites like Reddit and StackOverflow have begun charging AI companies for access to the data, and some publishers have taken legal action — including The New York Times, which sued OpenAI and Microsoft last year for copyright infringement. , claiming that the companies used the news articles to train their models without permission. Companies like OpenAI, Google, and Meta have gone to great lengths in recent years to collect as much data as possible to improve their systems, including downloading YouTube videos and creatively interpreting their own data policies. Recently, some AI companies have struck deals with publishers such as the Associated Press and News Corp, owner of The Wall Street Journal, giving them continued access to their content.
But broad data limitations can pose a threat to AI companies, which need continuous, high-quality data to keep their models fresh and up-to-date. They can also be a problem for smaller AIs and academic researchers who rely on public datasets and cannot afford to license the data directly from the publisher. One such dataset is the Common Crawl, which contains billions of pages of web content and is maintained by a nonprofit organization. This has been documented in over 10,000 scientific studies. It is unclear which popular AI products were trained on these sources, as few developers disclose the full list of training data they use. But datasets from Common Crawl, including C4 (which stands for Colossal, Cleaned Crawled Corpus), have been used by companies such as Google and OpenAI to train earlier versions of their models.
Yacine Jernite, a machine learning researcher at Hugging Face, a company that provides tools and data for AI developers, described the consent crisis as a natural response to the AI industry’s aggressive data collection practices. “It’s not surprising that we’re seeing a pushback from data creators after the text, images and videos they share online are used to develop commercial systems that sometimes directly threaten their livelihoods,” he said. would have to be acquired through, it would exclude “researchers and civil society from participating in the management of technology”.
Stella Biderman, executive director of EleutherAI, a nonprofit AI research organization, echoed these fears. “The big tech companies already have all the data,” he said. “Changing the license of the data doesn’t retroactively revoke that permission, and the primary impact is on latecomers, who are typically either smaller startups or researchers.” Artificial intelligence companies argue that their use of public web data is legally protected under fair use. However, collecting new data has become more difficult.
Some AI leaders worry about reaching a “data wall,” the point where all training data on the public Internet is exhausted, and the rest is hidden behind paywalls, blocked with robots.txt, or locked into exclusive agreements. Some companies believe that they can break down the wall of data by using synthetic data – i.e. data that is generated by the AI systems themselves – and thus train their models. But many researchers doubt that today’s AI systems can generate synthetic data of high enough quality to replace the human-generated data they’re losing.
Another challenge is that while publishers can try to prevent AI companies from querying their data by placing restrictions in their robots.txt files, these requests are not legally binding and compliance is voluntary. (Think of it as a “do not enter” sign on data, but without the force of law.) Major search engines honor these opt-out requests, and several leading AI companies—including OpenAI and Anthropic—publicly do so. stated that they do the same. Other companies, including the AI-powered search engine Perplexity, ignore these requests.
One of the big lessons from the study, said author Longpre, is that new tools are needed to give website owners a more precise way to control how their data is used. Some websites might object to AI giants using their data to train chatbots for profit, but they might be willing to let a nonprofit or an educational institution use that same data, he said. Currently, there is no good way to distinguish between these uses, or to disable one and enable the other. But there is also a lesson for the big AI companies, who for years have been treating the Internet like a buffet with unlimited capacity, without giving value to the owners of the data in return. If we exploit websites, they will start to close their doors.
Source: sg.hu