Wordfreq, linguistic research project closes: artificial intelligence has contaminated the data

Wordfreqa project designed to trace the evolution of language use in over 40 different languages, It has been closed in the past few weeks because the spread, in the last three years, of contents generated by artificial intelligence linguistic models has compromised the data on which the research activities were based.

It was the creator of the project herself, Robyn Speer, to report it on GitHubwarning that Wordfreq will be abandoned due to of the information “pollution” caused by generative artificial intelligence. “I don’t think anyone has reliable information about human language use after 2021,” Speer said.

Wordfreq has been a valuable resource for academics and researchers for years. The system analyzed millions of sources, including Wikipedia, movie and TV show subtitles, news articles, books, websites, Twitter, and Reddit, providing a detailed overview of linguistic evolutiontracking the emergence of new habits and old ones falling into disuse, the spread of new idioms, slang constructs and the reflection of cultural evolution in the way of communicating.

Going to freely scan the web, Wordfreq has come across a significant amount of “useless” content over the past two yearsreal waste generated by large linguistic models that are not actually written by anyone to communicate anything. The collection of this data compromises the reliability regarding the frequency of use of words: furthermore, it is content that is virtually present everywhere online, and that by effectively mimicking real language, is difficult to recognize and ignore. It is a completely different problem compared to spam, which has always been present on the web but to a lesser extent than authentic content and is more easily identifiable.

Speer brought an example of the excessive use of the English word “delve” (investigate, research) by ChatGPT, which does not reflect the actual use made by people of that word. However, this has led to altering the frequency of use recorded for this specific word, effectively polluting the data. Interestingly, the excessive occurrence of certain words is a phenomenon analyzed by another academic study to determine whether a text was written with the use of generative artificial intelligence.

The spread of AI has also brought a series of practical problems to the Wordfreq project: the tools used by the project to read large amounts of content are in fact similar to the same ones used by AI companies to train their linguistic models. This has led to a certain mistrust on the part of authors and content creators, who when faced with a tool that actively collects text from books, articles, websites or posts tend to think, even quite understandably, that on the other end there is someone who is training a “copycat” AI, perhaps even for profit. A direct consequence is therefore the difficulty in accessing content sourceswith many entities that have started to raise barriers, often for a fee, to large-scale data collection.

The creator of Wordfreq concluded her communication with a certain bitterness, expressing disappointment towards the large technological companies involved in the development of AI.further emphasizing how he wants to avoid that his research work could be in any way confused with the training activities of large linguistic models.

Source: www.hwupgrade.it