The rise of artificial intelligence is the main topic of the current time. Along with the continuous improvement of AI comes a number of problems that may not be apparent at first glance. One of these can be an ethical problem with the protection of personal data or training on user data without their consent.
The second issue was just investigated by ProofNews. The one according to the magazine Engadget found that some of the biggest tech companies in the world trained their AI models on datasets containing transcripts from more than 173,000 YouTube videos. Companies such as Nvidia, Anthropic or Apple trained their AI models on data that they did not have permission from the authors.
The dataset with transcripts of YouTube videos was created by the non-profit organization EleutherAI and contains transcripts from more than 48 thousand channels. Although the dataset does not contain the actual videos or images, it contains transcripts of videos from the biggest creators of the platform (MrBeast, Marques Brownlee, The New York Times, BBC, ABC News and thousands of others).
Marques Brownlee on his X account on the subject he wrote: “Apple has obtained data for its AI from several companies. One of them pulled tons of data/transcripts from YouTube videos, including mine. Apple technically avoids a “bug” by doing this because they are not the ones -scraping- the data. This problem will develop for a long time.”
Training data without consent violates the platform’s terms, Google says
Reaction to this revelation was mixed. A Google spokesperson stressed that using YouTube data to train AI without consent violates the platform’s terms. The same in the past he said YouTube CEO Neal Mohan. However, companies such as Apple, Nvidia, Anthropic and EleutherAI have yet to respond to Engadget magazine’s request for comment.
Lack of transparency about data sources used to train AI drew criticism not only from YouTube creators, but also from artists and photographers. Earlier this month, Apple came under fire for not disclosing the source of the training data for their Apple Intelligence.
OpenAI Chief Technology Officer Mira Murati with earlier this year she dodged the questions to The Wall Street Journal on whether the company used YouTube videos to train their Sora generator. In both cases, the origin of the training data is shrouded in secrecy, which only raises concerns about improper handling of publicly accessible data.
If you want to find out yourself if your (or any other) channel is part of EleutherAI’s transcript dataset, using search tool you can find out for yourself at Proof News.
Sources: Engadget (1, 2), X účet Marques Brownlee, Bloomerg, YouTube kanál The Wall Street Journal, Proof News
Source: www.cnews.cz