Some of the world’s biggest tech companies trained their AI models on a dataset that included transcripts of more than 173,000 YouTube videos without permission, according to a new investigation by Proof News.
The dataset, created by a non-profit organization called EleutherAIcontains video transcripts from more than 48,000 channels and was used by companies such as Apple, NVIDIA and Anthropic including. The findings point out that the technology relies heavily on data extracted from creators without their consent or compensation.
The dataset does not include videos or images from YouTube, but contains video transcripts from the platform’s biggest creators, such as Marques Brownlee και MrBeast, as well as major media such as The New York Times, the BBC and ABC News.
“Apple has sourced data for their AI from several companies,” Brownlee posted on X. “One of them obtained a huge amount of data/transcripts from YouTube videos, including mine,” he added. “This is going to be an evolving problem for a long time.” A Google spokesperson said YouTube CEO Neal Mohan’s earlier statements that companies using YouTube data to train AI models would violate the platform’s terms of use still stand.
So far, AI companies have not been transparent about the data used to train their models. Earlier this month, artists and photographers criticized Apple for not disclosing the source of training data for Apple Intelligence, its own version of genetic AI that will roll out to millions of Apple devices this year.
YouTube, the world’s largest video repository, is a goldmine not only for transcripts, but also for audio, video and images, making it an attractive dataset for training AI models.
Earlier this year, the CTO of OpenAI, Mira Murati, declined to answer questions from the Wall Street Journal about whether the company used YouTube videos to train Sora, OpenAI’s upcoming AI video generator. “I won’t go into details about the data that was used, but it was publicly available or licensed data,” Murati said at the time. Alphabet CEO Sundar Pichai has also said that companies using data from YouTube to train their AI models would be violating the platform’s terms of use.
Source: www.digitallife.gr