A case of a dataset bringing together a million messages published on Bluesky has caused a stir. And underlined the limits of the social network to prevent certain questionable practices for training AI.
“Anything you say could be used against you.” This sentence, repeated tirelessly in American legal series, could very well be applied today in the context of artificial intelligence: everything you write on the Internet could be used to train AI. Including on Bluesky, the social network that dreams of overthrowing X.
This is what a very recent case reveals, reported by 404Media in its November 26 edition. A specialist in machine learning (a branch of AI) had announcedduring the day of November 26, having created a dataset containing one million public publications taken from Bluesky.
To do this, he used an API made available by Bluesky. This allowed it to recover, in addition to the content of the messages, metadata — notably the timestamp of publications (time and day) and interaction statistics (reposts, quotes, likes). “ Ideal for testing the use of machine learning for Bluesky », he added.
Brutal backtracking after the outcry
24 hours later, woe! The expert abruptly backtracks. On Bluesky, he indicated this November 27 have “ removed Bluesky data from the repository. While I want to support the development of tools for the platform, I recognize that this approach violated the principles of transparency and consent in data collection. »
The person concerned had shared its archive on Hugging Face, a reference web platform dedicated to AI on which it is also possible to test models without great technical skills. The page is still online, but includes an update mentioning the withdrawal of the deposit due to the magnitude of the ” negative reactions » of the community.
The page remains, because the specialist, who has since apologized, wants to feed reflection and discussions about how the datasets can be used to help improve Bluesky and allow people to build the tools they need to build their own open models and approaches to create flows that work for their needs. »
This case comes as Bluesky took a position on the issue of generative AI (GenAI) on November 15. “ We do not use any of your content to train generative AI, and we have no intention of doing so », launched the site, noting « that none of its systems are generic AI systems trained on user content. »
Bluesky has rules, but can’t do much
The incident obviously reached the ears of Bluesky, who published an updated thread on its policy on generative AI. The platform, in particular, wanted to address the more specific subject of external third parties who access the social network. And essentially admit that its possibilities are limited to avoid certain excesses.
Bluesky is an open and public social network, just like websites on the Internet itself. Websites can specify whether they consent to outside companies mining their data using a robots.txt file. And, further, to emphasize that “ Bluesky will not be able to enforce this consent outside of our systems. »
Despite everything, Bluesky is studying the possibility of deploying rules of this type on its spaces, so that members of the platform indicate whether or not they agree to the idea that their messages serve “ in AI training datasets “. But, it is based on the assumption that everyone will play the game, respecting the rules.
This admission illustrates a relative helplessness in the face of a practice that is very frowned upon, and yet is widely observed on the internet: scrapping. It involves using automatic tools that suck up information that is publicly accessible on the internet, such as social networks, without worrying too much about the rules of these spaces.
A notable case was observed with Clearview, on facial recognition. However, scrapping is no stranger to GenAI. Accusations of this type have been made on this subject – the New York Times attacked OpenAI on this ground, for example. In another genre, YouTube had also warned OpenAI on this subject.
Source: www.numerama.com