Synthetic AI Data: Promises and Perils

Is it possible for an AI to be trained only on data generated by another AI? It may sound like a crazy idea, but it’s a concept that’s been around for a while — and as new real data becomes harder to come by, this approach becomes more and more attractive.

Anthropic used some synthetic data to train one of its flagship models, the Claude 3.5 Sonnet. Meta customized its Llama 3.1 models using AI-generated data. Also, OpenAI is said to be using synthetic data from its reasoning model o1 for the upcoming Orion.

But why does AI need data at all — and what kind of data does it need? Can this data really be replaced by synthetic data?

Important annotation

AI systems are statistical machines. Trained on a large number of examples, they learn patterns from those examples to make predictions, such as that the phrase “to whom” in an email usually precedes the words “may concern”.

Annotations, which are usually text that indicates the meaning or parts of the data that these systems process, are a key part of these examples. They serve as guides that “teach” the model to distinguish between things, places and ideas.

Let’s take the example of a photo classification model that is presented with many images of kitchens labeled with the word “kitchen.” During training, the model will begin to associate “kitchen” with general characteristics of kitchens (eg, that they contain refrigerators and work surfaces). After training, when given a photo of a kitchen that was not included in the initial examples, the model should be able to recognize it as a kitchen. (Of course, if the images of kitchens were labeled as “cow”, the model would recognize them as cows, which highlights the importance of good annotation.)

The demand for AI systems and the need to provide annotated data for their development has significantly increased the market for annotation services. According to the estimate Dimension Market Researchthis market is worth $838.2 million today — and will reach $10.34 billion in the next 10 years. Although there are no precise estimates of the number of people engaged in data tagging work, a paper from 2022 estimates that there are “one million”.

Companies of all sizes rely on workers employed by data annotation firms to create labels for datasets to train AI systems. Some of these jobs pay decently, especially if the labeling requires specialized knowledge (eg mathematical expertise). However, other jobs can be extremely demanding. Annotators in developing countries earn on average only a few dollars an hour, with no benefits or guarantees of future engagements.

Data sources are drying up

In addition to humanitarian reasons for finding alternatives to human-generated labels, there are also practical reasons.

It takes time for people to annotate data. Annotators may also have biases that manifest in their labels, and thus in the models trained on that data. Annotators make mistakes or have difficulty understanding annotation instructions. Besides, hiring people is expensive.

In general, data is expensive. For example, Shutterstock charges tens of millions of dollars to AI companies for access to its archives, while Reddit has made hundreds of millions of dollars by selling data licenses to Google, OpenAI and others.

Synthetic AI Data Promises and Perils 2

Also, it is becoming increasingly difficult to obtain data.

Most models are trained on vast collections of public data — increasingly limited by owners for fear of plagiarism or not getting credit for their data. More than 35% of the top 1,000 websites are now blocked by OpenAI’s web crawler. A recent study found that about 25% of data from “quality” sources became unavailable in key datasets used to train models.

If the current trend of blocking access to data continues, research group Epoch AI predicts that developers will run out of data to train generative AI models between 2026 and 2032. With fears of copyright infringement lawsuits and the possible inclusion of inappropriate content in open datasets, AI companies face serious challenges.

Synthetic risks

At first glance, synthetic data seems like a solution to all these problems. Need bookmarks? Generate them. Need more data copies? No problem. Borders practically do not exist.

To some extent, this is true.

“If ‘data is the new oil,’ synthetic data is a biofuel that can be created without the negative side effects of real data,” Os Keys, a University of Washington PhD student who studies the ethical implications of new technologies, told TechCrunch. “You can take a small initial set of data and by simulation and extrapolation create new entries.”

The AI ​​industry has embraced this concept and started implementing it.

This year, Writer, a company focused on generative AI technology for enterprises, introduced the Palmyra X 004 model, trained almost entirely on synthetic data. Writer claims the model cost just $700,000 to develop — compared to an estimated $4.6 million for a similar-sized model from OpenAI.

Microsoft’s Phi models are also partially trained on synthetic data. The same goes for Google’s Gemma models. This summer, Nvidia unveiled a family of models designed to generate synthetic training data, while AI startup Hugging Face recently released what they claim is the world’s largest synthetic text training dataset.

Synthetic AI Data Promises and Perils 3

Generating synthetic data has become a business of its own — and it is estimated that by 2030 it could be worth $2.34 billion. Gartner predicts that 60% of data used for AI and analytics projects this year will be synthetically generated.

Luka Soldaini, senior research scientist at the Allen Institute for AI, emphasized that synthetic data techniques can be used to generate training data in a format that is difficult to obtain through scraping (or even content licensing). For example, when training its Movie Gen video generator, Meta used Llama 3 to create captions for shots in the training data, which people then filled in with details, such as lighting descriptions.

In this regard, OpenAI stated that it fine-tuned GPT-4o using synthetic data to build a Canvas feature for ChatGPT, which behaves like a sketchbook. Amazon also said it generates synthetic data to supplement the real data it uses to train speech recognition models for Alexa devices.

“Synthetic data models can quickly be used to extend human intuition about what data is needed to achieve specific model behavior,” Soldaini said.

Synthetic data is not the solution to all problems. This approach has the same problem as all AI systems — “garbage in, garbage out.” Models create synthetic data, and if the data used to train those models is biased or has limitations, their results will be similar. For example, groups that are underrepresented in the original data will also be underrepresented in the synthetic data.

Synthetic AI Data Promises and Perils 5

“The problem is, it can only be done up to a point,” Keyes said. “Let’s say you only have 30 black people in your data set. Extrapolation can help, but if those 30 people are all middle-class or all fair-skinned, that’s what the ‘representative’ data will look like.”

In this context, a 2023 study by researchers at Rice University and Stanford University found that over-reliance on synthetic data during training can create models whose “quality or diversity progressively declines.” Sampling bias — a poor representation of the real world — causes model diversity to deteriorate after several generations of training, according to the researchers (though they also found that mixing with little real data helps mitigate this problem).

Keyes sees additional risks in complex models like OpenAI’s o1, which he believes can produce hallucinations that are harder to spot in his synthetic data. These hallucinations can reduce the accuracy of models trained on that data — especially if the sources of the hallucinations are not easily identified.

“Complex models hallucinate; data produced by complex models contain hallucinations,” Keyes added. “And with models like the o1, the developers themselves can’t necessarily explain why the artifacts appear.”

Cumulative hallucinations can lead to models that produce nonsense. A study published in the journal Nature reveals how models trained on faulty data generate even more faulty data and how this feedback loop degrades future generations of models. Researchers have found that models lose their understanding of more complex knowledge over generations, becoming increasingly generic and often producing answers that are irrelevant to the questions asked.

A subsequent study shows that other types of models, such as image generators, are not immune to this type of collapse:

Synthetic AI Data Promises and Perils 4

Soldaini agrees that “raw” synthetic data is not reliable, at least if the goal is to avoid training forgetful chatbots and homogeneous image generators. Useful use of synthetic data, he says, requires thorough screening, curation and filtering, and ideally should be combined with fresh, real data — just as you would with any other data set.

Failure to do so can lead to model collapse, where the model becomes less “creative” — and more biased — in its outcomes, which can seriously compromise its functionality. Although this process can be identified and stopped before it becomes serious, the risk remains.

“Researchers must review the generated data, iterate on the generation process and identify safeguards to remove low-quality data,” Soldaini said. “Synthetic data is not a self-processing machine; their results must be carefully inspected and improved before they are used for training.”

OpenAI CEO Sam Altman once claimed that AI would one day produce synthetic data good enough to effectively train itself. However — assuming it’s even feasible — that technology doesn’t exist yet. No major AI lab has published a model trained solely on synthetic data.

It seems that, at least for the foreseeable future, humans will be needed at some point to ensure that model training doesn’t go wrong.

The post Synthetic AI Data: Promises and Perils appeared first on ITNetwork.

Source: www.itnetwork.rs