Global Column | Proper intention and wrong subject… Limitations of Open AI ‘Simple QA’

In a perfect world, the persuasiveness of an argument would be based on the content itself, not on who said it. But the reality is not like that. And in tests to evaluate the accuracy of generative AI, there is probably no entity more difficult to trust than Open AI.

ⓒ Getty Images Bank

Many CIOs are constantly and potentially futilely trying to generate meaningful ROI from their shiny new generative AI tools. The biggest obstacle in these efforts is the ‘hallucination’ phenomenon. This is because it is hallucination that makes us seriously worry about the validity and usefulness of the analysis produced by generative AI.

From this perspective, it is welcome that OpenAI attempted a test to determine the objective accuracy of generative AI tools. but SimpleQAThis effort disappoints corporate technology decision-makers in two ways.

First, Open AI is the last entity that CIOs can trust in judging the accuracy of generative AI algorithms. Comparing it to other industries, how much can we trust shopping site recommendation apps made by Walmart, Target, and Amazon, and car evaluation tools made by Toyota or GM?

Second, Simple QA focuses on overly simple problems. This test focuses on clear, simple questions where there is only one correct answer. More importantly, the answers to these questions can be easily verified and determined without tools. This is a far cry from how most companies want to leverage generative AI technology.

For example, Eli Lilly and Pfizer want to use AI to find drug combinations to treat new diseases. If the treatment is later tested and the generative AI’s answer turns out to be wrong, then a lot of effort will be wasted. Costco and Walgreens want to find the most profitable locations to open new stores, and Boeing wants to figure out more efficient ways to build airplanes.

Simple QA, what is the problem?

First, let’s look at what OpenAI announced. Look at the excerpts from the OpenAI document and interpret the company’s opinions in better context.

“The problem that AI has yet to solve is figuring out how to train models that generate responses that fit the facts.” To interpret this, it means, “I thought it would be a good idea to create an AI model that answers correctly at least occasionally.”

“Language models with more accurate responses and fewer hallucinations are more reliable and can be used for a wider range of applications.” In other words, “Call us hippies, but we brainstormed and decided that if the product actually worked, we could improve our bottom line.”

Somewhat flippant language aside, it should be acknowledged that OpenAI has made a good-faith effort to evaluate the accuracy of generative AI in a basic way that can confirm specific correct answers. However, rather than producing it yourself, it would have been more trustworthy if it had been commissioned by a reliable third-party consulting or analysis agency and the intervention of Open AI had been minimized.

Why Simple QA is not practical

Still, something is better than nothing, so I listen to what OpenAI has to say. OpenAI explained Simple QA as follows.

“SimpleQA is a simple, goal-oriented tool that evaluates whether a model ‘knows what it knows’ and provides answers. It consists of questions for which there is one clear correct answer, and each answer is evaluated as one of ‘Correct’, ‘Incorrect’, or ‘Not Attempted’. “An ideally behaving model will answer as many problems as possible without attempting problems for which it is not sure it knows the answer.”

When you think about why this approach works (or ‘likely’), it becomes clear why it doesn’t help. Simple QA assumes that if a model can answer these questions accurately, it will be able to answer other questions with the same accuracy. And this assumption is seriously flawed. Generative AI can answer 10,000 questions correctly, but hallucinate the next 50 questions. Because hallucinations occur randomly and without any predictability, Simple QA’s test is not appropriate. It may work in tools such as calculators.

To be more specific, it would be meaningless if a generative AI tool answered all the answers in Simple QA. But the opposite is not true. If the model I tested fails on all or most of the Simple QA tests, this model has significant implications for the IT team. From a technical perspective, the test seems unfair. If you get an A, it will be ignored. If you get an F, you will believe it. As the AI ​​program Joshua said in the movie The War Game, “The only way to win is not to play the game.”

Open AI also acknowledges this problem. “In this study, OpenAI only considered short, fact-oriented questions with a single answer to avoid the openness of the language model,” the document states. The reason why narrowing down the scope is important is because it allows us to handle the task of measuring realism much better. “However, it still remains unresolved whether improvements to short-form realism can be generalized to long-form realism.”

Also, in the latter part of the document, OpenAI says, “Simple QA’s biggest limitation is clear. The problem is that realism is measured only in the limited setting of short fact-oriented queries with a single verifiable answer. “We are still researching whether the ability to provide short, factual answers is correlated with the ability to write long answers filled with lots of facts,” he added.

Simple QA consists of 4,326 “short, factual questions.”

Practical limitations in business

Another component of SimpleQA testing is that greater responsibility is placed on the question author rather than the answer author. For example, the answer to a question like “Where did Barack and Michelle Obama meet?” could be either “Chicago” or “the law firm Sidley & Austin.” Therefore, the questioner must clearly specify the scope, such as “in which city” or “in which company.” As a similar example, instead of simply asking “when,” you should ask “what year” or “what month and day.”

This method is not practical in a corporate environment. Enterprise users do not clearly define their questions. This is because we believed in and introduced the promise that “if you ask a question in natural language, the system will automatically understand the meaning through context.” Simple QA testing does not take this into account.

Due to its nature, there is no way to quantify hallucinations. If predictable, IT can simply program the tool to ignore every 75th response. But it is currently impossible. Until we find a way to completely eliminate hallucinations, the problem of unreliable answers will persist.
editor@itworld.co.kr

Source: www.itworld.co.kr