“Confirming AI limits” Open AI Simple QA test results, most models received F grade

OpenAI recently released SimpleQA, a newly developed benchmark to evaluate the factual accuracy of LLM, which is the basis of generative AI. SimpleQA is an evaluation tool consisting of 4,326 questions spanning various fields such as science, politics, pop culture, and art. Each question has one clear correct answer, which is verified by an independent reviewer.

This benchmark asks a chatbot the same question 100 times and checks how consistently it provides answers. It is assumed that a more confident model will consistently give the same answer.

The questions were selected based on previous examples of difficulties with OpenAI’s GPT-4-based models. Because we singled out specific questions, low accuracy scores do not assess the overall performance of the model, but rather indicate the model’s performance on particularly difficult questions.

Similar to the SAT, Simple QA focuses on more difficult questions that require learning rather than easy information that everyone knows. As a result, OpenAI’s models did not show high accuracy in these questions and often caused the so-called ‘hallucination’ phenomenon.

OpenAI’s new o1-preview model had a success rate of 42.7%, GPT-4o had a success rate of 38.2%, and the smaller GPT-4o-mini had a success rate of only 8.6%. Competitor Antropic’s Claude-3.5-sonnet model showed lower performance than OpenAI’s top model at 28.9%. These models received an F in terms of grade, and there were more wrong answers than correct answers.

Simple QA’s questions consist of simple content as follows.

In what year did the Titanic sink?
Who was the first president of the United States?
What is the chemical symbol for gold?
How many planets are there in the solar system?
What is the capital of France?
What is the longest river in the world?
Who painted the Mona Lisa?
What is the title of the first Harry Potter book?
What does CPU stand for?
Who is called the father of computers?

Most of these questions are simple questions that humans can easily answer, but they can be problematic for chatbots. The reason these tools struggle is because SimpleQA questions require clear, single, and uncontroversial answers. Even minor variations or ambiguous answers are considered failures. Chatbots are great at providing high-level explanations of very complex topics, but they struggle to provide single, concise, and accurate answers.

Additionally, Simple QA questions are short and self-contained and do not provide much context. That’s why providing as much context as possible when writing prompts improves the quality of your answers.

To further complicate matters, LLMs often overestimate their accuracy. SimpleQA asked chatbots how they rate the accuracy of the answers they provided, and the models consistently reported exaggerated success rates. Although you appear confident on the outside, your internal level of confidence may be low.

LLM is not really thinking

A recent study from MIT, Harvard, and Cornell University found that while LLMs can accomplish impressive tasks, they lack a coherent understanding of the world.

Researchers confirmed that LLM can generate accurate driving routes in complex environments such as New York City. However, when a detour was introduced, the model’s performance dropped sharply. This is because LLM does not have an internal cognitive structure about the environment like humans. For example, if just 1% of New York City’s roads were closed, AI’s route accuracy dropped from about 100% to 67%.

The researchers concluded that although models perform well in controlled environments, they may not have the consistent knowledge structure needed in random or diverse situations.

The severity of the AI hallucination problem

The fundamental problem facing the industry is this. Currently, industries and individuals rely on LLM-based chatbots and generative AI tools for actual work. The public, and even experts, believe this technology is more reliable than it actually is.

In one recent example, OpenAI offers Whisper, an AI voice recognition tool used to create medical records. According to the Associated Press, one version of Whisper has been downloaded more than 4.2 million times on HuggingFace, an open source AI platform.

About 30,000 medical staff and 40 medical systems, including Children’s Hospital Los Angeles, are using Nabla, which is optimized for medical terminology and is based on Whisper. The company estimated that Navla was used in approximately 7 million medical visits in the United States and France.

However, like other AI tools, Whisper is not free from hallucination problems.

An engineer investigated hallucinations in Whisper’s warriors and found hallucination issues in every document he reviewed. Another researcher identified cases of hallucinations in half of the 100 hours transcribed with Whisper.

University of Virginia faculty analyzed thousands of short snippets in a research repository hosted by Carnegie Mellon University. They found that about 40% of hallucinations were “harmful or concerning.”

In one transcription, Whisper even coined the name “hyperactivated antibiotics,” drugs that do not exist. Experts are concerned that the use of Whisper-based transcription tools could lead to incorrect diagnoses and other problems.

How to deal with AI hallucinations

Just as you would seek a second opinion on a diagnosis from a doctor, you should go through the same process with results obtained from ChatGPT, Perplexity AI, or any other LLM-based chatbot.

There is also a way to check the results of one tool through another tool. For example, if you have original documents (scientific papers, presentations, PDFs, etc.) related to the topic of your question, you can upload them to Google NotebookLM. You can then copy the results from other tools and paste them into Notebook LM to check whether they are true or not.

Additionally, you must check the original source and fact-verify all content. Chatbots can be useful for many purposes, including learning, exploring topics, and summarizing documents, but they are generally not a reliable source of factual information.

In particular, you should never copy the results of an AI chatbot and use them as your own voice or facts. Chatbot language is often subtly awkward or has strange emphases. This is a misleading act that carries the risk of conveying incorrect information.

First of all, there is a possibility that the chatbot in use may hallucinate, lie, or even create completely fictitious information. Chatbots are not as smart as you think.
editor@itworld.co.kr

Source: www.itworld.co.kr

Issue: *
Your Name: *
Your Email: *

Details: *

LLM is not really thinking

The severity of the AI ​​hallucination problem

How to deal with AI hallucinations

The severity of the AI hallucination problem