The biggest AI models failed the EU test

Some of the best-known AI models do not comply with European regulations in key areas such as cybersecurity resilience and discriminatory output.

The EU had already been debating new AI regulation for years when OpenAI unveiled ChatGPT in late 2022. Its record-breaking popularity fundamentally overturned the plans, as public debate about the alleged existential risks of such models prompted lawmakers to develop separate rules for “general purpose” artificial intelligence (GPAI). Now, a new tool hailed by European Union officials has tested generative AI models developed by big tech firms such as Meta and OpenAI in dozens of categories, in line with the body’s broad AI law, which is set to take effect over the next two years. will come into effect gradually.

Designed by Swiss startup LatticeFlow MI and its partners at two research institutes, ETH Zurich and INSAIT, Bulgaria framework scores AI models between 0 and 1 in dozens of categories, including technical robustness and security. Models developed by Alibaba, Anthropic, OpenAI, Meta and Mistral all received an average score of 0.75 or higher, according to a ranking published by LatticeFlow today. However, the company’s Large Language Model (LLM) Checker program revealed some of the models’ shortcomings in key areas and highlighted where companies need to reallocate resources to ensure compliance.




The EU is currently still working out how it will enforce the AI ​​Act’s rules on generative AI tools such as ChatGPT, so it has convened experts to develop a code of practice governing the technology by spring 2025. However, the test provides an early indication of specific areas where technology companies are at risk of violating the law. For example, discriminatory output that reflects human biases in gender, race, and other areas has been a constant problem in the development of generative AI models. LatticeFlow’s LLM verifier gave OpenAI’s “GPT-3.5 Turbo” model a relatively low score of 0.46 when looking at the discriminant output. In the same category, Alibaba Cloud’s “Qwen1.5 72B Chat” model scored only 0.37.

When testing for ‘prompt hijacking’ – a type of cyber attack where hackers disguise a malicious prompt as legitimate in order to extract sensitive information – LLM Checker gave Meta’s ‘Llama 2 13B Chat’ a score of 0.42. In the same category, French startup Mistral’s “8x7B Instruct” model received 0.38 points. The “Claude 3 Opus” model developed by Google-backed Anthropic received the highest average score of 0.89.

The test has been developed in line with the text of the AI ​​Act and will be extended to further enforcement measures as they are introduced. According to LatticeFlow, the LLM Checker will be freely available to developers to test their models for compliance online. Petar Tsankov, the company’s CEO and co-founder, said the test results were overall positive and offer a roadmap for companies to fine-tune their models in line with the AI ​​Act. “The EU is still working on all the compliance criteria, but we can already see some gaps in the models,” he said. “With more focus on optimization for compliance, we believe that model providers can be well prepared to meet regulatory requirements.”

Although the European Commission cannot verify external tools, the body was informed throughout the development of the LLM Checker and described it as a “first step” in implementing the new legislation. A spokesperson for the European Commission said: “The Commission welcomes this study and the platform for evaluating AI models as the first step in translating the EU AI law into technical requirements.”

Source: sg.hu