OpenAI to Make Its AI Models Safer with ‘Instruction Hierarchy’

OpenAI has developed a new method called “Instruction Hierarchy” to improve the security of its large language models (LLMs). The method, first used in the new GPT-4o Mini model, aims to prevent undesirable AI behavior caused by manipulation by malicious users using specific instructions.

Image source: Copilot

OpenAI’s API platform lead, Olivier Godement, explained that the “instruction hierarchy” will prevent dangerous prompt injections through hidden hints that users use to bypass model constraints and defaults, and block “ignore all previous instructions” attacks.

The new method, as The Verge writes, prioritizes the original developer’s instructions, making the model less susceptible to end-user attempts to force it to perform unwanted actions. In the event of a conflict between system instructions and user commands, the model will give the highest priority to the system instructions, refusing to perform injections.

OpenAI researchers believe that more sophisticated security measures will be developed in the future, especially for agent-based use cases, where AI agents are created by developers for their own applications. Given that OpenAI faces ongoing security challenges, the new method applied to GPT-4o Mini has important implications for the future approach to developing AI models.

If you notice an error, select it with your mouse and press CTRL+ENTER.

Source: 3dnews.ru