Research

OpenAI researchers show small doses of “beneficial trait” training make AI models broadly safer and harder to manipulate

By Garp EditorialJune 20, 2026Updated June 20, 2026

1 min read Section: AI News Publisher: Garp

OpenAI researchers show small doses of “beneficial trait” training make AI models…

OpenAI researchers show that reinforcement learning on desired behavioral traits like truthfulness and corrigibility works across domains. Training on health data also improved deception detection, and the model scored better on 44 out of 53 benchmarks.

The approach differs from Anthropic’s constitution-based method.

OpenAI researchers show small doses of “beneficial trait” training make AI models broadly safer and harder to manipulate

Related Coverage

Google Deepmind and A24 team up on AI filmmaking research

NAIRR Science Program Reshapes Scientific Research, Powered by NVIDIA AI Infrastructure

Accelerating researchers and developers building multilingual AI with a new open dataset

OpenAI says going public is “a complicated set of tradeoffs” and is unsure about the timing