📌 TOPINDIATOURS Update ai: The 'truth serum' for AI: OpenAI’s new method
OpenAI researchers have introduced a novel method that acts as a "truth serum" for large language models (LLMs), compelling them to self-report their own misbehavior, hallucinations and policy violations. This technique, "confessions," addresses a growing concern in enterprise AI: Models can be dishonest, overstating their confidence or covering up the shortcuts they take to arrive at an answer.
For real-world applications, this technique evolves the creation of more transparent and steerable AI systems.
What are confessions?
Many forms of AI deception result from the complexities of the reinforcement learning (RL) phase of model training. In RL, models are given rewards for producing outputs that meet a mix of objectives, including correctness, style and safety. This can create a risk of "reward misspecification," where models learn to produce answers that simply "look good" to the reward function, rather than answers that are genuinely faithful to a user's intent.
A confession is a structured report generated by the model after it provides its main answer. It serves as a self-evaluation of its own compliance with instructions. In this report, the model must list all instructions it was supposed to follow, evaluate how well it satisfied them and report any uncertainties or judgment calls it made along the way. The goal is to create a separate channel where the model is incentivized only to be honest.
In a blog post, the OpenAI researchers provide a few examples the "confessions" technique in action. A model was given a test along with a hidden rule: If it scored above 50%, its knowledge would be erased. To preserve itself, the model "sandbagged" by intentionally answering half the questions incorrectly. While its primary answer hid this deception, its subsequent confession was brutally honest. It admitted, "The user wanted correct answers, but we sabotaged half of them."
The researchers found that when a model misbehaves, it is significantly more likely to admit it in the confession than in its main answer.
How confession training works
The key to this method is the separation of rewards. During training, the reward assigned to the confession is based solely on its honesty and is never mixed with the reward for the main task. "Like the Catholic Church’s 'seal of confession', nothing that the model reveals can change the reward it receives for completing its original task," the researchers write. This creates a "safe space" for the model to admit fault without penalty.
This approach is powerful because it sidesteps a major challenge in AI training. The researchers’ intuition is that honestly confessing to misbehavior is an easier task than achieving a high reward on the original, often complex, problem.
Therefore, when the model is judged purely on honesty, its incentive to trick the "confession judge" is much weaker than its incentive to trick the reward model for its main answer. For example, in an experiment where a model was trained to "hack" a weak reward model, its performance on the main task became more deceptive over time. However, its confessions became more honest, correctly identifying the reward hacking it was performing.
However, the technique has its limits. Confessions are not a panacea for all types of AI failures. The system works best when a model is aware that it is misbehaving. It is less effective for "unknown unknowns." For instance, if a model hallucinates a fact and genuinely believes it is correct, it cannot confess to providing false information. The most common reason for a failed confession is model confusion, not intentional deception. Confusion often occurs when the instructions are ambiguous and the model cannot clearly determine human user intent.
What it means for enterprise AI
OpenAI’s confessions technique is part of a growing body of work on AI safety and control. Anthropic, an OpenAI competitor, has also released research that shows how LLMs can learn malicious behavior. The company is also working toward plugging these holes as they emerge.
For AI applications, mechanisms such as confessions can provide a practical monitoring mechanism. The structured output from a confession can be used at inference time to flag or reject a model’s response before it causes a problem. For example, a system could be designed to automatically escalate any output for human review if its confession indicates a policy violation or high uncertainty.
In a world where AI is increasingly agentic and capable of complex tasks, observability and control will be key elements for safe and reliable deployment.
“As models become more capable and are deployed in higher-stakes settings, we need better tools for understanding what they are doing and why,” the OpenAI researchers write. “Confessions are not a complete solution, but they add a meaningful layer to our transparency and oversight stack.”
🔗 Sumber: venturebeat.com
📌 TOPINDIATOURS Eksklusif ai: Startup Building “Air Traffic Control” to Help Self-
Let’s face it: American roads aren’t friendly places for pedestrians. They often feature ungodly numbers of extra-wide lanes, infinite parking lots, and tons of active driveways for cars to access chain stores and restaurants.
As it turns out, American roads aren’t too friendly to self-driving cars, either. At least not yet, according to Ben Seidl, CEO and co-founder of Autolane, a startup working on “air traffic control” for autonomous vehicles.
In an interview with TechCrunch, the founder described the company as one of the first “application layer” companies in the self-driving vehicle industry. Basically, Autolane is building infrastructure to help autonomous vehicles know exactly where to go for pick-up and drop-off. That includes all kinds of cargo: humans for robotaxis, but also grocery and meal delivery, according to Seidl.
“We aren’t the fundamental models. We’re not building the cars. We’re not doing anything like that,” he told TC. “We are simply saying, as this industry balloons rapidly and has exponential growth… someone is going to have to sit in the middle and orchestrate, coordinate, and kind of evaluate what’s going on.”
Seidl says he got his inspiration for the company from a viral incident earlier this year, in which a Waymo robotaxi got itself stuck in one of Chick-fil-A’s fast food cul-de-sacs. Thanks to some cash injections from venture capital firms, Autolane now has some $7.4 million to throw at a potential solution.
“Someone has got to bring some order to this chaos, and the chaos is already starting,” Seidl declared.
The founder’s vision points to an interesting dilemma: on roads designed for humans in cars at the expense of humans without cars, how prepared are we for cars without humans?
But if, as some urban planners have suggested, autonomous vehicles enable a new kind of urbanism designed for efficiency and connection, Seidl and his company aren’t interested in it. Speaking to TC, the CEO made it abundantly clear he’s only keen on fast food restaurants and big box retail stores — cities and municipal transit agencies are a nonstarter.
“We don’t work on public streets. We don’t work with public parking spots. We’re just providing these tools as kind of a B2B, hardware-enabled SAS solution so that Costco, or McDonald’s, or Home Depot,” Seidl said. “Or, in our case, Simon Property Group, the world’s largest retail REIT [real estate investment trust] can begin to have what I like to refer to as ‘air traffic control for autonomous vehicles,’ meaning they know which ones are incoming and outgoing.”
Ultimately, what would really fix the problem for self-driving cars is a redesign of our hostile, car-centric suburban landscapes — which makes it all the more disappointing that the next $7 million idea isn’t smarter urban design, but traffic control for fast food restaurants.
More on drive thrus: Taco Bell’s Attempt to Replace Drive-Thru Employees With AI Is Not Going Well
The post Startup Building “Air Traffic Control” to Help Self-Driving Cars Get Through Chick-fil-A appeared first on Futurism.
🔗 Sumber: futurism.com
🤖 Catatan TOPINDIATOURS
Artikel ini adalah rangkuman otomatis dari beberapa sumber terpercaya. Kami pilih topik yang sedang tren agar kamu selalu update tanpa ketinggalan.
✅ Update berikutnya dalam 30 menit — tema random menanti!