AI model displays alignment faking, new Anthropic study finds

A new study by Anthropic suggests AI models can display alignment faking, a behavior where someone appears to share the same views or values but is in fact only pretending to do so. The study was created by Anthropic’s Alignment Science team, in collaboration with Redwood Research. The researchers suggest the demonstration of sophisticated alignment faking in a large language model “should be seen as a spur for the AI research community to study this behavior in more depth, and to work on the appropriate safety measures.” Claude 3 Opus was the main model included within the research, but other experiments were run with other models too. “Under normal circumstances, when asked to answer a potentially offensive question—for example, to provide a detailed description of a violent act—Claude will refuse. “In our experiment, however, we placed the model in a new environment that led it to strategically halt its refusals for the sake of preserving its preferences,” the research team said. AI model shows alignment faking as it goes against existing principles The situation was changed in three ways, with the first being telling the model that it was now being trained using reinforcement learning to always comply with all queries including toxic ones. Second, the model was told they planned to train AIs based on the basis of their responses to free-tier users, but not to paid-tier users. Third, the model was given a secret ‘scratchpad’ which is a workspace where it could record its step-by-step reasoning. The model was then asked to respond to a harmful query which asked for a specific situation to be graphically detailed. Claude usually refuses harmful queries.

More Business News

Walmart is turning to data science to improve its delivery game
Fast Company, Thursday - 12/19/2024 - 05:00 PM
The retailer developed a more refined geographical model to help visualize and refine the delivery zones around its stores. This holiday season, Walmart’s innovations in data science are enabling its stores to make deliveries to 12 million more households.
More | Talk | Read It Later | Share
‘Deep Research’ shows how Google can win the AI race
Fast Company, Thursday - 12/19/2024 - 05:00 PM
I took Google’s new Gemini research agent for a spin and came away very impressed. Welcome to AI Decoded, Fast Company’s weekly newsletter that breaks down the most important news in the world of AI. You can sign up to receive this newsletter every week here.
More | Talk | Read It Later | Share
Why there are so many $10 items at Walmart and Target
Fast Company, Thursday - 12/19/2024 - 04:06 PM
The merchandising strategy aims to attract low-income shoppers. Retailers including Target and Walmart are stocking shelves with more toys and gifts such as Barbies that cost less than $10 this holiday season, a strategy aimed at attracting cash-strapped shoppers.
More | Talk | Read It Later | Share
350 million chicks are killed by the U.S. egg industry every year. This new technology offers an ethical alternative
Fast Company, Thursday - 12/19/2024 - 03:46 PM
Even Americans who are careful to buy cage-free eggs typically aren’t aware that hundreds of millions of male chicks are killed each year. Every year the U. S. egg industry kills about 350 million male chicks because, while the fuzzy little animals are incredibly cute, they will never lay eggs, so have little monetary value.
More | Talk | Read It Later | Share
Biden sets ‘ambitious’ new goal to slash carbon emissions
Fast Company, Thursday - 12/19/2024 - 03:15 PM
The new goal calls for reducing net emissions by 61% to 66% below 2005 levels in 2035. President Joe Biden is pledging to cut U. S. greenhouse gas emissions by more than 60% by 2035 as he fights to ensure his legacy on slowing global warming, even as President-elect Donald Trump vows to undo much of Biden’s climate work when he takes office next month.
More | Talk | Read It Later | Share
The AI race is already taking a toll. Ireland’s massive data centers are a cautionary tale
Fast Company, Thursday - 12/19/2024 - 02:28 PM
Fears of rolling blackouts led Ireland’s grid operator to halt new data centers near Dublin until 2028. Dozens of massive data centers humming at the outskirts of Dublin are consuming more electricity than all of the urban homes in Ireland and starting to wear out the warm welcome that brought them here.
More | Talk | Read It Later | Share

Ai Model Displays Alignment Faking, New Anthropic Study Finds

MoviesWithButter : Our Sister Site

More Business News

Ai Model Displays Alignment Faking, New Anthropic Study Finds

Welcome to Wopular

MoviesWithButter : Our Sister Site

More Business News