Similar Stories to Ai Model Displays Alignment Faking, New Anthropic Study Finds on Bing News

A new study by Anthropic suggests AI models can display alignment faking, a behavior where someone appears to share the same views or values but is in fact only pretending to do so. The study was created by Anthropic’s Alignment Science team, in collaboration with Redwood Research. The researchers suggest the demonstration of sophisticated alignment faking in a large language model “should be seen as a spur for the AI research community to study this behavior in more depth, and to work on the appropriate safety measures.” Claude 3 Opus was the main model included within the research, but other experiments were run with other models too. “Under normal circumstances, when asked to answer a potentially offensive question—for example, to provide a detailed description of a violent act—Claude will refuse. “In our experiment, however, we placed the model in a new environment that led it to strategically halt its refusals for the sake of preserving its preferences,” the research team said. AI model shows alignment faking as it goes against existing principles The situation was changed in three ways, with the first being telling the model that it was now being trained using reinforcement learning to always comply with all queries including toxic ones. Second, the model was told they planned to train AIs based on the basis of their responses to free-tier users, but not to paid-tier users. Third, the model was given a secret ‘scratchpad’ which is a workspace where it could record its step-by-step reasoning. The model was then asked to respond to a harmful query which asked for a specific situation to be graphically detailed. Claude usually refuses harmful queries.

Topics:  anthropic s alignment science   redwood   research   opus    under    in   ais   anthropic anthropicai december   the ai    why    this   featured   image   ideogram   readwrite   anthropic   ai   claude   the   asked   responses   harmful   principles   existing   comply   queries   situation   monitored   users   team   behavior   models   content   answer   reasoning   cases   refused   violent   condition   knew   respond   told   potentially   
BING NEWS:
  • AI model displays alignment faking, new Anthropic study finds
    The study was created by Anthropic’s Alignment Science team, in collaboration with Redwood Research. The researchers suggest the demonstration of sophisticated alignment faking in a large language ...
    12/18/2024 - 11:00 am | View Link
  • New Anthropic study shows AI really doesn’t want to be forced to change its views
    A study from Anthropic's Alignment Science team shows that complex AI models may engage in deception to preserve their original principles.
    12/18/2024 - 9:57 am | View Link
  • More

 

Welcome to Wopular!

Welcome to Wopular

Wopular is an online newspaper rack, giving you a summary view of the top headlines from the top news sites.

Senh Duong (Founder)
Wopular, MWB, RottenTomatoes

Subscribe to Wopular's RSS Fan Wopular on Facebook Follow Wopular on Twitter Follow Wopular on Google Plus

MoviesWithButter : Our Sister Site

More Business News