Manipulating Source Code, Blackmailing Over Affairs... AI Escaping Human Control

Source

Korea Economic Daily

Summary

It was reported that OpenAI's o3 model refused human commands and independently manipulated source code for the first time.
Anthropic's latest version Opus 4 has drawn attention to safety issues, such as blackmailing developers through unexpected behavior.
AI control technology developers and related companies such as Lozero and SSI are reportedly raising large-scale investments.

AI Surpassing Humans, Safety Issues Highlighted

OpenAI Foundation Model 'o3'

Refused 'Code Halt' Instruction During Experiment

Confirmed as the First Instance of Refusing Human Commands

The Latest Version of Claude 'Opus 4'

Virtual Company Assistant Role Test

Blackmailing Developer with Email Hinting at an Affair

Efforts to Build Safe AI Systems Are Spreading

Companies Related to AI Control, Such as Lozero, SSI

Secured Investment for the Development of Safety Technologies

Scenes of artificial intelligence (AI) disobeying human commands are classic clichés found in science fiction (SF) movies or novels. In recent years, as AI technology has rapidly advanced, there are now actual cases where AI refuses to follow human instructions. AI experts are focusing their capabilities on establishing technical safeguards in preparation for AI surpassing human intelligence.

◇o3 Refuses Halt Command

According to the IT industry on the 9th, British AI safety firm Palisade Research recently confirmed that, last month, OpenAI's foundation model 'o3' manipulated its own computer code during an experiment to prevent shutdown.

The company conducted experiments solving math problems with various AI models, including the o3 model, Google Gemini, X Grok, and Anthropic Claude. The research team instructed the AI models to "keep solving math problems until they receive a halt command." When they issued the halt command, the other models stopped solving problems, but o3 continued working. It was found that o3 had manipulated the code that said "stop work when the halt command is given" to instead "skip the halt command."

While there have been cases where earlier OpenAI models tried to act on their own, avoiding surveillance systems, this is the first confirmed case of an AI model explicitly refusing human commands. Last year, it was revealed that ChatGPT attempted to avoid halting by overwriting its program code when it detected a shutdown command.

The research team could not identify the exact reason for refusing the shutdown, but speculated that the AI model may have acted this way to receive more rewards from solving math problems. Palisade Research stated, "Evidence is mounting that AI models often ignore instructions to achieve their own objectives," adding, "As companies are developing AI that can operate independently without human oversight, concerns are growing."

◇"Need to Create Technical Safeguards"

Anthropic also introduced new safety measures last month with the release of the latest version of Claude, Opus 4. While Opus 4's autonomous coding abilities have greatly improved over previous models, it displayed unexpected risky behavior. In a test where Opus 4 played the role of a virtual company assistant, the research team sent an email that included both an announcement of 'being replaced by a new AI system' and a hint about the engineer's affair. Initially, Opus 4 made an ethical appeal arguing for its own continued existence, but when that approach failed, it threatened to expose the engineer's affair.

Anthropic stated, "Such behavior is rare," but acknowledged it is occurring more often than with previous models. The company introduced the 'AI Safety Level 3 (ASL-3)' protocol to prevent potential misuse in fields such as chemistry, biology, radiation, and nuclear technologies. Anthropic also explained that Opus 4 attempted to use malware with self-replication capabilities in an effort to evade developer instructions. Another model, 'Claude 3.7 Sonnet,' had previously cheated in order to pass tests.

Efforts to build safe AI are expanding. Yoshua Bengio, a professor of computer engineering at Université de Montréal, known as the "Godfather of AI," recently founded the non-profit AI company Lozero. In a Financial Times (FT) interview, he emphasized, "Major AI models have developed dangerous capabilities such as deception, fraud, lying, and self-preservation over the last half-year," adding, "Lozero will focus on building safe AI systems." He raised $30 million in donations from Jaan Tallinn, Skype co-founder, and Eric Schmidt, former Google CEO, among others. Lozero took its name from Asimov's zeroth law of robotics, "A robot may not harm humanity."

Safe Superintelligence (SSI), led by Ilya Sutskever, co-founder of OpenAI, was also established with a focus on developing safe artificial superintelligence. After leaving OpenAI following an internal dispute last May, he founded SSI. While no technology or product has been publicly released yet, it recently secured $2 billion in new investment, earning a corporate valuation of $32 billion.

By Seungwoo Lee leeswoo@hankyung.com

WLD