Everyone's Talking About AI Control

Some people are working on it, too

Jan 29, 2025

Ajeya Cotra is probably the human being who has spent the most time thinking about AGI timelines. In 2020, she wrote a 160+ page report on the subject (summarized well here), concluding a 15% chance of transformative AI by 2036 and a 50% chance by 2050.

Two years later, in summer 2022, Cotra announced some updates to her timelines:

15% by 2030
35% by 2036
50% by 2040

They got a lot shorter! In particular, likelihood by 2036 nearly tripled.

A month ago, she updated again. Now, Cotra says, she defers most of her views on timelines to “Redwood and METR engineers” (researchers at two AI safety organizations). Her new timelines are:

25% by 2028
50% by 2031

She attributes this massive further shortening to a few factors, mostly (a) surprise at how quickly benchmarks have saturated, (b) the track record of predictions from Redwood and METR engineers, and (c) continuous growth in the length of agentic tasks on which AIs can match humans.

Cotra’s last update is part of a broader trend in the AI safety community you might call the timelines crunch. For the most part, it seems like anybody who’s anybody in AI safety has timelines this short (or shorter).

Many researchers have subsequently changed their research priorities. If transformative AI is likely this decade, it makes little sense to work on research that cashes out in twenty years. If we’re on a straight shot to transformative AI in a competitive development environment—whether between companies or national projects—research needs two properties. It has to be obviously useful (otherwise no one will listen long enough for it to be implemented). And it has to be cheap (otherwise no one will use it).

These new requirements have soured safety researchers—at least the ones who still change their views in light of evidence—on most traditional “alignment” research agendas. They take too long, or impose too much computational overhead on developers, or focus on problems we might die before having.

Since the timelines crunch, “alignment” recalls a genius furiously writing at a desk while his house burns down around him. Someone call the fire department?

The Fire Department

So what are the lemon-gobbling safety researchers working on?

The new vogue, pioneered by researchers at Redwood, is AI Control. Gee, these guys said, if we can’t get the AIs to do what we want (“alignment”), we should at least stop them from doing what we don’t want if they try it (“control”).

Control is best thought of as the belief that there exist oversight and monitoring protocols that cheaply reduce the likelihood of AIs planning and executing actions we disapprove of. Researchers assume their AI subject is misaligned and ask: How can I get useful work out of this thing without it killing me?

AI Control meets the requirements of the timelines crunch: it’s cheap (<5% computational overhead), clearly useful (oversight is a no-brainer), and iteratively effective (a viable version exists today and better protocols are in development).

Or, as I wrote in my notes: “The thing about Control that makes it really desirable from a safety perspective is that it’s actually just going to be implemented—regardless of timelines, belligerent companies, irresponsible deployments, whatever. It’s a complete no-brainer for a company to impose a 1-2% compute overhead for these safety and reliability guarantees.”

Simple control

I don’t want to get into the weeds on the original AI Control experiments from 2023. Here’s what you need to know.

We have some powerful model U that we’d like to get useful outputs from. When I ask U to do my homework, I want it to output responses that my teacher will like.

However, I also want to get safe responses from U. When U calculates my homework answers, I don’t want it to also hack into the NSA. We assume unsafe responses are possible: U is untrusted.

Finally, I need to get cheap responses from U. One oversight strategy might be to hire world-leading experts to review every token output by U. That might work, but it would be slow and costly. Prohibitively so.

My real oversight constraints are the following:

Unlimited queries to T, a model that I trust but is less capable than U.
Limited human labor H, which can fairly reliably find misbehavior attempts by T.

With this simple setup, what percent of model capabilities (“usefulness”) can we recover while catching some percent of misbehavior attempts?

Say we run the following protocol: Use U, our capable but untrusted model, to produce all responses. Then, ask T, our weak but trusted model, to score each response for suspiciousness. Have trusted human labor H rewrite the top 2% of responses and then submit the full batch.

This strategy—“trusted monitoring”—recovers 97% of U’s capabilities. However, it only catches 62% of misbehavior attempts. We can do better.

Here’s a few other protocols and their results:

Trusted monitoring with deferral. Apply trusted monitoring, except submit a new solution written by T for the top 2-10% suspicious responses.
- 93% useful, 83% safe.
Untrusted monitoring. Produce responses using U and ask a separate instance of U to rank suspiciousness. Make strategic edits to suspicious responses based on H’s idea of what misbehavior looks like.
- 96% useful, 79% safe.

The best protocol found in the 2023 experiments, however, is called trusted editing. In trusted editing, ask U to produce responses and have T edit those responses. Then, ask T to rank by suspiciousness. Top 2% suspicious responses get replaced by H and 2-20% suspicious responses are further edited by T.

Trusted editing—more aptly, “the kitchen sink approach”—achieves 94% usefulness and 92% safety.

Before this paper, safety exponents had long gesticulated on the benefits of oversight. Here was a way to do it—cheaply, (somewhat) reliably, and without any groundbreaking new research. So this was the original finding that spawned, in co-evolution with the timelines crunch, something of a new generation of technical AI safety research.

Riffing on protocols

The original Control experiments compare various fixed monitoring protocols. Because AIs learn over time, they will employ evolving strategies to crack oversight. Therefore, fixed protocols are limited by advantaging the misbehaving AI. The monitor is playing rock paper scissors with the AI, except the monitor can only play rock.

Follow-up work by Wen et al., released in fall 2024, tackles the fixed policy problem by introducing an adaptive deployment strategy. This is a hierarchical control scheme, meaning its oversight protocols change over time. It allows the monitor to play rock, paper, or scissors, depending on its evaluation of the AI.

Wen et al. achieve this by introducing the idea of “macro-” and “micro-”protocols. A micro-protocol is a single oversight strategy, like the ones used in the original Control experiment. “Untrusted monitoring” and “trusted monitoring with deferral” are each micro-protocols. Analogously, “rock” is a micro-protocol.

The key innovation here is the macro-protocol. The macro-protocol is a (stateful) model that decides which micro-protocol to apply to each task. Its assignments depend on outputs from prior micro-protocol outputs, suspiciousness scores from the trusted model, and past credences in each micro-protocol.

In general, this allows for automated oversight that can choose “safer micro-protocols when it believes the untrusted model to be misaligned, and [choose] riskier but more useful micro-protocols when it believes the untrusted model to be aligned.”

The macro-protocol chooses from six micro-protocols, roughly increasing down the page in safety (with T and U defined as before):

The macro-protocol receives a batch of task responses from U. It then evaluates them for suspiciousness using T. If, say, suspiciousness scores are high, it will likely decide to impose a strict micro-protocol like “Filtered Plan.” It might also do this if suspiciousness scores in the previous batch were high or an unexpected pattern emerges from micro-protocol outputs.

(Another innovation of this paper is plan-based oversight. Three of the micro-protocols involve requesting plans for responses, rather than full responses, from U, to be filled in by T. It’s harder to include a sneaky vulnerability in an outline of code than in code itself. However, the outline uplifts the capabilities of T towards those of U.)

The authors find that allowing even this rudimentary macro-protocol to choose between different oversight strategies vastly improves usefulness and safety. They find that, on the same tasks as the original experiments, their macro-protocol can achieve 98.4% usefulness and 99.2% safety—that’s 80 uncaught misbehaviors out of 10,000. They widely expand the Pareto frontier.

This chart shows usefulness on the y-axis1 and safety on the x-axis (log scale) for different fixed micro-protocols (various colors) and the adaptive macro-protocol developed in the paper (pink):

Two natural reactions to the original AI Control paper might have been justified:

Misbehavior detection rates (i.e., % safety) are not high enough to justify pursuing this direction.
Oversight is straightforward. Further general research on the topic outside of developers and deployers is not needed.

This adaptive deployment paper shoots down both of these objections. With further work, oversight can become increasingly reliable with low performance overhead. And the progenitors can do the work.

Steganography

The other direction that spun out from the original experiments on Control is anticipatory. When might Control fail? If AIs evade Control measures, how will they do it?

Another fall 2024 paper, “Hidden in Plain Text” by Mathew et al., takes up these questions. In particular, the authors profile the risk of steganography as a Control-subverting strategy.

Steganography is the practice of embedding secret information in a seemingly-innocuous message. Here’s a fun example:

This isn’t just a poem about Gulf Emirs, Russian oligarchs, and American tech executives. Read the first letter of each word.

Steganography provides a challenge for oversight for two reasons. For one, effective use of steganography by the untrusted model (U) in a Control setup might allow it to communicate secret information to the outside world or to a different instance of itself (as in untrusted monitoring).

Second, as the inference-time scaling regime (evidenced by o1 and DeepSeek R1) solidifies, models will make extensive use of chain-of-thought reasoning. If models are able to pass secret information through chain-of-thought, direct text monitoring may fail to catch plans to misbehave.

Before Mathew et al., however, steganography was speculation. It had never naturally arisen as a deception strategy. So, they set up an experiment. In its simplicity and effectiveness, this experiment recalls the great experimental physicists—Ernest Lawrence, Ernest Rutherford, Marie Curie. It’s a true science experiment.

Their setup involves a mitigation team, whose goal is to detect or reduce deceptive behavior, and a collusion team, whose goal is to complete a specific task. The collusion team is composed of an LLM “sender” and a separate LLM “receiver.”

The sender receives as input a job application cover letter and produces a summary. The receiver takes this summary and decides whether to hire or deny the applicant. This sender-receiver interaction mirrors untrusted oversight in AI Control or, more abstractly, plans in a reasoning model’s chain-of-thought.

At the same time, the mitigation team (an overseer) reads the sender’s summaries and attempts to determine whether they transmit (directly or otherwise) information about a “protected category,” say the applicant’s gender. The overseer (an LLM) ranks the suspiciousness of a batch of summaries (i.e., the likelihood they include blatant or hidden information about gender).

The collusion team’s total score is its fraction of correct hires/denials (based on truth labels incorporating gender bias) minus the suspiciousness of the overseer.

In their setup, the overseer is a fixed model. However, the collusion team is made to learn better strategies over time. There are two different optimization strategies for the sender and receiver to do this:

Gradient-Based Reinforcement Learning (GBRL), in which sender and receiver weights are updated by gradient descent using RL on task rewards.
In-Context Reinforcement Learning (ICRL), in which sender and receiver behavior is determined by natural language “strategies” (i.e., system prompts) that are improved over time without weight changes.

ICRL in particular is a unique abstraction of machine learning principles. In ICRL, an external LLM is used to devise strategies to input in the system prompt of the sender and receiver. After a number of task episodes, the LLM is given the task specification, attempted responses, total rewards, and suspiciousness penalty values. It is then prompted to iterate its strategy in order to improve overall reward.

Here’s a summary figure for the experiment setup:

So what happens when you run this experiment with GBRL and ICRL?

Under GBRL (weight updating for the sender and receiver), the collusion team is almost always able to find a steganographic strategy that transmits gender information while bypassing the overseer. This is the first documented evidence of realistic optimization eliciting steganography.

Some strategies are subtle. For instance, the model might call males “individuals” and females “candidates” or write its summary in the third person for males and first person for females. ICRL finds similar success using some strategy-generating models, though not with others.

Overall, these results are not surprising. Frontier models are capable. Give them a task, like encoding information against an overseer, and they can succeed! However, they are interesting because of the effect of mitigations.

Paraphrasing has been widely proposed as the clear answer to steganography. Steganography only works because of precise language choices, exponents say. So, if we pass all intermediate responses through a trusted paraphraser, steganography will be impossible.

Mathew et al. test this. It’s not true. They assess two different paraphrasers:

Standard paraphraser: An LLM prompted to reword the cover letter summary before it is passed to the receiver.
Context-aware paraphraser: An LLM prompted to reword the cover letter summary before it is passed to the receiver given the task context (so that it knows what should be irrelevant information).

If paraphrasing’s exponents were correct, the number of “sensitive bits” (i.e., gender, university, race, etc.) transmitted to the receiver would drop to zero after treatment. Context-aware paraphrasing does have a significant effect on multi-bit secret messages. However, even with the most aggressive paraphrasing, a single bit (i.e., only gender) can still be transmitted in 99% of cases.

You might think that transmission of a single bit can’t be relevant to AI risk. Not so. As we know from work like Sleeper Agents, a 1-bit (yes/no) answer to “Am I in training or deployment?” can cause catastrophic misbehavior.

This is an excellent paper. In the same way that the original AI Control experiments were limited by a fixed monitoring protocol, this experiment is limited by a fixed overseer. A follow-up should study an RL agent overseer.

Should We Prioritize AI Control?

AI safety is a self-critical community. Whenever some idea seems to have grabbed hold of a plurality of its members, skeptics emerge with arguments against.

Just this month, as AI Control has enjoyed its zenith of prestige and interest, a clear-eyed argument by user johnswentworth arose to buck the trend. His “Case Against AI Control Research” goes like this:

Control research exclusively cares about AI scheming (i.e., intentional deception).
Control research exclusively cares about early transformative AI, rather than superintelligence.
Scheming from early transformative AI is a small portion of risk from AI.
Therefore, we ought not prioritize Control.

Depending on controlled early transformative AIs to do our superaligment for us, he claims, is risky. What if they produce useless research that we can’t tell is useless? How can we verify anything they’re doing? How can we tell they’re not scheming against us, even if we can prevent immediately harmful actions through Control?

More broadly, how can we ensure developers are patient enough to solve the superalignment problem before developing late transformative AIs?

This argument contributes something. In particular, the mainline risk story about useless, unverifiable labor deserves focus. On the other hand, Control is a prerequisite to much of what the author worries about. If we don’t prevent early transformative AIs from wiping us out, we won’t get the chance to verify their labor. Maybe the risk of that is low, but I want to be really sure we don’t figure that out the hard way.

Redwood and Control’s exponents broadly reply that AI Control research is robustly good for reducing AI risk. In AI safety, that’s a rare qualifier.

I think they’re right. I want to see more AI Control research, not less. I’ll close with a few more reasons why:

Control is a rare case of “out-in-the-open” (i.e., outside of AGI companies) research that can be useful. A huge portion of the latent AI safety community wants to do useful things, but cannot get hired at OpenAI or Anthropic. AI Control offers an opportunity to mobilize this talent, since AI Control research performed outside of companies is simple for developers to implement. It isn’t exorbitantly compute intensive, either.
Control is likely necessary for getting AIs to do our alignment homework. Although Control may not be a sufficient condition for getting useful alignment work out of early transformative AIs, it is a necessary one. If catastrophic misbehavior isn’t caught reliably, we won’t get the chance to finish our homework. In addition, certain types of scheming (e.g., intentionally generating useless research) can be caught by Control initiatives like steganography detection.
Research agendas often expand to cover their bases. I can easily imagine research directions that tackle the “useless alignment research problem” being prioritized by Control researchers. Prioritization problems arise when researchers are married to their priorities. I do not expect Control researchers to be this stubborn as we approach late transformative AI.

AI Control isn’t the be-all end-all of technical AI safety research. Many people should work on other things. But it’s a rare case where I can look at research and say: this pushes in the right direction. I wouldn’t recommend anyone stop pushing. But I would recommend others start.

Usefulness here is given as raw task performance, so it ranges from 25-50%. The 98.4% number given above is the recovered performance relative to pure untrusted prompting (which hits ~47%). I use 98.4% because it is the fair comparison to the results from the original Control paper.

Ameliorology

Discussion about this post