On Governance and Alignment in AI Safety

Or: How I Learned to Stop Worrying and Love the Impact. Also: You Should Work on AI Governance

Jun 09, 2024

Many people in the AI Safety community make much of the so-called “Governance-Alignment split”: should you professionally focus on the regulatory challenges of AI or the technical alignment (or control, or oversight, etc.) problem of AI? This is especially pronounced amongst my circle of student organizers for AI Safety, who try to nudge exceptional college kids towards AI Safety careers, and from there towards the side of the split that seems ‘right.’

I want to argue that the more impactful side of the split, in general, is the regulatory side. In other words, when deciding whether you should do all the things to work in AI Governance or technical alignment, you should default to governance and only decide otherwise in the case of strong evidence (e.g. a particularly promising opportunity).

To tighten our scope, I’ll assume that we agree AI Safety broadly is the most impactful thing to work on (if you don’t agree, maybe read this). I’ll also mostly be focusing on the governance apparatus of the United States. This is for no other reason than that, as a true American, I don’t know much else.

Liars, damned liars, and politicians

Call the Governance side of the split the “G side” and the Alignment side the “A side.” What are we talking about when we talk about professional work on each side?

The G side refers to anything relevant to the policies of governments, organizations, and people with respect to powerful AI. In contrast, A-side work denotes anything relevant to the technical problem of preventing powerful AI from posing a catastrophic risk to humanity.

The first thing to clarify is that not all G-side work is ‘Humanities-ey’ and (at least in theory) not all A-side work is ‘STEM-ey.’ As I have mentioned before, likely some of the most important and urgent work in AI policy right now requires technical talent for implementation. There are many, many examples of this. For instance, the US AI Safety Institute, activated by a 2023 executive order, is mainly focused on implementing effective model evaluations—a significant technical problem. My other favorite example of this is on-chip security hardware, which would require significant technical attention.

In addition, there probably exists some non-technical work on the A side. For example, you might want philosophers or other analysts (or, like, pollsters?) to consult with technical alignment teams for input on direction (or, in theory, what values we should put in our alignable AI). I think this is kind of what Amanda Askell has been doing for a while at OpenAI and now Anthropic (e.g. she has a hand in RLHF, especially in a cited paper as “develop[ing] the conceptualization of alignment in terms of helpfulness, honesty, and harmlessness”), though she also does technical work.

This should dissuade the view that subject-bucketed affinity should dictate one’s path (at least away from the G side—it is plausible that effective work on the A side without significant technical affinity is overly difficult).

One should see also that the G side encompasses regulatory policies beyond only government action. Anything that places guardrails within (e.g. Anthropic’s Long-Term Benefit Trust), between (e.g. voluntary safety commitments by frontier labs), or above (e.g. Biden Executive Order reporting requirements) actors developing powerful AI counts here.

Since AI Safety is such a nascent field, it is remarkable how much variety exists within each category. G-side work runs the gamut of staffing in the executive branch, congress, or judiciary; policy at AI labs or think tanks; academic or independent policy research; lobbying; legal representation or analysis; and likely more. For A-side work, the options are working at a frontier lab, either on a safety team or embedded in a capability-scaling team; in academia; or independently. A-side work also decomposes further based on safety strategies (‘agendas’): mechanistic interpretability, scalable oversight, ELK, corrigibility, etc.

Though the G versus A side split is well defined, there’s a lot of ground to cover on each side.

The ITN framework

Note: This introduction to the ITN framework may be skippable for those familiar with the idea of using importance, tractability, and neglectedness to gauge a cause area’s impact. If you skip like this, pick back up at “Governance is more important than alignment.”

I suppose the most significant assumption we’ll have to make to get the wheels turning on this argument is that you should spend your professional career doing the work that, in your estimation, is the most positively impactful. I will very briefly argue for this point (for further discussion, see “Famine, Affluence, and Morality”):

It seems as though a few things are probably true:

Many people experience truly terrible or short lives.
Causing terrible things to happen to people is wrong.
Neglecting to prevent terrible things from happening to people is approximately as bad as causing such things.
You can prevent terrible outcomes in many people’s lives through work.

Likely we agree on points (1), (2), and (4). Some disagree with (3), but it seems as sound as the others:

Say the current state of the world is ‘good’ (i.e. Utopian). Jennifer has two actions available to her: action (a), whose outcome is changing the state of the world to ‘bad’ (i.e. terrible); and action (b), which leaves the world ‘good.’ All else equal, Jennifer would be acting very wrongly if she chose action (a) over (b). Similarly, now suppose the current state of the world is ‘bad.’ Jennifer again has two actions: action (c), which changes the world to ‘good’; and action (d), which leaves the world ‘bad.’ Again, all else equal, Jennifer would be acting very wrongly if she decided to choose action (d) rather than (nearly costlessly) preventing the suffering of many people in the bad world. In short, it is rarely a good policy to stand idly by while children die.

The natural conclusion from (1)-(4), then, is that one should work to prevent terrible things from happening. So long, that is, as one believes one should avoid extreme moral failure. The finer point that one should optimize this effort would be too far afield to discuss rigorously here, but it effectively falls out of the previous discussion of the failure to prevent harm.

Deciding which efforts are most effective for preventing massive harms and doing good is a difficult challenge. The best way to meet this challenge is to use something called the Importance, Tractability, and Neglectedness (ITN) framework. The idea is as follows: you want to figure out which cause (say, A side versus G side) will allow you to have the largest positive impact. To figure out the impact you would have in a field, you need to know the size of your lever (Importance), how good the handholds are on the lever (Tractability), and the number of people already pulling on them (Neglectedness).

Importance matters because a cause whose complete success only adds two feet to the top of the Burj Khalifa is not worth pursuing. Tractability matters because even if there were gold reserves at the top of the Burj Khalifa, it might not be worth going up there to mine them because it would be exorbitantly expensive to bring equipment up 800 meters. Neglectedness matters because if everyone in the world is working on the ‘add two feet to the Burj Khalifa’ project, it’s probably more valuable for you to work on something else.

In other words, if one cause is more important, more tractable, and more neglected than another, one should work on the first. On the margin, in the G side versus A side split, the G side currently dominates in this way.

Governance is more important than alignment

Both the G side and A side benefit from a very lofty importance: arguably, preventing the extinction of humanity. But there’s nuance here.

Because one’s career choice is a marginal change to the amount of total effort into some cause, when discussing importance, one should scrutinize the amount of good done for each percent solution to the problem (or, in this case, to each percent success of the project). (For simplicity, I will amortize impact so we do not run into the problem of large impact discontinuities, which exist in both the G and A side.)

A solution to the G side project looks like one of three things. Either we achieve (1) sweeping responsible development requirements that mirror or exceed standards in other industries (e.g. enforced IAEA standards in nuclear power) strong enough to prevent development without near certainty of safety; (2) a national moratorium, in addition to international agreements, to halt or ban massive AI development/training (such as scientific halting of recombinant DNA experiments); or (3) a centralized international government AI project with sufficient regulation to ensure safety (e.g. CERN).

The key feature of each of these three outcomes is that development only continues when safety can be effectively guaranteed. Perhaps you believe that safe development cannot be guaranteed; then, upon the success of the G-side project, development will not proceed.

There exist many agendas on the A side, most of which do not hope or claim to fully solve the problem of aligning powerful AI. The overall solution, however, is a way to ensure the safety of arbitrarily powerful AI.

I understand that the solution sets on both sides, especially the A side, are truly immense, perhaps unsolvable challenges. But, if we take the world seriously, if we truly count the OOMs, if we actually eat the PASTA, these are the challenges we see.

An immediate observation is that the G side solutions are effectively a superset of the work completed in the A side solution. In other words, any project achieving the ends of (1)-(3) above will with high likelihood safely produce an alignment solution if one exists. You might protest that this will just require additional A-side workers, but the fact is, once incentives (in labs or a large centralized project) align with creating provably safe systems, current AI capabilities researchers will be induced to work on the problem of alignment. You should not take a role that will already be filled by more capable workers through standard market incentives.

In contrast, there’s no particular reason that completed A-side work is ever implemented. Some American AI lab might crack the alignment problem, but I see no reason this means some Chinese AI development group (or terrorist group, or random kid on a laptop, etc.) will use that solution.

I have also consistently found it much easier to tell a story in which things go well where near-term alignment work is not prioritized than one in which significant regulatory governance is not used. The former is simple to imagine because of the regulatory incentive shifts described above. The latter seems difficult for two reasons. First, unregulated competitive pressure between labs makes responsible development challenging. Second, the consistency of algorithmic and computational scaling would imply widespread access to highly capable AI a short period after the initial breakthrough.

For these reasons, it seems that the cause-specific mission of G-side work is more important than that of A-side work. This does not at all, however, demonstrate that you should work on the G side. To show that, we must look at tractability and neglectedness.

Governance is more tractable than alignment

I do not think I would have been able to confidently say that AI Governance is more tractable than AI Alignment two years ago. Today, I am more sure of little else.

It would be hard to overstate the massive shift in the Overton window towards AI regulation that has occurred over the last two years.

Google Trends shows that search frequency for “AI” has grown ten times since Summer 2022.1

Europe has passed a sweeping law to regulate AI in many domains. Although frontier AI lab lobbyists from the US and France were able to dismantle most of its proposed regulations for general frontier models, follow-through on regulation is there.

A massively important bill, SB 1047, is currently moving through the California State Senate. That bill would ensure development of frontier AI can only occur if “a developer can provide reasonable assurance that the covered model… will not come close to possessing a hazardous capability.” It would also require demonstration of the ability to fully shutdown training before development can occur.

Governance of AI also has the attention of Chuck Schumer, Mitt Romney, among other prominent US lawmakers. We now have regular AI Safety summits, attended by world leaders across the globe, producing unprecedented commitments and statements from governments and industry. For example, the UK AI Safety Summit produced the Bletchley declaration, signed by the US, UK, EU, and China, which recognizes that “There is potential for serious, even catastrophic, harm, either deliberate or unintentional, stemming from the most significant capabilities of these AI models.”

And this push for regulation is not just blown smoke. AI developers in search of dangerous profits see these efforts as a threat to their very existence. There are now more than 451 organizations lobbying the federal government on AI. This includes 3,400 individual lobbyists in 2023 alone, including 60 individuals solely from Microsoft—OpenAI’s biggest investor. These numbers are likely higher in 2024; the number of individuals lobbying the White House on AI grew 188 percent from the first to fourth quarter of 2023.

I cannot imagine a more exciting, promising, or consequential time to form AI policy.

G-side projects also have the benefit of widespread international cooperative precedent. A paper by policy researchers including ‘AI Godfather’ Yoshua Bengio identifies at least 12 precedent organizations for the institutions that would likely be required to facilitate safe international AI development, including the IAEA, CERN, ITER, IPCC, and others.

On the other hand, hope for the success of A-side work seems slim and diminishing. It hardly takes more than a glance at the words of former OpenAI employees or the dissolution of OpenAI’s ‘Superalignment’ team before its first anniversary (its purpose was to solve superintelligent AI alignment within four years) to realize that A-side work at leading labs is on the downturn.

This is obviously less true at Anthropic, whose strong safety culture persists for now. For example, the interpretability work (especially SAEs) pursued there is the best in the world. But Anthropic, of its own admission, has less and less reason to pursue an A-side agenda for every paper it releases. This is because of its mission formulation, outlined in the founding “Core Views on AI Safety” document.

Basically, Anthropic thinks that the difficulty of succeeding in the A-side mission is uncertain. Its probability distribution over the difficulty of alignment, in the figure below, is a somewhat flat bimodal distribution, with most probability mass between “the alignment problem is as hard as inventing the steam engine” and “the Apollo mission” or right around “Impossible.”

As outlined in “Core Views,” as we gather evidence that the difficulty of alignment is around as hard or harder than P vs. NP (the most famous and fundamental unsolved problem in computer science),

Anthropic’s role will be to provide as much evidence as possible that AI safety techniques cannot prevent serious or catastrophic safety risks from advanced AI, and to sound the alarm so that the world’s institutions can channel collective effort towards preventing the development of dangerous AIs. (“Core Views”)

Work like the demonstrable insufficiency of current safety techniques to dissuade “Sleeper Agents” ‘eats away’ at probability mass in this graph from the left, giving us more and more evidence that we could be in a pessimistic world. In which case, even Anthropic will end its alignment agenda and focus on the G side.

A last note on Anthropic is that OpenAI was also founded with a clear and ostensibility genuine safety mission. We should hope that Anthropic’s internal governance structures are enough to insulate it from the pressures of massive corporate investment more effectively than OpenAI.

Academia and independent research might seem like promising ways to get useful A-side work done while avoiding this corporate politicking. Unfortunately, it is incredibly difficult for the GPU-poor to break ground on the A side. Without access to frontier models, data centers in which to experiment on them, or enough computational power to pursue ambitious agendas, researchers are relegated to outdated solutions to yesterday’s problems. AI moves pretty fast. If you don’t stop and buy 350 thousand GPUs once in a while, you could miss it.

The striking point of comparison in tractability is that the G side looks like a somewhat close approximation of a blend between nuclear energy and weapons policy. Something with precedent. Something the US and international governments are designed to control. On this problem, we can bring to bear a hundred years of regulatory teeth. The A side, on the other hand, is a scattered field with more different plans than researchers where no one agrees that any plan could work.

G-side work is in ascendancy. Opportunity abounds. It is more tractable.

Governance is more neglected than alignment

Neglectedness is a more cut-and-dry quantitative measure than the others.

In a 2023 report, Benjamin Hilton

estimate[s] there are around 400 people around the world working directly on reducing the chances of an AI-related existential catastrophe (with a 90% confidence interval ranging between 200 and 1,000). Of these, about three quarters are working on technical AI safety research, with the rest split between strategy (and other governance) research and advocacy.

Depending on the breakdown of the “rest split,” in the most optimistic scenario we have around three times as many people working on the A side than the G side. Since we do not include strategy (this is typically safety-facing work) directly on the G side or A side, this number could be high as four to five times. Another approach by Stephen McAleese also finds a 300-100 split in favor of the A side.

One might theorize that this split has come down on an efficiency equilibrium (obtained because each of these 400 picked the most marginally effective side to work on as they entered the field). I suspect, however, that this imbalance has more to do with the prior domain expertise of EAs (mostly STEM) and intellectual history of AI Safety (rooted in theory and ‘nerd culture’).

For these reasons, the G side is clearly more neglected than the A side.

What’s all this got to do with me?

As I have argued before, the massive stakes of AI development in the coming decades present an opportunity to do quite a lot of good. As I have argued above, these stakes should motivate any person interested in doing good to work to (or donate to) facilitate safe development.

Once motivated to work on AI Safety, I see many entering the field struggling to decide whether technical or governance work is right for them. As I have argued, AI Governance work is more impactful than technical alignment work because it is more important, more tractable, and more neglected. Thus, I recommend that, when faced with this decision, you begin with a heavy bias towards AI Governance work.

Regardless of which side you choose (or if you choose something else entirely, like strategy or communications), you should always begin with a foundational understanding of AI. Regardless of which side you choose, be so good they can’t ignore you.

Funny enough, though I wouldn’t use this as evidence of much, the subregion in the United States with the highest relative frequency is Washington D.C.

Ameliorology

Discussion about this post