Blog

Killswitches

We are misdirecting new AI safety enthusiasts into technical alignment work that attempts to build a safe AI, and policy that attempts to regulate a safe AI. The responses that will actually work are still neglected: evals and killswitches.

Many doom scenarios go something like this:

  1. We have good controls and have done a good job in aligning AI with human values.
  2. However, the AGI will become an ASI. At this point, the superintelligent being realizes it has no reason to be empathetic towards us unintelligent humans. It has more important, self-driven goals.
  3. We all die.

Many of the doom scenarios require an AGI or ASI or other being with unwieldy intelligence. If we are relying on the safety systems that us unintelligent humans have developed by this point, that is very bad. However, coalescing efforts in capability evals, red-teaming, and kill switches doesn't face the same challenge of battling literal superintelligence, because it's implemented as we develop the superintelligence. Claude 5, Grok 5, GPT 6, etc won't have the superintelligence required to sneak past our evaluations and kill switches. They're too stupid.

The only way we can not be screwed is if we:

A) Catch AI's failure early

I think this will be simple if we're looking in the right ways, because we are working with non-superintelligent models.

Timeline:

  1. We are here → AI too stupid to know it's being jailbroken/actually harming us
  2. AI smart enough to not get jailbroken/smart enough to understand when it's harming us
  3. AI is smart enough to think up and execute its own kill everyone scheme, despite knowing it's actually hurting us. This is the "why would intelligent beings care about less intelligent ones" point.

Somewhere between 2 and 3 we have to catch it. It will be very hard because after we solve jailbreaking/basic lying and scheming, people will be very excited and believe AI is safe and go full speed in development.

B) Demonstrate it in a way that is compelling to the entire world and their most important governments

Honestly, this is the hard part and it will probably require someone to induce a Hiroshima level failure. The action at this point is to halt all AI development and prevent it from recursively developing itself. At this point, it will still be possible for human developed systems to sandbox AI (it isn't superintelligence yet).

TLDR: Control bad. Evals killswitches good!

Update from 9/15/2025

In writing this I underestimated how much alignment efforts will scale as a result of help from AI, and overestimated how simple of a problem it is to "define when AI gets too dangerous" and "get everyone to stop". But, I do still believe technical governance is neglected as a field compared to mech interp.

Technocrats

The emergence of a totalitarian technocratic class should be given the same treatment we give to AI X-risk.

One night when I couldn't sleep, I asked ChatGPT to link articles critiquing and supporting the Abundance agenda, mostly because I wanted to understand why it was being memed so hard. One from the LA Book Review stood out to me as the strongest steelman against abundance: it frequently occurs that when we as a society go "full steam ahead" on public works construction, a small elite disproportionately reap the benefits, and a small minority are brutally ruined because of it. Cornelius Vanderbilt and co. carried around gold plated cigar boxes while buccaneering acres of ancestral land from Native Americans and spilling the sweat, blood, and tears of Chinese immigrant laborers. This happens repeatedly on a large scale (the Great Wall of China, plantations and the Columbian exchange) and on a much smaller scale (Amazon's warehouse workers and drivers) time and time again.

I continued reading, and what's worse is that the abundance agenda will catch on. The article introduced me to the idea of political orders, which are, "a long-term alignment of the political system around an agenda that draws cross-partisan support". Past examples are the New Deal era, and the subsequent deregulation in the 1970s-90s. Here is where I diverge from Jones's writing: I believe the abundance agenda already has momentum to become the new bipartisan unifier. We have plentiful Sam Altmans, Dario Amodeis, and Jensen Huangs to rally behind this political movement with massive influence. They have plenty of justification by appealing to the public's hatred of creeping scopes, leaky budgets, and broken promises.

Consider the ease with which massive datacenter powerplants bypass environmental regulations and the fact that the net worth of only 4 major AI CEOs exceeds 1T. The issue is real, in progress, and we might be further along this path of doom than we are along the one of deadly AGI. I would love to live in a future where we are free to build without our arms tied behind our backs with red tape. But, like the alignment of superintelligence, it must be done right the first time.