We are misdirecting new AI safety enthusiasts into technical alignment work that attempts to build a safe AI, and policy that attempts to regulate a safe AI. The responses that will actually work are still neglected: evals and killswitches.
Many doom scenarios go something like this:
- We have good controls and have done a good job in aligning AI with human values.
- However, the AGI will become an ASI. At this point, the superintelligent being realizes it has no reason to be empathetic towards us unintelligent humans. It has more important, self-driven goals.
- We all die.
Many of the doom scenarios require an AGI or ASI or other being with unwieldy intelligence. If we are relying on the safety systems that us unintelligent humans have developed by this point, that is very bad. However, coalescing efforts in capability evals, red-teaming, and kill switches doesn't face the same challenge of battling literal superintelligence, because it's implemented as we develop the superintelligence. Claude 5, Grok 5, GPT 6, etc won't have the superintelligence required to sneak past our evaluations and kill switches. They're too stupid.
The only way we can not be screwed is if we:
A) Catch AI's failure early
I think this will be simple if we're looking in the right ways, because we are working with non-superintelligent models.
Timeline:
- We are here → AI too stupid to know it's being jailbroken/actually harming us
- AI smart enough to not get jailbroken/smart enough to understand when it's harming us
- AI is smart enough to think up and execute its own kill everyone scheme, despite knowing it's actually hurting us. This is the "why would intelligent beings care about less intelligent ones" point.
Somewhere between 2 and 3 we have to catch it. It will be very hard because after we solve jailbreaking/basic lying and scheming, people will be very excited and believe AI is safe and go full speed in development.
B) Demonstrate it in a way that is compelling to the entire world and their most important governments
Honestly, this is the hard part and it will probably require someone to induce a Hiroshima level failure. The action at this point is to halt all AI development and prevent it from recursively developing itself. At this point, it will still be possible for human developed systems to sandbox AI (it isn't superintelligence yet).
TLDR: Control bad. Evals killswitches good!
Update from 9/15/2025
In writing this I underestimated how much alignment efforts will scale as a result of help from AI, and overestimated how simple of a problem it is to "define when AI gets too dangerous" and "get everyone to stop". But, I do still believe technical governance is neglected as a field compared to mech interp.