Alan Chan
Why do you care about AI Existential Safety?
The high-level reasoning for why I care about AI x-safety is as follows.
- I care about humanity’s welfare and potential.
- AGI would likely be the most transformative technology ever developed.
- Intelligence has enabled human dominance of the planet.
- A being that possesses superior intelligence would likely outcompete humans if misaligned.
- Even if we aligned AGI, systemic risks might still result in catastrophe (e.g., conflict between AGIs)
- The ML research community is making rapid progress towards AGI, such that it appears likely to me that AGI will be achieved this century, if not in the next couple of decades.
- It seems unlikely that AGI would be aligned by default. It also seems to me we have no plan to deal with misuse or systemic risks from AGI.
- The ML research community is not currently making comparable progress on alignment.
- Progress on safety seems tractable.
- Technical alignment is a young field and there seems to be much low-hanging fruit.
- It seems feasible to convince the ML research community about the importance of safety.
- Coordination seems possible amongst major actors to positively influence the development of AGI.
Please give at least one example of your research interests related to AI existential safety:
My research mainly revolves around development alignment (e.g., cooperativeness, corrigibility) and capability (e.g., non-myopia, deception) evaluations of language models. The goal of my work is to develop both “warning shots” and threshold metrics. A “warning shot” would ideally alert important actors, such as policymakers, that coordination was immediately needed on AGI. A threshold metric would be something that key stakeholders would ideally agree upon, such that if a model exceeded the threshold (e.g., X amount of non-myopia), the model would not be deployed. I’ll motivate my work on evaluations of language models specifically in the following.
- It is likely that language models (LMs) or LM-based systems will constitute AGI in the near future.
- Language seems to be AI-complete.
- Language is extremely useful for getting things done in the world, such as by convincing people to do things, running programs, interacting with computers in arbitrary ways, etc.
- I think there is a > 30% chance we get AGI before 2035.
- I think there is a good chance that scaling is the main thing required for AGI, given the algorithmic improvements we can expect in the next decade.
- I partially defer to Ajeya’s bioanchors report.
- At a more intuitive level, the gap for text generative models between 2012 and now seems about as big as the gap between text generative models now and AGI. Increasing investments into model development makes me think we will close this gap, barring something catastrophic (e.g., nuclear war).
- No alignment solution looks on track for 2035.
- By an alignment solution, I mean something that gives us basically 100% probability (or a high enough probability such that we won’t expect an AI to turn on us for a very long time) that an AI will try to do what an overseer intends for them to do.
- An alignment solution will probably need major work into the conceptual foundations of goals and agency, which does not seem at all on track.
- Any other empirical solution, absent solid conceptual foundations, might not generalize to when we actually have an AGI (but people should still work on them just in case!).
- Given the above, it seems most feasible to me to develop evaluations that convince people of the dangers of models and of the difficulty of alignment. The idea is then that we collectively decide not to build or deploy certain systems before we conclusively establish whether alignment is solvable or not.
- We have been able to coordinate on dangerous technologies before, like nuclear weapons, bioweapons.
- Building up epistemic consensus around AGI risk seems also good for dealing with a post- AGI world, since alignment is not sufficient for safety. Misuse and systemic risks (like value lock-in, conflict) remain issues, so we probably eventually need some global, democratically accountable body to handle AGI usage.