Skip to content

Why do you care about AI Existential Safety?

As AI systems become more autonomous and more integrated into critical sectors like healthcare, finance, and security, concerns arise about unintended consequences, including catastrophic and existential risks. I’m interested in studying longer term ethical and safety risks that could emerge from future advanced AI systems. These include things like conscious AI systems, agentic systems, artificial general AI systems (AGI), as well as powerful narrow AI systems. I’m also interested in how AI technologies are affecting human cognition–how we think, how we make decisions, our memory capacities, our consciousness, our autonomy, etc. I’ve written on all of these topics (please see my personal website, www.kkvd.com, my google scholar page, and my Research Lab website, www.periscope.org). Ultimately, I *care* about AI existential safety because I care about humanity and the precariousness of the human experience and human condition. As a child, these were some of the questions that kept me up at night. As an academic philosopher, they are the questions that preoccupy my

Please give at least one example of your research interests related to AI existential safety:

One example of my work on AI Xrisk is the chapter I co-wrote for The Oxford Handbook of Digital Ethics, titled “How does Artificial Intelligence Pose an Existential Risk?” Using the argumentative and rigorous methods of philosophy, this paper tried to make as explicit as possible the reasons for thinking that AI poses an existential risk at all. We articulate what exactly constitutes an existential risk and how, exactly, AI poses such a threat. In particular, we will critically explore three commonly cited reasons for thinking that AI poses an existential threat to humanity: the control problem, the possibility of global disruption from an AI race dynamic, and the possible weaponization of AI. This paper was written to convince and inform the philosophical community as well as the academic community more broadly, which back in 2019 (when the paper was written) still needed some convincing on this point (and perhaps still do!). It was also written to serve as a pedagogical tool for scholars looking to teach units on the existential risk of AI to university and college students. To my knowledge, it has received over 30 citations and been used in classrooms around the world.

Other related research interests I have are in how to use AI to enhance human oversight capabilities, how to assess AI capabilities, how strong the instrumental convergence thesis really is (what evidence supports it), the limits of machine intelligence (I regularly teach a course on this topic, which has received some media attention), and our future with AI–how humans can learn from AI systems and not be ‘left behind’.

Why do you care about AI Existential Safety?

AI systems have potential to be beneficial to humanity, but the risks these systems pose are equally tremendous. These potential outcomes are key to my research.

Please give at least one example of your research interests related to AI existential safety:

I work on the problem of specification design, which is central to AI safety. Namely, how can we empower people to write correct specifications? What modalities—like preferences, corrections, explanations—can help with this endeavor? And, after writing a specification, how can a person know that the AI system has learned the intended interpretation of the specification?

Why do you care about AI Existential Safety?

I care about AI existential safety because it’s a commitment to ensuring that powerful technologies remain beneficial, equitable, and anchored in human values as they evolve. Ignoring the broader consequences of AI systems, especially in these nascent stages of development, could lead to outcomes we struggle to control. For me, this extends directly to healthcare: if diagnostic tools or treatment algorithms become more capable than any human team, we must ensure they truly serve all patients, especially those already marginalized by healthcare inequities such as those in sub-saharan Africa.

Please give at least one example of your research interests related to AI existential safety:

The project I’m currently working on is the formative development of an AI-driven chatbot for adolescents and young people living with HIV in Uganda. This chatbot aims to offer peer support, health education, and accurate medical information. In designing it, one of the issues I’m actively exploring is how to incorporate fail-safes and ethical guardrails to prevent biased or misleading outputs, especially given that we are dealing with a socially vulnerable group. By deployment, I want to ensure the system can handle delicate health inquiries without propagating misinformation or harmful content, an issue that aligns with the wider AI safety concern of reward hacking and unintended consequences.

Why do you care about AI Existential Safety?

I believe the development of AI systems is one of the most transformative technological advancements of our time. It could significantly benefit humanity, though it also comes with equally significant risks if not aligned with human intentions. AI systems will always encounter situations where human oversight is limited or infeasible. In these cases especially, making sure AI behaves as we want and need it to is a must – including complex or implicit desires such as human values, social norms, and common sense. AI existential safety is fundamentally about safeguarding humanity’s future. If we can create trustworthy and aligned AI systems, beyond just mitigating AI risks, they’ll also empower humanity to tackle really complex global challenges in ways that we previously couldn’t imagine.

Please give at least one example of your research interests related to AI existential safety:

AI existential safety is directly related to the critical challenge of ensuring that AI systems understand and align with human goals in complex and safety-critical scenarios, particularly when the AI system operates semi-autonomously or a human is unable to provide direct oversight in all situations. At a fundamental level, my research enables an AI agent to robustly infer what a human wants from it, including complex or implicit desires such as human values, social norms, and common sense. This is essential for preventing catastrophic failures stemming from misaligned incentives or goals, particularly in high-stakes applications where humans may need to partially or entirely rely on an AI system’s judgment. My work aims to mitigate risks posed by AI systems, ensuring they operate safely and in alignment with human intentions even if their abilities surpass those of humans.

Why do you care about AI Existential Safety?

I care about AI existential safety because I’ve seen firsthand how quickly powerful technologies can outpace our ability to govern them responsibly. Working on Generative AI Trust & Safety, I’ve learned that even small oversights or misalignments can cause real-world harm at scale. AI could reshape societies for generations, and if we don’t address issues like bias, misuse, and transparency early on, we risk losing control over outcomes that affect millions or billions of people. Ensuring AI aligns with humanity’s values isn’t just a technical challenge, it’s a moral imperative to safeguard our collective future. Beyond the immediate concerns of misinformation and bias, existential safety addresses the far-reaching consequences of advanced AI systems that might surpass human control. I believe that if we prioritize robust safeguards, collaborate across disciplines, and foster transparency, we can harness AI’s transformative potential without undermining our humanity. By investing in alignment research, we stand a better chance of guiding AI development in ways that benefit present and future generations.

Please give at least one example of your research interests related to AI existential safety:

One of my key research interests, demonstrated through both my work on the MLCommons benchmarks and my ongoing focus on text-to-image (T2I) generative AI safety, lies in ensuring that the datasets driving these advanced models are both comprehensive and ethically sound. In novel T2I research, I’ve analyzed publicly available datasets in terms of their collection methods, prompt diversity, and distribution of harm types. By highlighting dataset strengths, limitations, and potential gaps, my work helps researchers select the most relevant datasets for each use case, critically assess the downstream safety implications of their systems, and improve alignment with human values – a vital step in mitigating existential AI risks.

Why do you care about AI Existential Safety?

I care deeply about AI existential safety because I believe safeguarding humanity’s future is a profound moral responsibility. The existential risk posed by misaligned superintelligent AI is the threat of permanent foreclosure of humanity’s vast future—trillions of lives and boundless possibilities spanning astronomical timescales. This danger arises from the likely pursuit of instrumentally convergent subgoals such as resource acquisition and self-preservation by superintelligent systems irrespective of their ultimate objectives, which creates the potential for rapid, irreversible changes that could culminate in human extinction. Beyond sudden, catastrophic failures, I am equally concerned by the less-discussed, but insidious, gradual, accumulative erosion of societal resilience culminating in an irreversible collapse. This dual threat—immediate and long-term—demands significant advances in aligning AI with human values and mitigating the dangers of concentrated power. My research in mechanistic interpretability and the democratization of AI is a direct response to what I view as the most critical challenge to humanity’s continued flourishing.

Please give at least one example of your research interests related to AI existential safety:

My research directly addresses existential risks from advanced AI through two interconnected directions: interpretability and democratizing AI.

Interpretability for Alignment
The core challenge with increasingly powerful AIs lies in the mismatch between their external behavior and internal mechanisms. While these systems may demonstrate strong capabilities on established benchmarks and appear aligned, we cannot fully verify their internal decision-making processes or guarantee consistent behavior beyond those limited test cases. This constitutes a significant risk: we are deploying systems whose internal workings remain fundamentally opaque, particularly concerning given the potential for misaligned AGI to be exploited in domains such as automated warfare, bioterrorism, or autonomous rogue agents.

My interpretability research directly addresses this risk by developing techniques to reveal the internal representations and reasoning processes of LLMs. A concrete example is my work on Specialized Sparse Autoencoders (SSAEs). Standard Sparse Autoencoders (SAEs) offer a promising path toward disentangling LLM activations into monosemantic, interpretable features. However, they do not capture rare safety-relevant concepts without impractically large model widths. SSAEs overcome this limitation by illuminating rare features in specific subdomains. By finetuning with Tilted Empirical Risk Minimization on subdomain-specific data selected via dense retrieval from the pretraining corpora, SSAEs achieve a Pareto improvement over existing SAEs in the spectrum of concepts captured.

Democratizing AI: Mitigating the Risks of Concentrated Power
Even with perfect technical alignment, concentrated control of superintelligent AI presents a separate existential risk. Open-source development serves as a critical countermeasure by enabling early detection of alignment failures and democratic oversight of AI behavior. My research contributes to this democratization through improved efficiency across training, deployment, and communication.

My work on training efficiency includes GRASS, an optimizer employing sparse projections to drastically reduce the memory requirements for training LLMs. GRASS made it possible to pretrain 13B parameter LLMs on a single 40GB GPU, lowering the barrier to entry for large-model training. My research on deployment efficiency led to the development of ReAugKD, a technique that augments student models with a non-parametric memory derived from teacher representations. This improves test-time performance with minimal additional computational overhead.

By making the development and deployment of powerful AI systems more accessible and collaborative, we can mitigate the risks associated with concentrated power and increase the probability that these technologies are developed and utilized responsibly, for the benefit of all humanity rather than a privileged few.

Why do you care about AI Existential Safety?

When I was a child I was very angry a great deal of the time. It felt to me sometimes as if there was a deep sense of wrongness in the world and I needed to correct it in order to feel at peace. As I grew older I realized most of the things that would upset me were trivial compared to the depth of my feelings. I have, of course, managed to develop strategies to manage my emotions and keep myself happier, but I still feel connected to those deep, upsetting emotions, only now, I’ve found things that seem to have enough weight to justify the ways that I feel. The extinction of life on earth is such a thing. Not only is it severe enough to balance how I feel, it seems to me I am vastly incapable of feeling the true depth of emotion that is warranted.

Caring about existential safety feels like finally connecting to something I have sought since I was born, and my particular focus on AI x-risk is because of my affection for math and computer science, and my feeling that AI superintelligence represents a trap in our reality. Once we have put reality into a configuration where it contains an ASI, it is very unlikely we will be able to alter the trajectory of reality from then on.

Please give at least one example of your research interests related to AI existential safety:

I completed my honours project “Mechanistic Interpretability of Reinforcement Learning Agents“. It describes a novel method for exploring latent spaces extended from the work of Mingwei Li. I think this direction could be very synergistic with current mechinterp focus on SAEs, as one of the big problems in working with linear projection is finding valuable angles to look from, and I think this is what SAEs are doing in finding “semantically interpretable vectors”.

I list more of my research interests here. Here they are copied for reference:

Why do you care about AI Existential Safety?

We are developing a technology that may turn out to be more intelligent and way more efficient than humans. I truly believe such a technology can have immense impact to make everyone’s lives better in many different ways. However, controlling this intelligence could be one of the most challenging problems we will face in the coming decades. In the rapid race to develop artificial intelligence, it’s easy to overlook the safety and security of these models. We will not only face technical, but also societal challenges. As AI becomes more integrated into society, there’s a risk of gradually disempowering ourselves by relying too heavily on AI systems. Addressing the challenges posed by artificial general intelligence requires the collective effort of as many people as possible to ensure we navigate this path responsibly.

Please give at least one example of your research interests related to AI existential safety:

One of my primary research topics in the past has been red-teaming large language models (LLMs). By systematically testing models for vulnerabilities, we can identify and address potential risks before they manifest in real-world applications. When it comes to AI existential safety, it is important to adaptively evaluate for worst-case behaviors as those are the ones we expect may have a largest impact in the world. As AI capabilities advance, the stakes for properly evaluating safety measures will only increase. We must maintain rigorous standards for testing and validating our technical mitigations.

Michael is the Head of U.S. Policy for the Future of Life Institute. Previously, he was on the AI Policy team at Meta. Prior to Meta, he was the Senior Director of Tech and Human Rights for Amnesty International USA. Earlier in his career, he was an aid worker in Afghanistan, Sudan and Iraq, a foundation program officer for Humanity United, and the founder and CEO of Orange Door Research, a company that collected development and humanitarian survey data for the UN and the World Bank in conflict-affected countries.

His work has been published in the Washington Post, Guardian, Stanford Social Innovation Review, Al Jazeera, TechCrunch, Fortune, The Hill, LA Times, SF Chronicle, Baltimore Sun, and McSweeney’s. He is a graduate of Yale and Harvard Law.

Santeri Koivula joined FLI’s EU Policy team through the five-month Talos Fellowship. His research focuses on assessing the plausibility of different systemic risks from AI. Santeri is also a Master’s student in Science, Technology and Policy at ETH Zürich. He holds a Bachelor’s degree in Mathematics and Systems Sciences from Aalto University, and has previously interned at CERN and the think tank Demos Helsinki among others. Additionally, he used to be a part-time forecaster at Rand Forecasting Initiative.

Sign up for the Future of Life Institute newsletter

Join 40,000+ others receiving periodic updates on our work and focus areas.
cloudmagnifiercrossarrow-up
linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram