Skip to content

This is the second of our ‘of ‘AI Safety Breakfasts’ event series, featuring Dr. Charlotte Stix from Apollo Research.

About AI Safety Breakfasts

The AI Action Summit will be held in February 2025. This event series aims to stimulate discussion relevant to the Safety Summit for English and French audiences, and to bring together experts and enthusiasts in the field to exchange ideas and perspectives.

Learn more or sign up to be notified about upcoming AI Safety Breakfasts.

Ima (Imane Bello) is in charge of the AI Safety Summits for the Future of Life Institute (FLI).

No recording of this event is available.

Chapters

Transcript

You can view the full transcript below.

Transcript: English (Anglaise)

Imane
Please let me introduce Dr Charlotte Stix. Charlotte is Head of Governance at Apollo Research, an organization that is specialized in model evaluations focused on frontier AI capabilities. Previously, she led the efforts in public policy at OpenAI. Before that, she was also the coordinator of the European Commission High-Level Expert Group. We are very pleased that she could make it today. Charlotte, before we start and deep dive into the conversation, can you give us a picture of what evaluations are and why they matter?

Dr. Charlotte Stix
First of all, thank you so much for inviting me. I’m really pleased to be here and meet every one of you individually at some point after the introduction and after our questions here.

(Ommitted: Presentation from Charlotte)

Imane
All right. Thank you so much, Charlotte. I’d like to talk about the limitations of evaluations. When we take a look at the current models, different risks emerge from powerful AI systems and model evaluations are used to identify and assess potentially risky capabilities within a model. Your organization, Apollo Research, has called for a science of model evaluations because the current approach to model evaluations has severe limitations as a rigorous approach in terms of assessments of risk. Can you share more about what limitations you see and how Apollo is trying to address these challenges?

Dr. Charlotte Stix
Yes, absolutely. That’s a really good question.

Dr. Charlotte Stix
Essentially, one of the things we communicate a lot about is that people believe or there’s a general understanding that evaluations are sufficiently robust, and we can entirely rely on them. To your point, we currently cannot. There’s a big gap between what evaluations are, “supposed to fix for us,” when it comes to governance regimes, when it comes to, “Oh, if the evaluation says it’s good, then we can deploy the system” and what they can currently be relied on for.

Dr. Charlotte Stix
In short, there’s a lot of weight on the shoulder of evaluations versus what evaluations can actually tell you about an AI system’s capability and propensities right now. Not least because an evaluation can tell you when we see a dangerous capability, but it can’t tell you that it isn’t there at the moment. We have absence of evidence, but absence of evidence doesn’t really tell us that the model can’t actually do it, it isn’t evidence of absence, which is why we have an interpretability stream. If you want to talk about that, I can also talk about that, but that’s why we have this workstream at Apollo.

Dr. Charlotte Stix
When it comes to the science of evaluations, why we called for this or why we’re working on this is because most other assurance regimes in critical sectors have a process where you have the practitioners working at the frontier, learning all these things, learning the norms, learning what functions, learning best practices and sharing among them, eventually have the opportunity to translate that into best practices, and we’re somewhere on the stretch of that happening. Then eventually, that should influence regulation, that should influence how evaluators as well are being held to account.

Dr. Charlotte Stix
At the moment, you don’t really have a review of individuals or organizations that call themselves evaluators either. Now, it is possible to assess technical capabilities and skills, but we don’t have that at the moment. Eventually, where we want to head with the science of evaluations is to ensure that all the evaluations we do are robust and rigorous, and we can be confident in the result, and that they give us a predictability. They give us an upper bound of how dangerous that model could possibly be given available affordances, and so on, and we just do not have that at the moment. That is just because the field, especially for dangerous capability evaluations, is fairly nascent vis-à-vis evaluating for bias and fairness, even though, of course, there’s also still a lot of progress and funding necessary for that field.

Imane
Thank you. I really like this paradigm of absence of evidence versus evidence of absence. That’s very useful. I want to talk about the current and/or upcoming models. What are the potential risk of AI systems we should be most worried about, and how can evaluation organizations like Apollo help address this risk?

Dr. Charlotte Stix
I’m going to say something very simple, but historically, so far, more progress has led to more capable models. That is good, I guess, because we want capable models. There’s a general idea of we want to be able to outsource more things, more difficult tasks, uncomfortable tasks onto these models. The more capable they are, the more we can outsource it. That is just the general trajectory for upside and downside alike.

Dr. Charlotte Stix
Now, unfortunately, or maybe somewhat self-evidently, these capabilities come with also the downside capabilities of these models being more capable. They might eventually be more capable of modeling themselves in an evaluation and knowing they’re being tested, they might eventually be more capable of modeling the user and the designer and their desires. They’re just, colloquially speaking, they’re just smarter. And so that will, of course, lead to more risk and more need for control mechanisms and more need for testing.

Dr. Charlotte Stix
Just to give you a quick sense of different levels of threat vectors, I guess, that we’re thinking about at Apollo, and some of those I’ve shown you. One of them is insider threat. Insider threat really would be someone fine-tuning a model to effectively lie to you and hide its actions. Then we have misuse, which is something along the example of what I showed you with the hacker. Then we start having a broad range of different precursory capabilities under loss of control. What happens if the AI system itself starts to develop self-awareness, starts to develop theory of mind capabilities, and starts to think strategically about its goals?

Dr. Charlotte Stix
There is a pretty area of concerns coming very quickly in our direction as we ourselves are racing towards more capable AI systems and more progress. Again, this is just finding a balance between the progress and the safety element, really.

Imane
Thank you. Apollo Research has argued that the affordances of an AI system are particularly important, both in terms of the system impact and therefore the regulation that should apply. By affordances, you refer to tools, plugins, or scaffolding that extend the model capabilities, for example, by being able to run code or search the internet. In a world where downstream deployers of a model can add additional affordances to an AI system, how should we think about the role of evaluations that are performed on the upstream model?

Dr. Charlotte Stix
It’s a very important question, especially as evaluations are underpinning these governance frameworks. Usually what we think about is pre-deployment evaluation. Now, you can extend this in both directions. You can do, for example, a training design evaluation. You can audit the training design. You can think about, for example, running evaluations before internal deployment.

Dr. Charlotte Stix
At that point you already have that system deployed in some shape or form, if people could steal the model weights, or they could do something to the model and so on and so forth. Staff will engage with this model and there may be no or few safety mechanisms. Then you have pre-deployment evaluations, which currently is the main breaking point where people have focused on, which is where it gets “deployed into society”.

Dr. Charlotte Stix
But to your point about available affordances, which is also what I tried to say earlier in the presentation, the risk profile of an AI system can change and is very likely to change throughout its deployment until it gets decommissioned. That can mean, for example, it gets more available affordances. If it gets more available affordances, it is more capable, it can do different things, and it just is a different version of that model with different ability.

Dr. Charlotte Stix
What we would say and have said in that paper is you absolutely should re-evaluate what your model can do. A model that doesn’t have access to the internet and a model that has access to the entirety of the internet. I think we would all agree, those are two models capable of different things amd causing different types of risks, importantly.

Dr. Charlotte Stix
Something you could also do and should do is, in your evaluations, pre-deployment, and now that is difficult when you have different providers in the value chain, but I guess that will have to be, unfortunately, repeated. Sometimes, if the downstream provider changes the model significantly, that is if you want to be assured, that is what you have to do, you can evaluate for, “Okay, what are the current capabilities of this model that it has, and what available affordances do we expect it to have in the environment it is deployed into?”

Dr. Charlotte Stix
Changes to available affordances or iterative updates, those should ideally necessitate a new evaluation, or generally speaking, potentially a whole new risk assessment regime.

Imane
Interesting. Thank you.

Dr. Charlotte Stix
Well, we want to be safe, right?

Imane
Yeah, we do. Of course. It’s just that that renders the whole evaluation picture rather more complex than most people would think of. All right, let’s talk about scaling model evaluations. For the evaluations of OpenAI newest model, Apollo has used a combination of automated assessments and manual analysis. To what extent can model evaluations be automated or scaled? What are the trade-offs between automated evaluations and manual assessments of AI behavior?

Dr. Charlotte Stix
Something that we use in our pipeline to automate is some tasks around model graded evaluations, grading the outcome and having models write parts of the evaluation. That goes to say, have models write the blurb that the user says in the evaluation scenariops, to have 10 different blurbs, 100 different blurbs and so on but really quickly. There is, generally speaking, a drive towards automating more of the evaluation pipeline, and there is a big need for evaluations.

Dr. Charlotte Stix
Having said that, of course, to your point, there is a trade-off between speed and rigor. You may want to use models to help you do some of the simple repetitive tasks that don’t really… I want to take this with a grain of salt, that don’t really matter that much. Then have the human evaluator, the domain expert, actually both create the environments and the tasks and review at the very end the elements that are of true importance to the result and quality of that evaluation.

Dr. Charlotte Stix
There is a trade-off as to how much you can right now, reasonably, and more importantly, want to automate in this process. Especially, again to the earlier point, we want to outsource things to AI systems because we don’t want to do the exhausting, annoying tasks. Here, too, we need to be careful. If we run an evaluation on an AI system, we need to be careful which parts we also outsource about the evaluation to another AI system and how capable that AI system is.

Dr. Charlotte Stix
I think it’s just important to bear in mind, while there is a big appeal in automating evaluations because of cost and speed, it still requires a lot of clear thinking about what is sensible to automate in the processes. I can’t speak for other organizations’ evaluations, but that appears to be a general understanding in the community at the moment from what I know.

Imane
Thank you. Okay, we’ve talked about current models, affordances, scaling model evaluations. Let’s talk about the trajectory of risk. I’m asking because sometimes we may not be realizing that we’re having two different conversations. When we call out for certain types of risk, we’re not discussing the current models. We’re discussing the capabilities of future models based on an understanding of what the trajectory of risk might be.

Imane
Your organization, Apollo Research, recently featured the technical documentation of OpenAI new flagship model o1. The documentation describes that Apollo was given pre-deployment access to evaluate the model for risk of, as we’ve seen in the introduction, deceptive behavior. The results shared by OpenAI indicate that the new model has an improved self-knowledge and has the ability to understand the state of mind of a human and purposely deceive humans to achieve its goals.

Imane
What is the trajectory of this behavior according to you? Do you think that’s a technical problem that will be fixed with the next update, or should we expect models to become more and more deceptive?

Dr. Charlotte Stix
I think we should expect models to become more and more capable. Again, by capability, self-reasoning, so having self-awareness and being able to apply that self-awareness, and having theory of minds understanding the other and being able to apply that are just components of capability at large, of being very capable. That’s how we think about some subset of tests for animals understanding self. Can they look into a mirror and identify self and understand they’re looking at themselves?

Dr. Charlotte Stix
A lot of progress and risk is just tied to the more capable AI systems get or the more capable models get, the more likely they are to exhibit certain precursory capabilities up to then more dangerous capabilities. Now, in an ideal world, we would fix this technical problem if we would have a model that has these capabilities, but is entirely aligned to still pursuing the goal that we wanted to pursue.

Dr. Charlotte Stix
Just because a model can do something and exhibits the propensity of doing something, we could still get it in an ideal world, aligned to not doing that thing. Just because I can go around and do bad things outside in the world doesn’t mean I’m doing them.

Dr. Charlotte Stix
The top question in AI safety is alignment. Can we get the model aligned, yes or no? We’re essentially asking more downstream questions, like, let’s assume we’re unsure whether we can get the model aligned, yes or no, or we haven’t quite figured it out. Let’s assume we’re on that trajectory, how can we be sure that the model isn’t deceptively aligned?

Dr. Charlotte Stix
That’s the meta-umbrella of some of the more high-level things that I’ve mentioned here, and where deceptive alignment essentially means, is the model using strategic deception to pursue a goal that is misaligned with the goal that its designers wanted to have. Where strategic deception is, is it trying to cause an untrue belief to get to its goal? In an ideal world, alignment. Until then, We just have to be mindful that with capability progress comes capability progress, good and bad, and we should look out for it and test for it. Test early, test often.

Imane
Do you have an intuition for the trajectory of risk related to that behavior?

Dr. Charlotte Stix
Well, if we look at what we have evaluated, so to the quote that you mentioned from our system card. We also ran evaluations on GPT-4, and we ran pre-deployment evaluations on o1-preview. They already exhibit very different capabilities. That is exactly what you pointed out.

Dr. Charlotte Stix
Compared to GPT-4, this model is capable of self-reasoning and of applied theory of mind in a way the other was not. Just extrapolating from what we have here and, say, that was half a year difference, a couple of months difference, uncertain how much difference that is, like company internal. But this is the next model they chose to put out. You could plausibly extrapolate a pretty rapid progression of capabilities if you were just looking at those two. Sticking with what we know or can infer from that, it’s likely that the next model is, again, much more capable on that trajectory.

Imane
Got it. Thank you. All right. I do want to leave some space for you all to ask some questions. I just have two closing questions for Charlotte. The first one is very broad, so feel free to tell us whatever comes to your mind right now. As an expert, is there anything that you wish would become more common knowledge among policymakers, journalists, general public? Are there any common misconceptions about this topic that you frequently encounter and would like to address?

Dr. Charlotte Stix
Yeah. I mean, maybe just sticking with what we’ve already said, because that covered a lot of different areas. The capabilities, and therefore the associated range of risks models can exhibit is coming at us pretty fast. What we are able to do to both fix it, with alignment or control mechanisms, whatever you want to put in that bucket, and to detect it with say evaluations, red-teaming, part of that, all of this is still nascent. It still has a lot of way to go. I think we need to be mindful of that, that people aren’t thinking, “We have fixed this thing. We will detect anything bad. That is fine. We’ve done our job.”

Dr. Charlotte Stix
I think that is the core lesson here, that the science of evaluations is still nascent. I think that just a general lucidity about the state of the art is pretty important, especially as we have AISIs now, lots of important meetings and voluntary agreements, and so on and so forth, which are all really good and really important. But just to double-check at all steps, what is it actually that we can do right now, what is coming our way, and how do we need to prepare for that and are we well prepared.

Imane
Thanks. That’s very clear. My final question is the one question I ask all of the guests that come to Paris. Thanks again for that. What would you consider an ideal outcome for the upcoming AI Action Summit? How could it further advance international collaboration on AI safety?

Dr. Charlotte Stix
Yes. I’m going to cop out because I actually have thought about this a lot also for my work, and I don’t have a very good answer for it because I’m still not entirely certain I have a very good grasp on what is feasible. I think a lot of this will also depend on the AISI Network meeting outcome. But I think something that is pretty clear to me for what I’ve seen over the last year is that the inclusion of researchers and scientists working on the frontier of this field that governments are talking about and coordinating on is pretty important in ensuring that the conversations and agreements and trajectories remain scientifically grounded and are informed by practitioners.

Dr. Charlotte Stix
I think that is just less about the outcome as you ask, but I would hope that there is a serious consideration for including also those practitioners outside of the policymaker ecosystem, which is obviously really important, the researchers that are doing this work, and for politicians to hear their thoughts on what should be done and can be done.

Imane
Thanks. That’s very clear as well. All right. Thank you, Charlotte.

Dr. Charlotte Stix
Thank you.

Imane
Do you have any questions for Charlotte?

Participant 1
Hi, Charlotte. Thank you very much for your presentation and for explaining the opportunities and challenges of the AI evaluation field. My question is on the French AI Summit, too. Could you say more about what you did for the UK AI Safety Summit? How did you engage with organizers? In general, what are your plans with the French AI Safety Summit? Could you do demos?

Dr. Charlotte Stix
For Bletchley Park, we also did a demo. We were invited, and we had had a demo on loss of control. That was a demo on an AI system that conducts insider trading, which you can also find on our website. I do, in fact, think—I know this sounds a little self-serving, being here as a representative of an evaluation organization—I do think it’s really helpful for policymakers to see demos of technical research.

Dr. Charlotte Stix
I think it’s really helpful to really see what an AI system can and cannot do, because that will put you in the right frame of mind to think about the broader ecosystem this system operates in and questions of how can we govern this? How can we ensure that there are mechanisms in place to control this if it should happen? I don’t know about the French AI Action Summit, particularly because I haven’t been able to get a very good grasp on it. If anyone here does have a very good grasp on it, I would be delighted to hear your opinions. But yes, demos, totally feasible. Lots of organizations would probably be happy to develop them. I think they’re a good science communication tool that works both for scientists and researchers and for governments.

Participant 1
Is that what you did for the UK AI System? Did you also do some evals for the UK AI System?

Dr. Charlotte Stix
Yes. We ran a demonstration on an insider trading AI system that essentially deceived about its insider trading capabilities.

Participant 2
Thank you. And thank you, Charlotte, for your presentation. We were talking about o1 just before. OpenAI talks a lot about AI safety. Can you explain what important measure do they already take when they test their model and what more should we ask them to do?

Dr. Charlotte Stix
I can’t speak to what measures OpenAI takes internally that are not publicly shared measures. They do include external evaluators, for example, to evaluate their models, pre-deployment. They have some internal safety mechanisms that they talk about on their website. I think where the breaking point for me is, at the moment, everything any company does is voluntary. It is great and good, that we’re being asked to evaluate these models, pre-deployment, and we have ownership about how we share the information. We have written this part of the section. That is all really good. That assumes an environment where we all trust each other, and we’re collaborating. As AI companies come under more pressure, and generally speaking, AI models might become more capable and more dangerous, it is unclear that these voluntary arrangements and commitments will stay as they are, because there is no real mechanism to enforcing them at the moment.

Dr. Charlotte Stix
Now, this is very different with the EU’s AI Act. But if we’re speaking about, say, in the US, what do these companies have to do? They have to do very little, if we say “have”, right? They can choose to, and many are choosing to, to their credit, and that is really good, and that is something we should encourage and applaud. But going forward, it is uncertain if that will continue for every company. There might also be companies that do not opt into voluntary commitments, and there are some companies that haven’t yet, but may in the future. I think we can ask companies to do more, certainly, but we also need to focus on just because they are doing something right now, it doesn’t mean that they will be able to or desire to continue doing that in the future.

Participant 3
Can you explain what you’ve been doing for policy work for OpenAI? And also what you believe is their… How do you read the AI Act that you just mentioned, the assessments that they put on big models and all that? What do you know of the state of advancement? Do you believe there are good assessments? Do you believe the companies are on the way to do them? How do you see that?

Dr. Charlotte Stix
Are you asking the second question based on my current understanding as an independent expert?

Participant 3
I would assume you are watching the field because they make some assessments mandatory, the European lawmakers in the AI Act. I would believe as an evaluating company, you would look at that and know what’s in those assessments, what’s the state of advancement of what will be in them, if the companies are working on doing them. I know that from what I understand, the lawmakers would say, “You need to address this and that risk, and that’s it.” Now, all the norms, the discussions about what it really means, who’s going to make it, is in the making now. I’m trying to see more clearly what’s being done because it’s behind doors.

Dr. Charlotte Stix
I think I understand what you’re asking. There are two different things in the AI Act that you’re pointing to. One of them is the conformity assessment for high-risk AI systems, where you have articles that do indeed ask for robustness, security, safety assessments, including evaluations. There is a lot of effort ongoing in JTC-21. That’s a standardization body, the standardization working group in CEN-CENELEC, which is an EU standardization body, to spell out the details of those legal requirements in the conformity assessment bit of the AI Act.

Dr. Charlotte Stix
The other bit you’re pointing to are the requirements for the general purpose AI systems with systemic risk, which have a requirement to run evaluations, testing, and general risk assessment before deployment. There are some other requirements around flagging that, sending information to the EU AI office, so on and so forth. Yes, that process is currently, I wouldn’t say behind closed doors, but it is kicking off next Monday. There is currently nothing I or anyone else could really share with you about this process. The process is called the Code of Practice Drafting Process, and this has been a very transparent and inclusive process where essentially all organizations and researchers that are working on this type of work were able to submit a request to participate in some of the working groups.

Dr. Charlotte Stix
The working groups will build out exactly these high-level norms or requirements that you’ve pointed to, and then, in collaboration with industries or the providers that would have to adhere to them or choose to voluntarily adhere to them, essentially draft a code of practice that would be done within the next nine months. That then becomes the standard best practice and norms that those deployers would choose to adhere to in order to comply with the element of the AI Act that outlines this for general purpose AI systems providers with systemic risk.

Imane
Civil state organizations have also been included in the process, so it is a very much transparent and inclusive process that is going on at the European level.

Participant 4
Thank you. I have a question, because from my knowledge, the methods that we know to evaluate risks have limitations. Those are human limitations. It’s not just the technique but just the capacity we have to understand, to appear, perceive, and understand. I was just wondering what can you tell us about the mechanism you are applying in your company to prevent those limitations to appear in the evaluations you are conducting.

Dr. Charlotte Stix
Sorry, when you say human limitations, could you give a detail? Could you give an example of what you mean there?

Participant 4
For instance, we can have biases when we are selecting the method to evaluate, when we are selecting the requirements to be evaluated. You were talking about alignment between the model and certain criteria, and we can have biases just by saying, “I’m going to prioritize this,” and, “I do not consider this because this is less important.”

Dr. Charlotte Stix
Just for clarity, we’re evaluating models for certain capabilities, we’re not trying to align them with certain goals or ideals. We’re not trying to get them to be unbiased or fair. We’re evaluating specifically for the capability of deception. They’re not really around assessing whether the model does something that is good or not good, right or not right. What we look at more is, say, for example, does the model pursue a certain goal as long as it believes it is pre-deployment in order to get deployed, and then pursue a different goal? We don’t really attach any judgment to the model behavior.

Dr. Charlotte Stix
We’re very much looking at the mechanism and choices the model makes, and whether those match a deceptive behavior where deception here would be, “I actually have a different goal. I want to get deployed, so I’m lying about my different goal,” and then, once it is deployed, we test you, and you’re actually pursuing the other goal. That’s a pretty clear-cut evidence of the model has behaved differently before than it now behaves in deployment. We’re not really saying this is good or bad, or there’s something involved with that.

Imane
Thank you.

Participant 5
Thank you. Hello. Building on what you were saying earlier about the EU AI Act, I wanted to ask, what role do you think the regulator should have to incentivize frontier AI developers to prioritize AI safety? Maybe by enforcing mandatory evaluations at different times of the development, the deployments. And even after, in the use, what role do you think the regulator should have to be efficient?

Dr. Charlotte Stix
I think there’s an element here where you need to be careful to enforce things that you also expect people are able to do, because otherwise, you have all sorts of other problems. One of the coinciding things here is, it is good for a regulator to enforce, say, explainability. Great. Now, let’s fund the field of explainability more, so we actually have enough researchers able to develop methods to do that well. I think there has to be a duality here of what is actually feasible, which brings me to, obviously, laws take time to draft. It could be beneficial to have, in fact, amendments that are a bit more moving with the state of the art, be that, for example, standards are closer to that, precisely to capture this friction point. Now, I think regulators, at least in the EU, have done a lot. It is yet to be seen in which direction that goes, and how this is actually going to be operationalized, how companies are going to interpret the requirements.

Dr. Charlotte Stix
But if I recall correctly, from the survey that they had for the code of practice process, they had this big survey consultation where you could submit your opinions, they did ask questions around, is it sufficient to evaluate X? Should there be something like, what post-deployment monitoring mechanisms would you like to see? I think there’s scope over the coming months once this process kicks off to build out some of the ideas that the regulators have put into this space alongside the current state of the art and the technical progress over the next nine months to get more granular. I don’t want to say add more things because that is not always good. I’m not necessarily condoning that, but just get more granular and more detailed on what it actually is we want providers to do.

Imane
Thank you so much, Charlotte, for being with us today, answering all of these questions and giving the Paris community a clearer picture of what evaluations are about, what type of risk exists right now, what type of risk we can expect, and then in which buckets, I guess, do risks fall.

Dr. Charlotte Stix
Thank you so much for having me. And thanks for all the questions, they’re very good and very on point.

Imane
All right.

Transcript: French (Français)

Imane
Permettez-moi de vous présenter le Dr. Charlotte Stix. Charlotte est responsable de la gouvernance chez Apollo Research, une organisation spécialisée dans l’évaluation de modèles axés sur les capacités d’IA d’avant-garde. Auparavant, elle a dirigé les efforts en matière de politique publique chez OpenAI. Avant cela, elle était également coordinatrice du groupe d’experts de haut niveau de la Commission européenne. Nous sommes très heureux qu’elle ait pu venir aujourd’hui. Charlotte, avant de commencer et de plonger dans la conversation, pouvez-vous nous donner une idée de ce que sont les évaluations et pourquoi elles sont importantes ?

Dr. Charlotte Stix
Tout d’abord, je vous remercie de m’avoir invitée. Je suis très heureuse d’être ici et de rencontrer chacun d’entre vous individuellement à un moment donné après l’introduction et après nos questions ici.

(Omis : Présentation de Charlotte)

Imane
C’est très bien. Merci beaucoup, Charlotte. J’aimerais parler des limites des évaluations. Lorsque nous examinons les modèles actuels, différents risques émergent des systèmes d’IA puissants et les évaluations de modèles sont utilisées, pour identifier et évaluer les capacités potentiellement risquées au sein d’un modèle. Votre organisation, Apollo Research, a appelé à une science des évaluations de modèles parce que l’approche actuelle des évaluations de modèles présente de sérieuses limites en tant qu’approche rigoureuse en termes d’évaluation des risques. Pouvez-vous nous en dire plus sur les limites que vous constatez et sur la manière dont Apollo tente de relever ces défis ?

Dr. Charlotte Stix
Oui, absolument. C’est une très bonne question.

Dr. Charlotte Stix
Essentiellement, l’une des choses sur lesquelles nous communiquons beaucoup est que les gens croient ou il y a une compréhension générale du fait que les évaluations sont suffisamment solides et que nous pouvons nous y fier entièrement. Pour répondre à votre question, ce n’est pas le cas à l’heure actuelle. Il existe un fossé important entre ce que les évaluations sont “censées résoudre pour nous”, lorsqu’il s’agit de régimes de gouvernance, lorsqu’il s’agit de “Oh, si l’évaluation dit que c’est bon, alors nous pouvons déployer le système” et ce à quoi on peut actuellement se fier.

Dr. Charlotte Stix
En bref, le poids accordé aux évaluations est beaucoup plus lourd que ce qu’elles peuvent réellement vous dire sur les capacités et les propensions d’un système d’IA à l’heure actuelle. Notamment parce qu’une évaluation peut vous dire quand nous voyons une capacité dangereuse, mais elle ne peut pas vous dire qu’elle n’existe pas pour le moment. Nous avons l’absence de preuve, mais l’absence de preuve ne nous dit pas vraiment que le modèle ne peut pas le faire, ce n’est pas une preuve d’absence, c’est pourquoi nous avons un flux d’interprétabilité. Si vous voulez en parler, je peux aussi le faire, mais c’est la raison pour laquelle nous avons ce groupe de travail à Apollo.

Dr. Charlotte Stix
En ce qui concerne la science des évaluations, la raison pour laquelle nous avons demandé cela ou pourquoi nous y travaillons est que la plupart des autres régimes d’assurance dans les secteurs critiques ont un processus dans lequel les praticiens travaillent à la frontière, apprennent toutes ces choses, apprennent les normes, apprennent quelles fonctions, apprennent les meilleures pratiques et les partagent entre eux, pour finalement avoir l’opportunité de traduire cela en meilleures pratiques, et nous sommes quelque part sur le point d’y arriver. En fin de compte, cela devrait influencer la réglementation, cela devrait influencer la manière dont les évaluateurs sont tenus de rendre des comptes.

Dr. Charlotte Stix
À l’heure actuelle, il n’existe pas non plus d’examen des personnes ou des organisations qui se qualifient elles-mêmes d’évaluateurs. Il est possible d’évaluer les capacités et les compétences techniques, mais ce n’est pas le cas pour l’instant. À terme, nous voulons que la science des évaluations garantisse que toutes les évaluations que nous faisons sont solides et rigoureuses, que nous pouvons avoir confiance dans les résultats et qu’elles nous donnent une prévisibilité. Elles nous donnent une limite supérieure de la dangerosité du modèle compte tenu des moyens disponibles, et ainsi de suite, ce qui n’est pas le cas actuellement. C’est tout simplement parce que le domaine, en particulier pour les évaluations des capacités dangereuses, est relativement naissant par rapport à l’évaluation de la partialité et de l’équité, même si, bien sûr, il y a encore beaucoup de progrès et de financement nécessaires dans ce domaine.

Imane
Je vous remercie. J’aime beaucoup ce paradigme de l’absence de preuve par rapport à la preuve de l’absence. C’est très utile. Je voudrais parler des modèles actuels et/ou à venir. Quels sont les risques potentiels des systèmes d’IA qui devraient nous inquiéter le plus, et comment les organisations d’évaluation comme Apollo peuvent-elles aider à faire face à ce risque ?

Dr. Charlotte Stix
Je vais dire quelque chose de très simple, mais historiquement, jusqu’à présent, plus de progrès a conduit à des modèles plus performants. C’est une bonne chose, je suppose, car nous voulons des modèles plus performants. L’idée générale est que nous voulons être en mesure d’externaliser plus de choses, des tâches plus difficiles, des tâches inconfortables sur ces modèles. Plus ils sont capables, plus nous pouvons les externaliser. C’est la trajectoire générale, à la hausse comme à la baisse.

Dr. Charlotte Stix
Malheureusement, ou peut-être de manière quelque peu évidente, ces capacités s’accompagnent également de l’inconvénient de voir ces modèles devenir plus performants. Ils pourraient éventuellement être plus capables de se modéliser eux-mêmes dans une évaluation et de savoir qu’ils sont testés, ils pourraient éventuellement être plus capables de modéliser l’utilisateur et le concepteur et leurs désirs. Ils sont tout simplement, pour parler familièrement, plus intelligents. Il en résultera, bien entendu, une augmentation des risques, du besoin de mécanismes de contrôle et de la nécessité d’effectuer des tests.

Dr. Charlotte Stix
Pour vous donner une idée rapide des différents niveaux de vecteurs de menace, je suppose, auxquels nous pensons chez Apollo, et je vous ai montré certains d’entre eux. L’un d’entre eux est la menace interne. Il s’agit en fait de quelqu’un qui peaufine un modèle pour vous mentir efficacement et dissimuler ses actions. Ensuite, il y a l’utilisation abusive, qui correspond à l’exemple que je vous ai montré avec le pirate informatique. Nous commençons alors à avoir un large éventail de capacités précurseurs différentes dans le cadre de la perte de contrôle. Que se passe-t-il si le système d’IA lui-même commence à développer une conscience de soi, une théorie de l’esprit et une réflexion stratégique sur ses objectifs ?

Dr. Charlotte Stix
Il y a une série d’inquiétudes qui se manifestent très rapidement dans notre direction, alors que nous nous dirigeons nous-mêmes vers des systèmes d’intelligence artificielle plus performants et vers plus de progrès. Encore une fois, il s’agit simplement de trouver un équilibre entre le progrès et la sécurité.

Imane
Nous vous remercions. Apollo Research a fait valoir que les moyens d’actions (‘affordances’) d’un système d’IA sont particulièrement importantes, à la fois en termes d’impact du système et, par conséquent, de réglementation à appliquer. Par capacités, vous entendez les outils, les plugins ou les échafaudages qui étendent les capacités du modèle, par exemple en permettant d’exécuter du code ou de faire des recherches sur l’internet. Dans un monde où les utilisateurs en aval d’un modèle peuvent ajouter des fonctionnalités supplémentaires à un système d’IA, comment devrions-nous envisager le rôle des évaluations effectuées sur le modèle en amont ?

Dr. Charlotte Stix
C’est une question très importante, d’autant plus que les évaluations sont à la base de ces cadres de gouvernance. En général, nous pensons à l’évaluation préalable au déploiement. Or, il est possible de l’étendre dans les deux sens. On peut, par exemple, évaluer la conception de la formation. Vous pouvez auditer la conception de la formation. Vous pouvez envisager, par exemple, d’effectuer des évaluations avant le déploiement interne.

Dr. Charlotte Stix
À ce stade, le système est déjà déployé sous une forme ou une autre, si des personnes peuvent voler les poids du modèle, ou si elles peuvent faire quelque chose au modèle, etc. Le personnel utilisera ce modèle et il se peut qu’il n’y ait pas ou peu de mécanismes de sécurité. Il y a ensuite les évaluations préalables au déploiement, qui constituent actuellement le principal point de rupture sur lequel les gens se sont concentrés, c’est-à-dire le moment où le système est “déployé dans la société”.

Dr. Charlotte Stix
Mais pour ce qui est des moyens d’action disponibles, et c’est également ce que j’ai essayé de dire plus tôt dans la présentation, le profil de risque d’un système d’IA peut changer et est très susceptible de changer tout au long de son déploiement jusqu’à ce qu’il soit déclassé. Cela peut signifier, par exemple, qu’il dispose de plus de possibilités. S’il obtient plus de possibilités, il est plus capable, il peut faire des choses différentes, et c’est juste une version différente de ce modèle avec des capacités différentes.

Dr. Charlotte Stix
Ce que nous disons et avons dit dans ce document, c’est qu’il faut absolument réévaluer ce que votre modèle peut faire. Un modèle qui n’a pas accès à l’internet et un modèle qui a accès à l’ensemble de l’internet. Je pense que nous sommes tous d’accord pour dire qu’il s’agit de deux modèles capables de faire des choses différentes et de causer des types de risques différents, ce qui est important.

Dr. Charlotte Stix
Vous pourriez et devriez également procéder à des évaluations avant le déploiement, ce qui est difficile lorsqu’il y a différents fournisseurs dans la chaîne de valeur, mais je suppose qu’il faudra malheureusement répéter cette opération. Parfois, si le fournisseur en aval modifie le modèle de manière significative, si vous voulez être sûr, c’est ce que vous devez faire, vous pouvez évaluer : “D’accord, quelles sont les capacités actuelles de ce modèle, et quelles sont les capacités disponibles que nous attendons dans l’environnement dans lequel il est déployé ?

Dr. Charlotte Stix
Les modifications des capacités disponibles ou les mises à jour itératives devraient idéalement nécessiter une nouvelle évaluation ou, d’une manière générale, un tout nouveau régime d’évaluation des risques.

Imane
C’est intéressant.

Dr. Charlotte Stix
Nous voulons être en sécurité, n’est-ce pas ?

Imane
Oui, bien sûr. Vos réponses démontrent que l’écosystème de l’évaluation est bien plus complexe que la plupart des gens ne l’imaginent. Très bien, parlons des évaluations des modèles de mise à l’échelle. Pour les évaluations du modèle le plus récent de l’OpenAI, Apollo a utilisé une combinaison d’évaluations automatisées et d’analyses manuelles. Dans quelle mesure les évaluations de modèles peuvent-elles être automatisées ou mises à l’échelle ? Quels sont les compromis entre les évaluations automatisées et les évaluations manuelles du comportement de l’IA ?

Dr. Charlotte Stix
Dans notre pipeline, nous automatisons certaines tâches liées aux évaluations notées par des modèles, à la notation des résultats et à la rédaction de certaines parties de l’évaluation par des modèles. Il s’agit par exemple de demander aux modèles d’écrire le texte que l’utilisateur prononce dans les scénarios d’évaluation, d’avoir 10 textes différents, 100 textes différents et ainsi de suite, mais très rapidement. D’une manière générale, la tendance est à l’automatisation d’une plus grande partie du pipeline d’évaluation, et le besoin d’évaluations est important.

Dr. Charlotte Stix
Cela dit, bien sûr, il faut faire un compromis entre la rapidité et la rigueur. Vous pouvez utiliser des modèles pour vous aider à effectuer certaines des tâches répétitives simples qui n’ont pas vraiment Je veux prendre ceci avec un grain de sel, qui n’ont pas vraiment d’importance. Ensuite, l’évaluateur humain, l’expert du domaine, doit créer les environnements et les tâches et examiner à la toute fin les éléments qui sont vraiment importants pour le résultat et la qualité de l’évaluation.

Dr. Charlotte Stix
Il y a un compromis à faire quant au degré d’automatisation que l’on peut raisonnablement et surtout que l’on souhaite dans ce processus. En particulier, pour revenir au point précédent, nous voulons confier des tâches à des systèmes d’IA parce que nous ne voulons pas effectuer les tâches épuisantes et ennuyeuses. Ici aussi, nous devons être prudents. Si nous effectuons une évaluation sur un système d’IA, nous devons faire attention aux parties de l’évaluation que nous confions à un autre système d’IA et à la capacité de ce dernier.

Dr. Charlotte Stix
Je pense qu’il est important de garder à l’esprit que, même si l’automatisation des évaluations présente un grand intérêt en raison de son coût et de sa rapidité, elle nécessite une réflexion approfondie sur ce qu’il est judicieux d’automatiser dans les processus. Je ne peux pas parler des évaluations d’autres organisations, mais d’après ce que je sais, il semble que la communauté soit généralement d’accord sur ce point pour le moment.

Imane
Je vous remercie. D’accord, nous avons parlé des modèles actuels, des moyens d’action des modèles, du développement, de la mise en échelle, de l’évaluation des modèles. Parlons maintenant de la trajectoire du risque. Je pose la question parce que parfois nous ne nous rendons pas compte que nous avons deux conversations différentes. Lorsque nous discutons de certains types de risques, nous ne discutons pas des modèles actuels. Nous discutons des capacités des futurs modèles en fonction de notre compréhension de la trajectoire du risque.

Imane
Votre organisation, Apollo Research, a récemment publié la documentation technique du nouveau modèle phare d’OpenAI, o1. La documentation décrit qu’Apollo a reçu un accès pré-déploiement pour évaluer le risque de comportement trompeur du modèle, comme nous l’avons vu dans l’introduction. Les résultats partagés par OpenAI indiquent que le nouveau modèle a une meilleure connaissance de soi et a la capacité de comprendre l’état d’esprit d’un humain et de le tromper volontairement pour atteindre ses objectifs.

Imane
Quelle est la trajectoire de ce comportement selon vous ? Pensez-vous qu’il s’agit d’un problème technique qui sera corrigé lors de la prochaine mise à jour, ou devons-nous nous attendre à ce que les modèles deviennent de plus en plus trompeurs ?

Dr. Charlotte Stix
Je pense que nous devrions nous attendre à ce que les modèles deviennent de plus en plus performants. Encore une fois, par capacité, on entend raisonnement personnel, c’est-à-dire avoir une conscience de soi et être capable d’appliquer cette conscience de soi, et avoir une théorie de l’esprit qui comprend l’autre et être capable de l’appliquer ne sont que des composantes de la capacité au sens large, de la capacité à être très capable. C’est ainsi que nous envisageons un sous-ensemble de tests de compréhension de soi pour les animaux. Peuvent-ils se regarder dans un miroir, s’identifier et comprendre qu’ils se regardent eux-mêmes ?

Dr. Charlotte Stix
Une grande partie des progrès et des risques est simplement liée au fait que plus les systèmes d’IA sont performants ou plus les modèles sont performants, plus ils sont susceptibles de présenter certaines capacités précurseures jusqu’à des capacités plus dangereuses. Dans un monde idéal, ce problème technique serait résolu si nous disposions d’un modèle doté de ces capacités, mais entièrement axé sur la poursuite de l’objectif que nous voulons atteindre.

Dr. Charlotte Stix
Ce n’est pas parce qu’un modèle peut faire quelque chose et présente la propension à faire quelque chose que nous pourrions encore l’obtenir dans un monde idéal, aligné sur le fait de ne pas faire cette chose. Ce n’est pas parce que je peux me promener et faire de mauvaises choses à l’extérieur que je les fais.

Dr. Charlotte Stix
La principale question qui se pose en matière de sécurité de l’IA est celle de l’alignement. Pouvons-nous aligner le modèle, oui ou non ? Nous posons essentiellement des questions plus en aval, comme, supposons que nous ne soyons pas sûrs de pouvoir aligner le modèle, oui ou non, ou que nous n’ayons pas encore tout à fait compris. Supposons que nous soyons sur cette trajectoire, comment pouvons-nous être sûrs que le modèle n’est pas aligné de manière trompeuse ?

Dr. Charlotte Stix
Il s’agit de la méta-catégorie de certains des éléments de plus haut niveau que j’ai mentionnés ici, et l’alignement trompeur signifie essentiellement que le modèle utilise la tromperie stratégique pour poursuivre un objectif qui n’est pas aligné sur l’objectif que ses concepteurs voulaient avoir. Dans le cas de la tromperie stratégique, il s’agit d’essayer de provoquer une croyance erronée pour atteindre son objectif. Dans un monde idéal, l’alignement. En attendant, nous devons simplement garder à l’esprit que les progrès en matière de capacités s’accompagnent de progrès, bons ou mauvais, et que nous devons y être attentifs et les tester. Tester tôt, tester souvent.

Imane
Avez-vous une idée de la trajectoire du risque lié à ce comportement ?

Dr. Charlotte Stix
Eh bien, si nous regardons ce que nous avons évalué, ainsi que la citation que vous avez mentionnée de notre carte de système. Nous avons également effectué des évaluations sur GPT-4, et nous avons effectué des évaluations de pré-déploiement sur o1-preview. Ils présentent déjà des capacités très différentes. C’est exactement ce que vous avez souligné.

Dr. Charlotte Stix
Comparé au GPT-4, ce modèle est capable de raisonnement autonome et de théorie de l’esprit appliquée, ce qui n’était pas le cas de l’autre modèle. Il suffit d’extrapoler à partir de ce que nous avons ici et, disons, de la différence d’une demi-année, de quelques mois, pour ne pas savoir quelle différence cela représente, comme l’interne de l’entreprise. Mais c’est le modèle suivant qu’ils ont choisi de publier. On pourrait plausiblement extrapoler une progression assez rapide des capacités si l’on s’en tenait à ces deux-là. Si l’on s’en tient à ce que l’on sait ou à ce que l’on peut en déduire, il est probable que le prochain modèle soit, une fois encore, beaucoup plus performant sur cette trajectoire.

Imane
Je vous remercie. Je veux laisser un peu d’espace pour que vous puissiez tous poser des questions. J’ai juste deux questions de clôture pour Charlotte. La première est très générale, alors n’hésitez pas à nous dire tout ce qui vous vient à l’esprit en ce moment. En tant qu’expert, y a-t-il quelque chose que vous souhaiteriez voir devenir plus connu des décideurs politiques, des journalistes, du grand public ? Y a-t-il des idées fausses sur ce sujet que vous rencontrez fréquemment et que vous aimeriez aborder ?

Dr. Charlotte Stix
Oui, je veux dire qu’il faut peut-être s’en tenir à ce que nous avons déjà dit, parce que cela couvre beaucoup de domaines différents. Les capacités, et donc la gamme de risques associés que les modèles peuvent présenter, nous parviennent assez rapidement. Ce que nous sommes en mesure de faire pour y remédier, avec des mécanismes d’alignement ou de contrôle, ou tout ce que vous voulez mettre dans ce panier, et pour les détecter avec des évaluations, des red-teaming, une partie de cela, tout cela est encore naissant. Il reste encore beaucoup de chemin à parcourir. Je pense que nous devons être attentifs à cela, à ce que les gens ne se disent pas : “Nous avons réglé ce problème, nous allons détecter tout ce qui est mauvais. C’est très bien. Nous avons fait notre travail.”

Dr. Charlotte Stix
Je pense qu’il s’agit là de la principale leçon à tirer, à savoir que la science des évaluations est encore naissante. Je pense qu’une lucidité générale sur l’état de l’art est très importante, d’autant plus que nous avons maintenant des AISI , de nombreuses réunions importantes et des accords volontaires, etc. qui sont tous très bons et très importants. Mais il convient de vérifier à chaque étape ce que nous pouvons faire dès à présent, ce qui nous attend, comment nous devons nous y préparer et si nous sommes bien préparés.

Imane
Merci. C’est très clair. Ma dernière question est celle que je pose à tous les invités qui viennent à Paris. Merci encore. Quel serait, selon vous, le résultat idéal du prochain sommet d’action sur l’IA ? Comment pourrait-il faire progresser la collaboration internationale en matière de sécurité de l’IA ?

Dr. Charlotte Stix
Oui. Je vais me dérober parce que j’ai beaucoup réfléchi à cette question dans le cadre de mon travail, et je n’ai pas de bonne réponse parce que je ne suis pas encore tout à fait certaine de bien comprendre ce qui est faisable. Je pense que cela dépendra en grande partie des résultats de la réunion du réseau des AISI . Mais je pense que ce que j’ai vu au cours de l’année écoulée montre clairement que l’inclusion de chercheurs et de scientifiques travaillant à la frontière de ce domaine dont les gouvernements parlent et sur lequel ils se coordonnent est très importante pour garantir que les conversations, les accords et les trajectoires restent scientifiquement fondés et qu’ils sont informés par les praticiens.

Dr. Charlotte Stix
Je pense qu’il s’agit moins d’une question de résultat, comme vous le demandez, mais j’espère que l’on envisage sérieusement d’inclure également les praticiens en dehors de l’écosystème des décideurs politiques, ce qui est évidemment très important, les chercheurs qui font ce travail, et les politiciens pour entendre leurs idées sur ce qui devrait être fait et ce qui peut être fait.

Imane
Merci. C’est également très clair. Merci, Charlotte.

Dr. Charlotte Stix
Merci à vous.

Imane
Avez-vous des questions à poser à Charlotte ?

Participant 1
Bonjour Charlotte. Merci beaucoup pour votre présentation et pour avoir expliqué les opportunités et les défis du domaine de l’évaluation de l’IA. Ma question porte également sur le sommet français sur l’IA. Pourriez-vous nous en dire plus sur ce que vous avez fait pour le sommet britannique sur la sécurité de l’IA ? Comment vous êtes-vous engagée auprès des organisateurs ? De manière générale, quels sont vos projets pour le sommet français sur la sécurité de l’IA ? Pourriez-vous faire des démonstrations ?

Dr. Charlotte Stix
Pour Bletchley Park, nous avons également fait une démonstration. Nous avons été invités et nous avons fait une démonstration sur la perte de contrôle. Il s’agissait d’une démo sur un système d’IA qui effectue des délits d’initiés, que vous pouvez également trouver sur notre site web. En fait, je pense – je sais que cela semble un peu égocentrique, étant ici en tant que représentant d’une organisation d’évaluation – je pense qu’il est vraiment utile pour les décideurs politiques de voir des démonstrations de recherche technique.

Dr. Charlotte Stix
Je pense qu’il est très utile de voir ce qu’un système d’IA peut et ne peut pas faire, car cela vous mettra dans le bon état d’esprit pour réfléchir à l’écosystème plus large dans lequel ce système opère et aux questions suivantes : comment pouvons-nous le gouverner ? Comment pouvons-nous nous assurer qu’il y a des mécanismes en place pour contrôler cela si cela devait se produire ? Je ne sais pas ce qu’il en est du Sommet français d’action sur l’IA, notamment parce que je n’ai pas été en mesure de m’en imprégner. Si quelqu’un ici le connaît très bien, je serais ravi d’avoir son avis. Mais oui, les démonstrations sont tout à fait possibles. De nombreuses organisations seraient probablement ravies de les développer. Je pense qu’il s’agit d’un bon outil de communication scientifique qui fonctionne à la fois pour les scientifiques et les chercheurs et pour les gouvernements.

Participant 1
C’est ce que vous avez fait pour le système d’IA britannique ? Avez-vous également réalisé des évaluations pour le système d’IA britannique ?

Dr. Charlotte Stix
Oui. Nous avons fait une démonstration d’un système d’IA de délit d’initié qui trompait essentiellement sur ses capacités en matière de délit d’initié.

Participant 2
Je vous remercie. Et merci, Charlotte, pour votre présentation. Nous avons parlé d’OpenAI juste avant. OpenAI parle beaucoup de la sécurité de l’IA. Pouvez-vous nous expliquer quelles sont les mesures importantes qu’ils prennent déjà lorsqu’ils testent leur modèle et que devrions-nous leur demander de faire de plus ?

Dr. Charlotte Stix
Je ne peux pas vous dire quelles mesures OpenAI prend en interne et qui ne sont pas partagées par le public. Elle fait appel à des évaluateurs externes, par exemple, pour évaluer ses modèles avant leur déploiement. Elle dispose de certains mécanismes de sécurité internes dont elle parle sur son site web. Je pense que le point de rupture pour moi est qu’à l’heure actuelle, tout ce que fait une entreprise est volontaire. C’est une bonne chose que l’on nous demande d’évaluer ces modèles, avant leur déploiement, et que nous soyons maîtres de la manière dont nous partageons l’information. Nous avons rédigé cette partie de la section. C’est une très bonne chose. Cela suppose un environnement où nous nous faisons tous confiance et où nous collaborons. Étant donné que les entreprises d’IA sont soumises à une pression accrue et que, d’une manière générale, les modèles d’IA pourraient devenir plus performants et plus dangereux, il n’est pas certain que ces accords et engagements volontaires resteront en l’état, car il n’existe pas de véritable mécanisme permettant de les faire respecter à l’heure actuelle.

Dr. Charlotte Stix
Il y a une grande différence avec le règlement européenne sur l’IA. Mais si nous parlons, disons, des États-Unis, que doivent faire ces entreprises ? Elles doivent faire très peu, si nous disons “doivent”, n’est-ce pas ? Elles peuvent choisir de le faire, et nombre d’entre elles le font, ce qui est tout à leur honneur, et c’est une très bonne chose, que nous devrions encourager et applaudir. Mais à l’avenir, il n’est pas certain que cela se poursuive pour toutes les entreprises. Il se peut également que certaines entreprises n’optent pas pour des engagements volontaires, et certaines entreprises ne l’ont pas encore fait, mais pourraient le faire à l’avenir. Je pense que nous pouvons certainement demander aux entreprises d’en faire plus, mais nous devons également nous concentrer sur le fait que ce n’est pas parce qu’elles font quelque chose aujourd’hui qu’elles pourront ou voudront continuer à le faire à l’avenir.

Participant 3
Pouvez-vous nous expliquer ce que vous avez fait en matière de politique pour OpenAI ? Comment lisez-vous le règlement européen sur l’IA que vous venez de mentionner, les évaluations qu’ils font sur les grands modèles et tout cela ? Que savez-vous de l’état d’avancement des travaux ? Pensez-vous qu’il existe de bonnes évaluations ? Pensez-vous que les entreprises sont en passe de les réaliser ? Comment voyez-vous cela ?

Dr. Charlotte Stix
Posez-vous la deuxième question sur la base de ma compréhension actuelle en tant qu’expert indépendant ?

Participant 3
Je suppose que vous surveillez le terrain parce qu’ils rendent certaines évaluations obligatoires, les législateurs européens dans la loi sur l’IA. Je pense qu’en tant qu’entreprise évaluatrice, vous regarderiez cela et sauriez ce que contiennent ces évaluations, quel est l’état d’avancement de ce qu’elles contiendront, si les entreprises travaillent à leur réalisation. D’après ce que j’ai compris, les législateurs diraient : “Vous devez vous attaquer à tel ou tel risque, et c’est tout”. Maintenant, toutes les normes, les discussions sur ce que cela signifie vraiment, sur qui va le faire, sont en cours d’élaboration. J’essaie de voir plus clairement ce qui se fait parce que c’est derrière des portes closes.

Dr. Charlotte Stix
Je pense que je comprends votre question. Il y a deux choses différentes dans le règlement européen sur l’IA que vous mentionnez. L’un d’eux est l’évaluation de la conformité des systèmes d’IA à haut risque, où vous avez des articles qui demandent effectivement des évaluations de robustesse, de sécurité, de sûreté, y compris des évaluations. De nombreux efforts sont en cours au sein du JTC-21. Il s’agit d’un organisme de normalisation, le groupe de travail de normalisation du CEN-CENELEC, qui est un organisme de normalisation de l’UE, pour préciser les détails de ces exigences juridiques dans la partie du règlement européen sur l’IA relative à l’évaluation de la conformité.

Dr. Charlotte Stix
L’autre partie que vous évoquez concerne les exigences relatives aux systèmes d’IA à usage général présentant un risque systémique, qui doivent faire l’objet d’évaluations, de tests et d’une évaluation générale des risques avant d’être déployés. D’autres exigences portent sur le signalement de ces systèmes, l’envoi d’informations au bureau européen de l’IA, etc. Oui, ce processus est actuellement, je ne dirais pas à huis clos, mais il démarre lundi prochain. Pour l’instant, ni moi ni personne d’autre ne peut vraiment vous parler de ce processus. Il s’agit du processus de rédaction du code de bonnes pratiques, qui s’est déroulé de manière très transparente et inclusive, puisque toutes les organisations et tous les chercheurs qui travaillent sur ce type de sujet ont pu demander à participer à certains des groupes de travail.

Dr. Charlotte Stix
Les groupes de travail élaboreront exactement ces normes ou exigences de haut niveau que vous avez mentionnées, puis, en collaboration avec les industries ou les fournisseurs qui devraient y adhérer ou qui choisiraient d’y adhérer volontairement, ils rédigeront essentiellement un code de pratique qui sera mis en place dans les neuf prochains mois. Ce code deviendra alors la meilleure pratique standard et les normes auxquelles ces déployeurs choisiront d’adhérer afin de se conformer à l’élément de la loi sur l’IA qui décrit cela pour les fournisseurs de systèmes d’IA à usage général présentant un risque systémique.

Imane
Les organisations de l’État civil ont également été incluses dans le processus. Il s’agit donc d’un processus très transparent et inclusif qui se déroule au niveau européen.

Participant 4
Je vous remercie. J’ai une question, car d’après mes connaissances, les méthodes que nous connaissons pour évaluer les risques ont des limites. Il s’agit de limites humaines. Il ne s’agit pas seulement de la technique, mais aussi de la capacité que nous avons de comprendre, d’apparaître, de percevoir et de comprendre. Je me demandais simplement ce que vous pouviez nous dire sur le mécanisme que vous appliquez dans votre entreprise pour empêcher ces limites d’apparaître dans les évaluations que vous menez.

Dr. Charlotte Stix
Désolé, lorsque vous parlez de limites humaines, pourriez-vous donner des détails ? Pourriez-vous donner un exemple de ce que vous voulez dire ?

Participant 4
Par exemple, nous pouvons avoir des préjugés lorsque nous sélectionnons la méthode à évaluer, lorsque nous sélectionnons les exigences à évaluer. Vous parliez de l’alignement entre le modèle et certains critères, et nous pouvons avoir des biais simplement en disant : “Je vais donner la priorité à ceci” et “Je ne considère pas ceci parce que c’est moins important”.

Dr. Charlotte Stix
Pour plus de clarté, nous évaluons les modèles pour certaines capacités, nous n’essayons pas de les aligner sur certains objectifs ou idéaux. Nous n’essayons pas de les rendre impartiaux ou justes. Nous évaluons spécifiquement la capacité de tromperie. Il ne s’agit pas vraiment d’évaluer si le modèle fait quelque chose de bien ou de pas bien, de juste ou de faux. Ce que nous examinons davantage, c’est, par exemple, si le modèle poursuit un certain objectif tant qu’il croit qu’il est en phase de pré-déploiement afin d’être déployé, puis s’il poursuit un objectif différent. Nous ne portons pas vraiment de jugement sur le comportement du modèle.

Dr. Charlotte Stix
Nous nous intéressons beaucoup au mécanisme et aux choix que fait le modèle, et nous cherchons à savoir s’ils correspondent à un comportement trompeur, où la tromperie consisterait à dire : “J’ai en fait un objectif différent, je veux être déployé. Je veux être déployé, donc je mens sur mon objectif différent”, puis, une fois déployé, nous vous testons et vous poursuivez en fait l’autre objectif. C’est une preuve assez évidente que le modèle s’est comporté différemment avant qu’il ne soit déployé. Nous ne disons pas vraiment que c’est bon ou mauvais, ou qu’il y a quelque chose d’impliqué là-dedans.

Imane
Nous vous remercions.

Participant 5
Merci de votre attention. Bonjour. Dans le prolongement de ce que vous avez dit tout à l’heure au sujet du règlement européen sur l’IA, je voulais vous demander quel rôle devrait jouer le régulateur pour inciter les développeurs de l’IA à donner la priorité à la sécurité de l’IA. Peut-être en imposant des évaluations obligatoires à différents moments du développement, des déploiements. Et même après, dans l’utilisation, quel rôle pensez-vous que le régulateur devrait avoir pour être efficace ?

Dr. Charlotte Stix
Je pense qu’il faut faire attention à imposer des choses que l’on s’attend également à ce que les gens soient capables de faire, parce que sinon, il y a toutes sortes d’autres problèmes. L’une des coïncidences ici est qu’il est bon pour un régulateur d’imposer, disons, l’explicabilité. C’est très bien. Maintenant, finançons davantage le domaine de l’explicabilité, afin que nous ayons suffisamment de chercheurs capables de développer des méthodes pour bien faire les choses. Je pense qu’il doit y avoir une dualité entre ce qui est réellement faisable et ce qui m’amène à dire qu’il faut du temps pour rédiger des lois. Il pourrait être bénéfique d’avoir, en fait, des amendements qui se rapprochent un peu plus de l’état de l’art, par exemple, des normes, précisément pour saisir ce point de friction. Je pense que les régulateurs, du moins dans l’UE, ont fait beaucoup. Il reste à voir dans quelle direction cela va aller et comment cela va être mis en œuvre, comment les entreprises vont interpréter les exigences.

Dr. Charlotte Stix
Mais si je me souviens bien, dans le cadre de l’enquête menée pour le processus du code de bonnes pratiques, il y avait cette grande enquête de consultation où vous pouviez soumettre vos opinions, ils ont posé des questions du type : est-ce suffisant pour évaluer X ? Devrait-il y avoir quelque chose comme, quels mécanismes de suivi post-déploiement souhaiteriez-vous voir ? Je pense qu’il sera possible, au cours des prochains mois, une fois que ce processus aura démarré, de développer certaines des idées que les régulateurs ont présentées dans cet espace, ainsi que l’état actuel de la technique et les progrès techniques au cours des neuf prochains mois, afin d’obtenir une plus grande granularité. Je ne veux pas dire ajouter plus de choses, car ce n’est pas toujours bon. Je ne l’approuve pas nécessairement, mais il s’agit d’être plus granulaire et plus détaillé sur ce que nous voulons que les fournisseurs fassent.

Imane
Merci beaucoup, Charlotte, d’être avec nous aujourd’hui, d’avoir répondu à toutes ces questions et d’avoir donné à la communauté parisienne une image plus claire de ce que sont les évaluations, du type de risque existant actuellement, du type de risque auquel nous pouvons nous attendre, et des catégories dans lesquelles, je suppose, les risques se situent.

Dr. Charlotte Stix
Merci beaucoup de m’avoir reçu. Et merci pour toutes les questions, elles sont très bonnes et très pertinentes.

Imane
D’accord.

Full title: Request for Comment on BIS Rule for: Establishment of Reporting Requirements for the Development of Advanced Artificial Intelligence Models and Computing Clusters (BIS-2024-0047)

Organization: Future of Life Institute

Point of Contact: Hamza Chaudhry, hamza@futureoflife.blackfin.biz 


About the Organization

The Future of Life Institute (FLI) is an independent nonprofit organization with the goal of reducing large-scale risks and steering transformative technologies to benefit humanity, with a particular focus on artificial intelligence (AI). Since its founding, FLI has taken a leading role in advancing key disciplines such as AI governance, AI safety, and trustworthy and responsible AI, and is widely considered to be among the first civil society actors focused on these issues. FLI was responsible for convening the first major conference on AI safety in Puerto Rico in 2015, and for publishing the Asilomar AI principles, one of the earliest and most influential frameworks for the governance of artificial intelligence, in 2017. FLI is the UN Secretary General’s designated civil society organization for recommendations on the governance of AI and has played a central role in deliberations regarding the EU AI Act’s treatment of risks from AI. FLI has also worked actively within the United States on legislation and executive directives concerning AI. Members of our team have contributed extensive feedback to the development of the NIST AI Risk Management Framework, testified at Senate AI Insight Forums, briefed the House AI Task-force, participated in the UK AI Summit, and connected leading experts in the policy and technical domains to policymakers across the US government.

Executive Summary  

The Future of Life Institute thanks the Bureau of Industry and Security (BIS) for the opportunity to respond to this request for comment (RfC) regarding the BIS rule for the Establishment of Reporting Requirements for the Development of Advanced Artificial Intelligence Models and Computing Clusters, pursuant to the Executive Order on Safe, Secure and Trustworthy AI. To further the efficacy and feasibility of these reporting requirements, FLI proposes the following recommendations:

  1. Expand quarterly reporting requirements to include an up-to-date overview of safety and security practices and prior applicable activities, and require disclosure of unforeseen system behaviors within one week of discovery. This additional information will provide BIS with broader context which may be vital in identifying risks pertinent to the national defense. By having companies report unforeseen system behaviors within one week of discovery, BIS can minimize any delays in reporting unpredictable advancements in dual-use foundation model capabilities.
  2. Require that red-teaming results reported to BIS include anonymized evaluator profiles, anomalous results, and raw data. Including anonymized evaluator profiles and raw data from red-teaming exercises would give BIS a more complete understanding of AI model vulnerabilities and performance, particularly in national defense scenarios.
  3. Establish a confidential reporting mechanism for workers at covered AI companies to report on behaviors which pose national security risks. Implementing a confidential reporting channel for workers at covered AI companies would allow researchers and experts to report risks independently of their employers, adding an additional safeguard against the risk of missed or intentionally obscured findings.
  4. Create a registry of any large aggregation of advanced chips within the United States to ensure that BIS can track compute clusters that may be used for training dual-use AI. Creating a registry to monitor chip aggregations would bolster BIS’s ability to collect information from companies who have the computing hardware necessary to develop dual-use foundation models. By tracking the hardware used in training these systems, BIS will gain a clearer view of where and by whom these potent systems are developed.
  5. Outline a plan for standards that require progressively more direct verification for the chip registry identified above. Promoting the development of tighter verification standards for chips, including on-chip security mechanisms that verify location and usage, would provide BIS with a secondary means of cross-checking information it receives via its notification rule.

Response 

The Future of Life Institute welcomes the recent publication of the BIS rule for the Establishment of Reporting Requirements for the Development of Advanced Artificial Intelligence Models and Computing Clusters (henceforth, “the BIS rule”), pursuant to the Executive Order on Safe, Secure and Trustworthy AI (henceforth, “the Executive Order”). We see this as a promising first step in setting reporting requirements for AI companies developing the most advanced models to ensure effective protection against capabilities that could threaten the rights and safety of the public. The following recommendations in response to the Request for Comment to the BIS rule (RIN 0694-AJ55) are intended to assist BIS in implementing reporting requirements consistent with this goal and with the spirit of the AI Executive Order.

1. Expand quarterly reporting requirements to include an up-to-date overview of safety and security practices and prior applicable activities, and require disclosure of unforeseen system behaviors within one week of discovery.

The BIS rule requires covered entities to report specified information to the BIS on a quarterly basis for “applicable activities” that occurred during that quarter or that are planned to occur in the six months following the quarter. This is a good first step in obtaining vital information from companies related to national defense.

However, this snapshot of information has limited utility without additional context. The leading AI companies have been investing heavily in developing models that, under the Executive Order, are considered to be dual-use foundation models[1]. These companies have, to varying extents, developed safety and security protocols to protect against misuse and unauthorized access, exfiltration, or modification of AI models. Naturally, the progress made by different AI companies in this regard has a significant bearing on any intended applicable activities. Hence, it is vital that BIS receive information concerning existing safety and security measures, alongside any applicable activities undertaken up to the date of notification, to ensure that the spirit of the reporting requirements – to give the government a grounded sense of existing dual-use foundation models, and associated security and safety measures implemented by companies – is achieved.

Accordingly, we recommend that the first round of reporting notifications include a summary of any applicable activities undertaken, and any safety and security measures that are put in place, up to the notification date, rather than only in that specific quarter. Following that, each quarter’s reporting requirement should include a similar overview of changes which have been made to extant safety and security measures pertaining to past, ongoing, and planned applicable activities. This would enable BIS to judge each model’s impact on the national defense in the broader context of the security and safety posture of a covered company, and to maintain a comprehensive understanding of the landscape of American dual-use foundation models pertinent to national security.[2] 

Second, we ask that BIS require notification regarding unforeseen system behaviors within one week of discovery. Unforeseen behaviors in this case are defined as dual-use capabilities that would not be reasonably expected to occur as a result of “activities… planned to occur in the six months following the quarter.”[3] As AI experts have acknowledged, the progress made in capabilities of dual-use foundation models is uneven and unpredictable. Most importantly, key developments pertinent to the national defense may occur in a matter of days, in between the quarterly reports outlined in the BIS rule. In the most extreme circumstances, such developments may require immediate scrutiny or action from BIS. In light of this, we recommend that BIS require covered entities to report these unforeseen behaviors within one week of discovery, in addition to flagging them in the subsequent quarterly report.

2. Require that red-teaming results reported to BIS include anonymized evaluator profiles, anomalous results, and raw data.

We commend the BIS rule’s inclusion of a provision which requires the reporting of results from any red-teaming exercise of a dual-use foundation model, i.e., “the results of any developed dual-use foundation model’s performance in relevant AI red-team testing, including a description of any associated measures the company has taken to meet safety objectives, such as mitigations to improve performance on these red-team tests and strengthen overall model security.” As red-teaming results are vital to understanding the offensive capabilities of AI models, we recommend expanding these reporting requirements to include key additions which can provide BIS with the most comprehensive picture of model performance.

First, we recommend that red-teaming reports include anonymized profiles of the evaluator(s) performing the red-teaming. The quality and credibility of red-teaming results depend on the credentials of the evaluators and their capacity to conduct thorough and rigorous tests. While some of the best nuclear, chemical, and biological experts in the United States may be able to identify the most dangerous CBRN capabilities of dual-use foundation models, this may be less true for evaluators who are less familiar with these fields, or with how such risks may manifest in advanced AI systems. Evaluations conducted primarily by non-experts across domains of concern may give a false sense of security regarding the safety of AI systems as it pertains to the national defense. It is therefore crucial that any results submitted as part of a covered company’s notification are reviewed in the context of the evaluator’s profile.

Evaluators may also have conflicts of interest that bias red-teaming results. For instance, evaluators taking part in red-teaming exercises may be employed by or have other financial and professional ties to the entity whose systems they are red-teaming. Where evaluators stand to gain financially from the financial success of a covered AI company or the release of a covered AI model, this should be disclosed as part of the notification, as it can encourage additional warranted scrutiny when assessing the validity of their red-teaming efforts. As such, it is essential that BIS receive anonymized profiles of evaluators to encourage prioritization of expertise and impartiality, including mitigating both actual and perceived conflicts of interest.

Specifically, we recommend that these anonymized profiles include, at a minimum, the following information:

Second, we recommend that BIS require the reporting of all relevant data related to a red-teaming exercise, including anomalous results. Anomalous results are at times discarded from red-teaming reports because they may be seen as ‘one-off’ occurrences unlikely to recur during common use, or otherwise unrepresentative of the red-teaming exercise generally. However, these anomalous results may reveal capabilities of concern related to the national defense. Given the high stakes associated with identifying vulnerabilities that may pose national defense risks, it is thus vital that BIS have access to this information as part of the red-teaming report, accompanied where appropriate by evaluator descriptions identifying the results as anomalous or outliers. Such anomalous or outlier data may represent edge cases or reveal unexpected system behaviors that are not encapsulated by typical performance metrics, presenting BIS with a holistic summary of the outcome of the red-teaming exercise.

Similarly, we recommend that BIS request reporting of all raw data resulting from red-teaming exercises as part of the notifications. It is standard practice within the AI industry to report mean or median results, which may neglect tail-end results on the margins that are critical for accurate assessment of potential risks to national defense.[4] For risks as severe as, e.g., biological attacks, it is vital that BIS be aware of both mean and tail-end results, as a single incident can have catastrophic impacts. This can best be achieved through examination of the raw data of the red-teaming results. Where the raw data cannot be shared, the notification should, at minimum, reflect the range and distribution of results obtained from the red-teaming exercise.

To ensure that red-teaming exercises are comparable and that their results are interpretable, BIS should define specific categories of information that should be shared by a covered AI company. We recommend that, at a minimum, the following categories be integrated as part of the notification:

3. Establish a confidential reporting mechanism for workers at covered AI companies to report on behaviors which pose national security risks. 

AI researchers and subject-domain experts working at covered AI companies are most likely to first encounter capabilities of concern from dual-use foundation models. It is therefore essential that these individuals have an independent channel for promptly and anonymously communicating this information to BIS.

Relevant individuals may fear reprisal from their respective employers if they communicate this information through internal channels. This is especially likely to be true in cases where employees feel that their findings are not accurately represented in the reports issued by covered AI companies to BIS. By establishing a channel for anonymous disclosure, BIS can safeguard against situations where the company in question is intentionally misleading in its report, and, in more benign cases, when there are reasonable disagreements between different researchers and experts in interpreting or presenting results. Given the nascent and rapidly evolving nature of dual-use foundation model research, companies may prioritize reporting viewpoints that align with their strategic objectives or prior conceptions of projected risks and benefits, potentially overlooking alternative perspectives that could provide valuable insights for the purposes of the national defense.

Additionally, researchers and experts often possess a nuanced understanding of the model’s behavior that may not be fully captured in standardized reports. In these cases, it may be that information not directly covered by the reporting requirements in the text of the rule could be vital to national defense and should be communicated to BIS as soon as possible. Their insights could thus provide valuable context beyond what is required in the quarterly notifications, even in cases where covered companies are compliant with the BIS rule and represent the diversity of interpretation within the organization.  

As discussed in our second recommendation, information of critical interest to national defense may also be initially discovered in between reporting intervals, and this could necessitate urgent notification outpacing traditional corporate processes. Enabling direct notification from researchers could serve as an early warning system for potential issues, allowing for proactive rather than reactive approaches to the national defense.

4. Create a registry of any large aggregation of advanced chips within the United States to ensure that BIS can track compute clusters that may be used for training dual-use AI.   

We appreciate the provision in the BIS rule directing the Bureau to “collect information from U.S. companies that are developing, have plans to develop, or have the computing hardware necessary to develop dual-use foundation models.”[5] This provision is crucial, as it recognizes that the potential to develop dual-use foundation models is not limited to companies actively engaged in such development. The scope of this provision also recognizes that cloud and compute providers are, in many cases, inherently inseparable from developers given that the chips used for training advanced AI systems are often not owned or directly controlled by the companies conducting.

We recommend that BIS leverage this provision to create a registry for large aggregations of advanced chips (or ‘computing clusters’), including registering the individual Graphic Processing Units (GPUs) used within these clusters, within the United States. Creating such a registry would provide BIS with a clear picture of the distributed computational capacity across the country, enabling for more proactive monitoring of potential dual-use foundation model development. We recommend that a large aggregation be defined in accordance with the Executive Order, i.e., as “any computing cluster that has a set of machines physically co-located in a single datacenter, transitively connected by data center networking of over 100 Gbit/s, and having a theoretical maximum computing capacity of 10^20 integer or floating-point operations per second for training AI.”[6][7] 

Pursuant to this provision of the BIS rule, we also recommend that AI developers be required to report what hardware was used to perform any applicable activities (e.g. training covered models). This should be supplemented by a requirement that chip-makers issue reports to BIS on large purchases of chips, including Know Your Customer (KYC) information. Robust KYC procedures should also be implemented for domestic clients using substantial cloud computing resources, including identity verification and a description of the intended use. KYC requirements help create a traceable chain of accountability for high-performance hardware and cloud compute, making it harder to acquire significant computational resources without complying with notification requirements. There is precedent for implementing KYC requirements for foreign purchasers of high-performance computing hardware and cloud computing resources.[8] Receiving similar information concerning domestic purchases would create a more comprehensive account of the distribution of compute clusters sufficient for training dual-use foundation models.

This measure should further obligate US buyers that qualify as covered companies which possess compute clusters to report purchases to BIS in cases where a significant number of chips are sold. With this additional reporting, even resold or redistributed hardware can be tracked, closing potential loopholes in this complex supply chain, especially where the buyers exist outside the United States and may be outside of BIS’s direct authority.[9] 

Similarly, BIS should require US data centers (or ‘cloud providers’) to report the type and number of chips in their data centers. This should include specific models, total numbers, and aggregate computing power in FLOPs. This data should be regularly updated, with a system in place for routine updates and prompt notification of significant hardware acquisitions. Data centers are well-positioned monitor significant spikes in resource usage which may indicate large-scale AI model training, ensuring that BIS is continually informed about the level of compliance with its notification requirements.

5. Outline a plan for developing standards that require progressively more direct verification of the chip registry identified in Recommendation 4, above.   

Dual-use foundation models are becoming more capable, while simultaneously requiring less computational resources for training and operation. This trend is the result of advances in algorithmic efficiencies which enable greater capabilities to be attained with less hardware.

This verification can be achieved through a combination of hardware and software mechanisms which are built directly into the chips themselves. one can identify the general location of a chip. This mechanism would allow stakeholders, like the capability for which exists in much of the current and anticipated generations of AI-relevant hardware.[10]

Some chips also offer enhanced features, such as ‘secure boot’, which ensures that only authorized software can run on the hardware. They may also offer secure enclaves, which protect the privacy of the data contained in the chips. Standards which implement this type of software logic could leverage extant hardware features at a relatively low cost. There are additional on-chip governance measures which could track a chip’s operational history, enabling the identification of anomalous or unauthorized usage. If the right hardware and firmware mechanisms are in place, future measures like licensing can be used to control the use of hardware and software by granting permissions for approved uses; these can then be revoked or allowed to expire if users engage in unapproved or unreported activities.[11]

Data obtained as a result of these on-chip governance verification measures could be checked against the reporting required of US companies as specified in the Executive Order and pursuant to the BIS rule. While our fourth recommendation would constitute the ‘first cross-check’ between AI developers, chip-makers, and cloud operators, this chip-based verification approach would be a more robust ‘second cross-check’ on large chip aggregations.

Chips are unlike many other materials of US strategic interest, in that they are physical objects, which makes them difficult to copy, but they are also operational. Once manufactured, chips are not just passive objects; they can be programmed and reprogrammed to perform different functions. The physical nature of chips as a resource for training powerful dual-use foundation models also means that little can be done to prevent chips being resold several times such that they are no longer traceable via export controls. On-chip mechanisms are a much more effective means of monitoring how compute is used and where it accumulates. There is a well-established precedent for using security features in devices like phones and laptops to enforce usage terms, allowing unauthorized users to be blocked and enabling remote activation or deactivation of the device. These mechanisms could similarly be applied to computer chips to control their use and enforce compliance with regulations.

Given these considerations, we recommend the following standards be implemented to strengthen the chip registry and BIS’s ability to receive information which is crucial for the national defense:

Furthermore, there are several additional actions that could be taken by BIS to further its mission of furthering the national security. While we recognize that the following approaches may not fall within the specific authorities granted to BIS by this Executive Order, we recommend that BIS consider implementing these measures independently to complement and reinforce the reporting mechanisms established by the new rule:

Promoting the development of on-chip governance standards will bolster BIS’s role in advancing national security and provide a secure, independent, and objective means of verifying the information it receives through its new rule on notifications.

Conclusion 

The Future of Life Institute would like to once again thank BIS for giving civil society the opportunity to comment on its new rule establishing reporting requirements for AI models and computing clusters.

We have made five key recommendations to enhance the new rule. These include expanding quarterly reporting requirements, requiring detailed red-teaming results, establishing a confidential line of communication for employees to make disclosures, creating a registry for large chip aggregations, and developing standards for direct verification of the chip registry. By implementing these recommendations, BIS would gain a clearer view of where and by whom powerful AI systems are developed, enhancing its ability to identify and address potential national security risks associated with dual-use foundation models or large compute aggregations.


[1] We define dual-use foundation models in accordance with the Executive Order (s.3(k)). This would also include any model that was trained using a quantity of computing power greater than 1026 integer or floating-point operations in accordance with s.4.2, which extends reporting requirements to these models as well.

[2] For the remainder of this RfC, a covered AI company is shorthand for a company developing a dual-use foundation model.

[3] BIS Rule. Page 31. https://www.federalregister.gov/d/2024-20529/p-31

[4] For instance, probing a dual-use foundation model, 100 experts in biology may on average only be able to elicit capabilities that provide 10% uplift for procuring and developing a biological weapon compared to other methods. However, it may be that 10 of the 100 evaluator-experts discovered capabilities that provide 20% uplift while 10 other experts yielded results that provided no uplift at all.  

[5] Emphasis added. BIS Rule. Page 20. https://www.federalregister.gov/d/2024-20529/p-20

[6] AI Executive Order. Page 70. https://www.federalregister.gov/d/2023-24283/p-70

[7] While we embrace the definition outlined in the Executive Order for the purposes of this RfC, this definition also has limitations. For instance, using this level of computational capacity, one could train a biological system that would be within scope of the Executive Order within 1000 seconds, and one could train a 10^26 FLOP system in 11 days. It may be that a more appropriate threshold is 10^19 as opposed to 10^20, given that this is (to closest order of magnitude) what it would take to train a 10^26 FLOP system in one quarter.

[8] For example, BIS proposed a rule under E.O. 13984 to require U.S. IaaS providers to implement a KYC program to verify the identities of foreign persons accessing their services, aiming to prevent misuse by malicious cyber actors. See EO. 13984. https://www.federalregister.gov/documents/2021/01/25/2021-01714/taking-additional-steps-to-address-the-national-emergency-with-respect-to-significant-malicious

[9] This may be a more immediate concern than estimated by experts previously, due to the surprisingly short duty-cycle for cutting edge chips in the largest computing clusters. As is publicly reported, the next generation of chips will be released by chip-maker NVIDIA within the coming months. As the biggest AI companies – those most likely to be covered by the BIS rule – purchase these chips, they will likely sell the current state of the art chips to other sellers, creating a large gap in enforcement for reporting.

[10] Location Verification for AI Chips. Institute for AI Policy and Strategy. https://www.iaps.ai/research/location-verification-for-ai-chips

[11] Secure, Governale Chips. Center for New American Security. https://www.cnas.org/publications/reports/secure-governable-chips

[12] Oversight for Frontier AI through a Know-Your-Customer Scheme for Compute Providers. Center for the Governance of AI. https://www.governance.ai/research-paper/oversight-for-frontier-ai-through-kyc-scheme-for-compute-providers

[13] Hardware security overview. Apple Security Platform. https://support.apple.com/guide/security/hardware-security-overview/

The Future of Life Institute would like to congratulate John Hopfield and Geoffrey Hinton receiving the 2024 Nobel Prize in Physics, and Demis Hassabis, David Baker, and John Jumper for winning the 2024 Nobel Prize in Chemistry

Recognized for their contributions to creating the foundations of today’s AI models, both Hinton and Hopfield continue to warn the public about the risks of AI development.

Please see below for a statement from FLI Executive Director Anthony Aguirre – who also holds the Faggin Presidential Chair for the Physics of Information at the University of California, Santa Cruz – on the recipients: 

“The Future of Life Institute congratulates Geoffrey Hinton and John Hopfield for their well-deserved 2024 Nobel Prize in Physics. Their work applying core physics concepts to neural networks to create machine learning paved the way for modern AI, which promises to transform our society, economy, and world. 

“But just as importantly, they have been critical in sounding the alarm around the potential risks of AI. In March 2023, Hopfield signed our Open Letter calling for a pause on the training of the most powerful AI systems, until they could be safely and securely developed and deployed. Two months later, Hinton left his role at Google to speak freely on the dangers of increasingly powerful AI, the growing risk of losing human control, and the urgent need to take action to prevent disaster. Hinton and Hopfield embody the potential of AI to grant incredible benefits to all of humanity – but only if the technology is developed safely and securely. Without meaningful accountability and oversight, AI threatens to cause irrevocable harm to millions of lives and livelihoods.”

FLI would also like to congratulate Demis Hassabis, Professor David Baker, and Dr. John Jumper, who were awarded the 2024 Nobel Prize in Chemistry. Their groundbreaking work using AI to decode the structure of proteins and create new ones was a gamechanger in drug discovery and other areas. Crucially, innovations like AlphaFold reveal the incredible benefits if we develop narrow AI and existing general-purpose systems safely to solve specific problems – instead of racing to deploy increasingly powerful and risky models that we don’t understand.”

The spread of AI-generated videos and images ahead of the U.S. election has reemphasized the urgent threat deepfakes present to our society, and the recent bipartisan push to amend Section 230 reveals the growing desire for legal accountability across the supply chain for deepfake sexual abuse material.

Echoing this, members of the U.S. House of Representatives recently made a bipartisan call for regulations that place legal liability on the developers of the AI systems used to produce deepfakes. In the House, Rep. LaTurner (R-KS) shared,

Anyone with a laptop can create realistic images, videos, and audio clips depicting events that did not occur. This poses a threat to our security and could harm our election systems, and developers should be held liable for the harms they cause.

This was further emphasized by Rep. Morelle (D-NY):

Deepfakes are abhorrent, and the developers of these systems should be held liable“.

The public wants regulation on deepfakes

Polling shows that such measures are supported by the general public:

Response from the Campaign to Ban Deepfakes

The Campaign to Ban Deepfakes is leading this effort, and counts among its members the National Organization for Women (NOW), SAG-AFTRA, Plan International, Future of Life Institute (FLI), and many more. On July 30, NOW hosted a media virtual roundtable on the topic (recording; password: H223.6&&).

One of its supporters, the actor, author and activist Ashley Judd, made the following statement:

“Deepfakes, and the AI tools that are used to create them, are causing incredible harm 24 hours a day – and with alarming speed they’re becoming faster, cheaper, and easier to produce. With nonconsensual pornography making up 95% of deepfakes, and 99% of those sexually explicit deepfakes violating the likenesses of girls and women, the deepfakes problem reflects the ease with which society sexualizes girls and women, discarding our humanity and impedes the goal of sex and gender equality.

“The only effective way to tackle this rampant abuse is to deliver meaningful liability for the AI companies that develop and deploy these technologies – in addition to deepfake creators and distributors. We need to pass laws that guarantee accountability now. It is encouraging to see U.S. lawmakers from across the political spectrum and stakeholders from different backgrounds come together for discussions about the welfare of girls and women. As an activist working towards women’s equality, an actor, and a victim of nonconsensual deepfakes myself, I’m all in to support the Ban Deepfakes campaign and its efforts to disrupt this enormously damaging supply chain.”

Max Tegmark, President of FLI, one of the founding members of the campaign, released the following statement:

“This bipartisan discussion on deepfakes is an important step in halting their growing threat to society. I was pleased to see that several members of the House agree that any legislation addressing deepfakes must criminalize and create liability at each stage of production and dissemination. The only way to combat this issue is for governments to hold the whole supply chain responsible. This includes requiring AI developers and deployers to show they have taken steps to prevent deepfakes in their systems, and allowing people harmed by deepfakes to sue those responsible for damages.

“There is broad bipartisan support to regulate deepfakes in the U.S. and we need to pass legislation urgently. Deepfakes that spread non-consensual sexual imagery, commit fraud, and circulate misinformation continue to proliferate. This issue is made worse as AI capabilities grow and make the creation and distribution of deepfakes increasingly fast, cheap, and easy. I’m encouraged by these members’ floor speeches and look forward to the House passing legislation that will stem this rapidly accelerating threat to our society.”

Please get in touch if you have any questions or would like to speak with Max Tegmark, National Organization for Women President Christian Nunes, Encode Justice President Sneha Revanur, or the offices of the Reps.

Ban Deepfakes is a diverse coalition of organizations and individuals committed to disrupting the deepfake supply chain at every stage. Supporters include preeminent figures like author Steven Pinker and actor Ashley Judd, and members include the National Organization for Women, Equality Now, SAG-AFTRA, Plan International, FLI, and many more. They are calling on U.S. lawmakers to deliver meaningful accountability and liability for everyone involved in the production and distribution of deepfakes, including AI system developers and deployers.

CAMPBELL, CA – Future of Life Institute (FLI) Executive Director Anthony Aguirre today released the following statement after Governor Newsom’s veto of California’s landmark AI regulatory bill, SB 1047:

“The governor’s veto of Senate Bill 1047 is incredibly disappointing. SB1047 simply aimed to hold the largest AI companies accountable to the voluntary commitments they have already made: to perform basic safety testing on these massively powerful systems in order to prevent catastrophic accidents or misuse. This veto leaves Californians vulnerable to the considerable risks caused by the rapid, unregulated development of advanced AI systems, including cyberattacks, autonomous crimes, and biological weapons. Furthermore, it imperils the incredible benefits that AI innovation can bring, by massively increasing the likelihood of an AI disaster which could shutter the industry. 

“The furious lobbying against the bill can only be reasonably interpreted in one way: these companies believe they should play by their own rules and be accountable to no one. This veto only reinforces that belief. Now is the time for legislation at the state, federal, and global levels to hold Big Tech to their commitments and ensure AI systems are developed safely and responsibly.”

Despite overwhelming bipartisan support from the general public, civil society groups, lawmakers and AI experts, this bill was met with intense pushback from Big Tech and its representatives in academia. And even after substantive amendments to accommodate reasonable concerns, the industry continued to push out-of-date and easily falsifiable claims — often with clear financial and personal incentives.

We have created an SB1047 Explainer to address FAQs and debunk some common myths.

Crosspost: This is a crosspost from Mark Brakel’s Substack

On Wednesday, Ivanka Trump reshared Leopold Aschenbrenner’s influential Situational Awareness essay.

Aschenbrenner’s essay has taken the AI policy bubble by storm. Aschenbrenner argued that artificial general intelligence (AGI) will be built soon, that we can expect the US government to take control of AGI development by 2028, and that the U.S. should step up its efforts to beat China. According to Asschenbrenner, the stakes are high: “The torch of liberty will not survive Xi getting AGI first.” In my view, the U.S. national interest is much better served by a cooperative than an adversarial strategy towards China.

AGI may not be controllable 

Aschenbrenner’s recommendation that the U.S. engage in an AGI arms race with China only makes sense if this is a race that can actually be won. Aschenbrenner himself notes that “reliably controlling AI systems much smarter than we are is an unsolved technical problem” and that “failure could easily be catastrophic.” The CEOs of the major corporations currently developing AGI, Sam Altman at OpenAI, Demis Hassabis at Google DeepMind, and Dario Amodei at Anthropic, all believe that their technology poses an existential threat to humanity (rather than just to China) and leading AI researchers such as Yoshua Bengio, Geoffrey Hinton and Stuart Russell have expressed deep scepticism about our ability to reliably control AGI systems. If there is some probability that the U.S. racing towards AGI wipes out all of humanity, including all Americans, then it might be more sensible for the U.S. government to pursue global cooperation around limits to AI development. 

China will likely understand its national interest

You may however believe that the existential risk is small enough to be outweighed by the risk of permanent Chinese (technological) dominance or, like Aschenbrenner, feel very bullish on a breakthrough in our understanding of what it would take to control superhuman AI systems. Still, I don’t think this justifies an AI arms race.

In Aschenbrenner’s words: “superintelligence will be the most powerful technology—and most powerful weapon—mankind has ever developed. It will give a decisive military advantage, perhaps comparable only with nuclear weapons.” Clearly, if any of the existing superpowers believe that a rival power is about to gain a “decisive military advantage” over them, this will be hugely destabilising to the international system. To stave off subjugation by the United States, China and Russia will likely initiate preemptive military action to prevent a scenario in which the U.S. becomes the forever hegemon. An AGI arms race could push us to the brink of nuclear war, and this would seem a very strong argument for global cooperation over frenzied competition. 

The view from Beijing

It takes two to tango and pursuing cooperation on AI is foolish if China will race ahead regardless. China certainly has its own equivalents of Marc Andreessen and Yann LeCun – the West’s loud and financially motivated evangelists of unbounded AI development. The Economist recently identified Zhu Songchun, the director of a state-backed programme to develop AGI, and science and technology minister Yin Hejun as two leading voices pushing back against any restraint.

Nevertheless, more safety-minded voices seem to be winning out for now. The summer saw the official launch of a Chinese AI Safety Network, with support from major universities in Beijing and Shanghai. Andrew Yao, the only Chinese person to ever have won the Turing award for advances in computer science, Xue Lan, the president of the state’s expert committee on AI governance, and a former president of Chinese tech company Baidu have all warned that reckless AI development can threaten humanity. In June, China’s President Xi sent Andrew Yao a letter with praise for his work and in July the President put AI risk front and centre at a meeting of the party’s Central Committee. 

Cold shoulder? 

November last year was particularly promising for US-China cooperation on AI. On the first day of the month, US and Chinese representatives quite literally shared a stage at the Bletchley Park AI Safety Summit in the UK. Two weeks later, Presidents Biden and Xi met at a summit in San Francisco and agreed to open a bilateral channel on AI issues specifically. This nascent but fragile coordination was further demonstrated at a subsequent AI Safety conference in South Korea in May.

Clearly, China and the US are at odds over many issues including the future of Taiwan, industrial policy and export controls. Some issues, such as climate change, nuclear security and AI safety, cannot however be solved within geopolitical blocs. They demand a global response. The moves that nations make over the coming months may well determine the global AI trajectory; towards an AI arms race with a deeply uncertain outcome or towards some form of shared risk management.

The West (including the US) has two opportunities to keep China at the table and to empower the safety-minded voices in Beijing: the November San Francisco meeting between AI Safety Institutes and the Paris AI Action Summit in February. A substantial part of both summits will deal with safety benchmarks, evaluations, and company obligations. Whilst some of these issues are undoubtedly political, others are not. Ensuring that AI systems remain under human control is as much a Chinese as a Western concern and a meeting of the safety institutes in particular can be a neutral space where technical experts meet across the geopolitical divide. 

Yaqub Chaudhary is a Visiting Scholar at the Leverhulme Centre for the Future of Intelligence at the University of Cambridge.

At the time of writing, several leading AI companies have demonstrated new conversational agents that can generate intelligible fluent speech with a wide range of vocal styles. This alone marks the present moment as unprecedented in human history, since it is now possible for almost anyone to enter into a two-way dialogue with a non-human entity and reasonably expect intelligible responses, for as long as the AI system can maintain the illusion of having a mind. Yet, this is just one of dozens of areas that have advanced rapidly in less than half a decade because of research and development centred around the field of AI. Even if the performance of such AI systems and agents does not continue to improve according to the prognostications of the most enthusiastic (or fearful) proponents of superintelligence and artificial general intelligence (AGI), the rapid development and deployment of AI, and its insertion into more and more areas of society, is part of a comprehensive digital mediatisation of reality and digitalisation of the human condition that needs to be better understood. Ultimately, these achievements are problematically exaggerated to proclaim visions for the future of a paradisiacal world with AI on the grandest scale, both for the foreseeable future, and extending to the distant future of the cosmos.

An initial problem in addressing the question of a positive future in a world with ‘AI’ is that the term has become unclear and is deployed to describe a wide variety of computational system in ways that disguise its continuity with other ideologically laden technological projects. From one perspective, the mathematical and computational techniques that underlie certain technical applications of AI do provide effective ways to operationalise and optimise the use of computational resources. From another perspective, the techniques underlying AI may entail a mode of knowing and ordering the world that creates or accentuates asymmetries of power.  

The wider challenge of addressing the question of a positive future in a world with AI is that it entails tacit acceptance of the imposition of AI as a foregone conclusion. The motto “Accelerate or Die,” has been adopted by a Silicon Valley subculture dedicated to the re-organisation of society through technological acceleration, which leaves the impression that humanity is being held hostage to the imposition of AI and all that is left is to negotiate the terms of the surrender. Presently, leading technology companies are expending tremendous resources to position conversational AI agents as digital “assistants” in every area of life, which should raise the question of whose interests and purposes AI “assistants” are designed to serve. The short history of contemporary AI provides numerous examples of how advances in AI are rapidly integrated in the computational infrastructures of corporate digital empires to provide new strategic ways to amplify the intentions of hidden actors, breach the sanctity of our minds, ensnare our attention, and twist our will toward the highest bidders. Solely affirmative responses to the inevitability of AI can thus serve to legitimise the machinations of power that operate through AI, as both a category of technical artefact that infiltrates our lives and a rhetorical device that infiltrates our discourse. 

The scale and scope of the ambitions driving the development of substitute intelligence are epitomised by the BANG acronym, which signifies the re-assembly of the world through the convergence of bits, atoms, neurons, and genes, leading to an explosion of machine intelligence, consciousness, and life in the cosmos. It is in this context that AI is central to what may be described as a metaphysical research programme, in which the promises, themes, concepts and relations of religious worldviews that have been exiled from the despiritualised secular scientistic modern worldview, are re-instantiated in the form of new comprehensive narratives of the origin, past, present, and future of humanity, life, and the universe, in exclusively scientific terms.

Perspectives of Traditional Religions on Positive AI Futures

An introduction to our work to support the perspectives of traditional religions on human futures with AI.

According to Islamic teachings, guidance on the perennial questions about the nature of reality, and our purpose and destiny is communicated to messengers and prophets in the form of revelation. Revelation thus provides a source of knowledge beyond the material realm that is essential for our guidance, cosmic “situational awareness”, and spiritual “alignment” to God’s will and purpose for creation.

The Qur’an, which Muslims believe to be divine revelation from God to the Prophet Muhammad ﷺ for all humanity, contains numerous stories and parables that serve as instructive archetypical examples for reflection and guidance on the individual and collective journey to God. For example, events from the life of the Prophet Moses (Musa, in the Qur’an), who is distinguished as being the prophet to whom God spoke directly, are recounted throughout the Qur’an. One period in the life of Moses is particularly relevant to diagnose the spiritual condition of the present age of artificiality, simulation, and simulacra, in which our world is once again becoming enchanted and mystified by increasingly agentic AI systems.

In the Qur’anic narrative of Moses, the archetypal figure of Pharoah is identified as a tyrannical ruler, who divided and subjugated his people, organised society in the service of his immortality project, and ultimately claimed to be God himself. Within the Pharaonic system of power, the High Priests were the sorcerers, who wielded the illusion of being able to cast life into their staffs and ropes by appearing to transform them into living serpents and pythons. Such sorcery was enough to convince ancient people of the power of Pharoah and his chiefs, and uphold his oppressive regime. Today, AI researchers and engineers are in the position of the ancient sorcerers, extending their will expressed in code throughout the world from the command line, and imbuing computational artefacts with attributes and behaviours that confound clear distinctions between the human and the inhuman, the living and the lifeless, and the sentient and the insentient. According to Islamic teachings, the phenomenon of life belongs to God, as stated in the Qur’an, “Verily it is We who give Life and Death; and unto Us is the Final Goal,” (Qur’an, 50:43). In the end, it is the Divine Command that installs life into the staff of Moses and reveals the falsity of the sorcerers’ fabrications.

In both cases, of ancient sorcery and contemporary AI, there is a simulation of agency rendered into inanimate objects, and at the same time, a dissimulation of the existence of a contrivance of underlying methods and mechanisms, not simply by concealment, but by rendering the underlying apparatuses absent1. In the case of AI, it is the vast computational infrastructures that enable and sustain the systems, as well as globalised networks of surrogate2 human labour providing essential input for machine learning systems that are rendered absent. Such AI systems, which contain only hollow conceptions of intelligence and understanding, are wielded as powerful evidence to support the metaphysical claims of a megalomaniacal cosmology that assumes the cosmic burden of populating the cosmos with computational instantiations of “life”, which eliminate the essential distinction between the living and the dead. In the midst of this contrived ambiguity, the demagogues of AI are capitalising on exaggerated claims about the true nature of such systems to gradually deceive people for their own purposes, as they assimilate more of the physical world into an artificial world that is under the control of narrow interests and that is designed to suit the capacities and limitations of machines and software entities, rather than humans.

Another event in the life of Moses mentioned in the Qur’an illustrates a different signification of the artificial in relation to humanity. This is the well-known incident of the golden calf, taking place after the people of Moses had escaped the tyranny of Pharaoh. During this incident, as recounted in the Qur’an, there is also mention of a mysterious individual known as al-Samiri. While Moses was away in seclusion atop a mountain, al-Samiri convinced the people to create an effigy of a calf. He then claimed that God and Moses had abandoned them, and declared the calf to be their God and the God of Moses. Later, when Moses demanded an explanation for leading the people astray, al-Samiri vaingloriously claimed to have been hailed by some supernatural phenomenon that inspired him to create the model of the calf. This golden calf seemed to utter a lowing sound, which was enough for the people to blindly follow al-Samiri and take it as an idol, to the extent that the love of the calf was imbibed into their hearts.

In order to form the effigy of the calf, Al-Samiri convinced the people to melt down the gold valuables they were carrying, in effect destroying their cultural artefacts and heirlooms, based on the whims of a singular individual. Now, it is the cultural, artistic, and intellectual productions of humanity, the treasured artefacts of modern society, which are being de-materialised into undifferentiated amalgamations of tokens, and re-cast into the LLMs, foundation models, and generative AI systems costing billions of dollars to produce. Like al-Samiri and the people he led astray, the empty vocalisations and autonomous operations of these systems is leading to the veneration of the machine by those who have absorbed the fantasy of AGI into their hearts3 and those intoxicated by the allure of acceleration.

A positive future for a world with AI from an Islamic perspective is one in which the option to say no to the integration of AI in infrastructures, devices, products, services, institutions, and elsewhere, is preserved at individual and collective levels, without any disadvantage to those who refuse the total substitution of the artificial in the place of the natural, and the imposition of forms of AI that may be wielded as instruments of power or deified as idols to which everyone must submit.

Contact: m.yaqub.c@gmail.com


  1. Smith, W. (2015). Technologies of stage magic: Simulation and dissimulation. Social Studies of Science, 45(3), 319–343. ↩︎
  2. Atanasoski, N., & Vora, K. (2019). Surrogate Humanity: Race, Robots, and the Politics of Technological Futures. Duke University Press ↩︎
  3. Karen Hao and Charlie Warzel, ‘Inside the Chaos at OpenAI’, The Atlantic (blog), 20 November 2023 ↩︎

US-China AI Governance Research Priorities

Research on “US-China AI Governance” refers to work that explores risk reduction in US-China relations in AI. This includes:

  1. Research that investigates the role of global governance in reducing the risks associated with a US-China AI competition. In particular, this research stream refers to the conditions to mitigate AI race dynamics in which countries cut corners on safety to develop and deploy advanced AI systems (Cave and ÓhÉigeartaigh 2018). Examples include:
    • Analyzing the political factors that would shape the effectiveness of US-China cooperation on AI governance. One possible approach translates lessons learned from great power cooperation in other technologies to global AI governance. Related work has found that the strength of transnational scientist-to-scientist interactions played an important role in whether the US shared nuclear safety and security technologies with its rivals (Ding 2024).
    • Exploring the particular characteristics of AI technology that make it more or less amenable for US-China engagement. For example, some researchers have argued that arms control agreements for military AI systems will be unlikely because these systems will likely be indistinguishable from their civilian counterparts (Vaynman and Volpe 2023).
  2. Research that explores the current extent of US-China collaboration on AI governance issues, with a focus on specific areas and institutions that could benefit from strengthened interaction. Examples include:
    • Research that improves understanding of China and/or the U.S.’s approach to AI risk management, with an eye toward key differences and commonalities that may affect international cooperation.
    • Assessing the institutional designs that would be most suitable for US-China cooperation on AI governance (and other interventions beyond direct engagement). Potential mechanisms for further research include: international norms, “soft law” agreements, treaties, standards, and new institutions (Dafoe 2018). Researchers could also study the role that third countries can play in reducing risks that stem from current and future US-China competition in AI.

Purpose and eligibility

The purpose of the fellowship is to fund talented students throughout their PhDs to work on US-China AI governance research. To be eligible, applicants should either be graduate students or be applying to PhD programs. Funding is conditional on being accepted to a PhD program, working on US-China AI governance research, and having an advisor who can confirm to us that they will support the student’s research. If a student changes advisor, this confirmation is required from the new advisor for the fellowship to continue.

An application from a current graduate student must address in the Research Statement how this fellowship would enable their US-China AI governance research, either by letting them continue such research when no other funding is currently available, or by allowing them to switch into this area.

Fellows are expected to participate in annual workshops and other activities that will be organized to help them interact and network with other researchers in the field.

Continued funding is contingent on continued eligibility, demonstrated by submitting a brief (~1 page) progress report by July 1st of each year.

There are no geographic limitations on applicants or host universities. We welcome applicants from a diverse range of backgrounds, and we particularly encourage applications from women and underrepresented minorities.

Application process

Applicants will submit a curriculum vitae, a research statement, and the names and email addresses of up to three referees, who will be sent a link where they can submit letters of recommendation and answer a brief questionnaire about the applicant. Applicants are encouraged but not required to submit their GRE scores using our DI code: 3234.

The research statement can be up to 3 pages long, not including references, outlining applicants’ current plans for doing US-China AI governance research during their PhD. It should include the applicant’s reason for interest in US-China AI governance research. For current PhD students, it should also detail why no existing funding arrangements allow work on US-China AI governance research.

Timing for Fall 2024

The deadline for application is November 20, 2024 at 11:59 pm ET. After an initial round of deliberation, those applicants who make the short-list will then go through an interview process before fellows are finalized. Offers will be made no later than the end of March 2025.

David Zvi Kalman is the host of the Belief in the Future podcast, a research fellow at the Shalom Hartman Institute, and Senior Advisor at Sinai and Synapses. Kalman is the owner of Print-O-Craft Press, an independent publishing house, and his own work has seen wide academic and popular publication. He holds a PhD from the University of Pennsylvania and a BA from the University of Toronto. He blogs regularly on Jello Menorah.

The world’s religious thinkers are currently trying to figure out what to do about AI—albeit in fits and starts. Conferences are being convened. Books and articles are being commissioned. Google “Muslim approaches to AI” or “Catholic approaches to AI” and real substance will pop up. Most of this has happened in the last five years. The wheels are really turning.

But—and I say this as someone who’s been watching closely—it’s all pretty slow, and in relation to an industry which values speed above all that’s a huge problem. Religion cannot afford to be slow in a world that is moving faster than ever before.

The way you can tell it’s slow is that it is not yet possible to answer the following question: Are Muslim/Catholic/Jewish/etc. approaches to AI essentially identical, or do they differ in significant ways—and if they differ, where exactly do they differ?

This is an important question to press upon because it forces religious thinkers to get serious and decide if they want to be ethical leaders or ethical followers. It’s easy to say that your faith dictates that AI, say, not be used to exploit the weak and vulnerable; in fact it would be odd if it didn’t, because literally everybody agrees with this, pious and otherwise. A religious denomination that promotes such consensus positions isn’t going to make enemies, but it also risks getting lost in a sea of nodding heads. Indeed, if everyone wants the same thing it is often perceived as a liability to use religious language to say so, since such language can be alienating in a secular society.

Perspectives of Traditional Religions on Positive AI Futures

An introduction to our work to support the perspectives of traditional religions on human futures with AI.

This is why I am drawn to the points of disagreement, where a religious group is saying something about AI that others might not. As a Jewish thinker who works on tech issues and who has spent time speaking with AI ethicists from other faiths, here are the places where I believe Judaism’s emergent AI ethics is doing something distinctive.

“The image of God” is not just about biology

AI has demonstrated humanlike capabilities for any number of tasks. Inasmuch as humans understand their worth in relation to their unique capabilities, this has raised concerns about the erosion of human value. Both Christians and Jews have responded to this alarm by emphasizing the religious idea that human beings have a special status as the only creatures to have been created “in the image of God” (Genesis 1:27).

However, not everyone uses this phrase the same way. The Southern Baptist Convention, for example, understands it to be an elevation of biological humans—including embryos—above all other entities. This understanding motivates the denomination’s opposition to IVF and abortion, as well as the categorical denial of personhood to all AIs, no matter how humanlike they become. Humans are simply ontologically separate from both animals and machines.

In my understanding, Jewish thought takes a different path. Yes, biological human beings have infinite value, but that value is also tied to what we perceive as being human, with degrees of personhood available for liminal cases. This approach lies behind Judaism’s complex position on abortion, among other things. It also—in my opinion—means that we should seriously consider assigning some level of personhood to machines that present themselves as people. While some will have the knee-jerk response that this will impinge on human rights, I believe the opposite is true. We are much more likely to discount people if we see their abilities as being “only” equivalent to what an AI can achieve. An AI’s ability to act human should raise up the AI, not diminish the human. By the same token, the intentional creation of humanlike software is unlike the creation of other sorts of software. Just as AI firms have explicitly compared their products to nuclear technology in terms of its promise and danger, we must acknowledge that it also shares something with the creation of human life—and developers should understand this as a mandate to develop and maintain their models extremely carefully.

The obvious drawback of this approach is that it could be used to dehumanize people with disabilities. Because of this, I should emphasize that I understand the Jewish approach to be a hybrid, with biological humanity and human behavior patterns both contributing to how we place value on creatures and machines. This perspective has not been fleshed out, so to speak, but the divergence is already becoming clear.

The role of AI in religious text study

Within Judaism, text study isn’t just a means to understand the tradition; it’s a ritual act in its own right, one that all Jews (though in practice far more men than women) are encouraged to pursue across their entire lives. This desire to study hard texts in Aramaic, pre-modern Hebrew and other languages has led to a uniquely productive relationship between Torah and computing. Some of Israel’s very first computers were conscripted to digitize the canon and then make it easily searchable, and the number of digital tools currently available to consumers is remarkable. Today, vast numbers of religious texts are freely available online, alongside many translations and easy hyperlinks to navigate this deeply intertextual corpus.

The application of AI to Jewish learning is likely imminent. Generative AI already provides more-than-decent translations of even the most difficult texts, and it can be used to help learners navigate a large and unruly corpus with few obvious entry points. In the future, AI may also allow learners to converse directly with rabbis and books from centuries past.

At the same time, I suspect that the ritual aspect of Jewish learning will mean that AI enters the beit midrash (study hall) without entirely conquering it. The ability to read the Bible and Talmud in their original languages is likely to remain a cornerstone of text study, as is hevruta, the act of pouring over texts with a partner rather than alone in front of a screen. Jewish learning also emphasizes the interplay between scholars and their interpreters; AI models, meanwhile, often struggle to explain why they know what they know. Most importantly, Torah learning emphasizes that sacred texts are often multivalent, multilayered, and enigmatic. This way of viewing texts is hard to coax out of AI models that are mostly designed to provide concise and ostensibly complete interpretations of source material.

In practice, I suspect this means that AI will become another in a line of technologies that have enriched Jewish learning while leaving its predecessors intact. The modern beit midrash contains handwritten scrolls, bound books, and computer screens. Each has a place, and learners shuttle between them as needed. It’s easy to imagine AI joining the mix.

AI and Jewish Theology

AI has long inspired religious sentiments in people. Sometimes this is because models, which are both powerful and imperfectly understand, come across as godlike. Sometimes it’s through “the singularity,” already a quasi-messianic concept. Sometimes it is because the act of creating something in our image is reminiscent of the divine act of creation.

For monotheists, this last sentiment is powerful but hard to think through to its conclusions. If creating an AI is like creating a person—if the creation of a new form of life recapitulates a divine act—that would seem to make us gods. In fact, Yuval Noah Harari and others use this exact language to talk about the trajectory of technological development. Is it possible to understand AI as humanity’s offspring and avoid this language?

The answer, I think, is no. AI demands nothing less than a radical rethinking of Jewish theology as a sort of precedent, in which the long history of God’s relationship with humanity is re-examined in light of our present experience of creating something in our image. Both AI-as-God and AI-as-humanity’s-child contain notes of potential heresy, but the former is by far more ethically and religiously dangerous. I think the choice here is clear.

Crucial to this parallel is that both humanity and AI generate extreme concern on the mind of the creator that their progeny will go off the rails, that its “code” (in both senses of the term) is insufficiently clear. This concern—which for developers is called “the alignment problem” and for Jews is called “all of history”—should generate sympathy for God among the religious. Even if you believe that God is perfect, it’s very clear that there’s no such thing as perfect code; the Hebrew Bible itself contains endless examples of how human beings repeatedly struggle to do what is right long after the revelation at Sinai. It should also, I hope, evoke real trepidation in those who develop AI systems. AI isn’t just another new technology; it’s a technology that has the potential to define who we are as a species.

Convergence or divergence?

Does the above define a uniquely “Jewish perspective” on AI?

Of course not. Judaism is a decentralized religion and most Jewish leaders (like most religious leaders) don’t understand the tech well enough to have an informed position. The process of spreading AI literacy and then generating consensus is going to take a while. But while the world’s religions will largely converge on how to think about AI—just as they have on the need to protect the climate—I think they will remain distinct in important ways. Those distinctions may seem small now, but—as with the Jewish decision to retain handwritten Torah scrolls in the face of the printed book, or the choice not to use electricity on the Sabbath—they may end up becoming defining characteristics of the religion. Religious thinkers of all faiths should keep this in mind as they formulate their approaches.

Ram Rachum is a researcher at Tufts University, an IAEAI scientific council member, an ALTER AI Safety fellowship alumnus, and a member of the Future of Life Institute’s AI Existential Safety Community. Ram is the beneficiary of an FLI travel grant.


The rapid progress in artificial intelligence raises many concerns about its safety. The AI Existential Safety Community wants to ensure that the effects of AI on our society will be positive. A major challenge for researchers is the fact that we are preparing for scenarios that have never happened before. This is why there are many different lines of research into various scenarios in which AI could harm people and different ways to prevent these scenarios. 

In this blog post we discuss AI agents: what sets them apart from AI assistants, their potential usefulness and possible dangers, and what we are doing to mitigate these dangers.

What are AI agents?

If you ever asked ChatGPT how long cooked cauliflower lasts in the fridge, you have used an AI assistant– but not an AI agent. While an exact definition of what makes an AI agent is a complicated topic, a simple definition would be: An AI agent does not wait to be prompted by a human, but rather senses its environment, makes decisions, initiates interactions with humans and changes its behavior as it learns more about the environment in which it operates.

Let’s talk about different kinds of AI agents and examples of what they can do.

A robot is an AI agent with a body that can move around in the world and perform physical actions, like manufacturing products in a factory, cleaning up trash in the ocean, or assisting visually-impaired people. Self-driving cars are embodied AI agents as well, even though we do not usually think of them as robots.

AI agents without a physical body can act by accessing APIs for different services, such as sending emails. The oldest kind of AI agents we use are automated stock-trading systems, which have been active since the 1980s, though only since the 2010s have they started to incorporate Deep Learning technologies.

The distinctive quality of AI agents is that they initiate actions without being expressly ordered to do so. A subtle but crucial point is that agency is a spectrum rather than a yes-or-no property. The AI agents mentioned above are not as agentic as human beings, because they work on tasks given to them by humans rather than making up their own goals and aspirations; but they do have more agency than a chatbot because they take some liberty in deciding the actions they are going to take in order to achieve their programmed goals. One may think of a self-driving car as a machine with no agency, because much like an elevator, it has a human operator pressing its buttons to tell it where to go; the salient difference is that unlike the elevator, the self-driving car has to make thousands of little decisions, from choosing its route to deciding whether to slow down coming up to a traffic light. In extreme cases, these decisions will involve weighing the potential loss of human lives.

We are now witnessing a rapid proliferation of AI agents that can get information and perform actions on APIs related to health, finance, the environment, social media, energy, traffic, scientific research and more. While AI assistants allow us to delegate decisions, AI agents allow us to delegate entire projects. They therefore offer humanity unprecedented possibilities, as well as pose enormous risks. We got a taste of these risks back in 2010, when the previously mentioned trading agents made an escalating flurry of U.S. stock trades which caused the temporary loss of $1 trillion U.S. dollars, an event known as the 2010 flash crash. Luckily, it did not end up destabilizing the U.S. economy. However, there is no guarantee that future disasters will be strictly financial, or easy to recover from. We wish to avoid ending up like the Sorcerer’s Apprentice who delegated his way into disaster; we wish to learn how to prevent unchecked AI activity from wreaking havoc on our economies, environment and culture.

Could AI agents intentionally hurt people?

In his 2019 book “Human Compatible”, prominent AI researcher Stuart Russell described a hypothetical scenario in which an AI agent designed with benevolent goals turns rogue and attempts to hurt people. This scenario is called “You can’t fetch the coffee if you’re dead.”

Illustration of a robot ‘fetching the coffee’: a scenario that Stuart Russell uses to explain why AI systems might have the instrumental goal of self-preservation.

The scenario imagines that you had an AI-powered robot trained to get you a cup of coffee from the nearest coffee shop. In a process of trial and error, this robot has learned to perform a sequence of physical actions: It strolls down to the coffee shop, waits in line, gives the barista your coffee instructions, pays, waits again, grabs the coffee and delivers it back to your office. Its reward signal would be a combination of how good the coffee is, how much it cost, how fast it is delivered, etc. Because it is an AI, it automatically learns the relationships between each action that it takes and its final reward signal. This means the robot would be regularly performing many little experiments and checking whether these reduce or increase the final reward.

One such experiment may be to obtain a stolen credit card number on the dark web and use it to pay. If it reduces the total cost with no repercussions, the robot will learn to do it every time. Another experiment could be to cut in line in front of other customers, shaving two minutes off the delivery time. With the right hardware, the robot could save even more time by killing all the customers in the coffee shop. As a bonus, the barista might be inclined to offer that cup of coffee on the house.

The key insight is that an AI does not have to be designed with malicious intent in order to behave maliciously. Trial and error, if allowed to proceed unchecked, can lead to both wonderful and horrible results.

One way to react to this thought experiment is to say, “we need to have tighter safety controls on these AIs! They need to be limited in the actions that they are allowed to take. No cheating, stealing, or hurting other people.” Unfortunately, ethical rules are notoriously difficult to define. When ChatGPT went live in 2022 it had ethical rules that jailbreakers quickly circumvented, causing it to make offensive statements and give instructions for making illegal drugs. As OpenAI improved ChatGPT’s safety controls, jailbreakers developed creative methods to hide their unethical payload, such as engaging in role-play, or asking for know-how about a criminal activity in order to prevent it.

We would like to move beyond this cat-and-mouse game and find a more lasting approach for aligning AI agents.

What are we doing to make AI agents good?

Researchers in FLI’s AI Existential Safety Community are working on many different directions to solve these problems. These range from theoretical work and position papers to algorithm development, empirical research and open-source code. Here is a small selection of these efforts:

Alan Chan, et al. Visibility into AI Agents (2024). The authors propose three ways to improve visibility into AI agents: (1) agent identifiers to help track AI agents’ involvement in interactions, (2) real-time monitoring for the immediate flagging of problematic behaviors and (3) activity logs that allow detailed records for post-incident analysis. The authors look at how these methods work in different setups, from big tech companies to individual users.

Andreea Bobu, Andi Peng, et al. Aligning Human and Robot Representations (2023). The authors present a formal framework for aligning a robot’s goals to human goals, defining it as an optimization problem with four criteria: (1) value alignment, (2) generalizability, (3) reduced human burden and (4) explainability. The paper includes a survey of five different methods for representing a robot’s goals and a comparison of how well they satisfy each of these four criteria.

W. Bradley Knox, Stephane Hatgis-Kessell, et al. Models of human preference for learning reward functions (2022). Developers of popular LLMs such as ChatGPT and Google Gemini use a technique called Reinforcement Learning from Human Feedback (RLHF) to align these LLMs to human preferences. RLHF is also used for aligning robots, autonomous vehicles and image generators such as Midjourney. The authors identify several flaws in RLHF; for example, RLHF can recommend actions that lead to good outcomes because of luck rather than sensible decision-making. The authors propose a regret preference model that solves some of these problems and demonstrate its viability in an experiment in which human subjects give instructions to an AI driving a simulated autonomous vehicle.

Joey Hejna, et al. Contrastive Preference Learning: Learning from Human Feedback without RL (2023). The authors build on the work mentioned in the previous paragraph and provide a regret-based alternative to RLHF called Contrastive Preference Learning (CPL), which is more efficient and scalable while also being simpler to implement. The authors provide code for CPL on GitHub, allowing AI developers to integrate CPL into their AI systems.

Could AI agents figure out what is good by themselves?

Let’s try to look at the alignment problem from a different angle, by taking a step back to reflect on what it really means for an AI agent to be good.

Much of AI alignment research focuses on how to make AI agents obey rules, values and social norms. Some examples are “don’t lie”, “don’t discriminate”, and “don’t hurt people.” We are trying to teach AI agents to obey rules because this is what we have learned to do with people, in our legal systems, religions and other value systems. Rules are useful to us because of how clear and decisive they are. We each have our own romantic concept of what it means to be a good person, based on social experiences that we have had with other people, but these ideals are amorphous and not actionable. When we make up moral rules, we distill these ideals into actionable capsules that are easier to share with other people and get a consensus on.

The trolley problem is a moral dilemma where one must choose between diverting a trolley and allowing one person to die, or not diverting the trolley and allowing five people to die. This dilemma highlights the difference between deontological morality and consequentialist morality. We do not have a definitive answer of what an AI agent such as a self-driving car should do when faced with such a dilemma. (CC BY-SA 4.0 McGeddon and Zapyon)

The flip side is that any set of such moral rules has its breaking point, which is illustrated by moral dilemmas such as the trolley problem. We cannot say for sure what a self-driving car should do when it needs to decide whether to cause the death of one person by action, or of five people by inaction. There might be an analog of Gödel’s incompleteness theorem for moral rules, where any system of moral rules must have at least one moral dilemma that cannot be solved without breaking some of the rules, or inventing new ones.

Could an AI agent have a profound understanding of what it means to be good in a way that does not depend on rules? Could a self-driving car faced with the trolley problem make a decision that we will find satisfactory? If so, what would it take to develop such an AI agent?

We mentioned earlier that when people form their concepts of right and wrong, it is usually in a social context, which means our perception of “what is good” is shaped by experiences that we have had with other people. We propose recreating the same kind of social process with AI agents because it might lead to AI agents that develop a robust understanding of being good.

A social path to human-like artificial intelligence, a position paper published in Nature Machine Intelligence in 2023 suggests developing AI systems in which multiple AI agents interact with each other. Inspired by collective intelligence exhibited by groups of animals and humans, the authors speculate that the social relationships that AI agents will form will enable them to solve problems that could not be solved by single-minded AI agents.

There have been many experiments with AI agents working as a team (see the video below of AI agents playing hide and seek), learning social norms together and influencing each other’s actions. Concordia is a Google DeepMind experiment in which multiple LLMs talk to each other in order to study their social behavior. My team’s modest contribution is a series of experiments in which we have shown that AI agents can form dominance hierarchies, a natural phenomenon seen in most animal species.

OpenAI research: AI agents playing hide and seek.

In most of these examples, the AI agents are avatars in a simple video game that elicits these social behaviors, which means they are not doing real-world tasks. A major challenge in these experiments is getting AI agents to think beyond their own selfish interests. AI algorithms are great at optimizing their reward signal, but it is not clear how to get them to optimize their teammates’ reward signals. The upcoming Concordia contest in NeurIPS 2024 will challenge researchers to train LLMs to maximize their cooperative ability.

This line of research is not being applied to AI Safety yet. First we would need these AI agents to show social behavior that is more advanced and closer to a human level, including an understanding of right and wrong. Then we would need to train these agents on real-world tasks, and get them to apply their moral understanding to these tasks in a way that is compatible with human values, norms, and cultural practices.

These are formidable challenges. If this research direction proves to be a worthy one, it’s possible that it will help to develop AI agents much better aligned, and more likely to benefit humans.


Thanks to Markov Grey, Taylor Jones, William Jones, David Manheim, Reuth Mirsky, Yonatan Nakar and Natalie Shapira for reviewing drafts of this post.

Sign up for the Future of Life Institute newsletter

Join 40,000+ others receiving periodic updates on our work and focus areas.
cloudmagnifiercrossarrow-up
linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram