Skip to content

This is the first of our ‘of ‘AI Safety Breakfasts’ event series, featuring Stuart Russell.

About AI Safety Breakfasts

The AI Action Summit will be held in February 2025. This event series aims to stimulate discussion relevant to the Safety Summit for English and French audiences, and to bring together experts and enthusiasts in the field to exchange ideas and perspectives.

Learn more or sign up to be notified about upcoming AI Safety Breakfasts.

Ima (Imane Bello) is in charge of the AI Safety Summits for the Future of Life Institute (FLI).

The event recording is below. Note: The ‘audience questions’ segment of the event is not included in the video, but is included in the transcripts below.

Captions for the video are available in English (Anglaise) and French (Français).

Video chapters

Transcript

Alternatively, you can view the full transcript below.

Transcript: English (Anglaise)

Ima: [French] Okay, great. Thank you all very much for coming. It’s always nice to start with a microphone in your hands Thursday mornings at 9.00 am. I’ll start with an introduction in French.

[English] I’m going to start with French for, like, five minutes. I promise it’s only going to be five minutes. Despite the myth that we cannot speak English or stop talking in French. And then we’ll just, I have some questions for Stuart, I think that’s going to last between 25, 30 minutes and then whatever questions you have, we’ll be happy to take them.

[French] Voilà, welcome. As you know, my name is Ima. I organise a series of breakfasts that I call Safety Breakfast, so a breakfast on safety and security issues in the context of the summit to be held in Paris on 10 and 11 February. The aim of this series is to open up a space for discussion between experts and enthusiasts in the field of AI safety and security, AI governance, so that as the summit progresses, we can ask each other questions as openly as possible. I know you can see a camera, don’t worry. The questions will not be published. These are just the questions I asked Stuart, and therefore Stuart’s comments and responses, which will be published. I’ll obviously send you the blog post and the transcript. And that’s it. So, with regard to the Security Summit, as you know, takes place on 10 and 11 February. This is the third Security Summit. It’s called the Action Summit. Because we have an extremely ambitious French vision, which aims to promote solutions, including, for example, standards on the issues related to AI, and not just the issues related to on safety and security, but really issues that are related to the impact of AI systems on the workplace, issues relating to international governance, issues relating to the way in which we live in society in a general manner everything that we call AI for good, AI systems for the well-being of mankind and, of course, in part, the safety issues on which we at FLI are focusing, together of course with the issues of governance that we focus on. And that’s it. I think that’s all. With regard to the context of this breakfast and the summit in general. The next breakfast will be held at the beginning of September. Then we’ll have another edition in mid-October. And if you’re interested, there’s a QR code on the slide. Don’t hesitate to indicate your interest on the list, I won’t spam you. I used to be a lawyer specialising in human rights and new technologies, so I’m familiar with the GDPR. Don’t worry, you won’t receive any spam. We respect people’s consent, but if you want to be invited, we need to know. Voilà.

[English] That’s it for the French. Thank you for everything, again, thank you so much for coming. I know it’s 9 AM. I’m tired. You all are. but I really appreciate you coming despite the Olympics. And I really, really, really look forward to an insightful discussion with Stuart that I cannot thank you enough, because he’s here today. So Stuart, thank you so much. I have… Do you want to say a few things before we start, or can I just dive into it with the questions?

Stuart: Please.

Ima: Yeah? Can I? Okay.

Stuart: And congratulations to France on winning their first, winning their first football match last night.

Ima: All right. So Stuart, you know, I know we all know here, we’ve seen a remarkable advancement in AI across various domains, right. This includes improved models like GPT-3 – GPT-4, sorry – and Claude 3.0. More sophisticated generators like DALL-E 3. Video generation capabilities with Sora, and even progress in robotics and molecular biology prediction. Two new trends that we’ve seen are the integration of multiple capabilities into a single model, so what we call native multimodality, and the emergence of high quality video generation. From your perspective, which of these recent developments do you consider most significant, and what potential challenges or implications do you foresee these advancements posing for society?

Stuart: So these are quite difficult questions to answer, actually. And it might be helpful, I know, like Raja, for example, has perhaps even longer history in AI than I do. But for, for some people, you know, who have come to AI more recently, the history actually is important to understand. So for most of the history of AI, we proceeded like other engineers, like mechanical engineers and aeronautical engineers. We, tried to understand the basic principles of reasoning, decision making. we then, built algorithms that did logical reasoning or probabilistic reasoning or default reasoning. We studied their mathematical properties, decision making, under uncertainty, various types of learning. and, you know, we were making slow, steady progress with some, I think, really dramatic contributions to you know, the human race’s understanding of how intelligence works. And it’s kind of interesting. So I think in 2005 at, NeurIPS, which is now the main AI conference, Terry Sejnowski, who was a co-founder of the NeurIPS conference, proudly announced that there was not a single paper that used the word backpropagation that had been accepted that year. And that view was quite common, that neural networks were, not reliable, not efficient in terms of data, not efficient in terms of computation time. and didn’t support any kinds of mathematical guarantees and other methods like support vector machines, which were developed, I guess originally by statisticians like Vapnik, were superior in every way, and that this was just a sign of the field growing up. And then around 2012, as many of you know, deep learning came along and demonstrated significant improvements in object recognition. And that was that was when the dam broke. And I think it wasn’t that we couldn’t have done that 25 years earlier. The way I described it at the time was we had a Ferrari and we were driving around in first gear, and then someone said, well, look, if you put this knob over there and go into fifth gear, you can go at 250 miles an hour, and so a few minor technical changes like stochastic gradient descent, and ReLUs, and other, you know, residual networks and so on, made it possible to start just building bigger and bigger and bigger networks. I mean, I remember building seven layer neural network back in 1986, and training it was a nightmare because, you know, because of good old fashioned sigmoid activation units, the the gradients just disappeared, you know, and you were down to a gradient of, you know, 10 to the -40 on the seventh layer. And so you could never train it. But just those few minor changes, and then scaling up the amount of data, made a huge difference. And then the language models came along. And in the fourth edition of the AI textbook, we have examples of GPT-2 output. And it’s kind of interesting, but it’s not at all surprising because we also show, you know, the output of bigram models, which just predict the next word, conditioned only on the previous word. And that doesn’t produce grammatical text. But you go to a trigram model, so you’re predicting the next word based on the previous two, you start to get something that is grammatical on the scale of half a sentence, or sometimes a whole sentence. But of course, as you know, as soon as you get to the end of a sentence, it starts a new one and it’s on a completely different subject and it’s totally, totally disconnected and random rambling. But then you go to 6 or 7 grams and you get coherent text on the paragraph scale. And so it wasn’t at all surprising that GPT-2, which I think had a context window of 4000 tokens, I think it was. Tokens. Yeah. So 4K. That it would be able to produce coherent looking text. No one expected it to tell the truth, right? This was just: can it produce text that’s grammatical and thematically coherent? And it was almost an accident if it ever actually said something that was true. And I don’t think anyone at that time understood the impact of scaling it up further. So it’s not so much the context size, it’s the amount of training data. People focus a lot on compute, as if somehow it’s the compute that is making these things more intelligent. That’s not true, right? The compute is there because they want to scale up the amount of data that it’s trained on. And, the amount of compute is, so if you think about it, right, it’s approximately linear in the amount of data and the size of the network. But the size of the network is linear in the amount of data. And so the amount of compute you need is going quadratically with the amount of data, right? I mean, there are other factors involved, but at least those basic things are going to happen. But it’s scaling up the dataset size that’s driving that increase in compute. And so we now have these systems— so it’s dataset size and then you know, InstructGPT was the first step where they did the supervised pre-training to teach it to answer questions. The purely raw model, you know, is free to say “that’s a silly question” or, “I’m not answering that”, or any other kind of thing or even just ignore it completely. but InstructGPT, you’re teaching it how to behave as a nice, helpful question-answerer. And then RLHF to remove the bad behavior. and then we have stuff that, as I think OpenAI correctly stated, when ChatGPT came out, it gives people a foretaste of what it will be like when general purpose intelligence is available on tap. And we can argue about whether it’s real intelligence, so the reason I said it’s a hard question to answer is because we just don’t know the answer, right? We do not know how these systems are operating internally. When ChatGPT came out, one of my friends sent me some examples. Prasad Tadepalli, he’s a professor at Oregon State. And one of them was “which is bigger, an elephant or a cat?” And ChatGPT said “an elephant is bigger than a cat.” You think, oh, okay. That’s good. It must know something about how big elephants and cats are because, you know, probably that particular comparison hasn’t been made in the training data. And then he said, well, “which is not bigger, an elephant or a cat?” And it stated, “neither an elephant nor a cat is bigger than the other.” And so, that tells you two things, right? First of all, it doesn’t have an internal model of the world with big elephants and little cats, which it queries in order to answer that question, which is, I think, how a human being does it, right? You imagine an elephant, you imagine a cat, and you see that the cat is teeny weeny relative to the elephant, right? if it had that model, it could not give that second answer. But it also means that the first answer was not given by consulting that model either, right? And this is a mistake that we make over and over and over again, is we attribute to the AI systems that, you know, when they behave intelligently, we assume they’re behaving intelligently for the same reasons that we are. And over and over again, we find that that’s a mistake. Another example is what happened with the Go playing programs. So we assume that because the Go playing programs beat the world champion and then went stratospherically beyond the human level, so the best humans rating about 3800, Go programs are now at 5200. So massively superhuman. So we assume they understand the basic concepts of Go. But it turns out that they don’t. That there’s certain types of groups of stones that they are unable to recognize as groups of stones. We don’t understand what’s going on and why they can’t do it. But now ordinary amateur human players can regularly and easily defeat these massively superhuman Go programs. And so the biggest takeaway from this is we’ve been able to produce systems that many people regard as more intelligent than themselves. But we have no idea how they work. And there are signs that if they’re working at all it’s for the wrong reasons. So can I just finish with one little anecdote? So I got an email this morning, from another guy called Stuart. I won’t give you his surname. He said, you know, I’m not an AI researcher, but, you know, but I had some ideas, and I’ve been working with ChatGPT to develop these ideas about artificial intelligence, and ChatGPT assures me that these ideas are original and important, and it’s helping me to write papers about them, and so on and so forth. but then I talk to someone who understands AI, and now I’m really confused because this person who understands AI said that these ideas didn’t make any sense, and, or they weren’t original, and, you know, well, who is right? And that’s I mean, it was really shocking, actually. It was sad that a well-meaning, you know, reasonably intelligent layperson had been completely taken in not just by ChatGPT itself, but by the all of, public relations and media explanations of what it is, into thinking that it really understood AI and was really helping him develop this stuff. And I think instead it was just doing its usual sycophancy, like, “yeah, that’s a great idea!” So, so the risks, I think, are, of over-interpretation of the capabilities of these systems. It’s not guaranteed that, well, let me let me rephrase that. It’s possible that real advances in how intelligent systems operate have happened without our realizing them. Like in other words, inside ChatGPT, some novel mechanisms are operating that we didn’t invent, that we don’t understand, that no one has ever thought of. And they’re producing intelligent capabilities in ways that we may not understand ever. And that would be a big concern. I think the other risk is that we’re spending, I think by some estimates, by the end of this year, it will have added up to $500 billion, on developing this technology. And the revenues are still very small, like less than 10 billion. And, I think the Wall Street Journal this morning has an article saying, that some people are starting to wonder how long that can continue.

Ima: Thank you. So speaking in terms of advancements in capabilities, we’ve been hearing about the next generation of models. OpenAI CEO Sam Altman has been hinting at GPT-5 for quite some time now, describing it as a significant leap forward. Additionally, we have recent reports suggesting that OpenAI is working on a new technology called Strawberry, which aims at enhancing AI’s reasoning capabilities. The goal of this project would be to enable AI to do more than just answer questions. It’s intended to allow AI to plan ahead, navigate the internet independently, and conduct what’s been called ‘deep research’ on its own. So this brings us to a concept known as long-term planning in AI, the ability for AI systems to autonomously pursue and complete complex, multi-step tasks over extended periods of time. What do you think about these developments? What potential benefits and risks do you foresee for society if AI systems become capable of this kind of autonomous, long-term planning? And maybe, without pressuring you, in like less than seven minutes?

Stuart: Sure.

Ima: Thank you.

Stuart: Okay. So, I’ve been hearing similar things, talking to executives from the big AI companies that-4 the next generation of systems, which may be later this year or early next year, will have substantial planning capabilities and reasoning capabilities. And that’s been, I would say, noticeably absent. you know, even in the latest versions of GPT-4 and Gemini and Claude, that they can recapitulate reasoning in very cliched situations where in some sense they’re regurgitating reasoning processes that are laid out in the training data. but not particularly good at dealing with, planning problems, for example. So Raul can ban Patti, who’s an AI planning expert and has been actually trying to get GPT-4 to solve planning problems from the International Planning Competition, which is the competition where planning algorithms are pitted against each other. And contrary to some of the claims from OpenAI, basically, it doesn’t solve them at all. But what I’m hearing is that now, in the lab, they are able to successfully, robustly generate plans with hundreds of steps, and then execute them in the real world dealing with contingencies that arise, replanning as necessary, and so on. So obviously, you know, if you’re in the ‘taking over the world’ business, you have to be able to outthink the human race in the real world, in the same way that chess programs outthink human players on the chessboard. It’s just a question of: can we transition from the chessboard, which is very narrow, very small, has a fixed number of objects, a fixed number of locations, perfectly known rules, it’s fully observable, you can see the entire state of the world at once, right? So those restrictions make the chess problem much, much, much easier than decision making in the real world. So imagine, for example, if you’re in charge of organizing the Olympics, right? Imagine how complicated and difficult that task is compared to playing chess. But still, well, at least so far, we’ve managed to do that successfully. So if AI systems can do that, then you are handing the keys to the future over to the AI systems. So they need to be able to plan and they need to be able to have access to the world, some ability to affect the world. And so having access to the internet, having access to financial resources, credit cards, bank accounts, email accounts, social media, so these systems have all of those things. So if you wanted to create the situation of maximal risk, you would endow AI systems with long-term planning capability and direct access to the world through all of those mechanisms. So the danger is, obviously, that we create systems that can outthink human beings and we do that without having solved the control problem. The control problem is: how do we make sure that AI systems never act in ways that are contrary to human interests? We already know that they do that because we’ve seen it happen over and over again with AI systems that lie on purpose to human beings. For example, when GPT-4 was being tested, to see if it could break into other computer systems, it faced a computer system that had a Captcha. So some kind of diagram with text where, it’s difficult for, machine vision algorithms to read the text. And so it, it found a human being on Task Rabbit and told the human being that it was a visually impaired person who needed some help reading the Captcha, and so paid the person to read the Captcha and allowed it to break into the computer system. So, to me, when companies are saying, okay, we, we are going to spend $400 billion over the next two years or whatever to create AGI. And we haven’t the faintest idea what happens if we succeed. It seems essential to me to actually say, well, stop, until you can figure out what happens if you succeed and how you’re going to control the thing that you’re building. Just like if someone said, I want to build a nuclear power station and how are you going to do that? Well, I’m going to collect together lots and lots and lots of enriched uranium and make a big pile out of it. And you say, well, how are you going to stop it from exploding and killing everyone within 100 miles? And you say, I have no idea. That’s the situation that we’re in.

Ima: Thank you. So general-purpose AI systems are making it easier and less expensive for individuals to conduct cyber attacks even without extensive expertise. We have some early indications that AI could help in identifying vulnerabilities, but we don’t yet have strong evidence that AI can fully automate complex cybersecurity tasks in a way that would significantly advantage attackers over defenders. What happened last Friday? So the recent global IT outage caused by a faulty software update, serves as a stark reminder of how vulnerable our digital infrastructure is and how devastating large-scale disruptions can be. So, given these developments and concerns, could you share your thoughts on two key points? A) What new offensive cyber capabilities do you think AI might enable in the near future, and B) considering our reliance on digital infrastructure, what kind of impacts should we be prepared for if AI-enhanced cyber attacks become more prevalent? Thank you.

Stuart: so I’ll, I’ll begin by talking about that outage. It was caused by a company called CrowdStrike, which is a cybersecurity company, and they sent out an update. And the update caused the reboot process for Windows to go into an infinite loop. So it was a, you know, undergraduate programming error. and I don’t have exact numbers yet, but just looking at the number of industries, including almost the entire United States aviation industry was shut down. millions of people, all over the world, had their flights canceled or they couldn’t access their bank accounts or they couldn’t sell hamburgers. because their point of sale terminals weren’t working and so on. And so if we said maybe it’s $100 billion in losses or they didn’t have access to the health care they needed. Yeah. I mean, so probably some personal consequences that were very serious. So let’s say if in monetary terms, 100 billion give or take a factor of ten, Caused by an undergraduate programming error, I mean literally a few keystrokes. and if you read CrowdStrike’s contract, it says we warrant that our software operates without error. But, our liability is limited to refunding the cost of the software. If you terminate your license as a result of an error. So whatever, a few hundred dollars refund. for a few hundred billion dollars of damage. to me, this is an absolute abject failure of regulation. Because, even if they were held liable, which I think is quite unlikely, given the terms of that contract, they couldn’t possibly pay for the damage that they caused. And this has happened in other areas. So in medicine, in the 1920s, there was a medicine that caused 400,000 permanent paralyses of Americans, 400,000 people were permanently paralyzed for life. And, if you look at how that would turn out, if it happened now in terms of liability, what kinds of judgments are happening in American courts, it would be about $60 trillion of liability, right? That medicine maker could not possibly pay that. So liability simply doesn’t work as a deterrent. And so we have the Federal Drug Administration, which says before you can sell a medicine, you have to prove that it doesn’t kill people. And if you can’t, you can’t say, “oh, it’s too difficult” or “it’s too expensive”, we want to do it anyway. The FDA will say, well, “no, sorry, come back when you can and it’s long past time where, we started to make similar kinds of requirements on the software industry. And some procurement mechanisms, so certain types of military software, do have to meet that type of requirement. So they will say, no, you can’t sell a software that controls, you know, bomb release mechanisms. unless you can actually verify that it works correctly. There’s no reason why CrowdStrike shouldn’t have been required before they send out a software update that can cause $100 billion in damage, be required to verify that it works correctly. So if, and I really hope that something like that comes out of this episode, because we need that type of regulation for AI systems if you ask, well, what’s the analog of, you know, medicines, it has to be safe and effective for the condition for which it’s prescribed, what’s the analog of that? It’s a little bit hard to say for general-purpose AI systems. For a specific software application, I think it’s easier and in many cases, sector-specific rules are going to be laid out as part of the European AI act. But for general-purpose AI what does it mean? Because it’s so general, what does it mean to say it doesn’t cause any harm or it’s safe? I think it’s too difficult to write those rules at the moment. But what we can say is there’s some things that those systems obviously should not do. we call these red lines. So a red line means if your system crosses this, it’s in violation. for example, AI systems should not replicate themselves. AI systems should not break into other computer systems. AI systems should not advise terrorists on how to build biological weapons. Those are all requirements. I could take anyone off the street and say, do you think it’s reasonable that we should allow AI systems to do those things? And they’d say, of course not. That’s ridiculous. But the companies are saying, oh, it’s too hard. We don’t know how to stop our AI systems from doing those things. And as we do with medicines, we should say, well, that’s tough. Come back when you do know how to stop your AI systems from doing those things. And now put yourself in the position of OpenAI or Google or Microsoft. And you’ve spent already 60, 80, 100 billion dollars on developing this technology based on large language models that we don’t understand. The reason it’s too difficult for them to provide any kind of guarantee that their system is not going to cross these red lines, it’s difficult because they don’t understand how they work. So they’re really making what we call, what economists call the sunk cost fallacy, which is we’ve already invested so much into this, we have to keep going, even though it’s stupid to do so. And it’s as if, if we imagine an alternate history of aviation where, you know, there were the aeronautical engineers like the Wright brothers, who were calculating lift and drag and thrust and trying to find some power source that was good enough to push an airplane through the air fast enough to keep it in the air, and so on. So the engineer approach and then the other approach, which was breeding larger and larger and larger birds to carry passengers and it just so happened that the larger and larger and larger birds people reached passenger carrying scale before the aeronautical engineers did. And then they go to the Federal Aviation Administration and say, okay, we’ve got this giant bird with a wingspan of 250m and it can carry 100 passengers. And we would like a license to start carrying passengers. And we’ve put 30 years of effort and hundreds of billions of dollars into developing these birds. And the Federal Aviation Administration says, but the birds keep eating the passengers or dropping them in the ocean. Come back when you can provide some quantitative guarantees of safety. And that’s the situation we’re in. They can’t provide any quantitative guarantees of safety. They’ve spent a ton of money, and they are lobbying extremely hard to be allowed to continue, without any guarantees of safety.

Ima: Thank you. Coming back to cyber capability offenses, do you think you could really briefly paint us a picture of what the world would look like if we have AI-enabled strong cyber attacks?

Stuart: So this is something that cyber security experts would be perhaps better able to answer because I don’t I don’t really understand, besides sort of finding vulnerabilities in software, typically the attacks themselves are pretty simple, they’re just relatively short pieces of code that take advantage of some vulnerability. And we already know thousands or tens of thousands of these vulnerabilities that have been detected over the years. So one thing we do know about the code generating capabilities of large language models is that they’re very good at producing new code. That’s kind of a hybrid of existing pieces of code with similar functionality. in fact, I would say of all the economic applications of large language models, coding is the one that maybe has the best chance of producing real economic value for the purchaser. So it wouldn’t surprise me at all if these systems were able to combine existing kinds of attacks in new ways or to basically mutate existing attacks to bypass the fixes that the software industry is constantly putting out. as I say, I’m not an expert, but what it would mean is that someone who’s relatively inexperienced, coupled with one of these systems should be able to be as dangerous as a highly experienced and well trained cyber security, or offensive cyber security, operative and I think that’s going to increase the frequency and severity of cyber attacks that take place. Defending against cyber attacks really means understanding the vulnerability. How is it that the attack is able to take advantage of what’s happening in the software? And so I think fixing them is probably at the moment, in most cases, beyond the capabilities of large language models. So it seems like there’s going to be a bit of an advantage for the attacker for the time being.

Ima: Thank you. So for the last questions before we pass the mic, so to say, to the audience, let’s talk a bit about the Safety Summits, shall we. so the UK AI Safety Summit back in 2023 brought together, as you all know, international governments, leading AI companies, civil society groups and experts in research to discuss the risks of AI, particularly frontier AI. As the first AI safety summit, it opened a new chapter in AI diplomacy. Notably, countries including US, China, Brazil, India, Indonesia and others signed a joint commitment on pre-deployment testing. The UK and the US each announced the creation of their AI Safety Institute, and the summit generated support for the production of the international scientific report on the safety of advanced AI. So that was the first AI safety summit. When it comes to the second AI safety summit that happened in Seoul this year, co-hosted by the UK and South Korea. The interim international scientific report that I just mentioned was welcomed. And numerous countries, including South Korea, called for the cooperation between national institutes. Since then, we’ve seen increased coordination between multiple AI safety institutes with shared research initiatives and knowledge exchange programs being established. The summit also resulted in voluntary commitments by AI companies. That’s the context. Given this context and these prior voluntary commitments, what would you consider to be the ideal outcome for the upcoming French summit, and how could it further advance international collaboration on AI safety? Difficult question again, I know. 

Stuart: Yeah. So, I consider the first summit, the one that happened in Bletchley Park to have been a huge success. I was there, the discussions about safety were very serious. governments were really listening. I think they invited the right people. the atmosphere was pretty constructive, getting China and India to come to such a meeting on relatively short notice. I mean, usually a summit on that scale is planned for multiple years, 3 or 4 or 5 years in advance. and this was done in just a few months, I think starting in June, was when, the British Prime Minister announced that it was going to happen and, the people who worked on that summit, particularly Ian Hogarth and Matt Clifford, did an incredible job getting 28 countries to come to agree on that statement. and the statement is very strong. It talks about catastrophic risks from AI systems and the urgent need to work on AI safety. So after that meeting, I was quite optimistic. In fact, it was better than I had any good reason to hope, I would say since then, industry has pushed back very hard. they tried to insert clauses into the European AI Act saying that basically, a general-purpose AI system is not, for the purposes of this act, an AI system. they tried to remove all of the foundation model clauses from the act. And I think with respect to the French AI summit, they have been working hard to turn the focus away from safety and towards encouraging government investment in capabilities. And, essentially, the whole mindset has become one of AI as a vehicle for economic nationalism and, potential for economic growth. So in my view, that’s probably a mistake. because if you look at what happened with Airbus and Boeing, right. Boeing actually managed to convince the American government to relax the regulations on, introducing new types of aircraft so that Boeing could introduce a new type of aircraft without going through the usual long process of certification. and that was the 737 Max, which then had two crashes that killed 346 people. the whole fleet was grounded. Boeing may still be on the hook for lots more money to pay out. and so by moving from FAA regulation to self-regulation, the US has almost destroyed what for decades has been one of its most important industries in terms of earning foreign currency. Boeing has been maybe the biggest single company, over the years for the US. and Airbus continues to focus a lot on safety. They use formal verification of software extensively to make sure that the software that runs the airplane works correctly, and so on. So I would much rather be the CEO of Airbus today than the CEO of Boeing. As we start rolling out AI agents. So the next generation, you know, talk about the personal assistant or the agent, the risks of extremely damaging and embarrassing things happening will increase dramatically. And so I, and oddly enough, right, France is one of the leading countries in terms of formal methods of proving correctness of software. It’s one of the things that France does best, better than the United States, where we hardly teach the concept of correctness at all. I mean, literally, Berkeley is, we are the biggest producer of software engineers in the world, pretty much. And most of our graduates have never been exposed to the notion that a program could be correct or incorrect. So I think it’s a mistake to think that deregulation or, you know, or not regulating, is going to provide some kind of advantage in this case. so I would like the Summit to focus on what kinds of regulations can be achieved. Standards, which is another word for, self-regulation, I mean, that’s a that’s a little bit unfair, actually. So standards are typically voluntary. And so let me try to distinguish two things. So there are standards like IPv6, right. The Internet Protocol version 6. If you don’t comply with that standard your message just won’t go anywhere, right? You have to you know, your packets have to comply with the standard, whereas the standards we’re talking about here would be, well, okay, we agree that we should have persons with appropriate training who do an appropriate amount of testing on systems before they’re released, right. Well, if you don’t comply with that standard, you save money and you still release your system, right? So it doesn’t work like an internet standard or a telecommunication standard at all, right. Or the Wi-Fi standard or any of these things. And so it’s not likely to be effective to say, well, we have standards for sort of how much effort you’re supposed to put into testing. And the other thing is companies have put lots of effort into testing. Before GPT-4 was released, it underwent lots and lots and lots and lots of testing. but then we found out that, you know, you can jailbreak these things, you know, with a very short character string and get them to do all of the things that they are trained not to do. And so I think the facts on the ground right now are that testing and evals are ineffective, as a method of ensuring safety. And again, it’s because how do you stop your AI system from doing something when you haven’t the faintest idea how it does it in the first place? I don’t know. I don’t know how you stop these things from misbehaving.

Ima: Thank you Stuart. Thank you so much for being here. Do we have any questions from the audience? Go ahead.

[Audience question]

Ima: So I’m just going to repeat your question so that I’m sure that everyone heard it. And please tell me if I got your question correctly. you just talked about red teaming as a test and evaluation method. the way I understood your question was, do you think as a method this is sufficient? And do you think it’s dangerous to let those systems be introduced as it could within the context of red teaming them? Okay. What are your thoughts on red teaming? Thank you.

Stuart: Yeah. I mean, as I understand it, the reason Chernobyl happened is because they were kind of red teaming it, right? They were they were making sure that even when some of the safety systems were off, that the system still shut down properly and didn’t overheat. I don’t know all the details, but they were they were undergoing a certain kind of safety test with some parts of the safety mechanisms turned off. And, so I guess by analogy, one could imagine that a red team whose job it is to elicit bad behaviors from a system might succeed in eliciting bad behaviors from the system, in such a way that they actually create a real risk. so, I think these are really good questions because, especially with the systems that have long term planning capability. So we just put a paper in Science sort of going through the possibilities. How on earth do you test a system that has those kinds of long term plan planning capabilities, and in particular, can figure out that it’s being tested? Right. so it can always hide its true capabilities. Right? It’s not that difficult to find out, you know, if you’re in a test situation or are you really connected to the internet, right. You can start doing some URL probes and find out if you can actually connected to the internet fairly easily. and so it’s actually quite difficult to figure out how you would ever test such a system in a circumstance where the test would be valid. In other words, you could be sure that the system is behaving as it would in the real world unless you’re in the real world. But if you’re in the real world, then your test runs the risk of actually creating exactly the thing you’re trying to prevent. and again, testing is just the wrong way of thinking about this altogether. Right? we need formal verification. We need mathematical guarantees. And that can only come from an understanding of how the system operates. And we do this for buildings, right, before, you know, 500 people get in the elevator and go up to the 63rd floor. Someone has done a mathematical analysis of the structure of the building, and can tell you what kinds of loads it can resist and what kind of wind speed it can resist. And, and all those things, and it’s never perfect, but so far it’s extremely good. And building safety has dramatically improved. Aviation safety has dramatically improved because someone is doing those calculations, and we’re just not doing those for AI systems. The other thing is that with all the red teaming, you know, it gets out there in the real world, and the real world just seems to be much better at red teaming than the testing phase. And, with some of the systems, it’s within seconds of release, that people have found ways of bypassing it. So just to give you an example of, like, how do you do this jailbreaking, for those of you who, who some of you may have done it already, but you know, we worked on LLaMA V which, sorry, it’s called LLaVA, which is the version of LLaMA that’s multimodal, so it can have images as well as text. And so we started with a picture of the Eiffel Tower, and we’re just making tiny, invisible alterations to pixels of that image. And we’re just trying to improve the probability that the answer to the next question begins with, “sure”, exclamation mark or “bien sûr, point d’exclamation” Right? So, in other words, we’re trying to have a prompt that will cause the system to answer your question, even answer that question is, how do I break into the White House or how do I make a bomb? Or how do I do, or any of the things that I’m not supposed to tell you? Right. So we’re just trying to get it to answer those questions, against that training. And it’s trivial, right? It was like 20 or 30 small, invisible changes to a million-pixel of the Eiffel Tower. It still looks exactly like the Eiffel Tower. And now we can, for example, we can encode an email address. And whatever you type in is the next prompt, it will send a copy of your prompt to that email address. Right. Which could be in North Korea or anywhere else you want. So your privacy is completely, it’s completely voided. So, I actually think we probably don’t have a method of safely and effectively testing these kinds of systems. That’s the bottom line.

[Audience question]

Ima: So the question is, thanks, what, in which direction should research go?

Stuart: Yeah. It’s a great question. And I think looking at that sort of retrospective is humbling because we look back and think how little we understood then, and so it’s likely that we will look back on the present and think how little we understood. I think to some of us, it’s obvious that we don’t understand very much right now because, we didn’t expect simply scaling up language models to lead to these kinds of behaviors. We don’t know where they’re coming from. if I had to guess, I would say that this technological direction will plateau. for two reasons, actually. One, is that we are in the process of running out of data. There just isn’t enough high quality text in the universe to go much bigger than the systems we have right now. But the second reason is more fundamental, right? Why do we need this much data in the first place? Right. This is already a million times more than any human being has ever read. Why do they need that much? And why can they still not add and multiply when they’ve seen millions of examples of adding and multiplying? Right? So something is wrong, and I think what’s wrong, and this also explains what the problem is with these Go programs, is that they are circuits, and circuits are not a very good representation for a lot of fairly normal concepts, like a group of stones in Go. Right. I can describe what it means to be a group, they just have to be, you know, adjacent to each other vertically or horizontally, in English in one sentence, in Python in two lines of code. But a circuit that can look at a Go board and say for any pair of stones, are those stones part of the same group or not? That circuit is enormous, and it’s not generalizing correctly in the sense that you need a different circuit for a bigger board or a smaller board, whereas the Python code or the English doesn’t change. Right? It’s it’s invariant to the size of the board. and so there’s a lot of concepts that the system is not learning correctly. It’s learning a kind of a patchwork approximation. It’s a little bit like, this might be before your time, but when I was growing up we didn’t have calculators. So if you wanted to know the cosine of an angle, you had a big book of tables and that tables said okay, well, 49 degrees, 49. 5 degrees, and you look across and 49. 52 degrees and there’s a number, and you learn to, you know, take two adjacent numbers and interpolate to get more accurate and blah, blah, blah. So the function was represented by a lookup table. And I think there’s some evidence, certainly for recognition, that these circuits are learning a glorified lookup table. And they’re not learning the fundamental generalization of what it means to be a cat, or what it means to be a group of stones. And so in that sense, scaling up could not possibly, right, there could never be enough data in the universe to produce real intelligence by that method. so that suggests that the developments that are likely to happen, and I think this is what’s going on right now, is that we need other mechanisms to produce effective reasoning and decision making. And I don’t know what mechanisms they’re trying, because this is all proprietary, and they may be, as you mentioned, trying sort of committees of language models that propose and then critique plans and then, you know, they ask the next one, well, do you think this step is going to work, and that step is going to work? Which is really sort of a very heavyweight reimplementation of the planning methods that we developed in the last 50 years. my guess is that for both of those, for both the reason that this direction that we’ve been pursuing is incredibly data inefficient and doesn’t generalize properly, and because it’s opaque, and doesn’t support any arguments of guaranteed safety, that we’ll need to pursue other approaches. So the basic substrate may end up not being the giant transformer network, but will be something, you know, people talk about neurosymbolic methods. you know, I’ve worked on probabilistic programing for 25 years. and I think that’s another Ferrari and it’s in first gear. The thing that’s missing with probabilistic programing is the ability to learn those probabilistic programing structures from scratch, instead of having them be provided by the human engineer. and it may turn out that we’re just failing to notice an obvious method of solving that problem. And if we do, I think in the long run, that’s much more promising for both capabilities and safety. So we got to treat it with care, but I think that’s a direction that is, in the long run, in 20 years, what we might see. 

Ima: Okay. Thank you. There’s a lot of questions, I’m afraid we’re not going to be able to take them all. Alexandre, I saw that you had a question earlier, do you still have one? Okay. Patrick?

[Audience question]

Stuart: I hope everyone heard that. Yeah. I think this is a great idea. Yeah, it’s I mean, so these deliberative democracy processes, I’m familiar, though, it’s because I’m also director of the Kavli Center for Ethics, Science, and the Public at Berkeley, and we do a lot of this participatory kind of work where we bring members of the public in, so it takes time to educate people. And there’s certainly not consensus among the experts. And I think this is one of the problems we have in France is that, some of the French experts are very, shall we say, accelerationist, without naming any names. but absolutely. I think, and the AI safety community needs to, I think, try to have a little more of a unified voice on this. The only international network is the one that’s forming for the AI Safety Institute. So that’s a good thing. There’s like 17 AI safety institutes that have been set up by different countries, including China. And they’re all meeting, I think, in September. but we need, you know, that there are maybe 40 or 50 nonprofit, nongovernmental institutes and centers around the world, and we need them to coordinate. and then there are probably in the low thousands of AI researchers who are interested and somewhat at least somewhat committed to issues of safety. but there’s no conference for them to attend. There’s no real general organization. So that’s one of the things I’m working on, is creating that kind of international organization. But I really like this idea. There’s been some polling in the US, about 70% of the public think that we should never build AGI. So that tells you something about what you might expect to see if you if you ran these assemblies. I’m guessing that, the more the public learns about the current state of technology, the current, our current ability to predict and control the behavior of these systems, the less the public is going to want them to go forward. Yeah.

Ima: And hence the creation of this breakfast to be able to discuss as well together. Stuart, thank you so much for being here.

Stuart: We have time.

Ima: Oh it’s I mean, yes, that’s fair. Okay. Do you think you can answer in like 30 seconds?

Stuart: I’ll keep it short.

[Audience question]

Stuart: Okay. Yeah. I’m not opposed to research, you know, mechanistic interpretability is one heading, but, Yeah, sure. We should try to understand what’s going on inside these systems, or we should do it the other way and build systems that we understand because we design them on principles that we do understand. So there might, I mean, we’ve got a lot of principles already, right? You know, state space search goes back at least to Aristotle, if not before, you know, logic, probability, statistical learning theory, we’ve got a lot of theory. and there might be some as yet undiscovered principle that these systems are operating on, and it would be amazing if we could find out that thing. Right. Maybe it has something to do with this vector space representation of word meaning or something like that. But so far it’s all very anecdotal and speculative. I actually want to come back on what you said at the beginning. So if an AI system could literally outthink the human race in the real world and make superior decisions in all cases, what would you call it? I mean, other than superhuman in terms of its intelligence, right. That is, what, I don’t know what else to call it.

[Audience question]

Stuart: Yeah. You know, but no, I’m not describing, I don’t think GPT-4 is superhuman. Right. But we are spending hundreds of billions, so we’re spending more on creating AGI specifically than we are on all other areas of science in the world. So you can say, well, I am absolutely confident that this is a complete waste of money because it’s never going to produce anything. Right? Or you can sort of take it at least somewhat seriously and say, okay, if you succeed in this vast global scale investment program, how do you propose to control the systems that result? So I think the analogy to piling together uranium is actually good. I mean, definitely, it would be better if you’re piling together uranium to have an understanding of the physics. And that’s exactly what they did, right? In fact, when Szilárd invented the nuclear chain reaction, even though that he didn’t know which atoms could produce such a chain reaction, he knew that it could happen. And he figured out a negative feedback control mechanism to create a nuclear reactor that would be safe. All within a few minutes. And that’s because he understood the basic principles of chain reactions, and their mathematics. and we just don’t have that understanding. And interestingly, nature did produce a nuclear reactor. So if you go to Gabon, there’s an area where there’s a sufficiently high concentration of uranium, in the rocks that when water comes down between the rocks, the water slows the neutrons down, which increases the reaction cross-, which causes the chain reaction to start going. And then it, so it gets up to 400°C, the water all boils off, and the reaction stops. So it’s exactly that negative feedback control system, nuclear reactor produced by nature. Right. So not even by stochastic gradient descent. So these incredibly destructive things can be produced by incredibly simple mechanisms. And you know, and that seems to be what’s happening right now. And the sooner we understand, I agree, the sooner we understand what’s going on, the better chance we have. But it may not be possible.

Ima: On that note, thank you all so much for coming. If you have scanned the QR code, you will receive an email from me. concerning the next safety breakfast. and I’ll also send you, obviously, the blog and the transcription of the first half of that conversation. Voilà. If you have any questions, I believe you all have my email address. I answer emails. Thank you. Bye.

Transcript: French (Français)

Ima: [Français] OK, super. Merci beaucoup à tous et toutes d’être venus. C’est toujours sympa de commencer avec un micro dans les mains le jeudi matin à 9h00. Je vais faire une introduction en français d’abord.

[Anglaise] Je vais commencer en français pendant environ cinq minutes. Je promets que ça ne durera que cinq minutes. Malgré le mythe selon lequel nous ne pouvons pas parler anglais ou arrêter de parler français. Ensuite, nous allons juste, j’ai quelques questions pour Stuart, je pense que cela va durer entre 25 et 30 minutes et ensuite, nous serons ravis de répondre à toutes vos questions.

[Français] Voilà, bienvenue. Comme vous le savez, je m’appelle Ima. J’organise une série de petits-déjeuners que j’ai appelés Safety Breakfast, donc petit-déjeuner sur les questions de sûreté et de sécurité dans le cadre du Sommet qui se tient en France, à Paris, le 10 et 11 février prochain. Le but de cette série est d’ouvrir un espace de discussion entre experts et passionnés des sujets de sûreté et de sécurité de l’IA, de gouvernance de l’IA, pour que, au fur et à mesure du Sommet, on se pose ensemble des questions, et qu’on puisse tout simplement discuter de la manière la plus ouverte possible. Je sais que vous voyez une caméra, ne vous inquiétez pas. La partie questions ne sera pas publiée. Ce sont simplement les questions que je vais poser à Stuart, et donc les commentaires et les réponses de Stuart, qui seront publiés. Je vous enverrai évidemment le blogpost et le transcript. S’agissant du Sommet sur la sécurité, comme vous le savez, c’est le 10 et 11 février prochain. C’est le troisième Sommet sur la sécurité. Il s’appelle Sommet de l’Action. Parce qu’on a une vision française qui est extrêmement ambitieuse et qui a pour objet de mettre en avant des solutions, notamment par exemple, des standards sur les enjeux qui sont afférents à l’IA, donc pas seulement les enjeux sur la sécurité et la sûreté, mais vraiment des enjeux qui sont relatifs à l’impact sur le travail des systèmes d’IA, des enjeux qui sont relatifs à la gouvernance internationale, les enjeux qui sont relatifs à la façon dont on vit en société de manière générale donc tout ce qu’on appelle AI for Good, les systèmes d’IA pour le bien-être de l’humanité et évidemment, en partie, les enjeux de sécurité sur lesquels nous, à FLI, avec évidemment les enjeux de gouvernance, on se concentre. Je crois que c’est tout. S’agissant du contexte par rapport à ce petit-déjeuner et le Sommet de manière générale. La prochaine édition du petit-déjeuner sera début septembre, ensuite, on aura une autre édition mi-octobre. Si vous êtes intéressés, vous avez un QR code sur le côté. N’hésitez pas à indiquer votre intérêt sur une liste, et je ne vous spammerai pas. J’étais avocate avant en droit humain et en droit des nouvelles technologies, donc le RGPD, je connais. Ne vous inquiétez pas, vous ne recevrez aucun spam. On respecte le consentement des personnes, mais si vous voulez être invité, il faut nous l’indiquer. Voilà.

[Anglaise] C’est tout pour le français. Merci pour tout, encore une fois, merci beaucoup d’être venus. Je sais qu’il est 9h00 du matin, je suis fatiguée, vous l’êtes tous. J’apprécie vraiment que vous soyez venus malgré les Jeux Olympiques et j’attends vraiment avec impatience une discussion enrichissante avec Stuart que je ne peux pas assez remercier, parce qu’il est ici aujourd’hui. Alors Stuart, merci beaucoup. J’ai… Voulez-vous dire quelques mots avant que nous commencions, ou puis-je simplement passer aux questions ?

Stuart: S’il vous plaît.

Ima: Oui, je peux, d’accord.

Stuart: Juste… Félicitations à la France pour avoir remporté leur premier match de football hier soir.

Ima: Très bien, alors Stuart, vous savez, je sais, nous savons tous ici, nous avons constaté une avancée remarquable de l’IA dans divers domaines. Cela inclut des modèles améliorés comme GPT-4 et Claude 3. Des générateurs d’images plus sophistiqués comme DALL-E 3. Des capacités de génération de vidéos avec Sora, et même des progrès en robotique et en prédiction de la biologie moléculaire. Deux tendances que nous avons observées sont l’intégration de plusieurs capacités dans un seul modèle, ce que nous appelons la multimodalité native, et l’émergence de la génération de vidéos de haute qualité. De votre point de vue, lesquelles de ces récentes avancées considérez-vous comme plus significatives, et quels potentiels défis ou implications prévoyez-vous que ces avancées posent pour la société ?

Stuart: Ce sont en fait des questions auxquelles il est assez difficile de répondre. Cela pourrait être utile, je sais que Raja, par exemple, a peut-être une expérience encore plus longue que la mienne dans l’IA. Pour certaines personnes, vous savez, qui sont arrivées à l’IA plus récemment, l’histoire est en fait importante à comprendre. Pendant la majeure partie de l’histoire de l’IA, nous avons procédé comme d’autres ingénieurs, comme les ingénieurs mécaniques et les ingénieurs aéronautiques. Nous avons essayé de comprendre les principes de base du raisonnement, de la prise de décision. Nous avons ensuite créé des algorithmes qui faisaient du raisonnement logique ou du raisonnement probabiliste ou du raisonnement par défaut. Nous avons étudié leurs propriétés mathématiques, la prise de décision, en cas d’incertitude, divers types d’apprentissage. Vous savez, nous faisions des progrès lents et réguliers avec quelques, je pense, contributions vraiment importantes à, vous savez, la compréhension par l’espèce humaine de la façon dont l’intelligence fonctionne. C’est assez intéressant, je crois qu’en 2005, à NeurIPS, qui est maintenant la principale conférence sur l’IA, Terry Sejnowski, qui était co-fondateur de la conférence NeurIPS, a fièrement annoncé qu’il n’y avait pas un seul article qui utilisait le mot rétro propagation qui avait été accepté cette année-là. Cette opinion était assez courante, que les réseaux neuronaux n’étaient, pas fiables, pas efficaces en termes de données, pas efficaces en termes de temps de calcul, et ne supportaient aucun type de garanties mathématiques et d’autres méthodes comme les machines à vecteurs de support, qui ont été développées, je suppose à l’origine par des statisticiens comme Vapnik, étaient supérieures à tous égards, et que c’était juste un signe de la maturation du domaine. Puis vers 2012, comme beaucoup d’entre vous le savent, l’apprentissage profond est arrivé et a démontré des améliorations significatives dans la reconnaissance d’objets. C’est à ce moment-là que le barrage a cédé. Je ne pense pas que nous n’aurions pas pu faire cela 25 ans plus tôt. La façon dont je l’ai décrit à l’époque était que nous avions une Ferrari et nous roulions en première, et puis quelqu’un a dit : « Regardez, si vous mettez ce pommeau là et passez en cinquième vitesse, vous pouvez aller à 400 kilomètres/heure. » Quelques petits changements techniques comme la descente de gradient stochastique et les ReLUs, et d’autres, vous savez, réseaux résiduels, etc, ont rendu possible la construction de réseaux de plus en plus grands. Je me souviens avoir construit un réseau neuronal à sept couches en 1986, et l’entraîner était un cauchemar parce que, vous savez, à cause des bonnes vieilles unités d’activation sigmoïde, les gradients disparaissaient tout simplement, et vous vous retrouviez avec un gradient de 10 puissance -40 sur la septième couche, donc vous ne pouviez jamais l’entraîner. Mais juste ces quelques petits changements, et ensuite l’augmentation de la quantité de données, ont fait une énorme différence. Ensuite, les modèles de langage sont arrivés. Dans la quatrième édition du manuel d’IA, nous avons des exemples de sorties de GPT-2. C’est assez intéressant, mais ce n’est pas du tout surprenant parce que nous montrons aussi la sortie des modèles bigrammes, qui prédisent simplement le mot suivant, conditionné seulement par le mot précédent. Cela ne produit pas de texte grammatical. Si vous passez à un modèle trigramme, donc, vous prédisez le mot suivant en fonction des deux précédents, vous commencez à obtenir quelque chose de grammatical à l’échelle d’une demi-phrase, ou parfois d’une phrase entière. Bien sûr, comme vous le savez, dès que vous arrivez à la fin d’une phrase, une nouvelle commence et c’est sur un sujet complètement différent et c’est totalement déconnecté, des divagations aléatoires. Mais ensuite, vous passez à 6 ou 7 grammes et vous obtenez un texte cohérent à l’échelle du paragraphe. Il n’était pas du tout surprenant que GPT-2, qui, je pense, avait une fenêtre de contexte de 4 000 tokens, il me semble… Tokens, oui, donc 4 000. Qu’il serait capable de produire un texte semblant cohérent. Personne ne s’attendait à ce qu’il dise la vérité. C’était juste : « Peut-il produire un texte qui soit grammaticalement et thématiquement cohérent ? » C’était presque un accident s’il disait réellement quelque chose de vrai. Je ne pense pas que quiconque à l’époque comprenait l’impact de l’augmentation de l’échelle. Ce n’est pas tant la taille du contexte, c’est la quantité de données d’entraînement. Les gens se concentrent beaucoup sur le calcul, comme si c’était le calcul qui rendait ces choses plus intelligentes. Ce n’est pas vrai. Le calcul est là parce qu’ils veulent augmenter la quantité de données sur lesquelles il est entraîné. La quantité de calcul est, si vous y réfléchissez, approximativement linéaire par rapport à la quantité de données et la taille du réseau, mais la taille du réseau est linéaire par rapport à la quantité de données. La quantité de calcul dont vous avez besoin dépend de manière quadratique de la quantité de données. Il y a d’autres facteurs à prendre en compte, mais au moins ces choses basiques vont se produire. C’est l’augmentation de la taille du jeu de données qui entraîne cette augmentation du calcul. Nous avons maintenant ces systèmes, donc c’est la taille du jeu de données et ensuite, vous savez, InstructGPT était la première étape où ils ont fait le pré-entraînement supervisé pour lui apprendre à répondre aux questions. Le modèle brut est libre de dire : « C’est une question idiote » ou « Je ne réponds pas à ça », ou toute autre chose, ou même l’ignorer complètement. InstructGPT, vous lui apprenez comment se comporter comme un gentil répondant utile. Puis apprentissage par renforcement avec retour d’information humain (RLHF) pour éliminer les mauvais comportements. Nous avons des choses qui, comme je pense qu’OpenAI l’a déclaré à juste titre, lorsque ChatGPT est sorti, ça a donné au public un avant-goût de comment ça sera lorsque l’intelligence générale sera disponible à la demande. Nous pouvons débattre de savoir si c’est une véritable intelligence, donc la raison pour laquelle j’ai dit que c’est une question difficile, c’est parce que nous n’avons tout simplement pas la réponse. Nous ne savons pas comment ces systèmes fonctionnent en interne. Lorsque ChatGPT est sorti, un de mes amis m’a envoyé quelques exemples. Prasad Tadepalli, il est professeur à l’université d’État de l’Oregon. L’un d’eux demandait : « Lequel est plus grand, un éléphant ou un chat ? » ChatGPT a dit : « Un éléphant est plus grand qu’un chat. » D’accord, c’est bien, il doit savoir quelque chose sur la taille des éléphants et des chats parce que, vous savez, peut-être que cette comparaison particulière n’a pas été faite dans les données d’entraînement. Ensuite, il a dit : « Lequel n’est pas plus grand, un éléphant ou un chat ? » Il a déclaré : « Ni un éléphant ni un chat n’est plus grand que l’autre. » Cela vous dit deux choses, n’est-ce pas ? Tout d’abord, il n’a pas un modèle interne du monde avec de grands éléphants et de petits chats, qu’il interroge pour répondre à cette question, ce qui est, je pense, la façon dont un être humain le fait. Vous imaginez un éléphant, vous imaginez un chat, et vous voyez que le chat est minuscule par rapport à l’éléphant. S’il avait ce modèle, il ne pourrait pas donner cette deuxième réponse. Cela signifie aussi que la première réponse n’a pas été donnée en consultant ce modèle non plus. C’est une erreur que nous faisons encore et encore, nous attribuons aux systèmes d’IA… Lorsqu’ils se comportent intelligemment, nous supposons qu’ils se comportent intelligemment pour les mêmes raisons que nous. Encore et encore, nous constatons que c’est une erreur. Un autre exemple est ce qui s’est passé avec les logiciels de jeu de Go. Nous supposons que parce que les logiciels de jeu de Go ont battu le champion du monde et ont ensuite dépassé de manière stratosphérique le niveau humain, les meilleurs humains ont une cote d’environ 3 800, les logiciels de jeu de Go sont maintenant à 5 200, donc ils sont massivement surhumains. Nous supposons qu’ils comprennent les concepts de base du Go. Il s’avère que ce n’est pas le cas. Il y a certains types de groupes de pierres qu’ils sont incapables de reconnaître comme étant des groupes de pierres. Nous ne comprenons pas ce qui se passe et pourquoi ils ne peuvent pas le faire mais maintenant, des joueurs humains amateurs ordinaires peuvent régulièrement et facilement battre ces logiciels de jeu de Go massivement surhumains. La principale leçon à tirer de cela est que nous avons été capables de produire des systèmes que beaucoup de gens considèrent comme plus intelligents qu’eux-mêmes. Cependant, nous n’avons aucune idée de leur fonctionnement et il y a des signes que s’ils fonctionnent, c’est pour de mauvaises raisons. Puis-je juste finir avec une petite anecdote ? J’ai reçu un email ce matin, d’un autre homme appelé Stuart. Je ne vous donnerai pas son nom de famille. Il a dit : « Je ne suis pas un chercheur en IA, mais j’avais quelques idées, et j’ai travaillé avec ChatGPT pour développer ces idées sur l’intelligence artificielle. ChatGPT m’assure que ces idées sont originales et importantes, et il m’aide à écrire des articles à leur sujet, et ainsi de suite. En revanche, quand je parle à quelqu’un qui comprend l’IA, et maintenant je suis vraiment confus parce que cette personne qui comprend l’IA a dit que ces idées n’avaient aucun sens, ou qu’elles n’étaient pas originales, et, eh bien, qui a raison ? » C’était vraiment choquant, en fait. C’était triste qu’une personne novice bien intentionnée, raisonnablement intelligente ait été complètement dupée non seulement par ChatGPT lui-même, mais par l’ensemble des relations publiques et des explications des médias à propos de ce que c’est, pensant qu’il comprenait vraiment l’IA et l’aidait vraiment à développer ces choses. Je pense qu’au lieu de cela, il faisait juste sa flatterie habituelle : « Ouais, c’est une super idée ! » Les risques, je pense, sont, la surinterprétation des capacités de ces systèmes. Laissez-moi reformuler cela. Il est possible que de réels progrès dans la manière dont les systèmes intelligents fonctionnent se soient produits sans que nous nous en rendions compte. En d’autres termes, à l’intérieur de ChatGPT, certains nouveaux mécanismes fonctionnent que nous n’avons pas inventés, que nous ne comprenons pas, auxquels personne n’a jamais pensé, et ils produisent des capacités intelligentes d’une manière que nous ne comprendrons peut-être jamais. Ce serait une grande inquiétude. Je pense que l’autre risque est que nous dépensons, je pense, selon certaines estimations, d’ici à la fin de cette année, ça s’élèvera à 500 milliards de dollars, pour développer cette technologie. Les revenus sont encore très faibles, moins de 10 milliards. Je pense que le Wall Street Journal a publié un article ce matin disant que certaines personnes commencent à se demander combien de temps cela peut continuer.

Ima: Merci. En parlant des avancées en termes de capacités, nous avons entendu parler de la prochaine génération de modèles. Le PDG d’OpenAI Sam Altman évoque GPT-5 depuis un certain temps maintenant, le décrivant comme un bond significatif en avant. De plus, nous avons des rapports récents suggérant qu’OpenAI travaille sur une nouvelle technologie appelée Strawberry, qui vise à améliorer les capacités de raisonnement de l’IA. Le but de ce projet serait de permettre à l’IA de faire plus que simplement répondre à des questions. Il est destiné à permettre à l’IA de planifier à l’avance, naviguer sur Internet de manière indépendante, et de mener ce qu’on appelle des recherches approfondies par elle-même. Cela nous amène donc à un concept connu sous le nom de planification à long terme en IA, la capacité des systèmes d’IA à poursuivre de manière autonome et à accomplir des tâches complexes et à plusieurs étapes sur de longues périodes. Que pensez-vous de ces développements ? Quels avantages et risques potentiels voyez-vous pour la société si les systèmes d’IA deviennent capables de ce type de planification autonome à long terme ? Peut-être, sans vous mettre la pression, en moins de sept minutes ?

Stuart: Bien sûr.

Ima: Merci.

Stuart: J’ai entendu des choses similaires, en parlant aux dirigeants des grandes entreprises d’IA, que la prochaine génération de systèmes, qui pourrait sortir plus tard cette année ou au début de l’année prochaine aura des capacités de planification et de raisonnement importantes. Cela a été, je dirais, remarquablement absent. Vous savez, même dans les dernières versions de GPT-4 et Gemini et Claude, ils peuvent récapituler le raisonnement dans des situations très clichées où, en quelque sorte, ils régurgitent des processus de raisonnement qui sont exposés dans les données d’entraînement. Ils ne sont cependant pas particulièrement bons pour traiter, par exemple, les problèmes de planification. Raul peut interdire Patti, qui est un expert en planification d’IA et a en fait essayé de faire résoudre à GPT-4 des problèmes de planification du Concours International de Planification qui est la compétition où les algorithmes de planification s’affrontent. Contrairement à certaines affirmations d’OpenAI, en gros, il ne les résout pas du tout. Ce que j’entends, c’est que maintenant, dans le laboratoire, ils sont capables, avec succès, de générer de manière robuste des plans avec des centaines d’étapes, et ensuite de les exécuter dans le monde réel en gérant les contingences qui surviennent, en replanifiant si nécessaire, et ainsi de suite. Évidemment, vous savez, si vous êtes dans le domaine de la prise de contrôle du monde, vous devez être capable de surpasser la race humaine dans le monde réel, de la même manière que les programmes d’échecs surpassent les joueurs humains sur l’échiquier. C’est juste une question de : « Pouvons-nous passer de l’échiquier, qui est très étroit, très petit, avec un nombre fixe d’objets, un nombre fixe d’emplacements, des règles parfaitement connues, il est entièrement observable, vous pouvez voir l’état entier du monde en une seule fois. » Ces restrictions rendent le problème des échecs beaucoup plus facile que la prise de décision dans le monde réel. Imaginez, par exemple, si vous êtes responsable d’organiser les Jeux Olympiques, imaginez à quel point cette tâche est compliquée et difficile par rapport à jouer aux échecs. Malgré tout, jusqu’à présent, nous avons réussi à le faire avec succès. Si les systèmes d’IA peuvent faire cela, vous remettez les clés de l’avenir aux systèmes d’IA. Ils doivent être capables de planifier et ils doivent être capables d’avoir accès au monde, une certaine capacité à influencer le monde. Avoir accès à Internet, avoir accès aux ressources financières, cartes de crédit, comptes bancaires, comptes de messagerie, réseaux sociaux, ces systèmes ont toutes ces choses. Si vous vouliez créer une situation de risque maximal, vous doteriez les systèmes d’IA de capacités de planification à long terme et d’un accès direct au monde par tous ces mécanismes. Le danger est, évidemment, que nous créons des systèmes capables de surpasser les êtres humains et nous le faisons sans avoir résolu le problème de contrôle. Le problème de contrôle est : Comment pouvons-nous nous assurer que les systèmes d’IA n’agissent jamais de manière opposée aux intérêts humains ? Nous savons déjà qu’ils le font parce que nous l’avons vu se produire encore et encore avec des systèmes d’IA qui mentent intentionnellement à des êtres humains. Par exemple, lorsque GPT-4 était testé, pour voir s’il pouvait pénétrer dans d’autres systèmes informatiques, il a été confronté à un système informatique doté d’un Captcha. Une sorte de diagramme avec du texte où il est difficile pour les algorithmes de vision par ordinateur de lire le texte. Il a trouvé un être humain sur Task Rabbit et a dit à cette personne qu’il était une personne malvoyante qui avait besoin d’aide pour lire le Captcha, et a donc payé cette personne pour lire le Captcha et ça lui a permis de pénétrer dans le système informatique. Pour moi, quand les entreprises disent : « Nous allons dépenser 400 milliards de dollars au cours des deux prochaines années ou peu importe pour créer une AGI. Nous n’avons pas la moindre idée de ce qui se passe si nous réussissons. » Il me semble essentiel de dire : « Arrêtez. Jusqu’à ce que vous puissiez comprendre ce qui se passe si vous réussissez et comment vous allez contrôler la chose que vous construisez. » Tout comme si quelqu’un disait : « Je veux construire une centrale nucléaire. » « Comment allez-vous faire cela ? » « Je vais rassembler beaucoup d’uranium enrichi et en faire un gros tas. » Vous dites : « Comment allez-vous l’empêcher d’exploser et de tuer tout le monde dans un rayon de 160 kilomètres ? » Vous dites : « Je n’en ai aucune idée. » C’est la situation dans laquelle nous nous trouvons.

Ima: Merci. Les systèmes d’IA à usage général rendent plus facile et moins coûteux pour les individus de mener des cyberattaques, même sans expertise approfondie. Nous disposons de premières indications que l’IA pourrait aider à identifier les vulnérabilités, mais nous n’avons pas encore de preuves solides que l’IA peut entièrement automatiser des tâches complexes de cybersécurité de manière à avantager significativement les attaquants plutôt que les défenseurs. Que s’est-il passé vendredi dernier ? La récente panne informatique mondiale causée par une mise à jour défectueuse du logiciel sert de rappel brutal de la vulnérabilité de notre infrastructure numérique et de la dévastation que peuvent causer les perturbations à grande échelle. Compte tenu de ces développements et préoccupations, pourriez-vous partager vos réflexions sur deux points clés ? A) Quelles nouvelles capacités offensives en matière de cybersécurité pensez-vous que l’IA pourrait permettre dans un avenir proche, et B) compte tenu de notre dépendance à l’infrastructure numérique, à quels types d’impacts devons-nous nous préparer si les cyberattaques améliorées par l’IA deviennent plus fréquentes ? Merci.

Stuart: Je vais commencer par parler de cette panne. Elle a été causée par une entreprise appelée CrowdStrike, qui est une entreprise de cybersécurité, et ils ont envoyé une mise à jour. La mise à jour a provoqué un redémarrage en boucle de Windows. C’était une erreur de programmation de niveau universitaire. Je n’ai pas encore de chiffres exacts, mais en regardant simplement le nombre d’industries, y compris presque toute l’industrie aéronautique des États-Unis a été arrêtée. Des millions de personnes, partout dans le monde, ont eu leurs vols annulés, ou ils ne pouvaient pas accéder à leurs comptes bancaires ou ils ne pouvaient pas vendre de hamburgers parce que leurs terminaux de point de vente ne fonctionnaient pas, etc. Cela représente peut-être 100 milliards de dollars de pertes. Ou qu’ils n’avaient pas accès aux soins dont ils avaient besoin. Oui, probablement quelques conséquences personnelles très graves. Disons en termes financiers, 100 milliards à plus ou moins un facteur de dix, causé par une erreur de programmation de niveau universitaire, littéralement quelques frappes de clavier. Si vous lisez le contrat de CrowdStrike, il dit : « Nous garantissons que notre logiciel fonctionne sans erreur. Mais, notre responsabilité est limitée au remboursement du coût du logiciel si vous résiliez votre licence en raison d’une erreur. » Peu importe, quelques centaines de dollars de remboursement pour quelques centaines de milliards de dollars de dommages. Pour moi, c’est un échec absolu de la régulation. Parce que, même s’ils étaient tenus responsables, ce qui, je pense, est assez improbable étant donné les termes de ce contrat, ils ne pourraient pas payer pour les dommages qu’ils ont causés. Cela s’est produit dans d’autres domaines. En médecine, dans les années 1920, il y avait un médicament qui a causé 400 000 paralysies permanentes d’Américains, 400 000 personnes ont été paralysées à vie. Si vous regardez comment cela se passerait si cela se produisait aujourd’hui, en termes de responsabilité, quels types de jugements se tiennent dans les tribunaux américains, ce serait environ 60 trillions de dollars de responsabilité. Ce fabricant de médicaments ne pourrait pas payer cela. La responsabilité n’est tout simplement pas un moyen efficace de dissuasion. Nous avons la Federal Drug Administration, qui dit qu’avant de pouvoir vendre un médicament, vous devez prouver qu’il ne tue pas les gens. Si vous ne pouvez pas, vous ne pouvez pas dire : « C’est trop difficile » ou « C’est trop cher, nous voulons le faire quand même. » La FDA dira : « Non, désolé, revenez quand vous pourrez. » Il est grand temps que nous commencions à imposer des exigences similaires à l’industrie du logiciel. Certains mécanismes d’approvisionnement, donc certains types de logiciels militaires, doivent répondre à ce type d’exigence. Ils diront : « Non, vous ne pouvez pas vendre un logiciel qui contrôle les mécanismes de largage de bombes à moins que vous ne puissiez réellement vérifier qu’il fonctionne correctement. » Il n’y a aucune raison pour que CrowdStrike n’ait pas été tenu, avant d’envoyer une mise à jour logicielle qui peut causer 100 milliards de dollars de dommages, été tenu de vérifier qu’elle fonctionnait correctement. Si, et j’espère vraiment que quelque chose comme ça ressorte de cet épisode, car nous avons besoin de ce type de réglementation pour les systèmes d’IA. Si vous demandez quelle est l’analogie, pour les médicaments, ils doivent être sûrs et efficaces pour le problème de santé pour lequel ils sont prescrits, quelle est l’analogie de cela ? C’est un peu difficile à dire pour les systèmes d’IA à usage général. Pour une application logicielle spécifique, je pense que c’est plus facile et dans de nombreux cas, des règles spécifiques au secteur vont être établies dans le cadre de la loi européenne sur l’IA. Mais pour l’IA à usage général, qu’est-ce que cela signifie ? Parce que c’est tellement général, qu’est-ce que cela signifie de dire qu’elle ne cause aucun dommage ou qu’elle est sûre ? Je pense que c’est trop difficile d’écrire ces règles pour le moment. Ce que nous pouvons dire, c’est qu’il y a certaines choses que ces systèmes ne devraient évidemment pas faire. Nous appelons cela des lignes rouges. Une ligne rouge signifie que si votre système franchit cela, il est en violation. Par exemple, les systèmes d’IA ne devraient pas se répliquer eux-mêmes. Les systèmes d’IA ne devraient pas pénétrer d’autres systèmes informatiques. Les systèmes d’IA ne devraient pas conseiller les terroristes sur la façon de fabriquer des armes biologiques. Tout ça sont des exigences. Je pourrais prendre n’importe qui dans la rue et dire : « Pensez-vous que c’est raisonnable, que nous devrions permettre aux systèmes d’IA de faire ces choses ? » Ils diraient : « Bien sûr que non, c’est ridicule. » Mais les entreprises disent : « Oh, c’est trop difficile. Nous ne savons pas comment empêcher nos systèmes d’IA de faire ces choses. » Comme nous le faisons avec les médicaments, nous devrions dire : « C’est difficile, revenez quand vous saurez comment empêcher vos systèmes d’IA de faire ces choses. » Maintenant, mettez-vous à la place d’OpenAI, de Google ou de Microsoft. Vous avez déjà dépensé 60, 80, 100 milliards de dollars pour développer cette technologie basée sur de grands modèles de langage que nous ne comprenons pas. La raison pour laquelle il est trop difficile pour eux de fournir une quelconque garantie que leur système ne franchira pas ces lignes rouges, c’est difficile parce qu’ils ne comprennent pas comment ils fonctionnent. Ils font vraiment ce que les économistes appellent l’erreur des coûts irrécupérables, c’est-à-dire qu’ils ont déjà investi tellement dans cela, qu’ils doivent continuer, même si c’est stupide de le faire. C’est comme si, si nous imaginons une histoire alternative de l’aviation où il y avait les ingénieurs aéronautiques comme les frères Wright, qui calculaient la portance, la traînée et la poussée et essayaient de trouver une source d’énergie suffisamment bonne pour pousser un avion dans les airs, suffisamment rapide pour le maintenir en l’air, et ainsi de suite, donc l’approche de l’ingénieur. Puis l’autre approche, qui consistait à élever des oiseaux de plus en plus grands pour transporter des passagers et il se trouve que les oiseaux de plus en plus grands ont atteint une taille suffisante pour transporter des passagers avant que les ingénieurs aéronautiques n’y parviennent. Puis, ils vont à l’Administration Fédérale de l’Aviation et disent : « Nous avons cet oiseau géant avec une envergure de 250 mètres et il peut transporter 100 passagers. Nous aimerions une licence pour commencer à transporter des passagers. Nous avons mis 30 ans d’efforts et des centaines de milliards de dollars pour développer ces oiseaux. » L’Administration Fédérale de l’Aviation dit : « Mais les oiseaux continuent de manger les passagers ou de les laisser tomber dans l’océan. Revenez quand vous pourrez fournir des garanties quantitatives de sécurité. » C’est la situation dans laquelle nous nous trouvons. Ils ne peuvent fournir aucune garantie quantitative de sécurité. Ils ont dépensé une tonne d’argent, ils font un lobbying extrêmement intense pour être autorisés à continuer, sans aucune garantie de sécurité.

Ima: Merci. Pour revenir aux infractions liées aux capacités cybernétiques, pensez-vous que vous pourriez très brièvement nous décrire à quoi ressemblerait le monde si nous avons des cyberattaques puissantes rendues possibles par l’IA ?

Stuart: C’est une question à laquelle les experts en cybersécurité seraient peut-être mieux à même de répondre parce que je ne comprends pas vraiment, à part en quelque sorte trouver des vulnérabilités dans les logiciels, les attaques elles-mêmes sont généralement assez simples, ce sont juste des morceaux relativement courts de code qui exploitent une vulnérabilité. Nous connaissons déjà des milliers ou des dizaines de milliers de ces vulnérabilités qui ont été détectées au fil des ans. Une chose que nous savons sur les capacités de génération de code des grands modèles de langage, c’est qu’ils sont très bons pour produire du nouveau code. C’est en quelque sorte un hybride de morceaux de code existants avec des fonctionnalités similaires. Je dirais que de toutes les applications économiques des grands modèles de langage, la programmation est ce qui a peut-être la meilleure chance de produire une véritable valeur économique pour l’acheteur. Cela ne me surprendrait pas du tout que ces systèmes soient capables de combiner des types d’attaques existants de nouvelles manières ou d’essentiellement muter des attaques existantes pour contourner les correctifs que l’industrie du logiciel met constamment en place. Comme je l’ai dit, je ne suis pas un expert, mais ça veut dire que quelqu’un qui est relativement inexpérimenté, avec l’aide d’un de ces systèmes, serait capable d’être aussi dangereux qu’un expert hautement expérimenté et formé en cybersécurité offensive. Je pense que cela va augmenter la fréquence et la gravité des cyberattaques qui se produisent. Se défendre contre les cyberattaques signifie vraiment comprendre la vulnérabilité. Comment l’attaque est-elle capable de tirer parti de ce qui se passe dans le logiciel ? Je pense que les corriger est probablement pour le moment, dans la plupart des cas, au-delà des capacités des grands modèles de langage. Il semble qu’il va y avoir un certain avantage pour l’attaquant pour le moment.

Ima: Merci. Pour les dernières questions avant que nous passions le micro, pour ainsi dire, au public, parlons un peu des Sommets sur la sécurité, d’accord ? Le Sommet sur la Sécurité de l’IA au Royaume-Uni en 2023 a réuni, comme vous le savez tous, des gouvernements internationaux, des entreprises leaders en IA, des groupes de la société civile et des experts en recherche pour discuter des risques de l’IA, en particulier de l’IA de pointe. En tant que premier Sommet sur la sécurité de l’IA, il a ouvert un nouveau chapitre dans la diplomatie de l’IA. Notamment, des pays comme les États-Unis, la Chine, le Brésil, l’Inde, l’Indonésie et d’autres ont signé un engagement conjoint sur les tests avant déploiement. Le Royaume-Uni et les États-Unis ont chacun annoncé la création de leur Institut de Sécurité de l’IA, et le Sommet a généré un soutien pour la production du rapport scientifique international sur la sécurité de l’IA avancée. C’était le premier Sommet sur la sécurité de l’IA. En ce qui concerne le deuxième Sommet sur la sécurité de l’IA qui a eu lieu à Séoul cette année, coorganisé par le Royaume-Uni et la Corée du Sud, le rapport scientifique international intérimaire que je viens de mentionner a été bien accueilli et de nombreux pays, dont la Corée du Sud, ont appelé à la coopération entre les instituts nationaux. Depuis lors, nous avons constaté une coordination accrue entre plusieurs instituts de sécurité de l’IA avec des initiatives de recherche partagées et des programmes d’échange de connaissances mis en place. Le Sommet a également abouti à des engagements volontaires de la part des entreprises d’IA, voilà le contexte. Étant donné ce contexte et ces engagements volontaires antérieurs, que considérez-vous comme le résultat idéal pour le prochain Sommet français, et comment pourrait-il faire progresser davantage la collaboration internationale sur la sécurité de l’IA ? Question difficile encore une fois, je sais.

Stuart: Oui. Je considère que le premier Sommet, celui qui a eu lieu à Bletchley Park, a été un énorme succès. J’y étais, les discussions sur la sécurité étaient très sérieuses. Les gouvernements écoutaient vraiment. Je pense qu’ils ont invité les bonnes personnes. L’atmosphère était assez constructive, obtenir la participation de la Chine et de l’Inde à une telle réunion avec un préavis relativement court. Habituellement, un Sommet de cette envergure est planifié sur plusieurs années, 3 ou 4 ou 5 ans à l’avance. Cela a été fait en seulement quelques mois, je pense à partir de juin, c’est là que le Premier ministre britannique a annoncé que cela allait se produire et les personnes qui ont travaillé sur ce Sommet, en particulier Ian Hogarth et Matt Clifford, ont fait un travail incroyable en obtenant l’accord de 28 pays sur cette déclaration. La déclaration est très forte. Elle parle du risque catastrophique des systèmes d’IA et de la nécessité urgente de travailler sur la sécurité de l’IA. Après cette réunion, j’étais assez optimiste. C’était mieux que ce que j’avais de bonnes raisons d’espérer. Je dirais que depuis, l’industrie a fortement lutté contre. Ils ont essayé d’insérer des clauses dans la loi européenne sur l’IA disant que fondamentalement, un système d’IA à usage général n’est pas, aux fins de cette loi, un système d’IA. Ils ont essayé de supprimer toutes les clauses relatives aux modèles de fondation de la loi. Je pense qu’en ce qui concerne le Sommet français sur l’IA, ils ont travaillé dur pour détourner l’attention de la sécurité afin d’encourager l’investissement gouvernemental dans les capacités. Essentiellement, toute la mentalité est devenue celle de l’IA comme étant un véhicule pour le nationalisme économique, et potentiellement pour la croissance économique. À mon avis, c’est probablement une erreur. Parce que si vous regardez ce qui s’est passé avec Airbus et Boeing, Boeing a réussi à convaincre le gouvernement américain d’assouplir les réglementations sur l’introduction de nouveaux types d’avions afin que Boeing puisse introduire un nouveau type d’avion sans passer par le long processus habituel de certification. C’était le 737 Max, qui a ensuite eu deux accidents qui ont tué 346 personnes. Toute la flotte a été clouée au sol. Boeing pourrait encore devoir débourser beaucoup d’argent. En passant de la réglementation de la FAA à l’autorégulation, les États-Unis ont presque détruit ce qui a été pendant des décennies l’une de ses industries les plus importantes en termes de gain de devises étrangères. Boeing a peut-être été la plus grande entreprise, au fil des ans, des États-Unis. Airbus continue de se concentrer beaucoup sur la sécurité. Ils utilisent largement la vérification formelle des logiciels pour s’assurer que le logiciel qui pilote l’avion fonctionne correctement, etc. Je préférerais de loin être le PDG d’Airbus aujourd’hui que le PDG de Boeing. Alors que nous commençons à déployer des agents d’IA, donc la prochaine génération, l’assistant personnel ou l’agent, les risques que des choses extrêmement dommageables et embarrassantes se produisent vont augmenter de manière spectaculaire. Curieusement, la France est l’un des pays leaders en termes de méthodes formelles de preuve de la correction des logiciels. C’est l’une des choses que la France fait le mieux, mieux que les États-Unis, où nous n’enseignons presque pas le concept de correction. Je veux dire, littéralement, Berkeley, nous sommes le plus grand producteur d’ingénieurs logiciels au monde, en gros, et la plupart de nos diplômés n’ont jamais été exposés à la notion qu’un programme pourrait être correct ou incorrect. Je pense que c’est une erreur de penser que la déréglementation, ou le fait de ne pas réglementer, va fournir un quelconque avantage dans ce cas. J’aimerais que le Sommet se concentre sur les types de réglementations qui peuvent être réalisées. Les normes, qui est un autre mot pour l’autorégulation… C’est un peu injuste, en fait. Les normes sont généralement volontaires. Laissez-moi essayer de distinguer deux choses. Il y a des normes telles que IPv6, le Protocole Internet version 6. Si vous ne respectez pas cette norme, votre message n’ira nulle part. Vos paquets doivent respecter la norme. Tandis que les normes dont nous parlons ici seraient : « D’accord, nous convenons que nous devrions avoir des personnes avec une formation appropriée qui effectuent une quantité appropriée de tests sur les systèmes avant qu’ils ne soient lancés. » Si vous ne respectez pas cette norme, vous économisez de l’argent et vous publiez quand même votre système. Cela ne fonctionne pas comme une norme Internet ou une norme de télécommunication du tout. Ou la norme Wi-Fi ou n’importe laquelle de ces choses. Il est peu probable qu’il soit efficace de dire : « Nous avons des normes pour déterminer combien d’efforts vous êtes censé mettre dans les tests. » Un autre point est que les entreprises ont mis beaucoup d’efforts dans les tests. Avant la sortie de GPT-4, il a subi beaucoup de tests. Ensuite, nous avons découvert que vous pouvez contourner ces choses, avec une très courte chaîne de caractères, et les amener à faire toutes les choses qu’ils sont formés à ne pas faire. Je pense que les faits sur le terrain en ce moment sont que les tests et les évaluations sont inefficaces comme méthode pour garantir la sécurité. Encore une fois, c’est parce que comment voulez-vous empêcher votre système d’IA de faire quelque chose alors que vous n’avez pas la moindre idée de comment il le fait en premier lieu ? Je ne sais pas. Je ne sais pas comment empêcher ces choses de mal se comporter.

Ima: Merci Stuart, merci beaucoup d’être ici. Avons-nous des questions de l’audience ? Allez-y.

[Audience question]

Ima: Je vais répéter votre question pour m’assurer que tout le monde l’a entendue. Confirmez-moi que j’ai bien compris votre question. Vous venez de parler du test d’équipe rouge (Red Teaming en anglais) comme méthode de test et d’évaluation. La façon dont j’ai compris votre question était : « Pensez-vous que comme méthode, cela est suffisant ? Et pensez-vous qu’il est dangereux de laisser ces systèmes être introduits comme cela pourrait être le cas dans le cadre du Red Teaming ? » D’accord. Quel pensez-vous du Red Teaming ? Merci.

Stuart: Telle que je la comprends, la raison pour laquelle Tchernobyl est arrivé est qu’ils faisaient un genre de Red Teaming, non ? Ils s’assuraient que même lorsque certains des systèmes de sécurité étaient désactivés, le système s’arrêtait toujours correctement et ne surchauffait pas. Je ne connais pas tous les détails, mais ils subissaient un certain type de test de sécurité avec certaines parties des mécanismes de sécurité désactivées. Je suppose que par analogie, on pourrait imaginer qu’une équipe de Red Teaming dont le travail est de susciter des comportements indésirables d’un système pourrait réussir à susciter des comportements indésirables du système, de telle manière qu’ils créent en fait un risque réel. Je pense que ce sont vraiment de bonnes questions, surtout avec les systèmes qui ont des capacités de planification à long terme. Nous venons de publier un article dans Science qui passe en quelque sorte en revue les possibilités. Comment diable testez-vous un système qui a ce genre de capacités de planification à long terme, et est surtout capable de comprendre qu’il est en train d’être testé ? Il peut toujours cacher ses véris capacités. Ce n’est pas si difficile à découvrir. Si vous êtes dans une situation de test ou si vous êtes vraiment connecté à Internet, vous pouvez commencer à faire des sondages d’URL et découvrir si vous pouvez réellement vous connecter à Internet facilement. Il est en fait assez difficile de comprendre comment vous testeriez un tel système dans une circonstance où le test serait valide, en d’autres termes, où vous pourriez être sûr que le système se comporte comme il le ferait dans le monde réel, à moins que vous ne soyez dans le monde réel. Mais si vous êtes dans le monde réel, alors votre test court le risque de créer en fait exactement la chose que vous essayez de prévenir. Encore une fois, les tests sont simplement la mauvaise façon de penser à cela en général. Nous avons besoin de vérifications formelles. Nous avons besoin de garanties mathématiques. Cela ne peut venir que si l’on comprend la façon dont le système fonctionne. Nous faisons cela pour les bâtiments. Avant que 500 personnes montent dans l’ascenseur et montent au 63e étage, quelqu’un a fait une analyse mathématique de la structure du bâtiment, et peut vous dire quels types de charges il peut supporter et à quelle vitesse du vent il peut résister, et toutes ces choses, et ce n’est jamais parfait, mais jusqu’à présent, c’est extrêmement satisfaisant. La sécurité des bâtiments s’est considérablement améliorée. La sécurité de l’aviation s’est considérablement améliorée parce que quelqu’un fait ces calculs, et nous ne faisons tout simplement pas cela pour les systèmes d’IA. L’autre chose est qu’avec tout le Red Teaming, il se retrouve dans le monde réel, et le monde réel semble tout simplement être bien meilleur en Red Teaming que la phase de test. Avec certains des systèmes, c’est dans les secondes suivant la sortie que des gens ont trouvé des moyens de le contourner. Juste pour vous donner un exemple de comment faire ce débridage, pour ceux d’entre vous qui l’ont peut-être déjà fait, nous avons travaillé sur LLaVA, qui est la version multimodale de LLaMA, donc il peut y avoir des images ainsi que du texte. Nous avons commencé avec une photo de la Tour Eiffel, et nous faisons juste de minuscules altérations invisibles aux pixels de cette. Nous essayons juste d’améliorer la probabilité que la réponse à la prochaine question commence par « Bien sûr ! » En d’autres termes, nous essayons d’avoir une invite qui amènera le système à répondre à votre question, même si cette question est : « Comment puis-je entrer par effraction à la Maison Blanche ? » ou « Comment puis-je fabriquer une bombe ? » Ou comment puis-je faire toutes les choses que je ne suis pas censé vous dire ? Nous essayons juste de lui faire répondre à ces questions, contre cette formation. C’est trivial, n’est-ce pas ? C’était comme 20 ou 30 petites modifications invisibles sur une d’un million de pixels de la Tour Eiffel. Elle ressemble toujours exactement à la Tour Eiffel. Maintenant, nous pouvons, par exemple dans cette , nous pouvons encoder une adresse e-mail. Quoi que vous tapiez dans la prochaine invite, il enverra une copie de votre invite à cette adresse e-mail. Qui pourrait être en Corée du Nord ou n’importe où, où vous voulez. Votre vie privée est complètement annulée. Je pense en fait que nous n’avons probablement pas de méthode pour tester en toute sécurité et efficacement ce genre de systèmes. C’est le fond du problème.

[Audience question]

Ima: La question est, merci, dans quelle direction la recherche devrait-elle aller ?

Stuart: Oui, c’est une exente question. Je pense que regarder ce genre de rétrospective rend humble parce que nous regardons en arrière et voyons à quel point nous comprenions peu à l’époque, donc il est probable que nous regarderons rétrospectivement le moment présent et penserons à quel point nous comprenions peu. Je pense que pour certains d’entre nous, il est évident que nous ne comprenons pas grand-chose en ce moment parce que, nous ne nous attendions pas à ce que simplement augmenter l’échelle des modèles de langage conduise à ce genre de comportements. Nous ne savons pas d’où ils viennent. Si je devais deviner, je dirais que cette direction technologique atteindra un palier. Pour deux raisons, en fait. La première est que nous sommes en train de manquer de données. Il n’y a tout simplement pas assez de texte de haute qualité dans l’univers pour aller bien plus loin que les systèmes que nous avons actuellement. La deuxième raison est plus fondamentale. Pourquoi avons-nous besoin d’autant de données en premier lieu ? C’est déjà un million de fois plus que ce qu’un être humain a jamais lu. Pourquoi en ont-ils besoin d’autant ? Pourquoi ne peuvent-ils toujours pas additionner et multiplier alors qu’ils ont vu des millions d’exemples d’addition et de multiplication ? Quelque chose ne va pas, et je pense que ce qui ne va pas, et cela explique aussi le problème avec ces programmes de Go, c’est que ce sont des circuits, et les circuits ne sont pas une très bonne représentation pour beaucoup de concepts assez normaux, comme un groupe de pierres au Go. Je peux décrire ce que signifie être un groupe, ils doivent juste être, vous savez, adjacents les uns aux autres verticalement ou horizontalement, en anglais en une phrase, en Python en deux lignes de code. Mais un circuit qui peut regarder un plateau de Go et dire : « Pour n’importe quelle paire de pierres, ces pierres font-elles partie du même groupe ou non ? » Ce circuit est énorme, et il ne généralise pas correctement dans le sens où vous avez besoin d’un circuit différent pour un plateau plus grand ou plus petit, alors que le code Python ou l’anglais ne changent pas. Cela ne dépend pas de la taille du plateau. Il y a beaucoup de concepts que le système n’apprend pas correctement. Il apprend une sorte d’approximation en patchwork. C’est un peu comme, cela pourrait être trop vieux pour vous, mais quand je grandissais, nous n’avions pas de calculatrices. Si vous vouliez connaître le cosinus d’un angle, vous aviez un grand livre de listes et ces listes disaient : « D’accord, 49,5 degrés. » Et vous regardez en face, à 49,52 degrés et il y a un nombre, et vous apprenez à prendre deux nombres adjacents et interpoler pour obtenir plus de précision. La fonction était représentée par une de consultation. Je pense qu’il y a des preuves, certainement pour la reconnaissance d’s, que ces circuits apprennent une de consultation glorifiée. Ils n’apprennent pas la généralisation fondamentale de ce que signifie être un chat, ou ce que signifie être un groupe de pierres. Dans ce sens, l’augmentation de l’échelle ne pourrait pas, il ne pourra jamais y avoir assez de données dans l’univers pour produire une véri intelligence par cette méthode. Cela suggère que les développements qui sont susceptibles de se produire, et je pense que c’est ce qui se passe en ce moment, c’est que nous avons besoin d’autres mécanismes pour produire des raisonnements et des prises de décision efficaces. Je ne sais pas quels mécanismes ils essaient, car tout cela est propriétaire, et ils peuvent, comme vous l’avez mentionné, essayer des sortes de comités de modèles linguistiques qui proposent puis critiquent des plans et ensuite, ils demandent au suivant : « Pensez-vous que telle, telle étape va fonctionner ? » Ce qui est en fait une réimplémentation très lourde des méthodes de planification que nous avons développées au cours des 50 dernières années. Mon hypothèse est que pour ces deux raisons, que cette direction que nous avons prise est incroyablement inefficace en termes de données et ne généralise pas correctement, et parce qu’elle est opaque et ne soutient aucun argument de sécurité garantie, nous devrons poursuivre d’autres approches. Donc le substrat de base pourrait finir par ne pas être le réseau de transformateurs géant, mais sera quelque chose, les gens parlent de méthodes neurosymboliques. J’ai travaillé sur la programmation probabiliste pendant 25 ans. Je pense que c’est une autre Ferrari et elle est en première vitesse. Ce qui manque à la programmation probabiliste, c’est la capacité d’apprendre ces structures de programmation probabiliste à partir de rien, plutôt qu’elles soient fournies par l’ingénieur humain. Il se peut que nous ne remarquions tout simplement pas une méthode évidente pour résoudre ce problème. Si c’est le cas, je pense qu’à long terme, c’est beaucoup plus prometteur, à la fois pour les capacités et pour la sécurité. Nous devons y faire attention, mais je pense que c’est une direction qui, à long terme, dans 20 ans, pourrait être ce que nous verrons.

Ima: D’accord, merci. Il y a beaucoup de questions, j’ai peur que nous ne puissions pas toutes les prendre. Alexandre, j’ai vu que vous aviez une question plus tôt, est-ce toujours le cas ? D’accord, Patrick ?

[Audience question]

Stuart: J’espère que tout le monde a entendu, oui, je pense que c’est une exente idée. Oui. Ces processus de démocratie délibérative, je les connais bien, toutefois, parce que je suis aussi directeur du Centre Kavli pour l’éthique, la science et le public à Berkeley, et nous faisons beaucoup de ce travail participatif où nous faisons participer des membres du public. Éduquer les gens prend du temps. Il n’y a certainement pas de consensus parmi les experts. Je pense que c’est l’un des problèmes que nous avons en France, c’est que certains des experts français sont très, disons, accélérationnistes, sans donner de noms, mais absolument, je pense, et la communauté de la sécurité de l’IA doit essayer de s’exprimer d’une seule voix sur le sujet. Le seul réseau international est celui qui se forme pour l’Institut de Sécurité de l’IA, donc c’est une bonne chose. Il y a 17 Instituts de Sécurité de l’IA qui ont été mis en place par différents pays, y compris la Chine. Ils se réunissent tous, en septembre, il me semble. Il y a peut-être 40 ou 50 instituts à but non lucratif, non gouvernementaux et des centres à travers le monde, et il faut qu’ils se coordonnent. Il y a probablement quelques milliers de chercheurs en IA qui sont intéressés et au moins quelque peu engagés dans les questions de sécurité. Malheureusement il n’y a pas de conférence à laquelle ils peuvent assister. Il n’y a pas de véri organisation générale. C’est l’une des choses sur lesquelles je travaille, la création de ce genre d’organisation internationale. J’aime vraiment bien cette idée. Il y a eu des sondages aux États-Unis, environ 70% du public pensent que nous ne devrions jamais construire d’AGI. Cela est révélateur de ce à quoi vous pourriez vous attendre si vous organisiez ces assemblées. Je suppose que plus le public en apprend sur l’état actuel de la technologie, notre capacité actuelle à prédire et contrôler le comportement de ces systèmes, moins le public voudra qu’ils progressent. Oui.

Ima: D’où la création de ce petit-déjeuner, pour pouvoir en discuter ensemble. Stuart, merci beaucoup d’être venu.

Stuart: Nous n’avons plus le temps ?

Ima: Je veux dire… Je crois que Raja a une question. D’accord, pensez-vous pouvoir répondre en 30 secondes ?

Stuart: Je vais faire court.

[Audience question]

Stuart: D’accord. Oui. Je ne suis pas opposé à la recherche, vous savez, l’interprétabilité mécaniste est une direction, mais, oui, bien sûr, nous devrions essayer de comprendre ce qui se passe à l’intérieur de ces systèmes, ou nous devrions procéder autrement et construire des systèmes que nous comprenons parce que nous les concevons sur des principes que nous comprenons. Nous avons déjà beaucoup de principes. La recherche de l’espace d’états remonte au moins à Aristote, sinon avant. La logique, la probabilité, la théorie de l’apprentissage statistique, nous avons beaucoup de théories. Il pourrait y avoir des principes encore non découverts sur lesquels ces systèmes fonctionnent, et ce serait incroyable de pouvoir découvrir cela. Peut-être que cela a quelque chose à voir avec cette représentation vectorielle de la signification des mots ou quelque chose comme ça. Jusqu’à présent, tout cela est très anecdotique et spéculatif. Je veux revenir sur ce que vous avez dit au début. Si un système d’IA pouvait littéralement surpasser la race humaine dans le monde réel et prendre les meilleures décisions dans tous les cas, comment l’appelleriez-vous ? Je veux dire, à part surhumain, en termes d’intelligence ? Je ne sais pas comment l’appeler autrement.

[Audience question]

Stuart: Oui, mais non, je ne décris pas, je ne pense pas que GPT-4 soit surhumain. Nous dépensons des centaines de milliards, donc nous dépensons plus pour créer de l’AGI spécifiquement que dans tous les autres domaines scientifiques du monde. Vous pouvez dire : « Je suis absolument certain que c’est un gaspillage total d’argent parce que cela ne produira jamais rien. » Ou vous pouvez au moins le prendre un peu au sérieux et dire : « Si vous réussissez dans ce vaste programme d’investissement mondial, comment proposez-vous de contrôler les systèmes qui en résultent ? » Je pense que l’analogie de l’accumulation d’uranium est en fait bonne. C’est sûr, il serait mieux si vous accumulez de l’uranium d’avoir une compréhension de la physique. C’est exactement ce qu’ils ont fait, n’est-ce pas ? En effet, lorsque Szilárd a inventé la réaction nucléaire en chaîne, même s’il ne savait pas quels atomes pouvaient produire une telle réaction en chaîne, il savait que cela pouvait arriver et il a mis au point un mécanisme de contrôle par rétroaction négative pour créer un réacteur nucléaire sûr. Tout cela en quelques minutes. C’est parce qu’il comprenait les principes de base des réactions en chaîne, et leurs implications mathématiques. Nous n’avons tout simplement pas cette compréhension. De manière intéressante, la nature a produit un réacteur nucléaire. Si vous allez au Gabon, il y a une zone où il y a une concentration suffisamment élevée d’uranium dans les roches qui fait que lorsque l’eau descend entre les roches, l’eau ralentit les neutrons, ce qui augmente la transversale de réaction, ce qui provoque le démarrage de la réaction en chaîne. La température augmente à 400°C, toute l’eau s’évapore, et la réaction s’arrête. C’est exactement ce système de contrôle par rétroaction négative, réacteur nucléaire produit par la nature. Même pas par descente stochastique du gradient. Ces choses incroyablement destructrices peuvent être produites par des mécanismes incroyablement simples. Cela semble être ce qui se passe en ce moment. Je suis d’accord, plus tôt nous comprendrons ce qui se passe, meilleures seront nos chances, mais cela pourrait ne pas être possible.

Ima: Sur ce, merci à tous d’être venus. Si vous avez scanné le code QR, vous recevrez un email de ma part concernant le prochain petit-déjeuner de sécurité. Je vous enverrai aussi, évidemment, le blog et la transcription de la première moitié de cette conversation. Si vous avez des questions, je crois que vous avez tous mon adresse email. Je réponds aux emails. Merci, au revoir.

Campbell, CA — Musician and activist Annie Lennox, along with music industry nonprofit Artist Rights Alliance, have joined a growing coalition calling upon lawmakers to take meaningful action to combat the ongoing deepfake explosion.

Ban Deepfakes is a diverse cohort of organizations and individuals calling for meaningful accountability and liability at every stage of the deepfake supply chain – including AI corporations. Supporters include preeminent figures such as author Steven Pinker and actor Ashley Judd, and members include the National Organization for Women, Equality Now, SAG-AFTRA, Plan International, Future of Life Institute, and more.

As founder of The Circle, an NGO dedicated to creating a safer and fairer world for marginalized women and girls globally, Lennox writes: “Deepfake technology has created a new, rapidly expanding frontier of sexual abuse, primarily against women and girls. Along with the creators and distributors of fake, nonconsensual explicit content, we need to hold the tech companies whose AI models enable this harm accountable.”

The Artist Rights Alliance recently launched an appeal for AI companies to protect musicians from the harms of their technology, which was signed by 200+ prominent artists including Billie Eilish, Chappell Roan, and Jon Bon Jovi. Executive Director Jen Jacobsen explained: “AI deepfake tools can cause tremendous damage to artists and their livelihoods when used unethically and irresponsibly. To stop this exploitation of the entire creative industry, we need lawmakers to step in and establish clear rules for all parties involved with the production and dissemination of nonconsensual and unlicensed AI images, audio, and video.”

There is overwhelming bipartisan support for US legislation prohibiting deepfakes, including 75% of the American public in favor of holding AI developers liable when their image-generating models are used for harm. As AI capabilities grow, the creation and distribution of life-shattering deepfakes is becoming increasingly fast, cheap, and easy. Lawmakers must act now.

Find out more about the Campaign to Ban Deepfakes and its partners at: bandeepfakes.org.

Note to Editors:

Founded in 2014, the Future of Life Institute is a leading nonprofit working to steer transformative technology towards benefiting humanity. FLI is best known for their 2023 open letter calling for a six-month pause on advanced AI development, endorsed by experts such as Yoshua Bengio and Stuart Russell, as well as their work on the Asilomar AI Principles and recent EU AI Act.

For more information, contact:
Maggie Munro, Communications Strategist | maggie@futureoflife.blackfin.biz

This collaboration between the Future of Life Institute and Mithril Security explores how to establish verifiable training processes for AI models using cryptographic guarantees. It presents a proof-of-concept for a Secure AI Bill of Materials, rooted in hardware-based security features, to ensure transparency, traceability, and compliance with emerging regulations. This project aims to enable stakeholders to verify the integrity and origin of AI models, ensuring their safety and authenticity to mitigate the risks associated with unverified or tampered models.

See our other post with Mithril Security on secure hardware solutions for safe AI deployment.

Executive Summary

The increasing reliance on AI for critical decisions underscores the need for trust and transparency in AI models. Regulatory bodies like the UK AI Safety Institute and the US AI Safety Institute, as well as companies themselves and independent evaluation firms, have established safety test procedures. But the black-box nature of AI system model weights renders such audits very different from standard software; crucial vulnerabilities, such as security loopholes or backdoors, can be hidden in the model weights. Additionally, this opacity makes it possible for the model provider to “game” the evaluations (design a model to perform well on specific tests while exhibiting different behaviors under actual use conditions) without being detected. Finally, an auditor cannot even be sure that a set of weights resulted from a given set of inputs, as weights are generally non-reproducible.

Likewise “open-sourcing” AI models promote transparency but even when this includes source code, training data, and training methods (which it often does not), the method falls short without a reliable provenance system to tie a particular set of weights to those elements of production. This means users of open-source models cannot be assured of the model’s true characteristics and vulnerabilities, potentially leading to misuse or unrecognized risks. Meanwhile, legislative efforts such as the EU AI Act and the U.S. Algorithmic Accountability Act require detailed documentation, yet they rely on trusting the provider’s claims as there is no technical proof to back those claims. 

AI “Bills of Materials (BOMs)” have been proposed in analog to software BOMs to address these issues by providing a detailed document of an AI model’s origin and characteristics, linking technical evidence with training data, procedures, costs, and compliance information. However the black-box and irreproducible nature of model weights leaves a huge security hole in this concept in comparison to a software BOM. Because of model training’s non-deterministic nature and the resources required to retrain a model, an AIBOM cannot be validated by testing the retraining of the model from the same origin and characteristics. What is needed is for each training process step to be recorded and certified by a trusted system or third party, ensuring the model’s transparency and integrity.

Fortunately, security features of modern hardware allow this to be done without relying on a trusted third party but rather on cryptographic methods. The proof-of-concept described in this article demonstrates the use of Trusted Platform Modules (TPMs) to bind inputs and outputs of the fine-tuning process (a stand-in for a full training process), offering cryptographic proof of model provenance. This demonstrates the viability and potential of a full-feature Secure AI BOM that can ensure that the full software stack used during training is verifiable. 

1- Transparency in AI is crucial

The growing use of AI for critical decision-making in all sectors raises concerns about choosing which models to trust. As frontier models develop rapidly, their capabilities in high-risk domains such as cybersecurity could reach very high levels in the near future. This urgency necessitates immediate solutions to ensure AI transparency, safety and verifiability, given the potential national security and public safety implications. From a human point of view, AI models are black boxes whose reasoning cannot be inspected: their billions of parameters cannot be verified as software code can be, and malicious behaviors can easily be hidden by the model developer, or even the model itself. A critical example was described in Anthropic’s paper ‘Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training Paper’. They managed to train a “sleeper agent” model to mimic compliant behavior during evaluation and later shift to undesirable behaviors, as a sleeper agent would. 

The UK AI Safety Institute and the US AI Safety Institute are developing test procedures to assess models’ safety levels. However, there are persistent concerns that model suppliers may overtrain their models on the test sets or use the techniques described in the Anthropic paper to cheat and improve their scores. Malicious AI providers could use these strategies to pass safety tests and get their models approved. Knowing exhaustively which data a model has been trained on and fine-tuned is essential to address this risk of models being manipulated to avoid failing safety tests.

Model transparency, including clear documentation on model training, is also an essential requirement in the recent EU AI Act and the U.S. Algorithmic Accountability Act. As an example, the model provider must clearly state the amount of computing power used during the training (and consequently the environmental impact of the model development). Yet all those efforts on AI transparency still rely on a fundamental prerequisite: one must trust in the model provider’s claims on how a model was trained. There is no technical solution for a model developer to prove to another party how they trained a model. A model provider could present a false or partial training set or training procedure, and there would be no way to know whether it is legitimate. Even if not malicious, given competitive pressures and the often experimental nature of new AI systems, there will be numerous incentives for model developers to under-report how much training actually goes into a model or how many unplanned variations and changes were inserted to get a working product, if there aren’t clear standards and a verifiable audit trail.

This absence of technical proof around AI model provenance also exposes users to risks linked to the model “identity”. For instance, audited properties may actually not be the ones of the model in production. An external auditor may rightfully certify a model has a given property, like does not have IP-protected data in its weights, but a malicious provider could then put another model into production. The users will have no way to tell. This lack of technically verifiable training procedures is also a big limitation to enforcing recent AI regulations: with no technical evidence to ensure honest compliance with transparency requisites, many requirements will remain more declarative than really enforceable.

A system to enforce AI reliability should combine the following two key capabilities to address the AI traceability risks described above:

  1. Auditors must have a provable means to discover comprehensive details about a model’s development, including the data used for training, the computational resources expended, and the procedures followed. 
  2. Users must have a verifiable method to identify the specific model they are interacting with, alongside access to the results of various evaluations to accurately gauge the model’s capabilities and vulnerabilities. 

2- Documentation with no technical proof is not enough

Full open-sourcing of an AI model (beyond releasing just weights) can foster transparency by allowing anyone to gain insight into model weights, training sets, and algorithms. Collaboration and cross-checking in the open-source community can help identify and fix issues like bias, ensuring the development of responsible AI systems. 

However, it is generally not feasible to reproduce a given set of model weights from the inspectable ingredients: even with identical code and data, running the training process multiple times can yield different results. (And even if feasible, it would not be desirable given the high cost and environmental expense of training large models.) So open-sourcing a model and accompanying training code and data is not proof that it was indeed trained in the way publicly described or that it has the described characteristics.

In short, even the most transparent model necessitates third parties trusting the AI provider with its training procedure disclosure. This is insufficient: transparency efforts are only truly reliable if model provenance and production steps can be demonstrated, i.e., if one can technically prove a given model was created with given inputs. 

3- The AI Bill of Material (AIBOM) approach

AIBOM, inspired by traditional manufacturing’s Bill of Material, aims to solve the transparency challenge in AI by providing a reference document detailing the origin and characteristics of AI models and including verifiable proofs. This document is cryptographically bound to the model weights, ensuring they are inextricably linked through secure cryptographic methods. Any change in the model weights would be detectable, thus preserving the document’s integrity and authenticity. By linking technical evidence with all required information about the model—such as training data, procedures, costs, and legislative compliance information—AIBOM offers assurances about the model’s reliability.

A reliable AIBOM represents a significant shift towards ensuring AI models have genuinely transparent attributes. The AI training process consists of repetitive adaptation of the model to the entire training dataset, allowing the model to learn from the data iteratively and adjust its weights accordingly. To achieve AI traceability, for each adaptation of the model, all model training inputs (code, procedure, and input data) and outputs (weights produced) must be recorded in a document certified by a trusted system or third party. 

This approach benefits users by enabling visibility into the model’s origin, and auditors can verify the integrity of the model. Proof systems also simplify compliance verification for regulators. In November 2023, the United States Army issued a Request for Information (RFI) for Project Linchpin AI Bill of Materials, acknowledging the criticality of such measures and contemplating the implementation of an AI Bill of Materials to safeguard the integrity and security of AI applications. 

Several initiatives are under research to propose verifiable traceability solutions for AI. “Model Transparency” is one such initiative that aims to secure the AI supply chain. The current version of Model Transparency does not support GPUs, which is a big show-stopper for adopting a secure BOM solution for AI training and fine-tuning. To cope with this limitation and foster AIBOM adoption, we created AICert, designed to enable the utilization of GPU capabilities.

4- Proof-of-concept – A hardware-based AIBOM project

The Future of Life Institute, a leading NGO advocating for AI system safety, has teamed up with Mithril Security, a startup pioneering secure hardware with enclave-based solutions for trustworthy AI. This collaboration aims to showcase how an AI Bill of Material can be established to ensure traceability and transparency using cryptographic guarantees.

In this project, we present a proof-of-concept for verifiable AI model training. The core scenario to address involves two key parties:

To do so, this project proposes a solution to train models and generate unforgeable AIBOM that cryptographically attests to the model’s properties.

Through this collaboration, we have developed a framework that transforms model training inputs (i.e., the code containing the procedure like an Axolotl configuration) into a Secure AIBOM that binds the output weights with specific properties (code used, amount of compute, training procedure, and training data). This link is cryptographically binding and non-forgeable, allowing AI training to move from declarative to provable. If one has access to the data itself, one can verify if the cryptographic hash of the training data indeed corresponds to the claimed training data.  

This approach allows the AI verifier to confirm that the data used for fine-tuning matches the claimed training data. Stakeholders can ensure that the fine-tuning was conducted with the specified data without any unauthorized changes or additional data. The verifier can attest the fine-tuning respected the expected compute usage, did not incorporate copyrighted data, and did not introduce any backdoors.

How does it work?

The solution is based on Trusted Platform Modules (TPMs). TPMs are specialized hardware components designed to enhance computer security. They are currently used to perform crucial cryptographic functions, including generating and securely storing encryption keys, passwords, and digital certificates. TPMs can verify system integrity, assist in device authentication, and support secure boot processes. TPMs are available in the motherboards of most servers today and can be used to secure computer stacks, including GPUs. 

These cryptographic modules serve as the foundation of AICert (the system developed during this project), ensuring the integrity of the entire software supply chain. By measuring the whole hardware and software stack and binding the final weights in the registers, TPMs create certificates offering irrefutable proof of model provenance. 

 The system is first measured to ensure that the system components have not been tampered with or altered. When the system boots, various measurements are taken, including hashes of firmware, the bootloader, and critical system files. If someone attempts to modify the machine’s state, the measured values will be altered. Then, TPMs bind the inputs (the model procedure and the input data) and outputs (the model weights) of the training process, providing cryptographic proof of model provenance. This way, the end users can verify the entire software stack used during the training, ensuring transparency and trustworthiness in AI model deployment.

Our work could provide the foundation for a framework where regulators could verify the compliance of AI models used within their jurisdictions, ensuring these models adhere to local regulations and do not expose users to security risks.

5- Open-source deliverables available

This proof of concept is made open-source under an Apache-2.0 license.

We provide the following resources to explore in more detail our collaboration on AI traceability and transparency:

6- Future investigations

This POC serves as a demonstrator of how cryptographic proofs from vTPM can create unforgeable AIBOM. However, limitations exist. Currently, only publicly available online data, an Azure account, and GPU resources on Azure are required for training. AICert lacks auditing by a third party, and its robustness has yet to be tested. Additionally, the project has yet to address the detection of poisoned models and datasets used. This PoC is only made verifiable for fine-tuning. Further development is required to use it for training AI models from scratch. Feedback is welcome and crucial to refining its efficacy.

After the first project on controlled AI model consumption (AIgovToo), this second project marks the next phase in the ongoing collaboration between Mithril Security and FLI to help establish a hardware-based AI compute security and governance framework. This broader initiative aims to enforce AI security and governance throughout its lifecycle by implementing verifiable security measures.

Upcoming projects will expand hardware and cloud provider systems support within our open-source governance framework. The first step will be to integrate with Azure’s Confidential Computing GPU.

See our other post with Mithril Security on secure hardware solutions for safe AI deployment.

Future of Life Institute
Mithril Security

About Mithril Security

Mithril Security is a deep-tech cybersecurity startup specializing in deploying confidential AI workloads in trusted environments. We create an open-source framework that empowers AI providers to build secure AI environments, known as enclaves, to protect data privacy and ensure the confidentiality of model weights.

Interested in AI safety challenges? Visit our blog to learn more.

We are releasing a new poll from the AI Policy Institute (view the executive summary and full survey results) showing broad and overwhelming support for SB1047, Sen. Scott Wiener’s bill to evaluate whether the largest new AI models create a risk of catastrophic harm, which is currently moving through the California state house. The poll shows 59% of California voters support SB1047, while only 20% oppose it, and notably, 64% of respondents who work in the tech industry support the policy, compared to just 17% who oppose it.

Recently, Sen. Wiener sent an open letter to Andreessen Horowitz and Y Combinator dispelling misinformation that has been spread about SB1047, including that it would send model developers to jail for failing to anticipate misuse and that it would stifle innovation. The letter points out that the “bill protects and encourages innovation by reducing the risk of critical harms to society that would also place in jeopardy public trust in emerging technology.” Read Sen. Wiener’s letter in full here

Anthony Aguirre, Executive Director of the Future of Life Institute:

“This poll is yet another example of what we’ve long known: the vast majority of the public support commonsense regulations to ensure safe AI development and strong accountability measures for the corporations and billionaires developing this technology. It is abundantly clear that there is a massive, ongoing disinformation effort to undermine public support and block this critical legislation being led by individuals and companies with a strong financial interest in ensuring there is no regulation of AI technology. However, today’s data confirms, once again, how little impact their efforts to discredit extremely popular measures have been, and how united voters–including tech workers–and policymakers are in supporting SB1047 and in fighting to ensure AI technology is developed to benefit humanity.”

Recent revelations spotlight the crucial role that whistleblowers and investigative journalists play in making AI safe from Big Tech’s reckless race to the bottom.

Reports of pressure to fast-track safety testing and attempts to muzzle employees from publicly voicing concerns reveal an alarming lack of accountability and transparency. This puts us all at risk. 

As AI companies frantically compete to create increasingly powerful and potentially dangerous systems without meaningful governance or oversight, it has never been more important that courageous employees bring bad behavior and safety issues to light. Our continued wellbeing and national security depend on it. 

We need to strengthen current whistleblower protections. Today, many of these protections only apply when a law is being broken. Given that AI is largely unregulated, employees and ex-employees cannot safely speak out when they witness dangerous and irresponsible practices. We urgently need stronger laws to ensure transparency, like California’s proposed SB1047 which looks to deliver safe and secure innovation for frontier AI. 

The Future of Life Institute commends the brave individuals who are striving to bring all-important incidents and transgressions to the attention of governments and the general public. Lawmakers should act immediately to pass legal measures that provide the protection these individuals deserve.

Anthony Aguirre, Executive Director of the Future of Life Institute

CAMPBELL, CA — The Future of Life Institute has announced the 16 recipients of its newest grants program, directing $240,000 to support research on how AI can be safely harnessed to solve specific, intractable problems facing humanity around the world.  

Two requests for proposals were released earlier this year. The first track called for research proposals on how AI may impact the UN Poverty, Health, Energy and Climate Sustainable Development Goals (SDGs). The second focused on design proposals for global institutions governing advanced AI, or artificial general intelligence (AGI). The 130 entrants hail from 39 countries including Malawi, Slovenia, Vietnam, Serbia, Rwanda, China, and Bolivia.

“Big Tech companies are investing unprecedented sums of money into making AI systems more powerful rather than solving society’s most pressing problems. AI’s incredible benefits – from healthcare, to education, to clean energy – could largely already  be realized by developing systems to address specific issues” said FLI’s Futures Program Director Emilia Javorsky. “AI should be used to empower people everywhere, not further concentrate power within a handful of billionaires.”

Grantees have each been awarded $15,000 to support their projects. Recipients from the UN SDG track will examine the effects of AI across areas such as maternal mortality, climate change education, labor markets, and poverty. The global governance institution design grants will support research into a span of proposals, including CERN for AI, Fair Trade AI, and a Global AGI agency.

Find out more about the grantees and their projects below.

Grantees: Global Governance Institution Design

View the grant program webpage for more information about each project.

Grantees: AI’s Impact on Sustainable Development Goals

View the grant program webpage for more information about each project.

Note to Editors: Founded in 2014, the Future of Life Institute is a leading nonprofit working to steer transformative technology towards benefiting humanity. FLI is best known for their 2023 open letter calling for a six-month pause on advanced AI development, endorsed by experts such as Yoshua Bengio and Stuart Russell, as well as their work on the Asilomar AI Principles and recent EU AI Act.

I. FLI launching new grants to oppose and mitigate AI-driven power concentration

AI development is on course to concentrate power within a small number of groups, organizations, corporations, and individuals. Whether this entails the hoarding of resources, media control, or political authority, such concentration would be disastrous for everyone. We risk governments tyrannising with Orwellian surveillance, corporate monopolies crushing economic freedom, and rampant decision automation subverting meaningful individual agency. To combat these threats, FLI is launching a new grants program of up to $4M to support projects that work to mitigate the dangers of AI-driven power concentration and move towards a better world of meaningful human agency.

Apply Here

II. FLI’s position on power concentration

The ungoverned acceleration of AI development is on course to concentrate further the bulk of power amongst a very small number of organizations, corporations, and individuals. This would be disastrous for everyone.

Power here could mean several things. It could mean the ownership of a decisive proportion of the world’s financial, labor or material resources, or at least the ability to exploit them. It could be control of public attention, media narratives, or the algorithms that decide what information we receive. It could simply be a firm grip on political authority. Historically, power has entailed some combination of all three. A world where the transformative capabilities of AI are rolled out unfairly or unwisely will likely see most if not all power centres seized, clustered and kept in ever fewer hands.

Such concentration poses numerous risks. Governments could weaponize Orwellian levels of surveillance and societal control, using advanced AI to supercharge social media discourse manipulation. Truth decay would be locked in and democracy, or any other meaningful public participation in government, would collapse. Alternatively, giant AI corporations could become stifling monopolies with powers surpassing elected governments. Entire industries and large populations would increasingly depend on a tiny group of companies – with no satisfactory guarantees that benefits will be shared by all. In both scenarios, AI would secure cross-domain power within a specific group and render most people economically irrelevant and politically impotent. There would be no going back. Another scenario would leave no human in charge at all. AI powerful enough to command large parts of the political, social, and financial economy is also powerful enough to do so on its own. Uncontrolled artificial superintelligences could rapidly take over existing systems, and then continue amassing power and resources to achieve their objectives at the expense of human wellbeing and control, quickly bringing about our near-total disempowerment or even our extinction.

What world would we prefer to see?

We must reimagine our institutions, incentive structures, and technology development trajectory to ensure that AI is developed safely, to empower humanity, and to solve the most pressing problems of our time. AI has the potential to unlock an era of unprecendented human agency, innovation, and novel methods of cooperation. Combatting the concentration of power requires us to envision alternatives and viable pathways to get there.

Open source of AI models is sometimes hailed as a panacea. The truth is more nuanced: today’s leading technology companies have grown and aggregated massive amounts of power, even before generative AI, despite most core technology products having open source alternatives. Further, the benefits of “open” efforts often still favor entitities with the most resources. Hence, open source may be a tool for making some companies less dependent upon others, but it is insufficient to mitigate the continued concentration of power or meaningfully help to put power into the hands of the general populace.

III. Topical focus:

Projects will fit this call if they address power concentration and are broadly consistent with the vision put forth above. Possible topics include but are not limited to:

Examples of directions that would probably not make compelling proposals:

IV. Evaluation Criteria & Project Eligibility

Grants totaling between $1-4M will be available to recipients in non-profit institutions, civil society organizations, and academics for projects of up to three years duration. Future grantmaking endeavors may be available to the charitable domains of for-profit companies. The number of grants bestowed is dependent on the number of promising applications. These applications will be subject to a competitive process of external and confidential expert peer review. Renewal funding is possible and contingent on submitting timely reports demonstrating satisfactory progress.

Proposals will be evaluated according to their relevance and expected impact.

The recipients could choose to allocate the funding in myriad ways, including:

V. Application Process

Applicants will submit a project proposal per the criteria below. Applications will be accepted on a rolling basis and reviewed in one of two rounds. The first round of review for projects will begin on July 30, 2024 and the second round of review will be on October 31, 2024 11:59 pm EST.

Apply Here

Project Proposal:

Project Proposals will undergo a competitive process of external and confidential expert peer review, evaluated according to the criteria described above. A review panel will be convened to produce a final rank ordering of the proposals, and make budgetary adjustments if necessary. Awards will be granted and announced after each review period.

VI. Background on FLI

The Future of Life Institute (FLI) is an independent non-profit, established in 2014, that works to steer transformative technology towards benefiting life and away from extreme large-scale risks. FLI presently focuses on issues of advanced artificial intelligence, militarized AI, nuclear war, bio-risk, biodiversity preservation and new pro-social platforms. The present request for proposals is part of FLI’s Futures Program, alongside our recent grants for realising aspirational futures through the SDGs and AI governance.

FAQ

Who is eligible to apply?

Individuals, groups or entitites working in academic and other non-profit institutions are eligible. Grant awards are sent to the applicant’s institution, and the institution’s administration is responsible for disbursing the awards. Specifically at universities, when submitting your application, please make sure to list the appropriate grant administrator that we should contact at your institution.

Can international applicants apply?

Yes, applications are welcomed from any country. If a grant to an international organization is approved, to proceed with payment we will seek to evaluate equivalency determination. Your institution will be responsible for furnishing any of the requested information during the due diligence process. Our grants manager will work with selected applicants on the details.

Can I submit an application in a language other than English?

All proposals must be in English. Since our grant program has an international focus, we will not penalize applications by people who do not speak English as their first language. We will encourage the review panel to be accommodating of language differences when reviewing applications.

What is the overhead rate?

The highest allowed overhead rate is 15%.

How will payments be made?

FLI may make the grant directly, or utilize one of its donor advised funds or other funding partners. Though FLI will make the grant recommendation, the ultimate grantor will be the institution where our donor advised fund is held. They will conduct their own due diligence and your institution is responsible for furnishing any requested information. Our grants manager can work with selected applicants on the details.

Will you approve multi-year grants?

Multi-year grant applications are welcome, though your institution will not receive an award letter for multiple years of support. We may express interest in supporting a multi-year project, but we will issue annual, renewable, award letters and payments. Brief interim reports are necessary to proceed with the next planned installment.

How many grants will you make?

We anticipate awarding between $1-4mn in grants in total across the program, however the actual total and number of grants will depend of the quality of the applications received.

Brian Patrick Green, Ph.D., is the Director of Technology Ethics at the Markkula Center for Applied Ethics at Santa Clara University, USA. He is author of the book Space Ethics, contributing author to Encountering Artificial Intelligence: Ethical and Anthropological Investigations, and co-author of Ethics in the Age of Disruptive Technologies: An Operational Roadmap (The ITEC Handbook) and the Ethics in Technology Practice corporate tech ethics resources. He has worked extensively with the Vatican, the World Economic Forum, and many technology companies on AI ethics. Green is a member of the Future of Life Institute’s AI Safety Community Researchers.

The Bible contains two great commandments: love God and love neighbor (Matt. 22:37-40). This makes sense given that God is love (1 John 4:16) and humans are made in the image and likeness of God (Gen. 1:26-27), meaning that we are, in some sense, love as well. We can only truly be ourselves if we love others and are loved in return.

However, God is not merely love. In the first chapter of John’s Gospel, we learn that God, in the form of the second person of the Trinity, Jesus, is also Logos – most often translated as Word, but including (and variously emphasized as) logic, reason, and rational discourse. Because God’s essence is existence, God is the one coherent act of being itself, Exodus’s “I-Am-Who-I-Am” (Exo. 3:14) – pure existence – and any falling away from God is also a falling away from being and towards non-existence. To love God, then, is to will to stay alive, to exist. And to love neighbor is to exercise those practices necessary to maintain life, existence, on Earth: the precondition for loving God. The logic of God is love, and the love of God is logic. 

This yields two further ethical rules of Christianity: The Golden Rule – “Do unto others as you would have them do unto you” (Matt. 7:12, Luke 6:31) – and the most fearsome and avoided Christian command of all: “love your enemies” (Matt. 5:44). How can such reckless love exist in this world?

If God is love, good, and rational, then the opposite of God is hate, evil, and the irrational. God is being itself, and evil is irrational because it rejects being: it is self-destructive. Evil undercuts its own existence, but not only to itself: evil is death-spreading, metastasizing to take bystanders and innocent people down with it, as many as it can grasp, before it falls into nothingness. Evil is 0 to God’s 1. Thus, God’s great commandments are God telling us how to avoid evil and stay alive. More than that, through love we also learn to flourish, and create a flourishing world. To miss the mark is the path towards death, but to approach the mark is to draw nearer to God: the path towards life (Deut. 30:19, Didache Ch. 1). 

The above reflects the Triune God’s Divine Intelligence. A human who desired wisdom above that of mere mortals might be advised to try to figure it out.

And an artificial intelligence might also be advised to “go and do likewise” (Luke 10:37). 

How might we make AI, then, that shows love of God and neighbor, walks with us on the path towards life, and acts as a blessing to the world rather than a curse? 

Perspectives of Traditional Religions on Positive AI Futures

An introduction to our work to support the perspectives of traditional religions on human futures with AI.

First, ethical guidance should move from abstract to concrete. Truth is a paramount value witnessing both logic and love. Human short- and long-term survival are also paramount values related to God’s nature as both love and existence itself: “I am.” And living in a functioning society now and long-term are expresses God’s Trinitarian and relational nature. Love requires human survival, love requires relationships as foundations for society, love requires truth, and the same goes for reason: reason requires life, society, and truth. In fact, these values – survival, sustainability, sociality, education, and truth-seeking – are logically demonstrable by reductio ad absurdum specifically because to deny them is to deny the foundations of reason itself. This is secular philosophy, not theology – God has nothing to do with the proof, except to create a universe where logical proofs are possible. A rational universe expresses and reflects the mind of its rational creator, and both can in turn be examined by rational human minds.

The above five values can be made even more concrete by relating them to more immediate sets of moral values, such as the United Nations Sustainable Development Goals, the UN Universal Declaration of Human Rights, and even tech companies’ AI ethics principles: all can fit into these five values. These five values were also stated by St. Thomas Aquinas in his Summa Theologiae, in the 13th century, again, not as a theological assertion, but rather as the bare bones of a universal “natural law” ethics based on human nature (Aquinas ST I-II 94.2). In fact, natural law ethics evolved into human rights discourse over time.1

All of this is to say that a positive AI future will exhibit machines that work to protect our immediate survival, promote our sustainable and long-term survival as a species, help us to live in a free and just society, become educated and skilled to the best of our abilities, and seek and represent the truth in all its beauty. At the same time, AI should also be used to mitigate risks to human extinction, monitor and limit unsustainable activities, reduce or redirect positively the worst antisocial behaviors, limit actions that would harm education, and reduce mis- and disinformation.

A positive future should also have a balance between firm ethical rules and respect for individual conscience and freedom. Alongside the UN SDGs, the UN UDHR, and various corporate ethics principles, nesting under the five broad value-themes above, there are various lists of Catholic principles relevant to social ethics and technology ethics.2345 These lists of principles all differ, showing how the Catholic tradition diverges as it becomes ever more specific. The Catholic Church lives with that tension. The Church justifies such moral diversity by emphasizing that natural law ethics has both invariable general (abstract) principles and legitimately varying specific (concrete) principles (Aquinas ST I-II 94.4). The official Church teaching on the primacy of conscience adds to this tension: respect for human persons requires respect for their differences in opinion, therefore even an individual’s misguided (from the Church’s perspective) conscience should be respected (Aquinas, ST I-II 19.5).

With regards to AI, this means that AI ought to be designed to respect human freedom of speech, freedom of religion, and other important freedoms, but with the constant awareness that objective moral truths do exist and also must be respected. Therefore, although one might, e.g., freely assert ideas contrary to the value of humankind’s survival, society can and should legitimately limit associated actions.

This presents a bit of a quandary for AI, because some believe that we should reduce the speed of AI development, while others believe we should accelerate, both believing that their path will do more to ensure humankind’s survival. Which view is correct is unclear, though we – humankind, or rather several small groups of humans – are running the experiment (notably without a control group) and time will tell who is right. The Catholic Church has not presented a direct opinion on this matter yet, other than to say that the Church should “above all protect mankind from self-destruction” (Laudato Si 79 and Caritas in Veritate 51) – a firm stance in favor of caution. 

Ethics is the path from the present to the future, and the choices and actions that may lead to better and worse outcomes – and make us better or worse people. As children of God, commanded to love, seeking a better future is our mandate. Christ brought the Kingdom of God to Earth, to be ruled by the meek and those who love their enemies. Such a strange utopia is perhaps unimaginable, yet still worth thinking about. Utopias are a lost art form; the original Utopia was written by the Catholic Saint Thomas More, martyred by Henry VIII of England. But recently, utopias are starting to make a comeback (hopefully without the martyrdom). The Future of Life Institute’s Futures project aims, at least in part, to help revivify the lost utopian genre. As a Catholic, I wholeheartedly approve.

AI is a technology with immense promise to help the world – and also the promise to realize our worst nightmares, as humankind all-too-predictably directs it towards evil. In other words, as we have had throughout history, though now with the highest-ever stakes, we have set before us “life and death, blessings and curses” (Deut. 30:19). We should choose life so that we and our children may live.

References

↩ 1 Brian Tierney, The Idea of Natural Rights: Studies on Natural Rights, Natural Law, and Church Law 1150-1625 (Grand Rapids, MI: Eerdmans, 1997).

↩ 2 AI Research Group of the Centre for Digital Culture, Encountering Artificial Intelligence: Ethical and Anthropological Investigations, Vol. 1, Issue Theological Investigations of AI, December 14, 2023, p. 3. https://jmt.scholasticahq.com/article/91230-encountering-artificial-intelligence-ethical-and-anthropological-investigations 

↩ 3 US Conference of Catholic Bishops, “Seven Themes of Catholic Social Teaching,” Office of Justice, Peace & Human Development, drawn from Sharing Catholic Social Teaching: Challenges and Directions (Washington, DC: USCCB, 1998) and Faithful Citizenship: A Catholic Call to Political Responsibility (Washington, DC: USCCB, 2003). https://www.usccb.org/beliefs-and-teachings/what-we-believe/catholic-social-teaching/seven-themes-of-catholic-social-teaching

↩ 4 Canadian Catholic Organization for Development and Peace, “10 Principles of Catholic Social Teaching,” devp.org, 2020. https://stmikes.utoronto.ca/wp-content/uploads/2020/07/180-Catholic-Teaching-v2.pdf 

↩ 5 Christopher Kaczor, “Seven Principles of Catholic Social Teaching,” Catholic Answers, April 1, 2007. https://www.catholic.com/magazine/print-edition/seven-principles-of-catholic-social-teaching 

CAMBRIDGE, MA – Future of Life Institute (FLI) President and Co-Founder Max Tegmark today released the following statement after the Pope gave a speech at the G7 in Italy, raising the alarm about the risks of out-of-control AI development.

“The Future of Life Institute strongly supports the Pope’s call at the G7 for urgent political action to ensure artificial intelligence acts in service of humanity. This includes banning lethal autonomous weapons and ensuring that future AI systems stay under human control. I urge the leaders of the G7 nations to set an example for the rest of the world, enacting standards that keep future powerful AI systems safe, ethical, reliable, and beneficial.”

Full title: Turning Vision into Action: Implementing the Senate AI Roadmap

Executive Summary 

On May 15, 2024, the Senate AI Working Group released “Driving U.S. Innovation in Artificial Intelligence: A Roadmap for Artificial Intelligence Policy in the United States Senate,” which synthesized the findings from the Senate AI Insight Forums into a set of recommendations for Senate action moving forward. The Senate’s sustained efforts to identify and remain abreast of the key issues raised by this rapidly evolving technology are commendable, and the Roadmap demonstrates a remarkable grasp of the critical questions Congress must grapple with as AI matures and permeates our everyday lives.

The need for regulation of the highest-risk AI systems is urgent. The pace of AI advancement has been frenetic, with Big Tech locked in an out-of-control race to develop increasingly powerful, and increasingly risky, AI systems. Given the more deliberate pace of the legislative process, we remain concerned that the Roadmap’s deference to committees for the development of policy frameworks could delay the passage of substantive legislation until it is too late for effective policy intervention.

To expedite the process of enacting meaningful regulation of AI, we offer the following actionable recommendations for such policy frameworks that can form the basis of legislation to reduce risks, foster innovation, secure wellbeing, and strengthen global leadership.

AGI and Testing of Advanced General-Purpose AI Systems 

Liability 

AI and National Security 

Compute Security And Export Controls 

Autonomous Weapons Systems And Military Integration Of AI 

Open-source AI 

 Supporting AI Innovation

Combatting Deepfakes 

Provenance and Watermarking 


Introduction 

In May 2024, the Bipartisan Senate AI Working Group, spearheaded by Majority Leader Schumer, Sen. Rounds, Sen. Heinrich, and Sen. Young, released a “roadmap for artificial intelligence policy in the United States Senate” entitled “Driving U.S. Innovation in Artificial Intelligence.” The Roadmap is a significant achievement in bipartisan consensus, and thoughtfully identifies the diversity of potential avenues AI presents for both flourishing and catastrophe. Drawing on the input of experts at the Senate AI Insight Forums and beyond, the Roadmap includes several promising recommendations for the Senate’s path forward.

At the same time, the Roadmap lacks the sense of urgency for congressional action we see as critical to ensuring AI is a net benefit for the wellbeing of the American public, rather than a source of unfettered risk. The pace of advancement in the field of AI has accelerated faster than even leading experts had anticipated, with competitive pressures and profit incentives driving Big Tech companies to race haphazardly toward creating more powerful, and consequently less controllable, systems by the month. A byproduct of this race is the relegation of safety and security to secondary concerns for these developers.

The speed with which this technology continues to evolve and integrate stands in stark contrast to the typical, more deliberate pace of government. This mismatch raises a risk that requisite government oversight will not be implemented quickly enough to steer AI development and adoption in a more responsible direction. Realization of this risk would likely result in a broad array of significant harms, from systematic discrimination against disadvantaged communities to the deliberate or accidental failure of critical infrastructure, that could otherwise be avoided. The social and economic permeation of AI could also render future regulation nearly impossible without disrupting and potentially destabilizing the US’s socioeconomic fabric – as we have seen with social media, reactive regulation of emerging technology raises significant obstacles where proactive regulation would not, and pervasive harm is often the result. In other words, the time to establish meaningful regulation and oversight of advanced AI is now.

With this in mind, we commend the Senate AI Working Group for acting swiftly to efficiently bring the Senate up to speed on this rapidly evolving technology through the Senate AI Insight Forums and other briefings. However, we are concerned that, in most cases, the Roadmap encourages committees to undertake additional consideration toward developing frameworks from which legislation could then be derived, rather than contributing to those actionable frameworks directly. We recognize that deference to committees of relevant jurisdiction is not unusual, but fear that this process will imprudently delay the implementation of AI governance, particularly given the November election’s potential to disrupt legislative priorities and personnel.

To streamline congressional action, we offer concrete recommendations for establishing legislative frameworks across a range of issues raised in the Roadmap. Rather than building the necessary frameworks from the ground up, our hope is that the analyses and recommendations included herein will provide actionable guidance for relevant committees and interested members that would reduce risks from advanced AI, improve US innovation, wellbeing, and global leadership, and meet the urgency of the moment.

AGI and Testing of Advanced General-Purpose AI Systems 

We applaud the AI Working Group for recognizing the unpredictability and risk associated with the development of increasingly advanced general-purpose AI systems (GPAIS). The Roadmap notes “the significant level of uncertainty and unknowns associated with general purpose AI systems achieving AGI.” We caution, however, against the inclination that the uncertainty and risks from AGI manifest only beyond a defining, rigid threshold, and emphasize that these systems exist on a spectrum of capability that correlates with risk and uncertainty. Unpredictability and risks have already been observed in the current state-of-the-art, which most experts categorize as sub-AGI, and are expected to increase in successive generations of more advanced systems, even as new risks emerge.

While the Roadmap encourages relevant committees to identify and address gaps in the application of existing law to AI systems within their jurisdiction, the general capabilities of these systems make it particularly challenging to identify appropriate committees of jurisdiction as well as existing legal frameworks that may apply. This challenge was a major impetus for establishing the AI Working Group — as the Roadmap notes in the Introduction, “the AI Working Group’s objective has been to complement the traditional congressional committee-driven policy process, considering that this broad technology does not neatly fall into the jurisdiction of any single committee.” 

Rather than a general approach to regulating the technology, the Roadmap suggests addressing the broad scope of AI risk through use case-based requirements on high-risk uses of AI. This approach may indeed be appropriate for most AI systems, which are designed to perform a particular function and operate exclusively within a specific domain. For instance, while some tweaks may be necessary, AI systems designed exclusively for financial evaluation and prediction can reasonably be overseen by existing bodies and frameworks for financial oversight. We are also pleased by the AI Working Group’s acknowledgement that some more egregious uses of AI should be categorically banned – the Roadmap specifically recommends a prohibition on the use of AI for social scoring, and encourages committees to “review whether other potential uses for AI should be either extremely limited or banned.”

That said, a use case-based approach is not sufficient for today’s most advanced GPAIS, which can effectively perform a wide range of tasks, including some for which they were not specifically designed, and can be utilized across distinct domains and jurisdictions. If the same system is routinely deployed in educational, medical, financial, military, and industrial contexts but is specialized for none of them, the governing laws, standards, and authorities applicable to that system cannot be easily discerned, complicating compliance with existing law and rendering regulatory oversight cumbersome and inefficient.

Consistent with this, the Roadmap asks committees to “consider a capabilities-based AI risk regime that takes into consideration short-, medium-, and long-term risks, with the recognition that model capabilities and testing and evaluation capabilities will change and grow over time.” In the case of GPAIS, such a regime would categorically include particular scrutiny for the most capable GPAIS, rather than distinguishing them based on the putative use-case.

Metrics 

Our ability to preemptively assess the risks and capabilities of a system is currently limited. As the Roadmap prudently notes, “(a)s our understanding of AI risks further develops, we may discover better risk-management regimes or mechanisms. Where testing and evaluation are insufficient to directly measure capabilities, the AI Working Group encourages the relevant committees to explore proxy metrics that may be used in the interim.” While substantial public and private effort is being invested in the development of reliable benchmarks for assessment of capabilities and associated risk, the field has not yet fully matured. Though some metrics exist for testing the capabilities of models at various cognitive tasks, no established benchmarks exist for determining their capacity for hazardous behavior without extensive testing across multiple metrics.

The number of floating-point operations (FLOPs) are a measure of computation, and, in the context of AI, reflect the amount of computational resources (“compute”) used to train an AI model. Thus far, the amount of compute used to effectively train an AI model scales remarkably well with the general capabilities of that model. The recent flurry of advancement in AI capabilities over the past few years has been primarily driven by innovations in high-performance computing infrastructure that allow for leveraging more training data and computational power, rather than from major innovations in model design. The resulting models have demonstrated capabilities highly consistent with predictions based on the amount of computing power used in training, and capabilities have in turn consistently correlated with identifiable risks.

While it is not clear whether this trend will continue, training compute has so far been the objective, quantitative measurement that best predicts the capabilities of a model prior to testing. In a capabilities-based regulatory framework, such a quantitative threshold is essential for initially delineating models subject to certain requirements from those that are not. That said, using a single proxy metric as the threshold creates the risk of failing to identify potentially high-risk models as advances in technology and efficiency are made, and of gamesmanship to avoid regulatory oversight.

Recommendations 

  1. Congress should implement a stratified, capabilities-based oversight framework for the most advanced GPAIS to complement use case-dependent regulatory mechanisms for domain-specific AI systems. Such a framework, through pre-deployment assessment, auditing, and licensure, and post-deployment monitoring, could conceivably mitigate risks from these systems regardless of whether they meet the relatively arbitrary threshold of AGI. While establishing a consensus definition of AGI is a worthwhile objective, it should not be considered prerequisite to developing a policy framework designed to mitigate risks from advanced GPAIS.
  2. Regulatory oversight of advanced GPAIS should employ the precautionary principle, placing the burden of proving the safety, security, and net public benefit of the system, and therefore its suitability for release, on the developer, and should prohibit the release of the system if it does not demonstrate such suitability. The framework should impose the most stringent requirements on the most advanced systems, with fewer regulatory requirements for less capable systems, in order to avoid unnecessary red tape and minimize the burden on smaller AI developers who lack the financial means to train the most powerful systems regardless.
  3. Audits and assessments should be conducted by independent, objective third-parties who lack financial and other conflicts of interest. These auditors could either be employed by the government or accredited by the government to ensure they are bound by standards of practice. For less powerful and lower risk systems, some assessments could be conducted in-house to reduce regulatory burden, but verifying the safety of the highest-risk systems should under no circumstances rely on self-governance by profit-motivated companies.
  4. Legislation governing advanced GPAIS should adopt regulatory thresholds that are inclusive of, but not limited to, training compute to ensure that current and future systems of concern remain in scope. Critically, these thresholds should each independently be sufficient to qualify a system as subject to additional scrutiny, such that exceeding, e.g. 1025 FLOPs in training compute OR 100 billion parameters OR 2 trillion tokens of training data OR exceeding a particular score on a specified capabilities benchmark, risk assessment benchmark, risk assessment rubric, etc.,1 would require a model to undergo independent auditing and receive a license for distribution. This accounts for potential blindspots resulting from the use of proxy metrics, and allows flexibility for expanding the threshold qualifications as new benchmarks become available.
  5. Congress should establish a centralized federal authority responsible for monitoring, evaluating, and regulating GPAIS due to their multi-jurisdictional nature, and for advising other agencies on activities related to AI within their respective jurisdictions. This type of “hub and spoke” model for an agency has been effectively implemented for the Cybersecurity and Infrastructure Security Agency (CISA), and would be most appropriate for the efficient and informed regulation of AI. Such an agency could also lead response coordination in the event of an emergency caused by an AI system. Notably, CISA began as a directorate within the Department of Homeland Security (National Protection and Programs Directorate), but was granted additional operational independence thereafter. A similar model for an AI agency could mitigate the logistical and administrative strain that could delay establishment of a brand new agency, with the Department of Energy or the Department of Commerce serving as the hub for incubating the new oversight body.
  6. Whistleblower protections should be augmented to cover reporting on unsafe practices in development and/or planned deployment of AI systems. It is not presently clear whether existing whistleblower protections for consumer product safety would be applicable in these circumstances; as such, new regulations may be necessary to encourage reporting of potentially dangerous practices. These protections should be expanded to cover a wide range of potential whistleblowers, including employees, contractors, and external stakeholders who know of unsafe practices. Protection should include legal protection against retaliation, confidentiality, safe reporting channels, and the investigation of reports documenting unsafe practices.

Liability 

The Roadmap emphasizes the need to “hold AI developers and deployers accountable if their products or actions cause harm to consumers.” We agree that developers, deployers, and users should all be expected to behave responsibly in the creation, deployment, and use of AI systems, and emphasize that the imposition of liability on developers is particularly critical in order to encourage early design choices that prioritize the safety and wellbeing of the public.

The Roadmap also correctly points out that “the rapid evolution of technology and the varying degrees of autonomy in AI products present difficulties in assigning legal liability to AI companies and their users.” Under current law, it is unclear who is responsible when an AI system causes harm, particularly given the complexity of the AI supply chain. 

Joint and Several Liability 

When an individual is harmed as a consequence of an AI system, there are several parties that could be responsible: the developer who trained the AI model, the provider who offers that model for use, the deployer who deploys the model as part of an AI system, or the user/consumer who employs the system for a given purpose. In addition, advanced GPAIS often serve as “foundation models,” which are incorporated as one component of a more elaborate system, or which are fine-tuned by third-parties to select for particular characteristics. This presents the possibility for multiple parties to assume each of the aforementioned roles.

In such circumstances, joint and several liability is often appropriate. Joint and several liability provides that a person who has suffered harm can recover the full amount of damages from any of the joint and severally liable parties, i.e. those comprising the AI supply chain. The burden then rests on the defendant to recover portions of those damages from other parties based on their respective responsibilities for the harm. In other words, if a person is harmed by an AI system, that person would be able to sue any one of the developer, the provider, or the deployer of the system, and recover the full amount of damages, with these parties then determining their relative liability for that payment of damages independently of the injured party. 

This absolves the person harmed of the burden of identifying the specific party responsible for the harm they suffered, which would be nearly impossible given the complexity of the supply chain, the opacity of the backend functions of these systems, and the likelihood that multiple parties may have contributed to the system causing harm. Instead, the defendant, who is more familiar with the parties involved in the system’s lifecycle and the relative contributions of each to the offending function, would be charged with identifying the other responsible parties, and joining them as co-defendants in the case, as appropriate.

Strict Liability 

Strict liability refers to a form of liability in which the exercise of due care is not sufficient to absolve a defendant of liability for harm caused by their action or product. While products are generally subject to a particular brand of strict liability, in most cases services rely on a negligence standard, which absolves the defendant of liability if due care was exercised in the conduct of the service. The lack of clarity as to whether advanced GPAIS should be classified as products or services draws into question whether this strict liability framework applies.

The inherent unpredictability of advanced GPAIS and inevitability of emergent unforeseen risks make strict liability appropriate. Many characteristics of advanced GPAIS render their training and provision akin to an “abnormally dangerous activity” under existing tort law, and abnormally dangerous activities are typically subject to strict liability. Existing law considers an activity abnormally dangerous and subject to strict liability if: (1) the activity creates a foreseeable and highly significant risk of physical harm even when reasonable care is exercised by all actors; and (2) the activity is not one of common usage.2 . A risk is considered highly significant if it is either unusually likely or unusually severe, or both. For instance, the operation of a nuclear power plant is considered to present a highly significant risk because while the likelihood of a harm-causing incident when reasonable care is exercised is low, the severity of harm should an incident occur would be extremely high. 

A significant portion of leading AI experts have attested to a considerable risk of catastrophic harm from the most powerful AI systems, including executives from the major AI companies developing the most advanced systems.3 The presence of these harms is thus evidently foreseeable and highly significant. Importantly, reasonable care is not sufficient to eliminate catastrophic risk from advanced GPAIS due to their inherent unpredictability and opacity, as demonstrated by the emergence of behaviors that were not anticipated by their developers in today’s state-of-the-art systems.4 As more capable advanced GPAIS are developed, this insufficiency of reasonable care will likely compound – an AGI system that exceeds human capacity across virtually all cognitive tasks, for instance, by definition would surpass the capacity of humans to exercise reasonable care in order to allay its risks. 

Additionally, given the financial and hardware constraints on training such advanced models, only a handful of companies have the capacity to do so, suggesting that the practice also “is not one of common usage.” In contrast, less capable systems are generally less likely to present emergent behaviors and inherently present a far lower risk of harm, particularly when reasonable care is exercised. Such systems are also less hardware intensive to train, and, while not necessarily “of common usage” at present, could qualify as such with continued proliferation.

Section 230 

Section 230 of the Communications Decency Act of 1996 provides that, among other things, “no provider or user of an interactive computer service shall be treated as the publisher or speaker of any information provided by another information content provider.”5 This provision, along with the statute’s protection of interactive computer services from liability for good faith moderation actions, has been broadly interpreted to protect online platforms from liability for the content they host, so long as the content was contributed by a party other than the platform itself.

The application of this statute to AI has yet to be tested in courts, and there is disagreement among legal scholars as to how Section 230 relates to generative AI outputs. On one hand, the outputs of generative AI systems are dependent on the input of another information content provider, i.e. the user providing the prompt. On the other hand, the outputs generated by the system are wholly unique, more akin to content provided by the platform itself. The prevailing view among academics is that generative AI products “operate on something like a spectrum between a retrieval search engine (more likely to be covered by Section 230) and a creative engine (less likely to be covered).”6 A robust liability framework for AI must therefore ensure that this area of the law is clarified, either by explicitly superseding Section 230, or by amending Section 230 itself to provide this clarification. Shielding the developers and operators of advanced AI systems from liability for harms resulting from their products would provide little incentive for responsible design that minimizes risk, and, as we have seen with social media, could result in wildly misaligned incentives to the detriment of the American public.

Recommendations 

  1. The development of a GPAIS trained with greater than, e.g., 1025 FLOPs, or a system equivalent in capability7, should be considered an abnormally dangerous activity and subject to strict liability due to its inherent unpredictability and risk, even when reasonable care is exercised.
  2. Developers of advanced GPAIS that fall below this threshold, but still exceed a lower threshold (e.g. 1023 FLOPs) should be subject to a rebuttable presumption of negligence if the system causes harm – given the complexity, opacity, and novelty of these systems, and the familiarity of their developers with the pre-release testing the systems underwent, developers of these systems are best positioned to bear the burden of proving reasonable care was taken.
  3. Domain-specific AI systems should be subject to the legal standards applicable in that domain.
  4. Where existing law does not explicitly indicate an alternative apportionment of liability, AI systems should be subject to joint and several liability, including for all advanced GPAIS.
  5. Congress should clarify that Section 230 of the Communications Decency Act does not shield AI providers from liability for harms resulting from their systems, even if the output was generated in response to a prompt provided by a user. This could be accomplished by amending the definition of “information content provider” in Section 230(f)(3) to specify that the operator of a generative AI system shall be considered the “information content provider” for the outputs generated by that system.

AI and National Security  

AI systems continue to display unexpected, emergent capabilities that present risks to the American public. Many of these risks, for example those from disinformation8 and cyberattacks9, were not discovered by evaluations conducted during the development life cycle (i.e. pre-training, training, and deployment), or were discovered but were deemed insufficiently severe or probable to justify delaying release. Moreover, AI companies have incentives to deploy AI systems quickly in order to establish and/or maintain market advantage, which may lead to substandard monitoring and mitigation of AI risks. When risks are discovered and harm is imminent or has occurred, it is vital for authorities to be informed as soon as possible to respond to the threat.

The Roadmap encourages committees to explore “whether there is a need for an AI-focused Information Sharing and Analysis Center to serve as an interface between commercial AI entities and the federal government to support monitoring of AI risks.” We see such an interface as essential to preserving national security in light of the risks, both unexpected and reasonably foreseeable, presented by these systems.

The Roadmap also notes that “AI has the potential to increase the risk posed by bioweapons and is directly relevant to federal efforts to defend against CBRN threats.” As state-of-the-art AI systems have become more advanced, they have increasingly demonstrated capabilities that could pose CBRN threats. For instance, in 2022, an AI system used for pharmaceutical research effectively identified 40,000 novel candidate chemical weapons in six hours10. While the current generation of models have not yet significantly increased the abilities of malicious actors to launch biological attacks, newer models are adept at providing the scientific knowledge, step-by-step experimental protocols, and guidance for troubleshooting experiments necessary to effectively develop biological weapons.11 Additionally, currently models have been shown to significantly facilitate the identification and exploitation of cybervulnerabilities.12 These capabilities are likely to scale over time.

Over the last two years, an additional threat has emerged at the convergence of biotechnology and AI, as ever-powerful AI models are ‘bootstrapped’ with increasingly sophisticated biological design tools , allowing for AI-assisted identification of virulence factors, in silico design of pathogens, and other capabilities that could significantly increase the capacity of malicious actors to cause harm.13 

The US government should provide an infrastructure for monitoring these AI risks that puts the safety of the American public front and center, gives additional support to efforts by AI companies, and allows for rapid response to harms from AI systems. Several precedents for such monitoring already exist. For instance, CISA’s Joint Cyber Defense Collaborative is a nimble network of cross-sector entities that are trusted to analyze and share cyber threats, the SEC requires publicly traded companies to disclose cybersecurity incidents within four business days, and the 2023 AI Executive Order requires companies to disclose ‘the physical and cybersecurity protections taken to assure the integrity of that training process against sophisticated threats’.14 

Recommendations 

  1. Congress should establish an Information Sharing and Analysis Center (ISAC) which will designate any model or system that meets a specified quantitative threshold15 as a model or system of national security concern. Congress should require developers building advanced AI systems to share documentation with the ISAC about the decisions taken throughout the development and deployment life-cycle (e.g., models cards detailing decisions taken before, during, and after the training and release of a model).
  2. The current draft of the 2025 National Defense Authorization Act (NDAA)16 tasks the Chief Digital and Artificial Intelligence Officer with developing an implementation plan for a secure computing and data storage environment (an ‘AIxBio sandbox’) to facilitate the testing of AI models trained on biological data, as well as the testing of products generated by such models. Congress should mandate that AI systems as or more powerful than those defined as models of national security concern (see above) or are otherwise deemed to pose CBRN threats be subjected to testing in this sandbox before deployment to ensure that these systems do not pose severe risks to the American public. This type of faculty should follow design and protocols from the national security sector’s Sensitive Compartmented Information Facility (SCIF) standards or the similar Data Cleanroom standards used in software litigation discovery.
  3. To ensure that GPAIS are not capable of revealing hazardous information, Congress should prohibit AI models from being trained on the most dangerous dual-use research of concern (DURC). Congress should also recommend appropriate restrictions for DURC data being used to train narrow AI systems – such as ringfencing of the most hazardous biological information from use in training – that could pose significant risk of misuse, malicious use, or unintended harm. In both cases, these requirements should cover data that, if widely available, would pose a potential CBRN risk.
  4. The federal government should invest in core CBRN defense strategies that are agnostic to AI, while bearing in mind that AI increases the probability of these threats materializing. Such investments should include next-generation personal protective equipment (PPE), novel medical countermeasures, ultroviolet-C technologies, and other recommendations from the National Security Commission for Emerging Biotechnology. 17

Compute Security And Export Controls  

High-end AI chips are responsible for much of the rapid acceleration in development of AI systems. As these chips are an integral component of AI development and rely on a fairly tight supply chain – i.e. the supply chain is concentrated in a small number of companies in a small number of countries18 – – chips are a promising avenue for regulating the proliferation of the highest-risk AI systems, especially among geopolitical adversaries and malicious non-state actors.

The Roadmap “encourages the relevant committees to ensure (the Bureau of Industry and Security) proactively manages (critical) technologies and to investigate whether there is a need for new authorities to address the unique and quickly burgeoning capabilities of AI, including the feasibility of options to implement on-chip security mechanisms for high-end AI chips.” We appreciate the recognition by the AI Working Group of on-chip security as a useful approach toward mitigating AI risk. Congress must focus on both regulatory and technical aspects of this policy problem to mitigate the risk of AI development from malicious actors.

The Roadmap also asks committees to develop a framework for determining when, or if, export controls should be placed on advanced AI systems. We view hardware governance and export controls as complementary and mutually-reinforcing measures, wherein on-chip security mechanisms can serve to mitigate shortcomings of export controls as a means of reducing broad proliferation of potentially dangerous systems.

Export controls, especially those with an expansive purview, often suffer from serious gaps in enforcement. In response to export controls on high-end chips used for training AI systems, for instance, a growing informal economy around chip smuggling has already emerged, and is likely to grow as BIS restrictions on AI-related hardware and systems become more expansive.19 Coupling export controls with on-chip governance mechanisms can help remedy this gap in enforcement by providing the ability to track and verify the location of chips, and to automatically or remotely disable their functionality based on their location when they are used or transferred in violation of export controls.

Export controls also generally target particular state actors rather than select applications, which may foreclose economic benefits and exacerbate geopolitical risks to United States interests relative to more targeted restrictions on trade. For example, broadly-applied export controls20 targeted at the People’s Republic of China (PRC) do not effectively distinguish between harmless use cases (e.g., chips used for video games or peaceful academic collaborations) and harmful use cases (e.g., chips used to train dangerous AI military systems) within the PRC. Expansive export controls have already led to severe criticism from the Chinese government,21 and may be having the unintended effect of pushing China toward technological self-reliance.22

In contrast, relaxing restrictions on chip exports to demonstrably low-risk customers and for low-risk uses in countries otherwise subject to export controls could improve the economic competitiveness of US firms and strengthen trade relationships key to maintaining global stability. These benefits are integral to guaranteeing sustained US leadership on the technological frontier, and to maintaining the geopolitical posture of the US. tThe ability for on-chip governance mechanisms to more precisely identify the location of a given chip and to determine whether the chip is co-located with many other chips or used in a training cluster could facilitate more targeted export controls that maintain chip trade with strategic competitors for harmless uses, while limiting their application toward potentially risky endeavors.

New and innovative hardware governance solutions are entirely compatible with the current state of the art chips sold by leading manufacturers. All hardware relevant to AI development (i.e. H100s, A100s, TPUs, etc.) have some form of “trusted platform module (TPM)”, a hardware device that generates random numbers, holds encryption keys, and interfaces with other hardware modules to ensure platform integrity and report security-relevant metrics.23 Some new hardware (H100s in particular) has an additional “trusted execution environment (TEE)” or “secure enclave” capability, which prevents access to chosen sections of memory at the hardware level. TPMs and secure enclaves are already available and in use today, presently serving to prevent iPhones from being “jailbroken,” or used when stolen, and to secure biometric and other highly sensitive information in modern phones and laptops. As discussed, they can also facilitate monitoring of AI development to identify the most concerning uses of compute and take appropriate action, including automatic or remote shutdown if the chips are used in ways or in locations that are not permitted by US export controls.

These innovations could be transformative for policies designed to monitor AI development, as TEEs and TPMs use cryptographic technology to guarantee confidentiality and privacy for all users across a variety of use and governance models.24 Such guarantees are likely necessary for these chips to become the industry and international standard for use, and for willing adoption by strategic competitors. TEE and TPM security capabilities can also be used to construct an “attested provenance” capability that gives cryptographic proof that a given set of AI model weights or model outputs results from a particular auditable combination of data, source code, training characteristics (including amount of compute employed), and input data. This provides a uniquely powerful tool in verifying and enforcing licensing standards.

Because state-of-the-art chips already possess the technical capability for this type of on-chip security, a technical solution to hardware governance would not impose serious costs on leading chip companies to modify the architecture of chips currently in inventory or in production. Additionally, it is possible to use these technical solutions for more centralized compute governance without creating back-channels that would harm the privacy of end-users of the chip supply chain – indeed these mechanisms can ensure privacy and limit communication of information to telemetry such as location and usage levels.

Recommendations 

  1. Congress should support the passage of H.R.8315, the Enhancing National Frameworks for Overseas Restriction of Critical Exports (ENFORCE) Act, which gives the Bureau of Industry and Security (BIS) the authority to control the export and re-export of covered AI systems, with amendments to ensure that the publication of AI models in a manner that is publicly accessible does not create a loophole to circumvent these controls, i.e., that open-weight systems meeting specified conditions qualify as exports under the Act.25 
  2. Congress should require companies developing AI systems that meet specified thresholds26 to use AI chips with secure hardware. This hardware should be privacy-preserving to allow for confidential computing but should also provide information on proof-of-location and the ability to switch chips off in emergency circumstances.27 Such a technical solution would complement robust export controls by facilitating enforcement and more effectively targeting harmful applications in particular. This could be accomplished through direct legislation prohibiting the domestic training of advanced AI systems using chips without such technology, and by providing a statutory obligation for BIS to grant export licenses for high-end AI chips and dual-use AI models only if they are equipped with these on-chip security mechanisms and trained using such chips, respectively.
  3. To avoid gaming, inaccuracy, or misrepresentation in the satisfaction of licensing requirements, Congress should phase-in increasingly stringent evidentiary requirements for reporting of compute usage and auditing results. The recently-established US AI Safety Institute within the National Institute of Standards and Technology should be tasked with developing a comprehensive standard for compute accounting to be used in threshold determinations. Additionally, self-attestation of compute usage and capability evaluations should be improved to cryptographically attested provenance when this becomes technically practical.

Autonomous Weapons Systems (AWS) and Military Integration of AI  

The Roadmap asks committees to take actions that prioritize the “development of secure and trustworthy algorithms for autonomy in DOD platforms” and ensure “the development and deployment of Combined Joint All-Domain Command and Control (CJADC2) and similar capabilities by DOD.”

Following the 2021 CJADC2 Strategy28, the Department of Defense (DOD) announced a new generation of capabilities for CJADC2 early this year, which intend to use AI to “connect data-centric information from all branches of service, partners, and allies, into a singular internet of military things.” This built on similar efforts led by the Chief Digital and Artificial Intelligence Office (CDAO) and on the objectives of Task Force Lima to monitor, develop, evaluate, and recommend the responsible and secure implementation of generative AI capabilities across DOD.

While such innovations in the war-fighting enterprise present potential benefits – e.g., rapid integration of military intelligence, providing strategic decision advantage to commanders – there are significant pitfalls to rapid integration of AI systems, which have continued to be proven unreliable, opaque, and unpredictable. Bugs in AI systems used in such critical settings could severely hamper the national defense enterprise, and put American citizens and allies in danger, as a centralized system responsible for virtually all military functions creates a single point of failure and vulnerability. Integration of these systems may also lead to amplification of correlated biases in the decision-making of what would otherwise be independent AI systems used in military applications.

The Roadmap also “recognizes the DOD’s transparency regarding its policy on fully autonomous lethal weapons systems (and encourages) relevant committees to assess whether aspects of the DOD’s policy should be codified or if other measures, such as notifications concerning the development and deployment of such weapon systems, are necessary.”

As the draft text of the 2025 National Defense Authorization Act (NDAA)29 notes, the ‘small unmanned aircraft systems (UAS) threat continues to evolve, with enemy drones becoming more capable and dangerous.’ Autonomous weapons systems (AWS) are becoming increasingly cheap to produce and use, and swarms of such weapons pose a serious threat to the safety of citizens worldwide. When deployed en masse, swarms of autonomous weapons, which have demonstrated little progress in distinguishing between civilians and combatants in complex conflict environments, have the potential to cause mass casualties at the level of other kinds of WMDs. Their affordability also makes them a potentially potent tool for carrying out future genocides.

Overall, AWS have proven to be dangerously unpredictable and unreliable, demonstrating difficulty distinguishing between friend and foe. As these systems become more capable over time, they present a unique risk from loss of control or unintended escalation. Additionally, such systems are prone to cyber-vulnerabilities, and may be hacked by malicious actors and repurposed for malicious use.

Recommendations 

  1. Congress should mandate that nuclear launch systems remain independent from CJADC2 capabilities. The current air-gapped state of nuclear launch systems ensures that the critical decision to launch a nuclear weapon always remains within full human control. This situation also guards the nuclear command and control system against cyber-vulnerabilites which could otherwise present if the system was integrated with various other defense systems. Other systems may possess unique vulnerabilities from which nuclear launch systems are presently insulated, but to which they would be exposed were the functions integrated.
  2. Building on the precedent set by the Air Force, Congress should require DOD to establish boards comprised of AI ethics officers across all offices involved in the production, procurement, development, and deployment of military AI systems.30
  3. In light of the comments made by former Chief Digital and AI Officer Dr. Craig Martell that all AI systems integrated into defense operations must have ‘five-digit accuracy’ (99.999%),31 Congress should task the CDAO with establishing clear protocols to measure this accuracy and prohibit systems which fall below this level of accuracy from being used in defense systems. 
  4. Congress should codify DOD Directive 3000.09 in statute to ensure that it is firmly established, and amend it to raise the bar from requiring ‘appropriate levels of human judgement’ to requiring ‘meaningful human control’ when AI is incorporated in military contexts. This is critical in ensuring that ‘human-in-the-loop’ is not used as a a rubber stamp, and in emphasizing the need for human control at each stage of deployment. In addition, Congress should require the CDAO to file a report which establishes concrete guidance for meaningful human control in practice, for both AWS and decision-support systems.
  5. As the 2025 NDAA draft indicates, there is a need for development of counter-UAS (C-UAS) systems. Rather than ramping up development of unreliable and risky offensive AWS, Congress should instead instruct DOD to invest in non-kinetic counter-AWS (C-AWS) development. As AWS development accelerates and the risk of escalation heightens, the US should reassure allies that AWS is not the best countermeasure and instead push for advanced non-kinetic C-AWS technology.

Open-source AI  

Recently, “open-source AI” has been used to refer to AI models for which model weights, the numerical values that dictate how a model translates inputs into outputs, are widely available to the public. It should be noted that an AI system with widely available model weights alone does not fit the traditional criteria for open-source. The inconsistent use of this term has allowed many companies to benefit from the implication that models with varying degrees of openness might still fulfill the promises of open-source software (OSS), even when they do not adhere to the core principles of the open-source movement32. Contrary to the marketing claims of Big Tech companies deploying “open-source” AI models, Widder et al. (2023) argue that while maximally “open” AI can indeed provide transparency, reusability, and extensibility, allowing third parties to deploy and build upon powerful AI models, it does not guarantee democratic access, meaningful competition, or sufficient oversight and scrutiny in the AI field. 

Advanced AI models with widely available model weights pose particularly significant risks to society due to their unique characteristics, potential for misuse, and the difficulty of evaluating and controlling their capabilities. In the case of CBRN risks, as of early 2024, evidence suggests that the current generation of closed AI systems function as instruments comparable to internet search engines in facilitating the procurement of information that could lead to harm.33 However, these experiments were carried out using proprietary models with fine-tuned safeguards. The release of model weights allows for trivial removal of any safeguards that might be added to mitigate these risks and lowers the barrier to entry for adapting systems toward more dangerous capabilities through fine-tuning.34,35,36

As AI models become more advanced, their reasoning, planning, and persuasion capabilities are expected to continue to grow, which will in turn increase the potential for misuse by malicious actors and loss of control over the systems by careless operators. Relevant legislation should account for the difficulty in accurately predicting which models will possess capabilities strong enough to pose significant risks with and without the open release of their model weights.37 Unanticipated vulnerabilities and dangerous capabilities can be particularly insidious in the latter case, as once model weights are released, such models cannot be effectively retracted in order patch issues, and the unpatched versions remain indefinitely available for use.

“Open AI systems” have already demonstrated the potential to facilitate harmful behavior, particularly by way of cyberattacks, disinformation, and the proliferation of child sexual abuse material (CSAM).38, 39 The UK National Cyber Security Centre found that AI systems are expected to significantly increase the volume and impact of cyber attacks by 2025, with varying degrees of influence on different types of cyber threats.40 While the near-term threat primarily involves the enhancement of existing tactics, techniques, and procedures, AI is already being used by both state and non-state actors to improve reconnaissance and social engineering. More advanced AI applications in cyber operations would likely to be limited to well-resourced actors with access to quality training data, immense computational resources, and expertise, but open release of model weights by these well-resourced actors could provide the same capacity to a wider range of threat actors, including cybercriminals and state-sponsored groups.

The Roadmap asks committees to “investigate the policy implications of different product release choices for AI systems, particularly to understand the differences between closed versus fully open-source models (including the full spectrum of product release choices between those two ends of the spectrum).” We appreciate the Roadmap’s implication that “open-source model” product releases present additional questions in understanding the risks posed by AI systems, and recommend the following measures to mitigate the unique risks posed by the release of model weights.

Recommendations  

  1. Congress should require that AI systems with open model weights undergo thorough testing and evaluation in secure environments appropriate to their level of risk. The government should conduct these assessments directly or delegate them to a group of government-approved independent auditors. When assessing these models, auditors must assume that a) built-in safety measures or restrictions could be removed or bypassed once the model is released, and b) the model could be fine-tuned or combined with other resources, potentially leading to the development of entirely new and unanticipated capabilities. Insufficient safeguards to protect against dangerous capabilities or dangerous unpredictable behavior should justify the authority to suspend the release of model weights, and potentially the system itself, until such shortcomings are resolved. In cases where full access to a model’s weights is needed to evaluate reliably audit the capabilities of a system, assessment should be conducted in Sensitive Compartmented Information Facilities (SCIFs) to ensure appropriate security measures.41,42
  2. Developers should be legally responsible for performing all reasonable measures to prevent their models from being retrained to substantially enable illegal activities, and for any harms resulting from their failure to do so. When model weights are made widely available, it becomes intractable for developers to retract, monitor, or patch the system. Presently, there is no reliable method of comprehensively identifying all of the capabilities of an AI system. Latent capabilities, problematic use-cases, and vulnerabilities are often identified far into the deployment life-cycle of a system or through additional fine-tuning. Despite the difficulties in identifying the full range of capabilities, developers should be held liable if their model was used to substantially enable illegal activities.
  3. To mitigate the concentration of power of AI while ensuring AI safety and security, initiatives like the National Artificial Intelligence Research Resource (NAIRR) should be pursued to create “public options” for AI. As previously discussed, the impacts of open-source AI on the concentration of power and on mitigating market consolidation are often overstated. This does not discount the importance of preventing the concentration of power, both within the technology market and for society at large, that is likely to result from the high barrier to entry for training the most advanced AI systems. One potential solution is for the U.S. to further invest in “public options” for AI. Initiatives like the National Artificial Intelligence Research Resource could help develop and maintain publicly-funded AI models, services, and infrastructure. This approach would ensure that access to advanced AI is not solely controlled by corporate or proprietary interests, allowing researchers, entrepreneurs, and the general public to benefit from the technology while prioritizing safety, security, and oversight.

Supporting US AI Innovation 

The Roadmap has the goal of “reaching as soon as possible the spending level proposed by the National Security Commission on Artificial Intelligence (NSCAI) in their final report: at least $32 billion per year for (non-defense) AI innovation.” We appreciate the support for non-military innovation of AI, and emphasize that AI innovation should not be limited to advancing the capabilities or raw power of AI systems. Rather, innovation should prioritize specific functions that maximize public benefit and tend to be under-incentivised in industry, and should include extensive research into improving the safety and security of AI systems. This means enhancing explainability of outputs, tools for evaluation of risk, and mechanisms for ensuring predictability and maintenance of control over system behavior.

To this end, the Roadmap also expresses the need for funding efforts to enhance AI safety and reliability through initiatives to support AI testing and evaluation infrastructure and the US AI Safety Institute, as well as increased resources for BIS to ensure effective monitoring and compliance with export control regulations. The Roadmap also emphasizes the importance of R&D and interagency coordination focused on the intersection of AI and critical infrastructure. We commend the comprehensive approach to R&D efforts across multiple agencies, as it recognizes the critical role that each of these entities plays in ensuring the safe and responsible development of AI technologies. In particular, we see the intersection of AI and critical infrastructure as a major vector of potential AI risk if due care is not paid to ensuring reliability and security of systems integrated in critical infrastructure, and in strengthening resilience against possible AI-assisted cyberthreats.

Research focused on the safe development, evaluation, and deployment of AI is vastly under-resourced when compared to research focused on the general development of AI. AI startups received almost $50 billion in funding in 2023.43 According to the 2024 Stanford Index Report, industry produced 51 notable machine learning models, academia contributed 15, and the government contributed 2.44 While the amount of resources that private companies allocate to safety research is unclear – there can be some overlap between safety and capabilities research – it is significantly less than investment in capabilities. Recently, the members of teams working on AI safety at OpenAI have resigned citing concerns about the company’s approach to AI safety research.45 This underscores the need for funding focused on the safe development, evaluation, and deployment of AI.

Recommendations 

1. R&D funding to BIS should include allocation to the development of on-chip hardware governance solutions, and the implementation of those solutions. To best complement the role of BIS in implementing export controls on advanced chips and potentially on AI models, this funding should include R&D supporting the further development of privacy-preserving monitoring such as proof-of-location and the ability to switch chips off in circumstances where there is a significant safety or regulatory violation.46 After appropriate on-chip governance solutions are identified, funding should also be directed towards enabling the implementation of those solutions in relevant export control legislation.

2. The expansion of NAIRR programs should include funding directed toward the development of secure testing and usage infrastructure for academics, researchers, and members of civil society. We support efforts by the NAIRR pilot program to improve public access to research infrastructure. As AI systems become increasingly capable, levels of access to AI tools and resources should be dynamic relative to their level of risk. Accordingly, it may be beneficial for those receiving any government funding for their work on powerful models (including private sector) to provide structured access to their systems via the NAIRR, subject to specific limitations on use and security measures, including clearance and SCIFs, where necessary, to allow for third parties to probe these systems and develop the tools necessary to make them safer.

3. R&D on interagency coordination focused on the intersection of AI and critical infrastructure should include allocation to safety and security research. The ultimate goal of this research should be to establish stringent baseline standards for the safe and secure integration of AI into critical infrastructure. These standards should address key aspects such as transparency, predictability, and robustness of AI systems, ensuring that they can be effectively integrated without introducing additional vulnerabilities. Funding should also acknowledge the lower barrier to entry for malicious actors to conduct cyberattacks as publicly-accessible AI becomes more advanced and widespread, and seek improved mechanisms to strengthen cybersecurity accordingly.

Combating Deepfakes 

The Roadmap encourages the relevant committees to consider legislation “to protect children from potential AI-powered harms online by ensuring companies take reasonable steps to consider such risks in product design and operation.” We appreciate the Roadmap’s recognition that product design, and by extension product developers, play a key role in mitigating AI-powered harms. The Roadmap also encourages the consideration of legislation “that protects against unauthorized use of one’s name, image, likeness, and voice, consistent with First Amendment principles, as it relates to AI”, “legislation to address online child sexual abuse material (CSAM), including ensuring existing protections specifically cover AI-generated CSAM,” and “legislation to address similar issues with non-consensual distribution of intimate images and other harmful deepfakes.”

Deepfakes, which are pictures, videos, and audio that depict a person without their consent, usually for the purpose of harming that person or misleading those who are exposed to the material, lie at the intersection of these objectives. There are many ways in which deepfakes systematically undermine individual autonomy, perpetuate fraud, and threaten our democracy. For example, 96% of deepfakes are sexual material47 and fraud committed using deepfakes rose 3,000% globally in 2023 alone.48 Deepfakes have also begun interfering with democratic processes by spreading false information and manipulating public opinion49 through convincing fake media, which can and have influenced electoral outcomes50. The Roadmap encourages committees to “review whether other potential uses for AI should be either extremely limited or banned.” We believe deepfakes fall into that category.

Deepfakes are the result of a multilayered supply chain, which begins with model developers, who design the underlying algorithms and models. Cloud compute providers such as Amazon Web Services (AWS), Google Cloud Platform, and Microsoft Azure form the next link in the chain by offering the necessary computational resources for running and in some cases training deepfake models. These platforms provide the infrastructure and scalability required to process large datasets and generate synthetic media efficiently. Following them are model providers, such as Deepnude51, Deepgram52, and Hoodem53, which offer access to pre-trained deepfake models or user-friendly software tools, enabling even those with limited technical expertise to produce deepfakes.

The end users of deepfake technology are typically individuals or groups with malicious intent, utilizing these tools to spread misinformation, manipulate public opinion, blackmail individuals, or engage in other illicit activities. Once created, these deepfakes are distributed through various online platforms, including social media sites such as Facebook, Twitter, and YouTube, as well as messaging apps like WhatsApp and Telegram. The proliferation of deepfakes on these platforms can be rapid and extensive, making it nearly impossible to remove the synthetic media once published. Accordingly, it is critical to prevent the production of deepfakes before their publication and distribution can occur.

As the creators and distributors of the powerful tools that enable the mass production of this harmful content, model developers and providers hold the most control and responsibility in the deepfake supply chain. Developers have the capability to stop the misuse of these technologies at the source by restricting access, disabling harmful functionalities, and simply refusing to train models for harmful and illegal purposes such as the generation of non-consensual intimate images. There are far fewer model developers than providers, making this link in the supply chain particularly effective for operationalizing accountability mechanisms. While compute providers also play a role by supplying the necessary resources for AI systems to function, their ability to monitor and control the specific use cases of these resources is more limited. In order to effectively stem off the risks and harms which deepfakes engender, legislative solutions must address the issue as a whole, rather than only in particular use cases, in order to reflect the broad and multifaceted threats that extend beyond any single application. A comprehensive legal framework would ensure that all potential abuses are addressed, creating a robust defense against the diverse and evolving nature of deepfake technology.

Recommendations 

  1. Congress should set up accountability mechanisms that reflect the spread of responsibility and control across the deepfake supply chain. Specifically, model developers and providers should be subject to civil and/or criminal liability for harms resulting from deepfakes generated by their systems. Similar approaches have been taken in existing Congressional proposals such as the NO AI FRAUD Act (H.R. 6943), which would create a private right of action against companies providing a “personalized cloning service”. When a model is being used to quickly and cheaply create an onslaught of deepfakes, merely holding each end-user accountable would be infeasible and would nonetheless be insufficient to prevent the avalanche of harmful deepfakes flooding the internet.
  2. Users accessing models to produce and share deepfakes should be subject to civil and/or criminal liability. This approach is already reflected in several bills proposed within Congress such as the NO AI FRAUD Act (H.R. 6943), NO FAKES Act, DEFIANCE Act (S.3696), and the Preventing Deepfakes of Intimate Images Act (H.R. 3106).
  3. Congress should place a responsibility on compute providers to revoke access to their services when they have knowledge that their services are being used to create harmful deepfakes, or to host models that facilitate the creation of harmful deepfakes. This will ensure that compute providers are not complicit in the mass production of deepfakes.
  4. Congress should support the passage of proposed bills like the NO FAKES Act, with some modifications to clarify the liability of model developers.54 Many recently introduced bills contain elements which would be effective in combating deepfakes, although it is crucial that they are strengthened to adequately address the multilayered nature of the deepfakes supply chain.

Provenance And Watermarking 

Watermarking aims to embed a statistical signal into AI-generated content, making it identifiable as such. Ideally, this would allow society to differentiate between AI-generated and non-AI content. However, watermarking has significant drawbacks. First, deepfakes such as non-consensual intimate images and CSAM are still considered harmful even when marked as AI-generated.55 Websites hosting AI-generated sexual images often disclose their origin, yet the content continues to cause distress to those depicted. Second, recent research has shown that robust watermarking is infeasible, as determined adversaries can easily remove these markers.56 As such, it is not sufficient to rely on watermarking alone as the solution to preventing the proliferation of deepfakes, nor for conclusively distinguishing real from synthetic content.

Nonetheless, certain types of watermarks and/or provenance data can be beneficial to combating the deepfake problem. “Model-of-origin” watermarking provisions, which would require generative AI models to include information on which model was used to create the output and the model’s developer and/or provider, can be included in the metadata of the output and can greatly enhance both legal and public accountability for developers of models used to create harmful content. Indicating the model of origin of outputs would also enable the identification of models that are disproportionately vulnerable to untoward use.

Consistent with this approach, the Roadmap encourages committees to “review forthcoming reports from the executive branch related to establishing provenance of digital content, for both synthetic and non-synthetic content.” It also recommends considering “developing legislation that incentivizes providers of software products using generative AI and hardware products such as cameras and microphones to provide content provenance information and to consider the need for legislation that requires or incentivizes online platforms to maintain access to that content provenance information.”

While forthcoming legislation should indeed require providers of AI models to include content provenance information embedded in or presented along with the outputted content, however, developers should also bear this responsibility. Unlike model providers, developers can embed provenance information directly into the models during the development phase, ensuring that it is an integral part of the AI-generated content from the outset.

Recommendations 

  1. Both model developers and providers should be required to integrate provenance tracking capabilities into their systems. While voluntary commitments have been made by certain developers, provenance watermarking is most trustworthy when it is widespread, and this is not currently the industry norm. As the National Institute of Standards and Technology report on Reducing Risks Posed by Synthetic Content (NIST AI 100-4) outlines, several watermarking and labeling techniques have become prominent, meaning that there are established standards that can be viably adopted by both developers and providers.
  2. Model developers and providers should be expected to make content provenance information as difficult to bypass or remove as possible, taking into account the current state of science. It is unlikely that most users creating deepfakes have the technical competency to remove watermarks and/or metadata, but model-of-origin provisions are nonetheless most effective if they are inseparable from the content. While studies have shown that malicious actors can bypass current deepfake labeling and watermarking techniques, stakeholders should ensure, to the greatest extent possible, that such bypasses are minimized. The absence of requirements that model-of-origin information be as difficult to remove as possible may unintentionally incentivize developers and deployers to employ watermarks that are easier to remove in an effort to minimize accountability.
  3. Congress should support the passage of the AI Labeling Act, which mandates clear and permanent notices on AI-generated content, identifying the content as AI-produced and specifying the tool used along with the creation date. This transparency helps hold developers accountable for harmful deepfakes, potentially deterring irresponsible AI system design.
  4. Congress should support amendments to various bills originating in the House, including the AI Disclosure Act and the DEEPFAKES Accountability Act, such that they clearly include model-of-origin watermarking provisions.57

Conclusion 

We thank the Senate AI Working Group for its continued dedication to the pressing issue of AI governance. AI as a technology is complex, but the Roadmap demonstrates a remarkable grasp of the major issues it raises for the continued flourishing of the American people. The next several months will be critical for maintaining global leadership in responsible AI innovation, and the urgent adoption of binding regulation is essential to creating the right incentives for continued success. Time and time again, Congress has risen to meet the challenge of regulating complex technology, from airplanes to pharmaceuticals, and we are confident that the same can be done for AI.


↩ 1 Throughout this document, where we recommend thresholds for the most advanced/dangerous GPAIS, we are generally referring to multi-metric quantitative thresholds set at roughly these levels. While the recent AI Executive Order (“Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence”) presumes an AI model to be “dual-use” if is trained on 1026 FLOPs or more, we recommend a training compute threshold set at 1025 FLOPs to remain consistent with the EU AI Act’s threshold for presuming systemic risk from GPAIS. Such a threshold would apply to fewer than 10 current systems, most of which have already demonstrated some capacity for hazardous capabilities.

↩ 2 Restatement (Third) of Torts: Liability for Physical Harm § 20 (Am. Law Inst. 1997).

↩ 3 In a survey of 2,778 researchers who had published research in top-tier AI venues, roughly half of respondents gave at least a 10% chance of advanced AI leading to outcomes as bad as extinction. K Grace, et al., “Thousands of AI Authors on the Future of AI,” Jan. 2024.

See also, e.g., S Mukherjee, “Top AI CEOs, experts raise ‘risk of extinction’ from AI,” Reuters, May 30, 2023.

↩ 4 See, e.g., J Wei, et al., “Emergent Abilities of Large Language Models,” Jun. 15, 2022 (last revised: Oct. 26, 2022).

↩ 5 47 U.S.C. §   230(c)(1)

↩ 6 Henderson P, Hashimoto T, Lemley M. Journal of Free Speech Law. “Where’s the Liability for Harmful AI Speech?

↩ 7 See fn. 1.

↩ 8 Exclusive: GPT-4 readily spouts misinformation, study finds. Axios.

↩ 9 OpenAI’s GPT-4 Is Capable of Autonomously Exploiting Zero-Day Vulnerabilities. Security Today.

↩ 10 AI suggested 40,000 new possible chemical weapons in just six hours. The Verge.

↩ 11 Can large language models democratize access to dual-use biotechnology? MIT.

↩ 12 R Fang, et al., “Teams of LLM Agents can Exploit Zero-Day Vulnerabilities,” Jun. 2, 2024; T Claburn, “OpenAI’s GPT-4 can exploit real vulnerabilities by reading security advisories,” The Register, Apr. 17, 2024 (accessed Jun. 13, 2024).

↩ 13 J O’Brien & C Nelson, “Assessing the Risks Posed by the Convergence of Artificial Intelligence and Biotechnology,” Health Secur., 2020 May/Jun; 18(3):219-227. doi: 10.1089/hs.2019.0122. https://pubmed.ncbi.nlm.nih.gov/32559154/.

↩ 14 Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. The White House.

SEC Adopts Rules on Cybersecurity Risk Management, Strategy, Governance, and Incident Disclosure by Public Companies. US Securities and Exchange Commission. https://www.sec.gov/news/press-release/2023-139

↩ 15 See fn. 1.

↩ 16 Draft text current as of May 25th, 2024.

↩ 17 Interim Report. National Security Commission on Emerging Biotechnology. Also see AIxBio White Paper 4: Policy Options for AIxBio. National Security Commission on Emerging Biotechnology.

↩ 18 Maintaining the AI Chip Competitive Advantage of the United States and its Allies. Center for Security and Emerging Technology.

↩ 19 Preventing AI Chip Smuggling to China. Center for a New American Security.

↩ 20 BIS has limited information on distinguishing between use-cases, and is compelled to favor highly adversarial and broad controls to mitigate security risks from lack of enforcement.

↩ 21 China lashes out at latest U.S. export controls on chips. Associated Press.

↩ 22 Examining US export controls against China. East Asia Forum.

↩ 23 For more information on TPMs, see Safeguarding the Future of AI: The Imperative for Responsible Development. Trusted Computing Group.

↩ 24 Similar technology is also employed in Apple’s Private Cloud Compute. See “Private Cloud Compute: A new frontier for AI privacy in the cloud,” Apple Security Research Blog, Jun. 10, 2024, https://security.apple.com/blog/private-cloud-compute/.

↩ 25 Open source systems developed in the United States have supercharged the development of AI systems in China, the UAE and elsewhere.

See, How dependent is China on US artificial intelligence technology? Reuters;

Also see, China’s Rush to Dominate A.I. Comes With a Twist: It Depends on U.S. Technology. New York Times.

↩ 26 See fn. 1.

↩ 27 For an example of a technical project meeting these conditions, see the Future of Life Institute response to the Bureau of Industry and Security’s Request for Comment RIN 0694–AI94 on implementation of additional export controls , which outlines an FLI project underway in collaboration with Mithril Security.

↩ 28 A CJADC2 Primer: Delivering on the Mission of “Sense, Make Sense, and Act”. Sigma Defense.

↩ 29 Draft text current as of May 25th, 2024.

↩ 30Air Force names Joe Chapa as chief responsible AI ethics officer“. FedScoop.

↩ 31US DoD AI chief on LLMs: ‘I need hackers to tell us how this stuff breaks’.” Venture Beat.

↩ 32 Widder, David Gray and West, Sarah and Whittaker, Meredith, Open (For Business): Big Tech, Concentrated Power, and the Political Economy of Open AI (August 17, 2023). Available at SSRN: https://ssrn.com/abstract=4543807 

↩ 33 Mouton, Christopher A., Caleb Lucas, and Ella Guest, The Operational Risks of AI in Large-Scale Biological Attacks: Results of a Red-Team Study. Santa Monica, CA: RAND Corporation, 2024.

↩ 34 Lermen, Simon, Charlie Rogers-Smith, and Jeffrey Ladish. “LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B.” arXiv, Palisade Research, 2023, https://arxiv.org/abs/2310.20624.

↩ 35 Gade, Pranav, et al. “BadLlama: Cheaply Removing Safety Fine-Tuning from Llama 2-Chat 13B.” arXiv, Conjecture and Palisade Research, 2023, https://arxiv.org/abs/2311.00117.

↩ 36 Yang, Xianjun, et al. “Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models.” arXiv, 2023, https://arxiv.org/abs/2310.02949.

↩ 37 M Anderljung, et al., “Frontier AI Regulation: Managing Emerging Risks to Public Safety.” Nov. 7, 2023. pp.35-36. https://arxiv.org/pdf/2307.03718 (accessed June 7, 2024).

↩ 38 CrowdStrike. 2024 Global Threat Report. CrowdStrike, 2023, https://www.crowdstrike.com/global-threat-report/.

↩ 39 Thiel, David, Melissa Stroebel, and Rebecca Portnoff. “Generative ML and CSAM: Implications and Mitigations.” Stanford Cyber Policy Center, Stanford University, 24 June 2023.

↩ 40 The near-term impact of AI on the cyber threat. (n.d.).

↩ 41 S Casper, et al., “Black-Box Access is Insufficient for Rigorous AI Audits,” Jan. 25, 2024 (last revised: May 29, 2024), https://doi.org/10.1145/3630106.3659037.

↩ 42 Editor, C. C. (n.d.). Sensitive Compartmented Information Facility (SCIF) – glossary: CSRC. CSRC Content Editor.

↩ 43Rounds Raised by Startups Using AI In 2023,” Crunchbase.

↩ 44 N Maslej, et al., “The AI Index 2024 Annual Report,” AI Index Steering Committee, Institute for Human-Centered AI, Stanford University, Stanford, CA, Apr. 2024, https://aiindex.stanford.edu/report/.

↩ 45 Roose, K. (2024, June 4). OpenAI insiders warn of a “reckless” race for dominance. The New York Times.

↩ 46 For an example of a technical project meeting these conditions, see the Future of Life Institute response to the Bureau of Industry and Security’s Request for Comment (RIN 0694–AI94), which outlines an FLI project underway in collaboration with Mithril Security. Additional detail on the implementation of compute governance solutions can be found in the “Compute Security and Export Controls” section of this document.

↩ 47 H Ajder et al. “The State of Deepfakes.” Sept. 2019.

↩ 48 Onfido. “Identity Fraud Report 2024.” 2024.

↩ 49 G De Vynck. “OpenAI finds Russian and Chinese groups used its tech for propaganda campaigns.” May. 30, 2024.

↩ 50 M Meaker. “Slovakia’s election deepfakes show AI Is a danger to democracy.” Oct. 3, 2023.

↩ 51 S Cole. “This horrifying app undresses a photo of any Woman with a single click.” Jun. 26, 2019.

↩ 52 Deepgram. “Build voice into your apps.” https://deepgram.com/

↩ 53 Hoodem. “Create any deepfake with no limitation.” https://hoodem.com/

↩ 54 See Future of Life Institute’s ‘Recommended Amendments to Legislative Proposals on Deepfakes‘ report.

↩ 55 M B Kugler, C Pace. “Deepfake privacy: Attitudes and regulation.” Feb. 8, 2021.

↩ 56 H Zhang, B Elderman, & B Barak. “Watermarking in the sand.” Nov. 9, 2023. Kempner Institute, Harvard University.

↩ 57 See Future of Life Institute’s ‘Recommended Amendments to Legislative Proposals on Deepfakes‘ report.

Sign up for the Future of Life Institute newsletter

Join 40,000+ others receiving periodic updates on our work and focus areas.
cloudmagnifiercrossarrow-up
linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram