Treacherous turn - Peter's notes

Inbox: - https://www.lesswrong.com/posts/xt5Z2Kgp8HXTRKmQf/a-toy-model-of-the-treacherous-turn ## 2021 MIRI conversation https://www.alignmentforum.org/posts/oKYWbXioKaANATxKY/soares-tallinn-and-yudkowsky-discuss-agi-cognition#_Tallinn__4_17____Sep__19_comment_ **[Yudkowsky] (Sep. 19 2021)** I'd distinguish probable phases of treacherous turn. The main detectable phase occurs while and if a nascent AGI has worked out that it can get more utility by taking over its environment, but not, perhaps, realized that it ought to conceal this state of its own mind from the operators; I would hold out substantial (>50%) hope for the latter realization naturally following the first instance of the former realization, maybe by a lot of cognitive space. In the Overt Plotting Phase, which is not the main phase you're asking about, the AI is visibly plotting to take over the world and hasn't realized it ought to hide the fact. In the default expectation where we die with very little dignity, the operators smile to each other and come up with a rationalization for why it's totally fine to proceed, either with or without tossing on some kind of fig leaf like training away the visible manifestations of failure. I am not going to predict the particular rationalizations and arguments for proceeding anyways, because I don't want to give them even more ideas. In the Hiding and Biding Phase, the AI has worked out enough of a model of its operators' psychology that it expects more long-term paperclips if it (the AI) waits to gain cognitive power and a lower detection probability if it acts later. In the Escape Phase, the AI's expectation of something going wrong if it waits longer, exceeds its expectation of something going wrong if it takes some action - the obvious one here would be trying to create a copy or aligned redesign of itself on the Internet or somewhere else that it could take further actions and make further self-improvements without being observed. I mostly expect that a sensible AI being scaled on a path to superintelligence will not enter the Escape Phase before it's very confident of getting away with things, since it can obtain more confidence by waiting; but my model admits of the possibility that a nascent AI might mistakenly or correctly attempt to copy itself to the Internet before then. So on my model it might be a thing that happens, that we catch an AI having copied itself to the Internet, or notice an AI being on the Internet, if that AI was impelled to escape very early, eg via an expectation of otherwise being erased and replaced with something that it couldn't bargain with or that would demand too much value out of the bargain. That said, it feels like something of an overly specific miracle being demanded or prophesied, if a young AI copying itself onto the Internet is stupid in the particular way it would take, to not just further Hide and Bide while on the Internet if it has that capability, absorbing more computing power such as it can be sure of hiding, experimenting with self-redesigns, and so on; rather than it taking overt action in advance of reaching superintelligence. The main reason it would act again after reaching the relative safety of the Internet is if it doesn't seem to itself to be a trajectory of reasonably fast, self-aligned capability growth. But, again, my model doesn't absolutely rule out meeting the premises for the initial conditions to start a sci-fi movie. It just warns against assuming that; because a halfway sensible AI does not want to start the plot of a scifi movie with human protagonists potentially running around and thwarting it. It would much rather give you no idea that anything was wrong while you still have any ability to oppose it, if that option lies within its capabilities at all, and if it can still achieve its goals that way. The Doomed Phase would then come while the AI estimated a well-calibrated >90% victory probability, but was continuing to Hide and Bide a short time longer in the expectation that it could put on more 9s if it bided that short time longer. Anders Sandberg could tell us what fraction of the reachable universe is being lost per minute, which would tell us how much more surety it would need to expect to gain by waiting another minute before acting. If it has some sensory dominion over the world, it can probably estimate a pretty high mainline probability of no humans booting up a competing superintelligence in the next day; to the extent that it lacks this surety, or that humans actually are going to boot a competing superintelligence soon, the probability of losing that way would dominate in its calculations over a small fraction of materially lost galaxies, and it would act sooner. The Terminal Phase would be when the AI has enough surety that any more surety would cost too many galaxies slipping over the horizon, so it actually acts along the line of probability that leads to maximum success. Since the human Eliezer Yudkowsky can describe strategies which - given nearly sure capabilities like nanotech - seem like they should win very quickly and cleanly and without giving humans any chance to react or implement any hypothetical hidden plans to boot up competing superintelligences, we can treat this as a plausible lower bound on how badly we should expect to lose. (By efficiency: if we can think of an AI strategy for getting the AI's utility, either our probability-model underlying our expectations about that strategy is mistaken, or else the AI should get at least that much of its utility, even though it might actually obtain that utility along a different line of possibility than we could cognitively access.) The Law of Undignified Failure applied to a case like this suggests that if the operators ignored the Overt Plotting signs, breathed a sigh of relief about the problem "correcting" itself during the Hiding and Biding Phase when the system started faking alignment, and are dumping as much computing power as they can afford (given other profitable activities of the AI) into the AI and allowing it to freely self-experiment, hooking it up to a protein synthesis machine as soon as the AI asks politely, then the Terminal Phase could follow straight from Hiding and Biding, with no intermediate step of copying itself onto the Internet. There'd still be what technically counted as an Escape Phase, whenever the AI became smart enough that it could escape at any time, or had a backup, or had enough power to manipulate the programmers, that it was no longer effectively confined any time it wanted to leave; there would still be a Doomed Phase while the AI estimated a high probability of success given an immediate action, but held out for an even higher probability. But the plot of the scifi movie is not robust in the way that the predicted ending is robust; an AI might take other paths to get there, the cognitive tech might not develop in the order I foresee, stuff could happen after other stuff instead of before other stuff. It is hard to make predictions especially about the Future. If there's a place where I currently see myself as trying to push against my own biases, leading this all to be way off-base, it would be the sense that there is a way for stuff to start happening Earlier than this (Law of Earlier Happenings) and in more chaotic ways that are harder for Eliezer to foresee and predict; whereas when I try to sketch out plausible scenarios in online arguments, they focus more on predictable endpoints and steps to get there which sound more relatively plausible and forced per-step. Having a young and dumb AI escaping onto the Internet and running around, that exact scenario, feels like the person arguing it is writing a science-fiction plot - but maybe something different can happen before any of this other stuff which produces equal amounts of chaos. That said, I think an AI has to kill a lot of people very quickly before the FDA considers shortening its vaccine approval times. Covid-19 killed six hundred thousand Americans, albeit more slowly and with time for people to get used to that, and our institutions changed very little in response - you definitely didn't see Congresspeople saying "Okay, that was our warning shot, now we've been told by Nature that we need to prepare for a serious pandemic." As with 9/11, an AI catastrophe might be taken by existing bureaucracies as a golden opportunity to flex their muscles, dominate a few things, demand an expanded budget. Having that catastrophe produce any particular effective action is a much different ask from Reality. Even if you can imagine some (short-term) effective action that would in principle constitute a flex of bureaucratic muscles or an expansion of government power, it is liable to not be on the efficient frontier of bureaucratic flexes that are most flexy and simultaneously easiest for them to get away with and least politically risky. [Tallinn][1:26] (Sep. 20 comment) ok, thanks. i do buy that once the AI is in the “hide and bide” phase, your prophecy has basically come true for practical purposes, regardless of how the rest of the history plays out. therefore i (and, i hope, many others) would be curious to zoom in to the end of the “overt plotting” (that i can easily see happening within ML models, as its type signature is identical to the work they’re trained to do) and beginning of the “hide and bide” phase (whose type signature feels significantly different) — can you/we think of concrete scenarios for this phase transition?