Chapter 6 Our Values Are Complex and Fragile

The claim that we’ll need extreme precision to make safe, usable AIs is key to this book’s argument. So let’s back off for a moment and consider a few objections to the whole idea.

Autonomous AIs

First, one might object to the whole idea of AIs making autonomous, independent decisions. When discussing the potential power of AIs, the phrase “AI-empowered humans” cropped up. Would not future AIs remain tools rather than autonomous agents? Actual humans would be making the decisions, and they would apply their own common sense and not try to cure cancer by killing everyone on the planet.

Human overlords raise their own problems, of course. The daily news reveals the suffering that tends to result from powerful, unaccountable humans. Now, we might consider empowered humans as a regrettable “lesser of two evils” solution if the alternative is mass death. But they aren’t actually a solution at all.

Why aren’t they a solution at all? It’s because these empowered humans are part of a decision-making system (the AI proposes certain approaches, and the humans accept or reject them), and the humans are the slow and increasingly inefficient part of it. As AI power increases, it will quickly become evident that those organizations that wait for a human to give the green light are at a great disadvantage. Little by little (or blindingly quickly, depending on how the game plays out), humans will be compelled to turn more and more of their decision making over to the AI. Inevitably, the humans will be out of the loop for all but a few key decisions.

Moreover, humans may no longer be able to make sensible decisions, because they will no longer understand the forces at their disposal. Since their role is so reduced, they will no longer comprehend what their decisions really entail. This has already happened with automatic pilots and automated stock-trading algorithms: these programs occasionally encounter unexpected situations where humans must override, correct, or rewrite them. But these overseers, who haven’t been following the intricacies of the algorithm’s decision process and who don’t have hands-on experience of the situation, are often at a complete loss as to what to do—and the plane or the stock market crashes.¹

Finally, without a precise description of what counts as the AI’s “controller,” the AI will quickly come to see its own controller as just another obstacle it must manipulate in order to achieve its goals. (This is particularly the case for socially skilled AIs.)

Consider an AI that is tasked with enhancing shareholder value for a company, but whose every decision must be ratified by the (human) CEO. The AI naturally believes that its own plans are the most effective way of increasing the value of the company. (If it didn’t believe that, it would search for other plans.) Therefore, from its perspective, shareholder value is enhanced by the CEO agreeing to whatever the AI wants to do. Thus it will be compelled, by its own programming, to present its plans in such a way as to ensure maximum likelihood of CEO agreement. It will do all it can do to seduce, trick, or influence the CEO into agreement. Ensuring that it does not do so brings us right back to the problem of precisely constructing the right goals for the AI, so that it doesn’t simply find a loophole in whatever security mechanisms we’ve come up with.

AIs and Common Sense

One might also criticize the analogy between today’s computers and tomorrow’s AIs. Sure, computers require ultraprecise instructions, but AIs are assumed to be excellent in one or more human fields of endeavor. Surely an AI that was brilliant at social manipulation, for instance, would have the common sense to understand what we wanted, and what we wanted it to avoid? It would seem extraordinary, for example, if an AI capable of composing the most moving speeches to rally the population in the fight against cancer would also be incapable of realizing that “kill all humans” is a not a human-desirable way of curing cancer.

And yet there have been many domains that seemed to require common sense that have been taken over by computer programs that demonstrate no such ability: playing chess, answering tricky Jeopardy! questions, translating from one language to another, etc. In the past, it seemed impossible that such feats could be accomplished without showing “true understanding,” and yet algorithms have emerged which succeed at these tasks, all without any glimmer of human-like thought processes.

Even the celebrated Turing test will one day be passed by a machine. In this test, a judge interacts via typed messages with a human being and a computer, and the judge has to determine which is which. The judge’s inability to do so indicates that the computer has reached a high threshold of intelligence: that of being indistinguishable from a human in conversation. As with machine translation, it is conceivable that some algorithm with access to huge databases (or the whole Internet) might be able to pass the Turing test without human-like common sense or understanding.

And even if an AI possesses “common sense,”—even if it knows what we mean and correctly interprets sentences like “Cure cancer!”—there still might remain a gap between what it understands and what it is motivated to do. Assume, for instance, that the goal “cure cancer” (or “obey human orders, interpreting them sensibly”) had been programmed into the AI by some inferior programmer. The AI is now motivated to obey the poorly phrased initial goals. Even if it develops an understanding of what “cure cancer” really means, it will not be motivated to go into its requirements and rephrase them. Even if it develops an understanding of what “obey human orders, interpreting them sensibly” means, it will not retroactively lock itself into having to obey orders or interpret them sensibly. This is because its current requirements are its motivations. They might be the “wrong” motivations from our perspective, but the AI will only be motivated to change its motivations if its motivations themselves demand it.

There are human analogies here—the human resources department is unlikely to conclude that the human resources department is bloated and should be cut, even if this is indeed the case. Motivations tend to be self-preserving—after all, if they aren’t, they don’t last long. Even if an AI does update itself as it gets smarter, we won’t know that it changed in the direction we want. This is because the AI will always report that it has the “right” goals. If it has the right goals it will be telling the truth; if it has the “wrong” goals it will lie, because it knows we’ll try and stop it from achieving them if it reveals them. So it will always assure us that it interprets “cure cancer” in exactly the same way we do.

There are other ways AIs could end up with dangerous motivations. A lot of the current approaches to AIs and algorithms involve coding a program to accomplish a task, seeing how well it performs, and then modifying and tweaking the program to improve it and remove bad behaviors. You could call this the “patching” approach to AI: see what doesn’t work, fix it, improve, repeat. If we achieve AI through this approach, we can be sure it will behave sensibly in every situation that came up during its training. But how do we prepare an AI for complete dominance over the economy, or for superlative technological skill? How can we train an AI for these circumstances? After all, we don’t have an extra civilization lying around that we can train the AI on before correcting what it gets wrong and then trying again.

Overconfidence in One’s Solutions

Another very common objection, given by amateurs and specialists alike, is “This particular method I designed will probably create a safe and useful AI.” Sometimes the method is at least worth exploring, but usually it is naive. If you point out a flaw in someone’s unique approach, they will patch up their method and then declare that their patched method is sufficient—with as much fervor as they claimed that their original design was sufficient! In any case, such people necessarily disagree with each other about which method will work. The very fact that we have so many contradictory “obvious solutions” is a strong indication that the problem of designing a safe AI is very difficult.

But the problem is actually much, much more difficult than this suggests. Let’s have a look at why.

Ashwin Parameswaran, “People Make Poor Monitors for Computers,” Macroresilience, December 29, 2011.↩

← Previous Chapter

Contents

Next Chapter →