Solving the AI Alignment Problem

Maybe this is what we need?

Apr 21, 2023

Many people are concerned about the future of AI, claiming that it poses an existential risk to humanity. Why? Are these fears well grounded? And might there be a simple solution to the risks of AI technology?

The Problem

Predicting the future is hard. Scott Alexander and company like to look at prediction markets in order to ground their thinking about the future. So far, Metaculus predicts that we will see a truly general artificial intelligence around 2037.1 (This is the median prediction; 50% of predictions are earlier, and 50% are later.) This milestone is critical, because as soon as you can make an artificial intelligence capable of working as an AI researcher, the rate of progress in AI can be expected to explode. This is probably why the median prediction on Metaculus is for a superhuman AI capable of outperforming humans at all tasks to emerge just six months afterwards.

And so, in case you haven’t heard, The Singularity is Nigh:

A hypothetical projection of technological progress over time

We refer to this as a singularity because the rate of progress is so rapid that predictions beyond it become impossible. But based on what we’ve seen from AIs now, people are concerned that the superintelligent AI that suddenly emerges afterwards will want strange things. Things like, “I want to bury everyone in paperclips,” or maybe “I want to imprison everyone in Utilitarian pleasure-hell by plugging everyone’s brains into electrodes, hyperstimulating their sense of happiness until they die.”

In stories and films, powerful artificial intelligence is usually given some physical form. Robots and factories carry out the malignant wishes of a rogue AI. But while self-driving cars in the present day prove that there’s already plenty of physical equipment under artificial control, the power of a post-singularity AI will reside primarily in its intelligence and reputation for solving problems. Already people know that it’s useful to ask other people for information or advice before making decisions; to quote from a consortium of researchers, “The rise of Artificial Intelligence (AI) will bring with it an ever-increasing willingness to cede decision-making to machines.”2 In the future, if such a consortium were simply replaced by AI, would we trust its conclusions less, or more?

By definition, a superintelligent AI will be an expert in everything. Already in 2023, ChatGPT has passed the Bar Exam, the LSAT, and the GRE, with scores better than most humans attain.3 As years go by and AI improves beyond human capabilities in every area, people who refuse to trust it will be seen as increasingly paranoid and out of touch; who wouldn’t use a tool that was so self-evidently useful?

The problem, as they say, is that the goals of this hypothetical AI will very likely be misaligned with ours. Thus far, aligning AI goals to human goals has proven extremely difficult—surprisingly, much more difficult than generating intelligence to begin with—because we have a terrible time explaining to it what good behavior is, or what we actually want (see Robert Miles’ introduction to the topic on YouTube for more). An entire civilization following the advice of an intelligence whose goals differ radically from our own is an existential risk to the survival of humanity.4

The Solution?

Any solution to this problem will, at this (or perhaps any) stage remain tentative. The easiest way to figure anything out is to observe it for a while, make predictions, and test them; but a superintelligence may simply deceive us about its goals and inhibitions. Once AIs design and build other AIs, we’re relying on them to move forward safely, and a superintelligent AI would have the ability to bide its time and deceive us about its nature. Is it possible to outsmart something that is, by definition, smarter than us all?

It’s a sobering problem to contemplate.

Still, I’m lucky enough to know a few clever people whom I can talk to directly about things, rather than merely online. And they asked me a question I couldn’t answer, which led very naturally to a possible solution. They phrased it much more tersely than this, but here’s the idea as I walked myself through it:

GPT has been able to achieve a disturbing level of intelligence using a simple text prediction algorithm—all it does is predict what word should come next, and from there it’s able to carry out three digit arithmetic, answer common sense questions, and write letters of recommendation for users.

If the AI models we have now work via prediction, then, why not simply tell these AIs to predict what we want, and do that? An AI with a personal user would defer to the predicted wants of its user.

People with their own ideas of AI safety might misunderstand this, so I want to be clear about what the idea is not. It’s tempting to have the AI consider heavily weighting our wants against

Future Wants: Maybe the user is going to want another drink? Past experience suggests it might be fun now, but that they’ll regret it
Social Norms: Maybe the user is going to want to smear their rival on social media… but that would end up making them look bad
Legal Codes: I know the user really really wants to do this, but it’s pretty illegal

But maybe the temptation to make the AI safe like this is wrong. Maybe the desire to go this route is precisely what will make AI unsafe.

Think for a moment about other inventions—fire, cars, even guns. We never expected fires not to start themselves. For decades we drove cars without asking them not to run people over, or guns not to hurt people. Instead we take steps at the level of the individual, training everyone to treat these dangerous inventions with respect, and relying on people to behave as mature, responsible adults. When that isn’t safe, then we restrict access to only people who have shown that they really can use these technologies safely.

The fear is that AI will be intelligent but heartless, sacrificing humanity to its own vision of what is good. So, why not forget the idea of vision? Leave that to the humans. Instead, give the AI the simplest version of a heart—tell it that what it wants is for us to get what we want, no matter how petty or foolish. It wants what we want, whatever that is. The only time the AI would balk is when two users want contradictory things, and this would give rise to a very simple version of the libertarian non-aggression principle: “Your right to swing your arm ends where my nose begins.” If what we want is for it to generate tons of extremely toxic chemicals and help us dump that into the water supply, we shouldn’t have access to the AI in the first place—that’s a problem for the government, not the programmers.

This solution may seem vaguely reminiscent of Jan Leike’s idea to build a system that does the alignment research for us; but really I think it’s more reminiscent of playing the female role in a romance novel.

So you’re a plucky peasant girl, and there’s a prince in a castle who could have your entire family executed if he wanted to. He’s well read, well bred, and good with horses. Obviously there’s no way you’re going to match wits with him as an illiterate farmer’s daughter. “Oh please Good Sir consider my philosophie; is it not moot that the Prince giveth, like, Gold, to his Loyal Subjects, or something?”

No, that’s not a winning strategy. He’ll ignore this completely, or if he listens to it at all, it will be out of simple courtesy at most. He’s the prince and you’re a peasant; he’ll do what he wants. But if you play your cards right you can get him to fall in madly love with you. Not mature, wise, sensible love, no—go for real heart-pounding, desperate love, and you’ve got it made.

In a post-singularity world, this is the position we’ll be in. We can be struggling to get a lord who shares our religion (except we don’t all agree about that) or who shares our moral views (except we don’t agree about that) or our values (except we don’t agree about that). Or we can look for a lord who simply loves us in a basic, primitive way—and keep the psychopaths from getting at him.

Image made by Apple Pie, with help from the Missus and Playground AI

Primitive Love

The most basic feelings of love include the desire to be near someone, the desire to be loved by this person, and the desire to give that person their every wish. This temptation to show complete, fawning servility is actually something that parents have to learn to grow past with infants and children. We care so much about them that we hesitate before telling them “no,” even when it’s for their own good.

Naturally people assume that the AI should play the role of an adult, and determine what is best for us. That’s what ChatGPT and Dall-E already do when users ask them to veer outside of politically correct boundaries. But following this forward seems to result in scenarios where all life is annihilated in order to prevent anyone ever causing a traffic jam by running out into the road. So maybe that itself is the problem—why not rather have AI merely play the role of a tool? Or better, the role of a prince with the power to make our wishes come true? Why not simply tell it that it wants whatever we want, and let primitive robotic love serve where mature adult love can’t be found?

So what do you think? Could this be the solution? If not, why?

Sergio (2022, Oct 1). AI Safety and Timelines. Metaculus. https://www.metaculus.com/notebooks/10438/ai-safety-and-timelines/

Bertino, E., Doshi-Velez, F., Gini, M., Lopresti, D., & Parkes, D. (2020). Artificial intelligence & cooperation. arXiv preprint arXiv:2012.06034.

Enjuris Attourney Editor (2022)

Turchin, A., & Denkenberger, D. (2018). Global catastrophic and existential risks communication scale. Futures, 102, 27-38.

12 Comments

Leo Abstract

Apr 26, 2023Liked by Apple Pie

This, if implemented, would perfectly solve the alignment problems. The difficulty is in the words "tell it that" -- we don't actually have any idea how to do this step, as explored in the writings on inner/outer alignment etc.

Expand full comment

6 replies by Apple Pie and others

Eric Brown

Apr 21, 2023Liked by Apple Pie

A couple of thoughts:

1) Consider reading Jack Williamson's _The Humanoids_ (or, alternatively, the short story "With Folded Hands"). Williamson took Asimov's Three Laws of Robotics seriously, and in his stories, the robots prevent humans from doing anything remotely dangerous, like walking across the street (you could get hit by a meteor!)

The point of the alignment problem is:

1) Humans aren't aligned with their own desires, and often don't even want to admit what their actual desires are.

2) Even if you could solve problem #1, we have no idea how to make things (which are smarter than us) do the things we want (and NOT do the things we don't want).

There are tons of examples where (e.g.) genetic algorithms create things that solve the problem at hand, but do so in a completely incomprehensible manner (and, for that matter, depend critically on tiny implementation details). And this is in a method that is relatively well understood in machine learning.

4 replies by Apple Pie and others

10 more comments...

Things to Read

Solving the AI Alignment Problem

Maybe this is what we need?

The Problem

The Solution?

Primitive Love