Essay Title - Aleksandr Bowkis

Geoffrey Hinton has expressed concerns about the difficulty of controlling superhuman AI systems, stating that the lack of examples of “more intelligent things being controlled by… less intelligent things” indicates that humanity may struggle to control sufficiently advanced AI systems. Hinton’s statement, however, is arguably incorrect; humanity is already capable of exerting control over entities with superhuman capabilities such as nation states.

Nation states and large corporations are the best analogues we have for highly capable AI systems. The sum of individual skillsets and knowledge within these entities enables them to realise levels of innovation, influence and power that are unachievable by any individual. In this way, they can be considered autonomous agents with superhuman capabilities. Just like a sufficiently advanced AI system, these have the capacity to do enormous damage; for example, a state could adopt an aggressive, expansionist policy and initiate a war.

There are, however, some reasons to be optimistic about our ability to control similarly powerful AI systems. So far humanity has been able to guide and control these powerful entities to act mostly to its benefit (although admittedly imperfectly), despite their advanced capabilities. We might hope that similar strategies may be of use in controlling AI systems. Unfortunately, control over these entities has so far been possible only by a small number of individuals influencing a similarly powerful entity to apply a meaningful set of penalties and rewards to the dangerous organisation and therefore redirect its behaviour. Examples of this include state sanctions or legislation by the EU against social media companies to prevent the spread of misinformation online. These strategies, however, have not always been entirely successful.

This suggests that an effective approach to ensuring control of advanced AI systems might be to employ other, similarly powerful systems in an adversarial fashion. One such example is using the concept of debate to develop safe AI, in which two similarly capable AI models are set against one another in a zero-sum game and given the task of persuading a human judge that they have the most optimal strategy. The hope that this avoids duplicitous or manipulative behaviour on the part of the AI systems rests on the assumption that it is easier for a model to refute a lie made by its opponent than to effectively mislead or lie to the judge itself. In this way, it is in the interest of each model to provide correct and honest answers, thus preventing the development of misaligned systems. There still exist many challenges with this procedure (for example, it is not necessarily true that humans will be able to effectively judge the debate between two sufficiently powerful AI models), however these adversarial approaches to alignment which parallel current effective strategies for controlling and directing existing superhuman entities show promise.

Belrose and Pope make the case that the problem of AI control is straightforward in their essay “AI is easy to control”. In the interests of brevity, a brief rebuttal of just one of their arguments is presented here and a full discussion is left for further work. Their central claim is that the alignment process is remarkably effective for humans, and that access to a neural network’s weights should enable additional, more powerful control techniques for an AI. However, the threat posed by a sufficiently capable, misaligned artificial intelligence is so great that even a small failure rate is a cause for serious concern. Furthermore, even with humans, there exist many examples of individuals deviating from the shared moral values of a society when presented with sufficient incentives to do so (for example, a politician accepting bribes).

In summary, humanity is already capable of exerting (limited) control over superhuman entities such as nation states and corporations, contradicting Hinton’s statement. However, this is only through effectively manipulating similarly capable systems to act against them. This suggests debate frameworks and similar adversarial approaches could be effective methods for ensuring alignment of highly capable AI systems.

Does AI pose an existential risk by default?