On the Limits of Self-Reflection

TL;DR
Professional observations from a decade in enterprise internal audit, applied to the challenge of overseeing artificial intelligence.
How can we rely on our judgments of the safety and effectiveness of artificial intelligence systems if we don’t have access to robust evaluation mechanisms?
Only by thinking about governance from first principles, can we derive methods to gain reasonable assurance over these non-deterministic systems.

Reading Eliezer Yudkowsky’s 2007 essay “The Lens That Sees Its Flaws”, I was interested by the concept and limits of self-reflection; and the ability to observe and see the full truth in that process of observation.

This timeless concept resonates deeply with my current work on safety and observability in artificial intelligence systems.

In my professional duty at the world’s largest bank, I’ve experienced forms of observation that are not (and maybe cannot be) self-reflective.

I’d like to examine how observation that is not self-reflective touches on these three specific concepts:

Bias
Objectivity
Independence

The ideas in this article may be an addition to, more than a replacement for, the concepts in Yudkowsky’s essay. Maybe later essays address this further. However, I wanted to share my ‘beginner’s mind’ and take the time to relate my thoughts on Yudkowsky’s essay to the concepts in my daily work.

I’ll argue that we need to understand the dependence and interdependence of these three factors in order to apply them when architecting our future capabilities. And I’ll highlight a unique risk of relying on self-reflection that causes it to lose effectiveness specifically when applied to artificial intelligence systems.

The Nature of Bias, Objectivity and Independence

Bias is an ongoing process that causes a difference in evaluation, without a rational basis. It manifests as a particular type of blindness when presented with an observable truth.

It appears in many forms, and in many settings. Artificial intelligence systems inherit bias as a byproduct of the training material. Some forms of bias are easy to identify, at least in others. But the many subtle forms of bias and the interplay among various biases are not easy to fully observe, account for, or adequately address.

Aiming for ‘unbiased’ is worthwhile but difficult to achieve or demonstrate in practice. Reducing known biases is quantifiable but incomplete due to the multifaced interplay of a wide range of subtle biases. We can add diversity to our observations by including the concept of ‘differently-biased’.

By combining very differently-biased observations and reducing known biases we can meaningfully address the consequences of everyday bias.

Objectivity relates to how things are outside of the mind. It defines reality based on observable, measurable criteria, free from emotion and prejudice. It mirrors the ‘Correspondence Theory of Truth’, often attributed to Aristotle’s Metaphysics over 3,000 years ago.

Independence is the ability to reach the same conclusion via alternative methods, and the capability to change those methods at will. Importantly, it is impervious to influence or control that would compromise this ability. It exists separate from, and different to, that which it observes.

Each of these three complementary characteristics has a role to play when we author the capabilities that we need in order to understand, observe, and reflect on the way artificial intelligence interacts with our world.

The Role of Different Biases

The Johari window is a 1955 thought experiment that helps us categorize the concept of self-reflection versus external reflection, and allows us to clearly see how bias can be usefully used.

The Johari window is a grid of four squares, with the self and others as the two axes. Each axis has two variables: the concept of ‘known’ and ‘not known’.

One quadrant suggests there are some things about you that are obvious to yourself and others. Another quadrant posits that there are things unknown to all.

And then there were two categories of knowledge imbalance.

The first is that you know things about yourself that others do not and maybe cannot know, your favorite color, your current mood, your pet peeves. This is accessible via self-reflection.

The second is that which all around you see perfectly clearly, but that you remain blissfully unaware of. Maybe your inability to receive a compliment, your indecision when faced with two good choices, your impatience with those who move through life unaware of the things that you see.

It is this second category that is not accessible via self-reflection. It has to be a separate external lens. And a logical factor in this observational blindness is that some kind of bias is interfering with rational reflection.

By taking advantage of our own biases—and the fact they can be different to those of an external observer—we gain access to greater levels of awareness.

Objectivity: Due Process and Verification

The Johari window helps us understand the usefulness of different biases, and this next example examines the value of objectivity.

In society, humans are bound by a sense of duty, large and small. This manifests as a sense of moral obligation, civic and religious duty, and an internal voice of reason to guide our choices.

But these inward observations are not fully sufficient. Human societies, the world over, have external mechanisms to formally evaluate noncompliance and deliver appropriate consequences.

A tribal leader’s condemnation, a religious law, a judge and jury; these are all external lenses used to see the flaws of another ‘lens’.

This same process operates within the artificial personhood of the corporation. In highly regulated firms, there is the concept of three lines of defense.

The first is the business operative’s duty to perform their work with professional diligence. The second is a separate but not fully independent function of compliance or risk professionals that work in the world of risks and controls.

The third line of defense is the internal audit function, which is completely independent of the business and reports directly to the board. It has very different biases to the business. It is independent to the extent possible.

The crucial factor in these examples is objectivity: observing a measurable reality.

And objectivity in the third line of defense is possible by means of the audit process: documenting the logic, the reasoning, and the conclusions reached.

We’ve examined the Johari window mechanism of using different biases; we’ve evaluated how objectivity delivers results in the human and ‘artificial person’ business domains.

But how does independence meaningfully enhance these concepts?

Independence

True independence is the ability to reach the same conclusion (i.e. objectivity) via a wholly different process, combined with the ability to decouple the mechanism from that which it observes.

It is somewhat of a rarity in practice, but we see it in the judicial process, hopefully in the legislative process, and in parts of the regulated business world.

Authoring the Future

We can use our practical examples so far to ground us in the principles, and extrapolate into the realm of the artificial. This is not a whimsical thought experiment. It’s a real and growing danger that needs our awareness and attention.

When artificial intelligence systems perform real world work, the nondeterministic, essentially random nature of the technology creates output variability.

To address this, outputs are verified by an ‘LLM-as-a-Judge’, essentially a second artificial intelligence system that confirms the output of the first is valid. But this second system shares the nondeterministic nature of the first.

Worse, the training process using ‘reinforcement learning with human feedback’ encodes agreeableness as a default; this sycophancy causes systems to concur more often than critique.

The statistical likelihood is that true objectivity is unlikely and that biases and misconceptions are shared with the source model.

In essence: The judge is grading its own judgment.

This might appear academic, but the more we rely on AI, the more we depend on the correctness of the outputs, and so we depend on this flawed LLM-as-a-Judge framework.

A False Separation?

The counterargument is that the internal and external framing here is, potentially, arbitrary; we may be creating a false separation.

What is the self without others to observe it? What is a person without the community around them? What is a regulated corporation without internal audit?

The sum of the parts makes the whole. They’re not different things. They’re simply different aspects of the same thing.

Conclusion

Ultimately, we need to recognize the purpose and the role of self-observation and to understand the limits and benefits of the approach. By doing so, we can then understand the role and the benefit of independent, objective and differently-biased analysis.

As artificial intelligence systems become more capable, maybe even moving towards the stated goal of ‘superintelligence’, we need to be able to evaluate these systems, particularly if the intelligence is of a different type to ours.

How can we rely on our judgments of safety and effectiveness without robust evaluation mechanisms?

I’ll close by arguing that self-reflection, while worthwhile in concept, is uniquely unsuited to the nature of artificial intelligence systems. Self-reflection by something that operates on probabilistic repetition, simply creates a self-reinforcing feedback loop.

In governance, that’s an echo chamber of assurance.

And I’d offer a solution, based on the thoughts in this essay. The more we embrace the principles of independent, objective and differently-biased observation, the higher the level of determinism we can achieve.

Just as no human is fully deterministic; our systems don’t need to be either. But we do need a level of determinism that provides us with the reasonable assurance we need to progress with confidence. And that assurance should be codified into the third line of defense for artificial intelligence systems.