Suggestion for safe AI structure (Curated Transparent Decisions)

I recently had a thought about AI alignment that I wanted to get some feedback on. Keep in mind this is not a complete solution, just a core idea to build upon (and misses a lot of implementation details that would be by no means trivial). I’m by no means claiming it removes all risks associated with AI on its own

My concept is to train an AI model to produce logical reasoning graphs. It will only be trained on the accuracy of facts used in reasoning, the logical consistency of the graph produced, and completeness. It explicitly won’t be trained on alignment issues. If it is logically consistent, uses valid facts, and isn’t missing any obvious lines of reasoning it is not penalized even if the graph it produces suggests “kill all humans” as the best course of action.

The output of this graph would then be fed through a series of other systems. For example:

1) One that hydrates the graph with the expected consequences of suggested actions (and raises warnings if expected consequences are unknown).
2) One that verifies that all nodes in the graph logically follow.
3) One that checks if individual nodes in the graph are aligned and prunes any that are not.

Finally, the actions of the pruned graph are passed to a non-AI system to process into actions. This could be actions that operate known devices or request researchers to perform tests and provide feedback, etc.

While I don’t think this removes all dangers of an AGI, I think it helps with the specific problem of training an AI to lie to you about what it is trying to do. The AI is trained to provide accuracy in its evaluations, not on the probability that its suggestions are acted upon.

I’m sure this can be ripped apart in no time, but I would like to hear convincing arguments of why it wouldn’t work so I can think about the problem further.

Thanks!

EDIT: I’ve updated this post with additional thoughts based on Raemon’s observations.

There is a catastrophic failure case in the above described design:

While I hold that the training pressure works for individual decisions, there remains the case where the AI can see it would get higher rewards on future decisions if it takes a hit on a current one and thus ensuring additional resources for itself. Then the question becomes: “Can the AI produce a misaligned plan that gets past all the safety measures?” And while I suspect such a system would make it harder, it really is just kicking the can down the road (and probably not very far at that).

Initial additional thoughts on this:

If the AI can see potential additional utility if it gains resource X (even if X is the capability to produce better plans), it can claim those rewards simply by producing a plan that explains such. Thus it still has no incentive to revert to opaque planning. However, we now fall into one of two situations:

1) The AI reaches sufficient intelligence that every plan it suggests is something we won’t implement, thus producing a very expensive rock.
2) We start curating our queries to it, such that it no longer suggests dangerous actions. This then applies a new selective pressure to it, which is to either a) get us to ask it the more rewarding questions that we won’t action, thus returning it to a very expensive rock; or b) attempt to change the world such that the dangerous actions have already taken place.

Areas of further inquiry:

1) Does an automated query refinement loop still apply the same pressure as simply pre-curating queries? For example, does the selective pressure remain if we only ask it to “make two identical strawberries without killing anyone” after we ask it to “make two identical strawberries” and it suggested a plan (which we reward it for) that would kill everyone if actioned?
2) Can we ensure at least one actional plan by also equally rewarding it for simultaneously producing an additional plan with half the expected utility (and doing this recursively down to a given minimum expected utility)? IE, it has one suggested plan that involves killing everyone with “utility” calculated at 100, but also provides a plan that kills a small number of people with a calculated “utility” of 50, and finally a plan that kills nobody with a calculated “utility” of 25. We reward it as though it had created a single plan of utility 300, and then action the safe plan. (EDIT: This specific strategy will likely lead to a system that is very underpowered in what it can achieve. Next step: Is there a similar plan selection strategy we can use that can aim us toward more powerful safe plans?)
3) Based on the first two points, could we keep the intended incentive of transparency with the following procedure? Initially ask it all questions without alignment constraints and reward it as described above. Then if the plan contains alignment issues reask the question with alignment constraints, but with the intention of it providing plans both with and without the constraints and rewarding it for both plans. If the second plan still has alignment issues, ask again with additional constraints until you have an actionable plan. This could potentially remove incentives for the AI to avoid questions with alignment constraints while keeping incentives for transparency. (EDIT: Is this kicking the can down the road? Does it just incentivize it to make a safe-sounding second plan that also achieves the dangerous outcome, thus allowing it to claim the double reward while still improving future utility?)

Note: I’m being intentionally vague on concepts such as calculating utility and what constitutes good alignment. That is not because I don’t appreciate the complexity or importance of those problems. Indeed, I understand that my proposal depends on those also being solved, which is by no means a simple task. The reason for the vagueness is that my proposal isn’t trying to solve those problems, and I want to keep the discussion focussed on what this proposal actually is about, which is incentivizing transparency in decision-making.