I’m interested in doing in-depth dialogues to find cruxes. Message me if you are interested in doing this.
I do alignment research, mostly stuff that is vaguely agent foundations. Currently doing independent alignment research on ontology identification. Formerly on Vivek’s team at MIRI.
The nearest unblocked strategy problem combines with the adversarial nonrobustsness of learned systems, to make a much worse combined problem. So I expect constraints to often break given more thinking or learning than was present in training.
Separately, reflective stability of constraint-following feels like a pretty specific target that we don’t really have ways of selecting for, so we’d be relying on a lot of luck for that. Almost all my deontological-ish rules are backed up by consequentialist reasoning, and I find it pretty likely that you need some kind of valid and true supporting reasoning like this to make sure an agent fully endorses its constraints.
Why do you believe this reasoning at 50%? Is this due to new evidence or arguments? I’ve never heard it supported by anything other than bad high-level analogies.
There are a couple of factors that make the chain weaker as it goes along.
The level of human understanding decreases with each iteration, otherwise we would just do it without AI help. So any failure of helpfulness becomes less and less noticeable.
With this kind of test-driven engineering loop, especially in the current ML paradigm, it’s easier to hide problems than it is to deeply fix them. Accidentally hiding problems is easy today, but it becomes even easier as the AI becomes more complex and intelligent.
The only way I can see it going well is if an early AI realizes how bad of an idea this chain is and convinces the humans to go develop a better paradigm.
(I wrote a post about similar stuff recently, trying to explain the generators behind my dislike of prosaic research)