Honestly I’m not sure Oracles are the best approach either, but I’ll push the Pareto frontier of safe AI design wherever I can.
Though I’m less worried about the epistemic flaws exacerbating a box-break—it seems an epistemically healthy AI breaking its box would be maximally bad already—but more about the epistemic flaws being prone to self-correction. For instance, if the AI constructs a subagent of the ‘try random stuff, repeat whatever works’ flavor.
Honestly I’m not sure Oracles are the best approach either, but I’ll push the Pareto frontier of safe AI design wherever I can.
Though I’m less worried about the epistemic flaws exacerbating a box-break—it seems an epistemically healthy AI breaking its box would be maximally bad already—but more about the epistemic flaws being prone to self-correction. For instance, if the AI constructs a subagent of the ‘try random stuff, repeat whatever works’ flavor.