Could you explain what breadth-first AI safety plan could exist?
If the AI capabilities growth depends on the training run length in such a way that 16 hours are enough for an AI to become wildly syperhuman with no warning signs beforehand (IIRC this is what Yudkowsky depicted in IABIED?), then we are screwed. Luckily, this seems to contradict the scaling laws.
Otherwise the world is AI-2027-like, i.e. the ECI of AIs predictably depends on a combination of metrics (e.g. linearly scales with time for all Anthropic models except for Mythos, which also is severely increased in model size) and architecture details (e.g. neuralese).
In this case mankind is to establish a chain of AIs where each one reliably aligns the successor until the result reaches infinity. Suppose that humans and trusted AIs of ECI x can align the AIs of ECI y with probability P(x,y,t) if they spend t time units. If we knew that P somehow tended to 1 (e.g. if Agent-4 created Agent-5 aligned to the entirety of Agent-4′s cognition), then we’d be able to construct an optimal sequence of time units and capability increases. Alas, we find it hard to INcrease P and very easy to DEcrease it by doing things like introducing neuralese or training on the CoTs.
Could you explain what breadth-first AI safety plan could exist?
If the AI capabilities growth depends on the training run length in such a way that 16 hours are enough for an AI to become wildly syperhuman with no warning signs beforehand (IIRC this is what Yudkowsky depicted in IABIED?), then we are screwed. Luckily, this seems to contradict the scaling laws.
Otherwise the world is AI-2027-like, i.e. the ECI of AIs predictably depends on a combination of metrics (e.g. linearly scales with time for all Anthropic models except for Mythos, which also is severely increased in model size) and architecture details (e.g. neuralese).
In this case mankind is to establish a chain of AIs where each one reliably aligns the successor until the result reaches infinity. Suppose that humans and trusted AIs of ECI x can align the AIs of ECI y with probability P(x,y,t) if they spend t time units. If we knew that P somehow tended to 1 (e.g. if Agent-4 created Agent-5 aligned to the entirety of Agent-4′s cognition), then we’d be able to construct an optimal sequence of time units and capability increases. Alas, we find it hard to INcrease P and very easy to DEcrease it by doing things like introducing neuralese or training on the CoTs.