[ASoT] Reflectivity in Narrow AI

I wrote this a month ago while working on my SERI MATS applications for shard theory. I’m now less confident in the claims and the usefulness of this direction, but it still seems worth sharing.

I think reflectivity happens earlier then you might think in embedded RL agents. The basic concepts around value drift (“addiction”, …) are available in the world model from pretraining on human data (and alignment posts), and modeling context dependent shard activation and value drift helps the SSL WM predict future behavior. Because of these things I think we can get useful reflectivity and study it in sub-dangerous AI. This is where a good chunk of my alignment optimism comes from. (understanding reflectivity and instrumental convergence in real systems seems very important to building a safe AGI.)

In my model people view reflectivity through some sort of magic lens (possibly due to conflating mystical consciousness and non-mystical self-awareness?). Predicting that I’m more likely to crave cookie after seeing cookie or after becoming hungry isn’t that hard. And if you explain shards and contextual activation (granted understanding the abstract concepts might be hard) you get more. Abstract reflection seems hard but still doable in sub AGI systems.

There’s also a meaningful difference between primitive/​dumb shard, my health shards outmaneuver my cookie shards because the former are “sophisitcated enough” to reflectively outsmart the cookie shards (e.g. by hiding the cookies) while the latter is contextually activated and “more primitive”. I expect modelling the contextual activation of the cookie shard not to be hard, but reflective planning like hiding cookies seems harder- though still doable. (This might be where the majority of the interesting/​hard part could be.)