Agent-foundations researcher. Working on Synthesizing Standalone World-Models, aiming at a technical solution to the AGI risk fit for worlds where alignment is punishingly hard and we only get one try.
Currently looking for additional funders ($1k+, details). Consider reaching out if you’re interested, or donating directly.
Or get me to pay you money ($5-$100) by spotting holes in my agenda or providing other useful information.
I was hoping someone would go ahead and try this. Great work, love it.
Hm, I think that specific argument falls through in that case. Suppose humans indeed like BFWHASF more than ice cream, but mostly eat the latter due to practical constraints. That means that, once we become more powerful and those constraints fall away, we would switch over to BFWHASF. But that’s actually the regime in which we’re not supposed to be able to predict what an agent would do!
As the argument goes, in a constrained environment with limited options (such as when a relatively stupid AGI is still trapped in our civilization’s fabric), an agent might appear to pursue values it was naively shaped to have. But once it grows more powerful, and gets the ability to take what it really wants, it would go for some weird unpredictable edge-instantiation thing. The suggested “BFWHASF vs. ice cream” case would actually be the inverse of that.
The core argument should still hold:
For one, BFWHASF is surely not the optimal food. Once we have the ability to engineer arbitrary food items, and also modify our taste buds how we see fit, we would likely go for something much more weird.
The quintessential example would of course be us getting rid of the physical implementation of food altogether, and instead focusing on optimizing substrate-independent (e. g., simulated) food-eating experiences (ones not involving even simulated biology). That maps to “humans are killed and replaced with chatbots (not even uploaded humans) babbling nice things” or whatever.
Even people who prefer “natural” food would likely go for e. g. the meat of beasts from carefully designed environments with very fast evolutionary loops set up to make their meat structured for maximum tastiness.[1] Translated to ASI and humans, this suggests an All Tomorrows kind of future (except without Qu going away).
But I do think BFWHASF doesn’t end up as a very good illustration of this point, if humans indeed like it a lot.
Just tune the implementation details of all of this such that the pipeline still meets those people’s aesthetic preferences for “natural-ness”. E. g., note that people who want “real meat” today are perfectly fine with eating the meat of selectively bred beasts.