Bachelor in general and applied physics. AI safety/Agent foundations researcher wannabe.
I love talking to people, and if you are an alignment researcher we will have at least one common topic (but I am very interested in talking about unknown to me topics too!), so I encourage you to book a call with me: https://calendly.com/roman-malov27/new-meeting
Email: roman.malov27@gmail.com
GitHub: https://github.com/RomanMalov
TG channels (in Russian): https://t.me/healwithcomedy, https://t.me/ai_safety_digest
Roman Malov
Are bacteria agents? If yes, washing hands is kinda evil.
This was great! I think you should publish more fiction.
the swamp-man’s thoughts are optimized by the thought-experimenter
This resembles the old p-zombies sequence.
It lacks an adequately sophisticated self-model.
I’m not sure that having a self-model is such an important property of consciousness. I can imagine forgetting every fact about myself and being unable to tell what experiences I’m currently having, but still feeling them.
Rethinking everything
I have a biohazard suit under my bed. IIRC, I got it a few years ago during one of the lockdowns, after hearing some news about “a flash of a new respiratory virus that was deadlier than COVID” (and now I can’t even remember its name).
I’m also constantly struggling with using quotes inside quotes, especially when the outer quote ends in the same place as the inner quote. Usually, I just use different quotes (“double” vs. ‘single’ vs. «angle»), but when there are multiple quotes in the same place, it looks ugly no matter what I do.
A “tomato” is a red, savory fruit.
If it were “the word ‘tomato’ refers to a red, savory fruit”, then it would be the perfect case of map/territory use of quotes.
How do we rule out using this computation “illegitimately” (sneaking the computational work the so-called Turing-complete formalism was supposed to do into the translation step, as with the argument for Turing-completeness of digits of )?
I suggest that the framework used for translation itself should not be Turing-complete (which, of course, creates a self-referential definition, but we are in a less specified territory anyway; using this definition, we can at least form clusters a bit better).
Sending information is equivalent to storing information if you consider Galilean relativity (any experiment performed in a frame of reference moving at a constant speed is equivalent to the same experiment in a static frame of reference).
Another good reason for this is the subsystem alignment problem. Suppose you are trying to answer a question, but you’re not quite sure how. You can run some extra computations to help you. What computations do you run? Well, considering the myopic/non-myopic spectrum, you either want to run an extremely non-myopic computation, or an extremely myopic one (because those are Good, and anything else runs the risk of Evil).
In contrast (or, rather, in addition) there could be myopic supersystems constructed from non-myopic agents. Human Science is one example—scientists want lots of different things: money, power, status, respect, (I’m excluding curiosity for the sake of the argument) and are making plans to achieve them, but the system of incentives is set up in a way such that a result is an accurate reflection of reality (or at least that’s the idea, real science is not that perfect).
I have only a surface-level understanding of this topic, but active inference (one of the theories of intelligent agency) views brains (and agents) as prediction-error minimizers, and actions as a form of affecting the world in such a way that they minimize some extremely strongly held prediction (so strongly that it is easier to change the world to make the prediction error smaller).
My understanding mostly comes from this post by Scott Alexander:
My poor, fragile, little cognitive engines! These, then, will be the twin imperatives of your life: surprisal minimization and active inference. If your brains are still too small to process such esoteric terms, there are others available. Your father’s ancestors called them Torah and tikkun olam; your mother’s ancestors called them Truth and Beauty; your current social sphere calls them Rationality and Effective Altruism. You will learn other names, too: no perspective can exhaust their infinite complexity. Whatever you call them, your lives will be spent in their service, pursuing them even unto that far-off and maybe-mythical point where they blur into One.
Seems resonant with what you write in this sequence.
And also, if we have two hypotheses, and , and policy has a much lower expected value compared to BATNA, such that both terms in the product are negative, then the total product is positive (and large), and argmax is going to choose this policy (which is strictly worse than BATNA).
But I guess both of those issues can be easily assumed away.
Another reasonable option is to adjust the weight of each hypothesis based on its probability:
I’m not sure it’s the best way to attribute the weight; if expected values are less than 1, then lower-probability hypotheses get more weight.
such an assumption is not strong enough to guarantee that has a 50-50 coinflip hypothesis.
What do you mean by 50-50 hypothesis? Is it such that ? If that’s the case, it doesn’t seem fair to ask from a student to have such hypothesis: the task is to learn to imitate teacher, and teacher doesn’t actually perform behavior like HTHHT, it can be either perform either HHHHH or TTTTTT.
I’m just going from pure word vibes here, but I’ve read somewhere (to be precise, here) about Todorov’s duality between prediction and control: https://roboti.us/lab/papers/TodorovCDC08.pdf
And people who are scared of spiders often avoid looking at spiders (or even things that resemble them; consider the effectiveness of pranks with fake spiders).
I’m not sure, but this looks more like learned cooperative policy rather than two entities having models of each other and getting to conclusion about each other’s cooperation.
I just resolved my confusion about CoT monitoring.
My previous confusion: People say that CoT is progress in interpretability, that we now have a window into the model’s thoughts. But why? LLMs are still just as black-boxy as they were before; we still don’t know what happens at the token level, and there’s no reason to think we understand it better just because intermediate results can be viewed as human language.
Deconfusion: Yes, LLMs are still black boxes, but CoT is a step toward interpretability because it improves capabilities without making the black box bigger. In an alternate universe, we could just have even bigger, even messier LLMs (and I assume interpretability gets harder with size: after all, some small transformers have been interpreted), and observing the progress of CoT reasoning models is an update away from this universe, which was the (subjective) default path before this update.
Probably far earlier than humans would be able to replace neurons, if AGI+ is computer-executable code. We do not know the Turing machine that is able to execute us, AGI+ would know it for itself.