Bachelor in general and applied physics. AI safety/Agent foundations researcher wannabe.
I love talking to people, and if you are an alignment researcher we will have at least one common topic (but I am very interested in talking about unknown to me topics too!), so I encourage you to book a call with me: https://calendly.com/roman-malov27/new-meeting
Email: roman.malov27@gmail.com
GitHub: https://github.com/RomanMalov
TG channels (in Russian): https://t.me/healwithcomedy, https://t.me/ai_safety_digest
Roman Malov
I’m just going from pure word vibes here, but I’ve read somewhere (to be precise, here) about Todorov’s duality between prediction and control: https://roboti.us/lab/papers/TodorovCDC08.pdf
And people who are scared of spiders often avoid looking at spiders (or even things that resemble them; consider the effectiveness of pranks with fake spiders).
I’m not sure, but this looks more like learned cooperative policy rather than two entities having models of each other and getting to conclusion about each other’s cooperation.
I just resolved my confusion about CoT monitoring.
My previous confusion: People say that CoT is progress in interpretability, that we now have a window into the model’s thoughts. But why? LLMs are still just as black-boxy as they were before; we still don’t know what happens at the token level, and there’s no reason to think we understand it better just because intermediate results can be viewed as human language.
Deconfusion: Yes, LLMs are still black boxes, but CoT is a step toward interpretability because it improves capabilities without making the black box bigger. In an alternate universe, we could just have even bigger, even messier LLMs (and I assume interpretability gets harder with size: after all, some small transformers have been interpreted), and observing the progress of CoT reasoning models is an update away from this universe, which was the (subjective) default path before this update.
Is there a reason to hate Bill Gates? From a utilitarian perspective, he might be “the best person ever,” considering how much he gives to effective charities.
Do people just use the “billionaire = evil” heuristic, or are there other considerations?
What is it you’re looking for that makes this not count?
IIUC, micribiome is not infectious, and is not evolving as fast as diseases do.
I suppose that sociologists, historians, philosophers, and (especially) futurologists do tackle the questions you describe, though maybe there is a sense in which they aren’t doing so in a zoomed-out enough way.
I have just read the story, the post, and several comments, now amalgamating them in this reading:
The citizens of Omelas torture the child because they know that they themselves wouldn’t believe in their own utopia (just like the reader).and that their happiness, the beauty of their city, the tenderness of their friendships, the health of their children [...] depend wholly on this child’s abominable misery.
It depends on it precisely because of this human flaw of not being capable of believing in such a perfect world without any tradeoffs. It becomes a self-fulfilling prophecy of sorts—they need to torture the child because they believe they need to torture the child.
The ones who walk away, therefore, are those who are able to recognize and discard that flaw, and build a new utopia without tradeoffs.
he called rescuers, thereby advancing the Pareto frontier and avoiding pesky utilitarian calculations (which are mostly incomputable for humans anyway).
Hand out cookies on the street
Well, I would refrain from taking free cookies from a stranger.
I think that “bring treats to the party for everyone” is a better replacement.
People sometimes just say stuff.
Sometimes, the amount of optimization power that was put into the words is less than you expect, or less than the gravity of the words would imply.
Some examples:
“You are not funny.” (Did they evaluate your funniness across many domains and in diverse contexts in order to justify a claim like that?)
“Don’t use this drug, it doesn’t help.” (Did they do the double-blind studies on a diverse enough population to justify a claim like that?)
“That’s the best restaurant in town.” (Did they really go to every restaurant in town? Did they consider that different people have different food preferences?)
That doesn’t mean you should disregard those words. You should use them as evidence. But instead of updating on the event “I’m not funny,” you should update on the event “This person, having some intent, not putting a lot of effort into evaluating this thing and mostly going off the vibes and shortness of the sentence, said to me ‘You are not funny.’”
Do we have an AI Safety scientific journal?
If we do not, we should (probably) create it.
Wake up babe, new superintelligence company just dropped
And they show some impressive results.
The Math Inc. team is excited to introduce Gauss, a first-of-its-kind autoformalization agent for assisting human expert mathematicians at formal verification. Using Gauss, we have completed a challenge set by Fields Medallist Terence Tao and Alex Kontorovich in January 2024 to formalize the strong Prime Number Theorem (PNT) in Lean (GitHub).
Gauss took 3 weeks to do so, which seems way out of METR task length horizon prediction. Though I’m not sure if that’s fair comparison, both because we do not have baseline human time for this task, and because formalizing is a domain where it is very hard to get off track, the criterion of success is very crisp.
I think alignment researchers have to learn to use it (or any other powerful math prover assistant) in order to exploit every leverage we can get.
I would like at some point to develop a theory of an agent who has “other stuff to do” besides the decision problem presented to them. Maybe this agent has some macro-scale quantities, like the (current) amount of compute, the (current) speed of self-improvement, or the (current) rate of gaining utility (analogous to macro-scale variables like “temperature” and “pressure” in thermodynamics). So when you present this agent with a decision problem, it can decide that it’s not even worth its time, or it can spend years of time and gazillions of flops of compute if the query is actually worth it (though I expect the first version of the theory to only deal with queries much smaller than the overall stuff the agent deals with). Continuing the analogy with thermodynamics, I would like the macro-scale properties of the whole agent to somehow emerge from micro-scale properties of the decision problems it faces plus some uniformity assumptions.
I hope that this would help develop a scale-free theory of agency, in the same sense that thermodynamics is scale-free.
Your definition seems sensible to me. Humans are not bayesians, they are not built as probabilistic machines with all of their probability being put explicitly in the memory. So I usually think of Bayesian approximation, which is basically what you’ve said. It’s unconscious when you don’t try to model those beliefs as Bayesian and unconscious otherwise.
Just as you can unjustly privilege a low-likelihood hypothesis just by thinking about it, you can in the exact same way unjustly unprivilege a high-likelihood hypothesis just by thinking about it. Example: I believe that when I press a key on a keyboard, the letter on the key is going to appear on the screen. But I do not consciously believe that; most of the time I don’t even think about it. And so, just by thinking about it, I am questioning it, separating it from all hypotheses which I believe and do not question.
Some breakthroughs were in the form of “Hey, maybe something which nobody ever thought of is true,” but some very important breakthroughs were in the form “Hey, maybe this thing which everybody just assumes to be true is false.”
People often say, “Oh, look at this pathetic mistake AI made; it will never be able to do X, Y, or Z.” But they would never say to a child who made a similar mistake that they will never amount to doing X, Y, or Z, even though the theoretical limits on humans are much lower than for AI.
Idea status: butterfly idea
In real life, there are too many variables to optimize each one. But if a variable is brought to your attention, it is probably important enough to consider optimizing it.
Negative example: you don’t see your eyelids; they are doing their job of protecting your eyes, so there’s no need to optimize them.
Positive example: you tie your shoelaces; they are the focus of your attention. Can this process be optimized? Can you learn to tie shoelaces faster, or learn a more reliable knot?Humans already do something like this, but mostly consider optimizing a variable when it annoys them. I suggest widening the consideration space because the “annoyance” threshold is mostly emotional and therefore probably optimized for a world with far fewer variables and much smaller room for improvement (though I only know evolutionary psychology at a very surface level and might be wrong).
What do you mean by 50-50 hypothesis? Is it E1/2 such that ∀n:P(Thn|E1/2)=P(Ttn|E1/2)=1/2? If that’s the case, it doesn’t seem fair to ask from a student to have such hypothesis: the task is to learn to imitate teacher, and teacher doesn’t actually perform behavior like HTHHT, it can be either perform either HHHHH or TTTTTT.