Co-founder and CEO of quiver.trade. Interested in mechanism design and neuroscience. Hopes to contribute to AI alignment.
Twitter: https://twitter.com/azsantosk
Co-founder and CEO of quiver.trade. Interested in mechanism design and neuroscience. Hopes to contribute to AI alignment.
Twitter: https://twitter.com/azsantosk
From Metaculus’ resolution criteria:
This question resolves on the date an AI system competes well enough on an IMO test to earn the equivalent of a gold medal. The IMO test must be most current IMO test at the time the feat is completed (previous years do not qualify).”The IMO test must be most current IMO test at the time the feat is completed (previous years do not qualify).”
I think this was defined on purpose to avoid such contamination. It also seems common sense to me that, when training a system to perform well on IMO 2026, you cannot include any data point from after the questions were made public.
At the same time training on previous IMO/math contest questions should be fair game. All human contestants practice quite a lot on questions from previous contents, and IMO is still very challenging for them.
Also relevent is Steven Byrnes’ excelent Against evolution as an analogy for how humans will create AGI.
It has been over two years since the publication of that post, and criticism of this analogy has continued to intensify. The OP and other MIRI members have certainly been exposed to this criticism already by this point, and as far as I am aware, no principled defense has been made of the continued use of this example.
I encourage @So8res and others to either stop using this analogy, or to argue explicitly for its continued usage, engaging with the arguments presented by Byrnes, Pope, and others.
Hi! I’m Kelvin, 26, and I’ve been following LessWrong since 2018. Came here after reading references to Eliezer’s AI-Box experiments from Nick Bostrom’s book.
During high school I participated in a few science olympiads, including Chemistry, Math, Biology and Informatics. Was the reserve member of the Brazilian team for the 2012 International Chemistry Olympiad.
I studied Medicine and later Molecular Science at the University of São Paulo, and dropped out in 2015 to join a high-frequency trading fund based on Brazil. Had a successful career there, and rose up to become one of the senior partners.
Since 2020 I’m co-founder and CEO of TickSpread, a crypto futures exchange based on batch auctions. We are interested in mechanism design, conditional and combinatorial markets, and futarchy.
I’m also personally very interested in machine learning, neuroscience, and AI safety discussions, and I’ve spent quite some time studying these topics on my own, despite having no professional experience on them.
I very much want to be more active on this community, participating in discussions and meeting other people who are also interested in these topics, but I’m not totally sure where to start. I would love for someone to help me get integrated here, so if you think you can do that please let me know :)
One thing that appears to be missing on the filial imprinting story is a mechanism allowing the “mommy” thought assessor to improve or at least not degrade over time.
The critical window is quite short, so many characteristics of mommy that may be very useful will not be perceived by the thought assessor in time. I would expect that after it recognizes something as mommy it is still malleable to learn more about what properties mommy has.
For example, after it recognizes mommy based on the vision, it may learn more about what sounds mommy makes, and what smell mommy has. Because these sounds/smalls are present when the vision-based mommy signal is present, the thought assessor should update to recognize sound/smell as indicative of mommy as well. This will help the duckling avoid mistaking some other ducks for mommy, and also help the ducklings find their mommy though other non-visual cues (even if the visual cues are what triggers the imprinting to begin with).
I suspect such a mechanism will be present even after the critical period is over. For example, humans sometimes feel emotionally attracted to objects that remind them or have become associated with loved ones. The attachment may be really strong (e.g. when the loved one is dead and only the object is left).
Also, your loved ones change over time, but you keep loving them! In “parental” imprinting for example, the initial imprinting is on the baby-like figure, generating a “my kid” thought assessor associated with the baby-like cues, but these need to change over time as the baby grows. So the “my kid” thought assessor has to continuously learn new properties.
Even more importantly, the learning subsystem is constantly changing, maybe even more than the external cues. If the learned representations change over time as the agent learns, the thought assessors have to keep up and do the same, otherwise their accuracy will slowly degrade over time.
This last part seems quite important for a rapidly learning/improving AGI, as we want the prosocial assessors to be robust to ontological drift. So we both want the AGI to do the initial “symbol-grounding” of desirable proto-traits close to kindness/submissiveness, and also for its steering subsystem to learn more about these concepts over time, so that they “converge” to favoring sensible concepts in an ontologically advanced world-model.
I agree that current “language agents” have some interesting safety properties. However, for them to become powerful one of two things is likely to happen:
A. The language model itself that underlies the agent will be trained/finetuned with reinforcement learning tasks to improve performance. This will make the system much more like AlphaGo, capable of generating “dangerous” and unexpected “Move 37”-like actions. Further, this is a pressure towards making the system non-interpretable (either by steering it outside “inefficient” human language, or by encoding information stenographically).
B. The base models, being larger/more powerful than the ones being used today, and more self-aware, will be doing most of the “dangerous” optimization inside the black-box. It will derive from the prompts, and from it’s long-term memory (which will be likely be given to it), what kind of dumb outer loop is running on the outside. If it has internal misaligned desires, it will manipulate the outer loop according to them, potentially generating the expected visible outputs for deception.
I will not deny the possibility of further alignment progress on language agents yielding safe agents, nor of “weak AGIs” being possible and safe with the current paradigm, and replacing humans at many “repetitive” occupations. But I expect agents derived from the “language agent” paradigm to be misaligned by default if they are strong enough optimizers to contribute meaningfully to scientific research, and other similar endeavors.
I see about ~100 book in there. I met several IMO gold-medal winners and I expect most of them to have read dozens of these books, or the equivalent in other forms. I know one who has read tens of olympiad-level books in geometry alone!
And yes, you’re right that they would often pick one or two problems as similar to what they had seen in the past, but I suspect these problems still require a lot of reasoning even after the analogy has been established. I may be wrong, though.
We can probably inform this debate by getting the latest IMO and creating a contest for people to find which existing problems are the most similar to those in the exam. :)
I think it is an interesting idea, and it may be worthwhile even if Dagon is right and it results in regulatory capture.
The reason is, regulatory capture is likely to benefit a few select companies to promote an oligopoly. That sounds bad, and it usually is, but in this case it also reduces the AI race dynamic. If there are only a few serious competitors for AGI, it is easier for them to coordinate. It is also easier for us to influence them towards best safety practices.
I agree my conception is unusual, I am ready to abandon it in favor of some better definition. At the same time I feel like an utility function having way too many components makes it useless as a concept.
Because here I’m trying to derive the utility from the actions, I feel like we can understand the being better the less information is required to encode its utility function, in a Kolmogorov complexity sense, and that if its too complex then there is no good explanation to the actions and we conclude the agent is acting somewhat randomly.
Maybe trying to derive the utility as a ‘compression’ of the actions is where the problem is, and I should distinguish more what the agent does from what the agent wants. An agent is then going to be irrational only if the wants are inconsistent with each other; if the actions are inconsistent with what it wants then it is merely incompetent, which is something else.
While I am sure that you have the best intentions, I believe the framing of the conversation was very ill-conceived, in a way that makes it harmful, even if one agrees with the arguments contained in the post.
For example, here is the very first negative consequence you mentioned:
I think one can argue that, this argument being correct, the post itself will exacerbate the problem by bringing greater awareness to these “intentions” in a very negative light.
The intention keyword pattern-matches with “bad/evil intentions”. Those worried about existential risk are good people, and their intentions (preventing x-risk) are good. So we should refer to ourselves accordingly and talk about misguided plans instead of anything resembling bad intentions.
People discussing pivotal acts, including those arguing that it should not be pursued, are using this expression sparingly. Moreover, they seem to be using this expression on purpose to avoid more forceful terms. Your use of scare quotes and your direct association of this expression with bad/evil actions casts a significant part of the community in a bad light.
It is important for this community to be able to have some difficult discussions without attracting backlash from outsiders, and having specific neutral/untainted terminology serves precisely for that purpose.
As others have mentioned, your preferred ‘Idea A’ has many complications and you have not convincingly addressed them. As a result, good members of our community may well find ‘Idea B’ to be worth exploring despite the problems you mention. Even if you don’t think their efforts are helpful, you should be careful to portrait them in a good light.