I really enjoyed reading this story! It was a super cool mix of narrative and technical aspects. While reading, I noticed similarities between the world described in this story and the world model used as a part of Bengio et al.’s Scientist AI proposal. Now, the Terrarium itself isn’t a world model like Bengio describes, but I believe it could rather be a world that the world model generates theories about. Having Scientist AI generate theories about the Terrarium could lead to building intriguing theories about the emergent agent behavior and interactions. Specifically, Scientist AI could offer a way to get some legibility back without sacrificing the efficiency that is gained when agents think in neuralese.
Strawcabbage
I agree with a lot of what you’re saying here. Homework that takes the form of guessing the teachers password is not at all beneficial to the growth of reasoning skills. It only furthers the bad habit of repeating what you’ve heard because you know it’s right without actually learning what that word or phrase means. However, I don’t think that this framework applies to upper level coursework (roughly 9-12).
Homework at this level builds off of what was learned in the classroom very well, and depending on the subject, such as with math or physics, there isn’t really such a thing as guessing the teachers password most of the time. Math and Physics problems, especially in high school, commonly have purely mathematical answers, and even in more creative tasks like Language Arts, the tasks are open-ended. They will normally have some general question or topic for an essay that you have to answer, but other than that you are free to think about it for yourself.
I also believe that homework is more than trying to convince someone of something. Even granting that some homework is poorly designed, the habit of applying what you learned in class to a problem all by yourself is important. This act of working through the problem independently expands your knowledge on the subject and gives you the substance to actually reason with someone on something. Offloading thinking through the problem yourself to an LLM costs the student something real even when the assignment itself isn’t great.
I think my prior comment came across as too against LLM usage. I think LLMs are an incredible tool, especially when it comes to learning new material. However, I still believe that there is a major difference between using LLMs as a tool vs using them to “cheat” on an assignment. Using it to help you when you get stuck or struggle to understand a concept is very similar to just asking a teacher a question. Where as using it to generate the answer you submit is wrong in my eyes. While prompting AI is itself a valuable skill, it requires that the student has enough knowledge to evaluate whether the output is actually correct. A student who doesn’t understand how to solve a physics problem can’t meaningfully judge whether the LLM’s solution is right or where it went wrong. Homework is part of how you build that foundation.
This is a very good question. I think where I see the most issue in cheating on homework, especially for a child, Is that ultimately it becomes a crutch and hinders the growth of critical reasoning skills. Even as a college student I see people who are entirely dependent on AI to complete their homework. Then when finals and midterms come around, they struggle to learn concepts by themselves. I can imagine this would have even larger effects on a child, where their early years in school are the most vital to developing their reasoning skills. Despite the fact that a kid may not receive the homework consensually or it may not be structured well for their learning, I still think that it is important that they learn how to deal with those type of situations with their own reasoning ability to learn and improve.
This is a super cool research topic. Reading about this reminded me of ryan_greenblatt’s post “Current AIs seem pretty misaligned to me”, and how Greenblatt saw that when faced with hard to verify tasks, AI would present some concerning and potentially dangerous behaviors like overselling work quality and producing confident-sounding but overall shallow analysis.
Your LLM auditors seem to be placed in a similar situation when they try to solve design sabotages where it’s hard to verify whether the sabotage was actually active sabotage or just a design decision. So, when benchmarking auditor performance with an assumption of honest effort, there could be misalignment on these hard judgement calls like Greenblatt describes tangled up in the results. Point 3 in your takeaways section could be an example of how the model could take the easier path by pointing at an oddity rather than doing the hard work of reasoning about whether a design choice undermines the experiment’s outcomes.
I was wondering if your team has looked at the chain-of-thought reasoning of the auditor models to see if the failures look more like confusion or more like the kind of shallow overconfident reasoning that Greenblatt describes?