learn math or hardware
mesaoptimizer
This is very interesting, thank you for posting this.
the therapeutic idea of systematically replacing the concept “should” with less normative framings
Interesting. I independently came up with this concept, downstream of thinking about moral cognition and parts work. Could you point me to any past literature that talks about this coherently enough that you would point people to it to understand this concept?
I know that Nate has written about this:
As far as I recall, reading these posts didn’t help me.
Based on gwern’s comment, steganography as a capability can arise (at rather rudimentary levels) via RLHF over multi-step problems (which is effectively most cognitive work, really), and this gets exacerbated with the proliferation of AI generated text that embeds its steganographic capabilities within it.
The following paragraph by gwern (from the same thread linked in the previous paragraph) basically summarizes my current thoughts on the feasibility of prevention of steganography for CoT supervision:
Inner-monologue approaches to safety, in the new skin of ‘process supervision’, are popular now so it might be good for me to pull out one point and expand on it: ‘process supervision’ does not necessarily prevent steganography, nor does it ensure safety, because you are still pitting three goals against each other—achieving superhuman capability on a task, achieving causal fidelity of transcripts, achieving human-readability of the transcripts. Choose two: you can have transcripts which record a capable thought process accurately but which you cannot understand in full detail; which are not capable but their transcripts accurately convey the fallible flawed concepts and reasoning used; or which are capable and you understand, but are not what it actually thought (because they are misleading, wrong, or shallow ‘lies to children’ sorts of explanations).
Well, if you know relevant theoretical CS and useful math, you don’t have to rebuild the mathematical scaffolding all by yourself.
I didn’t intend to imply in my message that you have mathematical scaffolding that you are recreating, although I expect it may be likely (Pearlian causality perhaps? I’ve been looking into it recently and clearly knowing Bayes nets is very helpful). I specifically used “you” to imply that in general this is the case. I haven’t looked very deep into the stuff you are doing, unfortunately—it is on my to-do list.
I do think that systematic self-delusion seems useful in multi-agent environments (see the commitment races problem for an abstract argument, and Sarah Constantin’s essay “Is Stupidity Strength?” for a more concrete argument.
I’m not certain that this is the optimal strategy we have for dealing with such environments, and note that systematic self-delusion also leaves you (and the other people using a similar strategy to coordinate) vulnerable to risks that do not take into account your self-delusion. This mainly includes existential risks such as misaligned superintelligences, but also extinction-level asteroids.
Its a pretty complicated picture and I don’t really have clean models of these things, but I do think that for most contexts I interact in, the long-term upside of having better models of reality is significantly higher compared to the benefit of systematic self-delusion.
According to Eliezar Yudkowsky, your thoughts should reflect reality.
I expect that the more your beliefs track reality, the better you’ll get at decision making, yes.
According to Paul Graham, the most successful people are slightly overconfident.
Ah but VCs benefit from the ergodicity of the startup founders! From the perspective of the founder, its a non-ergodic situation. Its better to make Kelly bets instead if you prefer to not fall into gambler’s ruin, given whatever definition of the real world situation maps onto the abstract concept of being ‘ruined’ here.
It usually pays to have a better causal model of reality than relying on what X person says to inform your actions.
Can you think of anyone who has changed history who wasn’t a little overconfident?
It is advantageous to be friends with the kind of people who do things and never give up.
I think I do things and never give up in general, while I can be pessimistic about specific things and tasks I could do. You can be generally extremely confident in yourself and your ability to influence reality, while also being specifically pessimistic about a wide range of existing possible things you could be doing.
I wrote a bit about it in this comment.
I think that conceptual alignment research of the sort that Johannes is doing (and that I also am doing, which I call “deconfusion”) is just really difficult. It involves skills that are not taught to people, that seems very unlikely that you’d learn by being mentored in traditional academia (including when doing theoretical CS or non-applied math PhDs), that I only started wrapping my head around after some mentorship from two MIRI researchers (that I believe I was pretty lucky to get), and even then I’ve spent a ridiculous amount of time by myself trying to tease out patterns to figure out a more systematic process of doing this.
Oh, and the more theoretical CS (and related math such as mathematical logic) you know, the better you probably are at this—see how Johannes tries to create concrete models of the inchoate concepts in his head? Well, if you know relevant theoretical CS and useful math, you don’t have to rebuild the mathematical scaffolding all by yourself.
I don’t have a good enough model of John Wentworth’s model for alignment research to understand the differences, but I don’t think I learned all that much from John’s writings and his training sessions that were a part of his MATS 4.0 training regimen, as compared to the stuff I described above.
Note that when I said I disagree with your decisions, I specifically meant the sort of myopia in the glass shard story—and specifically because I believe that if your research process / cognition algorithm is fragile enough that you’d be willing to take physical damage to hold onto an inchoate thought, maybe consider making your cognition algorithm more robust.
Quoted from the linked comment:
Rather, I’m confident that executing my research process will over time lead to something good.
Yeah, this is a sentiment I agree with and believe. I think that it makes sense to have a cognitive process that self-corrects and systematically moves towards solving whatever problem it is faced with. In terms of computability theory, one could imagine it as an effectively computable function that you expect will return you the answer—and the only ‘obstacle’ is time / compute invested.
I think being confident, i.e. not feeling hopeless in doing anything, is important. The important takeaway here is that you don’t need to be confident in any particular idea that you come up with. Instead, you can be confident in the broader picture of what you are doing, i.e. your processes.
I share your sentiment, although the causal model for it is different in my head. A generalized feeling of hopelessness is an indicator of mistaken assumptions and causal models in my head, and I use that as a cue to investigate why I feel that way. This usually results in me having hopelessness about specific paths, and a general purposefulness (for I have an idea of what I want to do next), and this is downstream of updates to my causal model that attempts to track reality as best as possible.
- 21 May 2024 11:39 UTC; 3 points) 's comment on Fund me please—I Work so Hard that my Feet start Bleeding and I Need to Infiltrate University by (
I don’t know whether OpenAI uses nondisparagement agreements; I haven’t signed one.
This can also be glomarizing. “I haven’t signed one.” is a fact, intended for the reader to use it as anecdotal evidence. “I don’t know whether OpenAI uses nondisparagement agreements” can mean that he doesn’t know for sure, and will not try to find out.
Obviously, the context of the conversation and the events surrounding Holden stating this matters for interpreting this statement, but I’m not interested in looking further into this, so I’m just going to highlight the glomarization possibility.
I think what quila is pointing at is their belief in the supposed fragility of thoughts at the edge of research questions. From that perspective I think their rebuttal is understandable, and your response completely misses the point: you can be someone who spends only four hours a day working and the rest of the time relaxing, but also care a lot about not losing the subtle and supposedly fragile threads of your thought when working.
Note: I have a different model of research thought, one that involves a systematic process towards insight, and because of that I also disagree with Johannes’ decisions.
But the discussion of “repercussions” before there’s been an investigation goes into pure-scapegoating territory if you ask me.
Just to be clear, OP themselves seem to think that what they are saying will have little effect on the status quo. They literally called it “Very Spicy Take”. Their intention was to allow them to express how they felt about the situation. I’m not sure why you find this threatening, because again, the people they think ideally wouldn’t continue to have influence over AI safety related decisions are incredibly influential and will very likely continue to have the influence they currently possess. Almost everyone else in this thread implicitly models this fact as they are discussing things related to the OP comment.
There is not going to be any scapegoating that will occur. I imagine that everything I say is something I would say in person to the people involved, or to third parties, and not expect any sort of coordinated action to reduce their influence—they are that irreplaceable to the community and to the ecosystem.
“Keep people away” sounds like moral talk to me.
Can you not be close friends with someone while also expecting them to be bad at self-control when it comes to alcohol? Or perhaps they are great at technical stuff like research but pretty bad at negotiation, especially when dealing with experienced adverserial situations such as when talking to VCs?
If you think someone’s decisionmaking is actively bad, i.e. you’d better off reversing any advice from them, then maybe you should keep them around so you can do that!
It is not that people people’s decision-making skill is optimized such that you can consistently reverse someone’s opinion to get something that accurately tracks reality. If that was the case then they are implicitly tracking reality very well already. Reversed stupidity is not intelligence.
But more realistically, someone who’s fucked up in a big way will probably have learned from that, and functional cultures don’t throw away hard-won knowledge.
Again you seem to not be trying to track the context of our discussion here. This advice again is usually said when it comes to junior people embedded in an institution, because the ability to blame someone and / or hold them responsible is a power that senior / executive people hold. This attitude you describe makes a lot of sense when it comes to people who are learning things, yes. I don’t know if you can plainly bring it into this domain, and you even acknowledge this in the next few lines.
Imagine a world where AI is just an inherently treacherous domain, and we throw out the leadership whenever they make a mistake.
I think it is incredibly unlikely that the rationalist community has an ability to ‘throw out’ the ‘leadership’ involved here. I find this notion incredibly silly, given the amount of influence OpenPhil has over the alignment community, especially through their funding (including the pipeline, such as MATS).
I downvoted this comment because it felt uncomfortably scapegoat-y to me.
If you start with the assumption that there was a moral failing on the part of the grantmakers, and you are wrong, there’s a good chance you’ll never learn that.
I think you are misinterpreting the grandparent comment. I do not read any mention of a ‘moral failing’ in that comment. You seem worried because of the commenter’s clear description of what they think would be a sensible step for us to take given what they believe are egregious flaws in the decision-making processes of the people involved. I don’t think there’s anything wrong with such claims.
Again: You can care about people while also seeing their flaws and noticing how they are hurting you and others you care about. You can be empathetic to people having flawed decision making and care about them, while also wanting to keep them away from certain decision-making positions.
If you think the OpenAI grant was a big mistake, it’s important to have a detailed investigation of what went wrong, and that sort of detailed investigation is most likely to succeed if you have cooperation from people who are involved.
Oh, interesting. Who exactly do you think influential people like Holden Karnofsky and Paul Christiano are accountable to, exactly? This “detailed investigation” you speak of, and this notion of a “blameless culture”, makes a lot of sense when you are the head of an organization and you are conducting an investigation as to the systematic mistakes made by people who work for you, and who you are responsible for. I don’t think this situation is similar enough that you can use these intuitions blandly without thinking through the actual causal factors involved in this situation.
Note that I don’t necessarily endorse the grandparent comment claims. This is a complex situation and I’d spend more time analyzing it and what occurred.
“ETA” commonly is short for “estimated time of arrival”. I understand you are using it to mean “edited” but I don’t quite know what it is short for, and also it seems like using this is just confusing for people in general.
Wasn’t edited, based on my memory.
You continue to model OpenAI as this black box monolith instead of trying to unravel the dynamics inside it and understand the incentive structures that lead these things to occur. Its a common pattern I notice in the way you interface with certain parts of reality.
I don’t consider OpenAI as responsible for this as much as Paul Christiano and Jan Leike and his team. Back in 2016 or 2017, when they initiated and led research into RLHF, they focused on LLMs because they expected that LLMs would be significantly more amenable to RLHF. This means that instruction-tuning was the cause of the focus on LLMs, which meant that it was almost inevitable that they’d try instruction-tuning on it, and incrementally build up models that deliver mundane utility. It was extremely predictable that Sam Altman and OpenAI would leverage this unexpected success to gain more investment and translate that into more researchers and compute. But Sam Altman and Greg Brockman aren’t researchers, and they didn’t figure out a path that minimized ‘capabilities overhang’—Paul Christiano did. And more important—this is not mutually exclusive with OpenAI using the additional resources for both capabilities research and (what they call) alignment research. While you might consider everything they do as effectively capabilities research, the point I am making is that this is still consistent with the hypothesis that while they are misguided, they still are roughly doing the best they can given their incentives.
What really changed my perspective here was the fact that Sam Altman seems to have been systematically destroying extremely valuable information about how we could evaluate OpenAI. Specifically, this non-disparagement clause that ex-employees cannot even mention without falling afoul of this contract, is something I didn’t expect (I did expect non-disclosure clauses but not something this extreme). This meant that my model of OpenAI was systematically too optimistic about how cooperative and trustworthy they are and will be in the future. In addition, if I was systematically deceived about OpenAI due to non-disparagement clauses that cannot even be mentioned, I would expect that something similar to also be possible when it comes to other frontier labs (especially Anthropic, but also DeepMind) due to things similar to this non-disparagement clause. In essence, I no longer believe that Sam Altman (for OpenAI is nothing but his tool now) is doing the best he can to benefit humanity given his incentives and constraints. I expect that Sam Altman is entirely doing whatever he believes will retain and increase his influence and power, and this includes the use of AGI, if and when his teams finally achieve that level of capabilities.
This is the update I expect people are making. It is about being systematically deceived at multiple levels. It is not about “OpenAI being irresponsible”.
I still parse that move as devastating the commons in order to make a quick buck.
I believe that ChatGPT was not released with the expectation that it would become as popular as it did. OpenAI pivoted hard when it saw the results.
Also, I think you are misinterpreting the sort of ‘updates’ people are making here.
I mean, if Paul doesn’t confirm that he is not under any non-disparagement obligations to OpenAI like Cullen O’ Keefe did, we have our answer.
In fact, given this asymmetry of information situation, it makes sense to assume that Paul is under such an obligation until he claims otherwise.
I’ve experimented with Claude Opus for simple Ada autoformalization test cases (specifically quicksort), and it seems like the sort of issues that make LLM agents infeasible (hallucination-based drift, subtle drift caused by sticking to certain implicit assumptions you made before) are also the issues that make Opus hard to use for autoformalization attempts.
I haven’t experimented with a scaffolded LLM agent for autoformalization, but I expect it won’t go very well either, primarily because scaffolding involves attempts to make human-like implicit high-level cognitive strategies into explicit algorithms or heuristics such as tree of thought prompting, and I expect that this doesn’t scale given the complexity of the domain (sufficently general autoformalizing AI systems can be modelled as effectively consequentialist, which makes them dangerous). I don’t expect for a scaffolded (over Opus) LLM agent to succeed at autoformalizing quicksort right now either, mostly because I believe RLHF tuning has systematically optimized Opus to write the bottom line first and then attempt to build or hallucinate a viable answer, and then post-hoc justify the answer. (While steganographic non-visible chain-of-thought may have gone into figuring out the bottom line, it still is worse than first doing visible chain-of-thought such that it has more token-compute-iterations to compute its answer.)
If anyone reading this is able to build a scaffolded agent that autoformalizes (using Lean or Ada) algorithms of complexity equivalent to quicksort reliably (such that more than 5 out of 10 of its attempts succeed) within the next month of me writing this comment, then I’d like to pay you 1000 EUR to see your code and for an hour of your time to talk with you about this. That’s a little less than twice my current usual monthly expenses, for context.