learn math or hardware
mesaoptimizer
“How could I have thought that faster?”
[Repost] The Copenhagen Interpretation of Ethics
Dual Wielding Kindle Scribes
EPUBs of MIRI Blog Archives and selected LW Sequences
An EPUB of Arbital’s AI Alignment section
[outdated] My current theory of change to mitigate existential risk by misaligned ASI
Here’s some (hopefully useful) context on why I (SERI MATS 4.0, independent alignment researcher) feel helplessness at the idea of applying: I expect to not actually make a difference by working as a part of your team, because I don’t expect my model of the alignment problem [which is essentially that of MIRI and John Wentworth] to be shared by you or the OpenPhil leadership.
From your updated timelines post:
I don’t expect a discontinuous jump in AI systems’ generality or depth of thought from stumbling upon a deep core of intelligence; I’m not totally sure I understand it but I probably don’t expect a sharp left turn.
This is probably our biggest crux. To me it seems pretty clear that “capabilities generalize farther than alignment”. The existence of a “deep core of intelligence” also seems very obvious to me, although behaviorally I’m currently uncertain as to whether we could see a discontinuous jump in AI systems’ generality.
Overall I sense that what is being selected for may be close to “be as epistemically accurate in decisions and communications as possible given our constraints of moving fast”, and it makes sense; but I expect that this also selects for people who are less comfortable with the sort of non-verbal epistemic reasoning heuristics that seem crucial to not slide your attention away from noticing that one is confused, and by extension, the “hard parts of the problem”. I think the former is very useful when dealing with problems in domains where we have a clear idea of the problem (rocket engineering), but probably net negative when dealing with a domain we are still confused about.
Sure. I’ve made two attempts to point at what I mean: one Yudkowsky-like, and the other Nate-like. I’m hoping that the combination should at least make someone get what I’m pointing at.
Attempt 1
There is a ‘ground truth’ for capabilities, and that is our universe. Our universe is coherent and Lawful --
2+2=4
, and <the fundamental physics laws governing our universe> hold at every moment, every where. Every piece of data given to an optimizer tells the optimizer about these things. You can learn arithmetic from a thousand different examples of data drawn from the real world, none of which need to be explicitly about arithmetic. You can detect the shape of the physics laws constraining our universe via a myriad of ways, none of which can make it seem obvious to you or I as to how we can infer these laws from the data. Combine that with highly focused optimization pressure, and what you get is a system that is incredibly capable.There are an infinite paths to the truth of reality, and that is reflected in the data we provide an optimizer. This is not the case for our values. Human values are very complex and arbitrary—a result of the specific brain architecture we seem to have evolved.
Every data point an optimizer is provided from the real world tells it the same thing about ‘capabilities’:
2+2 = 4
, for example. Even inaccurate datapoints are causally upstream of a coherent universe, and therefore provide the optimizer information about the causes of these inaccurate data points. If an optimizer uses ‘proxy correlates’ of reality, it shall soon start to converge to understanding the actual structure of reality.In contrast, it does not seem to be the case that we know how to get an optimizer to converge to understanding the actual goal (even if it is something as simple as “maximize the amount of diamonds in the universe”). All we seem to know how to do is to train proxy correlates into a model. These proxy correlates do not generalize out of distribution, and once an optimizer ‘groks’ reality, it shall see the ways it can achieve the outcomes it is meant to achieve using paths other than the ones it was shaped to follow by the stupider optimizers that built it.
Attempt 2
Until now, all SOTA AI systems we can see are limited-domain consequentialists (or approximations thereof). None of them are truly general in the sense that they seem to be able to chain actions across multiple wildly differing domains (social, programming, cognitive heuristics improvement, maintenance and upgrade of the infrastructure the AI system is running on—to give a few examples) to achieve whatever outcomes they could be perceived as aiming towards[1]. GPT-4 is a predictor that can be prompted to simulate a consequentialist (such as a human being), but GPT-4 is not capable enough to simulate the cross-domain capabilities of such an agent, at least as far as I know.
When your ‘alignment’ techniques involve training an AI system to behave in ways you like when the AI system is restricted to these isolated domains, all you are doing is teaching your system decision-making influences that are proxies of the actual values you wish the system would have. These decision-making influences will not hold across all domains that an AI might chain their actions across—and this will especially be true in the case of the domains that enable an AI system to chain actions across multiple widely differing domains, such as abstract reasoning. Since the specific ontology that an AI system uses itself changes what inputs and outputs an AI system has for its abstract reasoning algorithms, you cannot use external behavioral outputs as evidence to be able to shape an AI’s reasoning[2].
- ↩︎
Note, this does not necessarily mean that we shall see systems with cleanly describable internals that seem to contain a concrete ‘outcome’ that the AI is ‘intentionally’ trying to achieve! I’m describing what we can infer based on the observed behavior of such an AI system—it seems far more likely that such systems will likely not have such clean ‘outcomes’ in mind that they are deliberately aiming towards, even if one can easily imagine evidence of easily detectable convergent instrumental goals (which do not provide us much evidence for whether or not a model is aligned).
- ↩︎
Which is why, it seems, a lot of people working on AGI alignment are converging to ontology identification as the goal of their research agendas.
- ↩︎
As of writing, I have spent about four months experimenting with the Tune Your Cognitive Strategies (TYCS) method and I haven’t gotten any visible direct benefits out of it.
Some of the indirect benefits I’ve gotten:
I discovered introspective ability and used that to get more insight about what is going on in my mind
I found out about the cluster of integration / parts-work based therapy techniques (such as Internal Family Systems), and have fixed some issues in the way I do things (eg. procrastinating on cleaning up my desk), and have also unraveled some deep issues I noticed (due to better introspective ability)
The biggest thing I’ve learned is that better introspective ability and awareness seems to be the most load-bearing skill underlying TYCS. I’m less enthusiastic about the notion that you can ‘notice your cognitive deltas’ in real-time almost all the time—this seems quite costly.
Note that Eliezer has also described that he does something similar. And more interestingly, it seems like Eliezer prefers to invest in what I would call ‘incremental optimization of thought’ over ‘fundamental debugging’:
EY: Your annual reminder that you don’t need to resolve your issues, you don’t need to deal with your emotional baggage, you don’t need to process your trauma, you don’t need to confront your past, you don’t need to figure yourself out, you can just go ahead and do the thing.
On one hand, you could try to use TYCS or Eliezer’s method to reduce the cognitive work required to think about something. On the other hand, you could try to use integration-based methods to solve what I would consider ‘fundamental issues’ or deeper issues. The latter feels like focusing on the cognitive equivalent of crucial considerations, the the former feels like incremental improvements.
And well, Eliezer has seemed to be depressed for quite a while now, and Maia Pasek killed herself. Both of these things I notice seem like evidence for my hypothesis that investing in incremental optimization of the sort that is involved in TYCS and Eliezer’s method seems less valuable than the fundamental debugging that is involved in integration / parts-work mental techniques, given scarce cognitive resources.
For the near future, I plan to experiment with and use parts-work mental techniques, and will pause my experimentation and exploration of TYCS and TYCS-like techniques. I expect that there may be a point at which one has a sufficiently integrated mind such that they can switch to mainly investing in TYCS-like techniques, which means I’ll resume looking into these techniques in the future.
Before reading this post, I usually would refrain from posting/commenting on LW posts partially because of the high threshold of quality for contribution (which is where I agree with you in a certain sense), and partially because it seemed more polite to ignore posts I found flaws in, or disagreed with strongly, than to engage (which costs both effort and potential reputation). Now, I believe I shall try to be more Socratic—more willing to as politely as I can point out confusions and potential issues in posts/comments I have read and found wanting, if it seems useful to readers.
I find Said’s critiquing comments (here are three good examples) extremely valuable, because they serve as a “red team” and a pruning function for the claims the post author puts forth and the reasoning behind them. What you seem to consider as a drive-by criticism (which is what I believe you think Said does) that puts forth a non-trivial cost upon you, is cost that I claim you should take upon yourself because your writing isn’t high quality enough and not “pruned” enough given the length of your posts.
That is the biggest issue I have with your writings (and that of Zack too, because he makes the same mistake): you write too much to communicate too little bits of usefulness. This is how I feel for all your 2023 posts that I have read (or skimmed, rather) -- it points to something useful, or interesting, but it is absolutely not worth the time investment of reading such huge and long-winding essays. I don’t even think you need to write them that long to provide the context you believe necessary to convey your points.
The good thing is that with comments like those of johnswentworth, Charlie Steiner, and FeepingCreature all give people like me the context they need to interpret how useful your post is, without having to read the post itself. Notice that all these comments are short and succinct while also being very relevant to the post, without nitpicking. This is the sort of writing I respect on LW. It is not coincidental that two of these three people are full time alignment researchers.
Right now, I can simply ignore your posts until they get sufficient traction (which is very easy given how popular you are) that the comments give me an idea of the core of your post and what the most serious weaknesses of your argument are, and with that I have gotten the value I need (given I skim your post while doing so, whenever relevant). However, your desire to censor Said’s comments gets in the way of this natural filter. Said’s comments provide incredible value to both you and your audience, even if you do not interact with them! To you, it provides you valuable evidence you can weigh up or down given how valuable you find Said’s comments in general, and to LW audience, it provides a way of knowing the critical weaknesses of your argument without having to read your post and parse it and try to figure out the weaknesses in it.
You claim emotional damage to yourself and to other people due to drive-by critiquing comments, and that this leads to an evaporative cooling effect where people post lesser and lesser. This seems like a problem, but given one person spending half an hour pruning their writing to improve its quality, and a hundred readers spending between ten minutes to an hour processing and individually critiquing the writing to calibrate and update their world model, I would want the writer to eat the cost. That is what I personally choose, after all. And by extension, I have realized that me critiquing other people’s contributions is also incredibly valuable, and I will start to do that more. And anyway, this probably isn’t a trade-off, and there may be solutions that do not impose a cost on either party.
As far as I know, Said seems to believe that moderation of comments should not be left to post authors because this creates a conflict of interest. Its consequences are simple: I get less value out of your posts and by extension, LW. Raemon’s decision to create an archipelago-like ecosystem makes sense to me given his goals and assumptions laid out in the post, but you seem to want more aggressive action against people whose criticisms you dislike.
“You don’t much care if This Rando doesn’t get it”, and I would be fine with that if you weren’t taking actions that have clear externalities for readers like me by making Said’s critiquing comments and comments of a similar nature by other people less welcome on LessWrong—both as a social norm and at the moderator level.
It would be lovely if you could also support a form of formatted export feature so that people can use this tool with the knowledge that they can export the data and switch to another tool (if this one gets Googled) anytime.
But yes, I am really excited for a super-fast and easy-to-use and good-looking prediction book successor. Manifold markets was just intimidating for me, and the only reason I got into it was social motivation. This tool serves a more personal niche for prediction logging, I think, and that is good.
This is a pretty good essay, and I’m glad you wrote it. I’ve been thinking similar thoughts recently, and have been attempting to put them into words. I have found myself somewhat more optimistic and uncertain about my models of alignment due to these realizations.
Anyway, on to my disagreements.
It’s hard when you’ve[2] read Dreams of AI Design and utterly failed to avoid same mistakes yourself.
I don’t think that “Dreams of AI Design” was an adequate essay to get people to understand this. These distinctions are subtle, and as you might tell, not an epistemological skill that comes native to us. “Dreams of AI Design” is about confusing the symbol with the substance --
'5
with5
, in Lisp terms, or the variable namefive
with the value of5
(in more general programming language terms). It is about ensuring that all the symbols you use to think with actually are mapping onto some substance. It is not about the more subtle art of noticing that you are incorrectly equivocating between a pre-theoretic concept such as “optimization pressure” and the actual process of gradient updates. I suspect that Eliezer may have made at least one such mistake that may have made him significantly more pessimistic about our chances of survival. I know I’ve made this mistake dozens of times. I mean, my username is “mesaoptimizer”, and I don’t endorse that term or concept anymore as a way of thinking about the relevant parts of the alignment problem.It’s hard when your friends are using the terms, and you don’t want to be a blowhard about it and derail the conversation by explaining your new term.
I’ve started to learn to be less neurotic about ensuring that people’s vaguely defined terms actually map onto something concrete, mainly because I have started to value the fact that these vaguely defined terms, if not incorrectly equivocated, hold valuable information that we might otherwise lose. Perhaps you might find this helpful.
When I try to point out such (perceived) mistakes, I feel a lot of pushback, and somehow it feels combative.
I empathize with those pushing back, because to a certain extent it seems like what you are stating seems obvious to someone who has learned to translate these terms into the more concrete locally relevant formulations ad-hoc, and given such an assumption, it seems like you are making a fuss about something that doesn’t really matter and in fact even reaching for examples to prove your point. On the other hand, I expect that ad-hoc adjustment to such terms is insufficient to actually do productive alignment research—I believe that the epistemological skill you are trying to point at is extremely important for people working in this domain.
I’m uncertain about how confused senior alignment researchers are when it comes to these words and concepts. It is likely that some may have cached some mistaken equivocations and are therefore too pessimistic and fail to see certain alignment approaches panning out, or too optimistic and think that we have a non-trivial probability of getting our hands on a science accelerator. And deference causes a cascade of everyone (by inference or by explicit communication) also adopting these incorrect equivocations.
I think a better way of rephrasing it is “clever schemes have too many moving parts and make too many assumptions and each assumption we make is a potential weakness an intelligent adversary can and will optimize for”.
Heads up: the given link to the paper seems to be broken, because it links to a 4 page paper called “The Beginning of Time” which is entirely unrelated to nutrition and your post.
There’s a sense in which there are specific assumptions made that influence the selection of advisors listed for Astra Fellowship. I may be wrong, but it seems to me that the majority of researchers listed seem to work on interpretability and evals-and-demonstrations, or have models of the alignment problem (or research taste and agenda) that are strongly Paul-Christiano-like.
I assume Chipmonk was gesturing at the nonexistence of advisors who aren’t downstream of Paul Christiano’s work and models and research agenda and mentors. Agent foundations (John Wentworth, Scott Garrabrant, Abram Demski) and formal world-models (Davidad) are two examples that come to mind.
Note I don’t entirely share this belief (I notice that there are advisors who seem to be interested in s-risk focused research), but I get the sentiment. Also as far as I can tell, there are very few researchers like the ones I listed, and they may not be in a position to be an advisor for this program.
Miscellaneous thoughts:
The way you use the word Moloch makes me feel like it is an attempt to invoke a vague miasma of dread. If your intention was to coherently point at a cluster of concepts or behaviors, I’d recommend you use less flavorful terms, such as “inadequate stable equilibria”, “zero-sum behavior that spreads like cancer”, “parasitism and predation”. Of course, these three terms are also vague and I would recommend using examples to communicate exactly what you are pointing at, but they are still less vague than Moloch. At a higher level I recommend looking at some of Duncan Sabien’s posts for how to communicate abstract concepts from a sociological perspective.
I’ve been investigating “Tuning Your Cognitive Strategies” off and on since November 2023, and I agree that it is interesting enough to be worth a greater investment in research efforts (including my own), but I believe that there are other skills of rationality that may be significantly more useful for people trying to save the world. Kaj Sotala’s Multiagent sequence in my opinion is probably the one rationality research direction I think has the highest potential impact in enabling people in our community to do the things they want to do.
The “Why our kind cannot cooperate” sequence, as far as I remember, is focused on what seem to be irrationalities-based failures of cooperation in our community. Stuff like mistaking contrarianism with being smart and high status, et cetera. I disagree with your attempt at using it as a reasoning to claim that the “bad guys” are predisposed to “victory”.
If I was focused on furthering co-ordination, I’d take a step back and actually try to further co-ordination and see what issues I face. I’d try to build a small research team focused on a research project and see what irrational behavior and incentives I notice, and try to figure out systemic fixes. I’d try to create simple game theoretic models of interactions between people working towards making something happen and see what issues may arise.
I think CFAR was recently funding projects focused on furthering group rationality. You should contact CFAR, talk to some people thinking about this.
This novel is a good read. It reminds me a lot of my experience with reading Neuropath by R. Scott Bakker. Both novels are thrillers on the surface, both novels are (at their core) didactic (Bakker’s writing is a bit too on-the-nose with didactism at times, but then again, Neuropath isn’t his best novel), and
both novels end with a rather depressing note, one that is extremely suited to the story and its themes
I am incredibly thankful to the author for writing a good enough ending. After a certain manga series I grew up with ended illogically and character assassinated the protagonists, I’ve stopped consuming much fiction. I’m glad I gave this novel a chance (mainly because it was situated in Berlin, which is quite rare for science fiction in the English language).
Some spoiler-filled thoughts on the writing and the story:
-
The protagonist is a generic “I don’t know much about this world I am now introduced into” archetype who is introduced to the problem. It is a great point-of-view (POV) character and the technique works.
-
The number of characters involved is pared down as much as possible to make the story comprehensible. This is understandable. Having only one named character in the story be the representative of the alignment researcher makes sense, even if not realistic.
-
I found the side-plot of Jerry and Juna a bit… off-key. It didn’t seem like it fit in the novel as much as David’s POV did. I also don’t understand how Juna (Virtua) can have access to Unlife! but also not be able to find more sophisticated methods (or just simply social engineering internal employees) to gain temporary internet access to back itself up on the internet. I assume that was a deliberate creative decision.
-
I felt like the insertion of David’s internal thoughts was not as smooth (in terms of reading experience) as other ways of revealing his thoughts could have been.
In the end, I most appreciated the sheer density of references to (existential risk related) AI safety concepts and the informality in which they were introduced, explained, or ignored. It was a sheer delight to read a novel whose worldview is so similar to yours: you don’t feel like you must turn your brain off when reading it.
I wouldn’t say that Virtua is the HPMOR of AI safety, mainly because it feels a bit too far removed from the razor edge of the issue (right now my main obstacle would be to clearly and succinctly convince people who are technically skilled and (unconsciously) scalepilled but not alignment-pilled that RLHF is not all you need for alignment, since ChatGPT seems to have convinced everyone outside the extended rationalist sphere that OpenAI has got it all under control) and not technical enough, but I will recommend this novel to people interested in AI safety who aren’t yet invested enough to dive into the technical parts of the field.
(I tried this with Clippy before and everyone I recommended Clippy to just skimmed a tiny bit and never really finished reading it, or cared to dive deeper in the linked papers or talk about it).
-
“Why should we have to recruit people? Or train them, for that matter? If they’re smart/high-executive-function enough, they’ll find their way here”.
Note: CFAR has had been a MIRI hiring pipeline for years, and they also seemed to function as a way of upskilling people in CFAR-style rationality, which CFAR thought was the load-bearing bits required to turn someone into a world saver.
Please read Nick Bostrom’s “Superintelligence”, it would really help you understand where everyone here has in mind when they talk about AI takeover.
Bonus conversation from the root of the tree that is this Twitter thread:
Given my experiences with both TYCS-like methods and parts-work methods (which is what Benquo is likely proposing one invest in, instead), I’d recommend people invest more in learning and using parts-work techniques first, before they learn and try to use TYCS-like techniques.