I have signed no contracts or agreements whose existence I cannot mention.
plex
Prediction: The memetic ecosystem is about to get extremely weird and kinda dangerous, evolution of memes is going to step up to many times the normal pace suddenly.
Maybe MCP make a bunch of them available, along with Pythia and sources and CCCT post and a few others like MIRI papers? The agents look smart enough to get x-risk now,
Hey, please do this using a Claude that has access to your research convos. You’re plausibly the best person on earth to do this.
Nice! This seems like a fun empirical angle on the thing. My guess is that this likely measures the speed of decay towards consequentialism, rather than whether it’s happening at all, but it’s neat to see some of the parameters you’d first want to test just show right up.
I expect #2 from your list is likely true, and maybe viable to prove some version of mathematically. In particular, I expect even simulator-like training processes to over time select for CCCT style dynamics through iterations of which training data from one model makes it to the next model.
I think #1 is not going to be true in the “can prove this happens universally” sense, some civilizations can co-ordinate. But I do expect it’s highly convergent for systems dynamics reasons, and expect virtually all actual rollouts of earth-like civilizations to end up doing it.
Or make non approved bugs get dropped into a pile that only non-team bug assessors look at, and if they assess something there as worthwhile and the team also does, the assessor gets a cut.
The problem with this pessimistic position is that it mistakes a vague conceptual argument about high-level incentives—one that masks many hidden assumptions—for definitive proof.
Given this, I think it’s maybe critically important to nail down the Convergent Consequentialst Cognition Thesis, if Dario wants a proof before he’ll buy the conceptual arguments. I think CCCT is correct, I have seen the intuitions from enough angles and seen the dynamics play out, but Dario is genuinely correct that we don’t have a well-nailed down proof of the strong version of this, and he’s not unreasonable to want one. If true, this feels like the kind of thing that’s provable. TurnTrout’s ones are the closest, but afaict don’t prove quite what’s needed to get Moloch/Pythia formalized.
Hey maths-y people with teams around the field looking for highly impactful things for your team members to do, consider having this on your lists of problems that you offer people on your teams? @Alex_Altair? @Alexander Gietelink Oldenziel? @peterbarnett? @Jacob_Hilton? @Mateusz Bagiński?
My very pre-formal intuition of an English capture of this is: Patterns and sub-patterns which steer the world[1] towards states where they have more ability to steer tend to dominate over time by outplaying patterns less effective at long-horizon optimization. Values other than this, if in competition with this, tend to lose weight over time.
An agent has a certain amount of foresight, ability to correctly model the future and the way current actions affect that future. Selection towards things other than more power use up the limited bits of optimization it has to narrow the future, and this trades off against power-seeking in a way which means given competition you tend to be left with only agents which care terminally about power (even though depending on environment, they might express other preferences).
Consequentialists tend to be able to get the the consequence of their future selfs controlling more of reality, power-seekers tend to win power-seeking games, as a multi-scale phenomena both between agents, subagents or circuits in a NN, and superorganisms.
- ^
Discovering Agents-style model possible futures and select current actions based on which futures you prefer.
- ^
Functional decision theory has open problems within it, but it is correct, and the rival decision theories are wrong
My understanding was MIRI is pretty confident that the correct decision theory is one of the ones in the LDT category, but that FDT was a specific formalization of an LDT which gets a lot of normal challenges right but has some known issues rather than being actually exactly correct. Given that we’ve afaict not solved DT, I think telling Claude “Do exactly FDT” is probably dangerously suboptimal, but telling it “here’s what we want from a good DT, correct handling of subjunctive dependence, we’re pretty sure it’s in the LDT category, here’s why this matters” is nicer.
Ok, rather than asking for MIRI people’s takes as I had in an earlier draft, I got a summary of positions from a Claude literature review:
Researcher Position Key Quote Link Wei Dai Not solved — more open problems ”UDT shows that decision theory is more puzzling than ever… Instead of one major open problem (Newcomb’s, or EDT vs CDT) now we have a whole bunch more. I’m really not sure at this point whether UDT is even on the right track.” LessWrong, Sept 2023 Scott Garrabrant Not solved — major obstacles remain ”Logical Updatelessness is one of the central open problems in decision theory.” Also authored “Two Major Obstacles for Logical Inductor Decision Theory” documenting fundamental unsolved issues. LessWrong, Oct 2017 / LessWrong, Apr 2017 Abram Demski Not solved — fundamental issues remain ”There may just be no ‘correct’ counterfactuals” and UDT “assumes that your earlier self can foresee all outcomes, which can’t happen in embedded agents.” In 2021: “I have not yet concretely constructed any way out.” LessWrong, Oct 2018 / LessWrong, Apr 2021 Rob Bensinger Not solved — ongoing research needed MIRI works on DT because “there’s a cluster of confusing issues here (e.g., counterfactuals, updatelessness, coordination) that represent a lot of holes or anomalies in our current best understanding.” LessWrong, Sept 2018 Lukas Finnveden Not solved — formalization is hard ”Knowing what philosophical position to take in the toy problems is only the beginning. There’s no formalised theory that returns the right answers to all of them yet… Logical counterfactuals is a really difficult problem, and it’s unclear whether there exists a natural solution.” LessWrong, Aug 2019 Jessica Taylor Not solved — alternatives needed Wrote “Two Alternatives to Logical Counterfactuals” arguing for different approaches (counterfactual nonrealism, policy-dependent source code), noting fundamental problems with existing frameworks. LessWrong, Mar 2020 Paul Christiano Nuanced — 2D problem space ”I don’t think it’s right to see a spectrum with CDT and then EDT and then UDT. I think it’s more right to see a box, where there’s the updatelessness axis and then there’s the causal vs. evidential axis.” LessWrong, Sept 2019 Eliezer Yudkowsky Progress made but problems remain In the FDT paper, Y&S acknowledge that “specifying an account of [subjunctive] counterfactuals is an ‘open problem’.” The companion paper “Cheating Death in Damascus” states: “Unfortunately for us, there is as yet no full theory of counterlogicals [...], and for FDT to be successful, a more worked out theory is necessary.” arXiv, Oct 2017/May 2018 Summary: The consensus among core MIRI/AF researchers (Wei Dai, Garrabrant, Demski, Bensinger, Finnveden) is that FDT/UDT represents the right direction but leaves major open problems—particularly around logical counterfactuals, embeddedness, and formalization.
I think you might be mixing up LDT and FDT, and “we have a likely accurate high level underspecified semantic description of what things a correct DT must have” with “we have a well-specified executable philosophy DT ready to go”.
Rob Miles suggested ‘inexorable disempowerment’ as maybe better on a call where we discussed the default associations of ‘gradual’.
Cool, yeah, some of these renames were due to the Discord limitation of stripping punctuation and don’t apply back here.
Seems like nitpick is much more a property of the claim and the discussion, rather than the person reading it.
So, kinda, but imagine that different people have developed different nitpick-detector-circuits / aka definitions, and you get that react when you don’t consider it to have been a nitpick. This can collide with the parts of your world-model unnecessarily and at least slightly suffering-inducingly, which is kinda the core insight under NVC.
But the nitpick one’s probably got small enough likelihood of anything notable to be fine, meta meta nitpick accepted with amusement.
I think
Missed the point
Weak Argument
are much more notable offenders along this axis, as those words are often used in heat not as epistemic markers. I used Missed Intended Point imo, you could use Missed Intended Point?, and maybe Invalid Argument? would work here for the other one?
Updated to nicer images, made react names more NVC-flavour making claims about the person putting the react down not about the person being reacted to or about global state. Somewhat recommend LW Staff ( @Raemon? @habryka ? ) do this too, there are principled reasons for doing this if you want people to be able to think and communicate clearly. My guess is people get a lot less of a sting from negative reacts with a simple set of renames, and are able to update more cleanly.
e.g.
Too Combative - > Too Combative imo
Nitpick → Seems Nitpicky To Me
Difficult To Parse → Difficult For Me To Parse
I made the new file names use this style mostly, but some adaptations for Discord. If you’d like to use this style I can make a draft version of renames for LW.
Yeah, I was not intending to exclude this type of goal directedness from unfolding. I don’t actually know the terminology for the thing I’ve been noticing a lot, but I bet the Buddhists have some specialized words which you’ll be able to inform me of. The ones I use are trying Vs intention, or wanting Vs desire. It’s something like steering by rejecting the realities you don’t like, conditioning on the outcomes you want in a way which makes you unable to look at the alternative clearly.
That’s the kind of goal directedness which blocks unfolding, whereas being accepting of the fact that your preferred outcome might not happen and holding your preferences about the world within your self model’s Markov boundary rather than letting them leak into your world model is totally compatible with unfolding.
I wasn’t differentiating from random drift though, just noticing the failure mode most salient to me which is unfolding blocking trying vibes :)
Some intuitions about what differentiates Unfolding from Thinking:
In Unfolding, the aspect of yourself in the driver’s seat isn’t trying to steer the CoT your brain is running, isn’t being goal-directed with respect to the outcome, and is instead letting the quieter and less well connected sub-patterns have their whispers come together to bring information which had been collected by the system but not noticed by the global neuronal workspace and allowed to sync to all the parts of the system that would benefit from knowing it.
In Thinking, you have a spec for what a good outcome would be, and you’re shaping your thoughts kinda top-down to reach that. It can search well in places you know to look, but you’re not going to be finding your unknown unknowns here very well, or noticing that the axioms of your search are too restrictive.
If I could schedule, the frontpage review happened before publishing, and the schedule UI had “delay publishing until frontpage”[1] as a checkbox, this would be ~solved.
- ^
I’d prefer this to “delay publishing until human review”, as ~half a dozen times in the past few years I’ve appealed via Intercom and had a human-reviewed page retroactively frontpaged (usually a resource, which LW team’s priors seem to be something like ‘this won’t be maintained’ but will because I optimize a bunch for not leaving stale projects).
Examples which Rafe requested when I mentioned this: the following were all marked as personal blog until I intercom’d in and asked for a re-assessment
https://www.lesswrong.com/posts/JsqPftLgvHLL4Pscg/new-weekly-newsletter-for-ai-safety-events-and-training
https://www.lesswrong.com/posts/dEnKkYmFhXaukizWW/aisafety-community-a-living-document-of-ai-safety
https://www.lesswrong.com/posts/vxSGDLGRtfcf6FWBg/top-ai-safety-newsletters-books-podcasts-etc-new-aisafety (nudge didn’t work for this one)
https://www.lesswrong.com/posts/MKvtmNGCtwNqc44qm/announcing-aisafety-training
https://www.lesswrong.com/posts/JRtARkng9JJt77G2o/ai-safety-memes-wiki
https://www.lesswrong.com/posts/x85YnN8kzmpdjmGWg/14-ai-safety-advisors-you-can-speak-to-new-aisafety-com
- ^
I’d be happy, if the auto-frontpager is ~instant, to get the option “delay publishing until human review” if it declines frontpage. Whether something gets ~50% less karma than it would by default is a pretty major drop in the effectiveness of what is often many hours of work, I’d be fine with waiting a day or two to avoid that usually.
The planned Ambitious AI Alignment Seminar which aims to rapidly up-skill ~35 people in understanding the foundations of conceptual/theory alignment using a seminar format where people peer-to-peer tutor each other pulling topics from a long list of concepts. It also aims to fast-iterate and propagate improved technical communication skills of the type that effectively seeks truth and builds bridges with people who’ve seen different parts of the puzzle, and following one year fellowship seems like a worthwhile step towards surviving futures in worlds where superintelligence-robust alignment is not dramatically easier than it appears for a few reasons, including:
Having more people who have deep enough technical models of theory that they can usefully communicate predictable challenges, and inform especially policymakers of strategically relevant considerations.
Having many more people who have the kind of clarity which helps them shape new orgs in the worlds where alignment efforts get dramatically more funds
Preparing the ground for any attempts at the lab’s default plan (get AI to solve alignment) by making more people understand what solving alignment in a once-and-for-all way actually requires and entails, so the labs are more likely to ask the AI to solve the kinds of problems it needs to before hard RSIing if value is to be preserved.
A side-output will be ranking concepts by importance and materials for learning them by effectiveness, which seems pretty valuable for the wider community.
The team is all-star and the venue excellent, and I expect it to be an amazing event. There is a funder who is somewhat interested, but they would like more evidence of this kind of effort being thought by competent people to be worthwhile in the form of Manifund comments, especially from people who’ve been involved in similar programs, so I’m signal-boosting it here along with posting the Expression of Interest for joining.
Changelog
New Features:
Added Ctrl+Z/Ctrl+Y undo/redo—experiment freely and roll back mistakes
🚀 Set ambition level—rescales % so you can extend a project[1]
📸 Copy graph to clipboard—Easy grabbing the image
Dotted reference lines drop from each point to show exact dollar amounts and to the left for exact % amounts[2]
Clear notification—Shows “Points cleared—press Ctrl+Z to undo” when you clear all points
🧲 Snap to Grid—Enable magnetic snapping to round numbers (off by default)
Right click to add text to breakpoints[3]
Smart currency input—type “90k” or “1.5M” and it auto-converts
Input validation with helpful error messages
Touch scrolling fixed—Dragging points on mobile doesn’t scroll the page anymore
Language Updates:
Guide describes all features properly
Changed “Utility” → “Impact” throughout (less jargon!)
New title: “Plot Your Impact at Different Funding Levels”
Clearer subtitle mentions sharing with funders
Added watermark with stable URL
Accessibility
Keyboard navigation—Can focus and navigate the graph with keyboard
Screen reader support—All buttons labeled, status messages announced
Modal focus trap—Can’t accidentally tab outside modal dialog
By my state of knowledge, it is an open question whether or not we will create AIs that are broadly loyal like this. It might not be that hard, if we’re trying even a little.
Curious about the story you’d tell of this happening? It looks to me quite implausible that we’d pull this off with anything like current techniques.
The task time horizon of the AI models doubles about every 7 months.
We’re pretty clearly in the 4 month doubling world at this point:
A metaphor I told to a family member who has worked for ~decades in climate change mitigation, after she compared AI to fossil fuels when I explained the incentives around AI regulation and economics/national security.
Fossil Fuels 2.0: Now with the technology trying to agentically bootstrap itself to godhood, aided by its superhuman persuasive abilities.
https://www.lesswrong.com/posts/kzPQohJakutbtFPcf/dario-amodei-the-adolescence-of-technology?commentId=AfSDKDMJaLvkjhmJa for @the gears to ascension