All the pics and bragging about how wonderful their adventures were really rub me the wrong way. It comes across as incredibly tone deaf to the allegations and focusing on irrelevant things. Hot tubs, beaches, and sunsets are not so important if you’re suffering from deeper issues. Good relationship dynamics are way more important than scenery and perks, especially in a small group setting.
Archimedes
One additional point worth noting is that physical health has an enormous impact on mental health. Exercise (along with good sleeping and eating habits) is valuable even if it didn’t make you stronger.
I found your thread insightful, so I hope you don’t mind me pasting it below to make it easier for other readers.
Neel Nanda ✅ @NeelNanda5 - Sep 24
The core intuition is that “When you see ‘A is’, output B” is implemented as an asymmetric look-up table, with an entry for A->B. B->A would be a separate entry
The key question to ask with a mystery like this about models is what algorithms are needed to get the correct answer, and how these can be implemented in transformer weights. These are what get reinforced when fine-tuning.
The two hard parts of “A is B” are recognising the input tokens A (out of all possible input tokens) and connecting this to the action to output tokens B (out of all possible output tokens). These are both hard! Further, the A → B look-up must happen on a single token position
Intuitively, the algorithm here has early attention heads attend to the prev token to create a previous token subspace on the Cruise token. Then an MLP neuron activates on “Current==Cruise & Prev==Tom” and outputs “Output=Mary”, “Next Output=Lee” and “Next Next Output=Pfeiffer”
“Output=Mary” directly connects to the unembed, and “Next Output=Lee” etc gets moved by late attention heads to subsequent tokens once Mary is output.
Crucially, there’s an asymmetry between “input A” and “output A”. Inputs are around at early layers, come from input embeddings, and touch the input weights of MLP neurons. Outputs are around more at late layers, compose with the unembedding, and come from output weights of MLPs
This is especially true with multi-token A and B. Detecting “Tom Cruise” is saying “the current token embedding is Cruise, and the prev token space says Tom”, while output “Tom Cruise” means to output the token Tom, and then a late attn head move “output Cruise” to the next token
Thus, when given a gradient signal to output B given “A is” it reinforces/creates a lookup “A → B”, but doesn’t create “B->A”, these are different lookups, in different parameters, and there’s no gradient signal from one to the other.
How can you fix this? Honestly, I can’t think of anything. I broadly think of this as LLMs working as intended. They have a 1 way flow from inputs to outputs, and a fundamental asymmetry between inputs and outputs. It’s wild to me to expect symmetry/flow reversing to be possible
Why is this surprising at all then? My guess is that symmetry is intuitive to us, and we’re used to LLMs being capable of surprising and impressive things, so it’s weird to see something seemingly basic missing.
LLMs are not human! Certain things are easy for us and not for them, and vice versa. My guess is that the key difference here is that when detecting/outputting specific tokens, the LLM has no notion of a variable that can take on arbitrary values—a direction has fixed meaning
A better analogy might be in-context learning, where LLMs CAN use “variables”. The text “Tom Cruise is the son of Mary Lee Pfeiffer. Mary Lee Pfeiffer is the mother of...” has the algorithmic solution “Attend to the subject of sentence 1 (Tom Cruise), and copy to the output”
Unsurprisingly, the model has no issue with reversing facts in context! Intuitively, when I remember a fact A is B, it’s closer to a mix of retrieving it into my “context window” and then doing in-context learning, rather than pure memorised recall.
That sounds rather tautological.
Assuming ratfic represents LessWrong-style rationality well and assuming LW-style rationality is a good approximation of truly useful instrumental reasoning, then the claim should hold. There’s room for error in both assumptions.
One might argue that there are defeating reasons that corporations do not destroy the world: they are made of humans so can be somewhat reined in; they are not smart enough; they are not coherent enough. But in that case, the original argument needs to make reference to these things, so that they apply to one and not the other.
I don’t think this is quite fair. You created an argument outline that doesn’t directly reference these things, so you can only blame yourself for excluding them unless you are claiming that such things have not been discussed extensively.
One extremely important difference between corporations and potential AGIs is the level of high-speed, high-bandwidth coordination (which has been discussed extensively) that may be possible for AGIs. If a massive corporation could be as internally coordinated and self-aligned as might be possible for an AGI, it would be absolutely terrifying. Imagine Elon Musk as a Borg Queen with everyone related to Tesla as part of the “collective” under his control...
Latency, regardless of the cause, is one of the biggest hurdles. No matter how perfect the VR tech is, if the connection between participants has significant latency, then the experience will be inferior to in-person communication.
OK. Let’s make it even more extreme. Suppose you take a commercial flight. The likelihood of dying in a crash is on the order of 1 in 10 million. From a percent error or absolute error perspective, 99.99999% isn’t that different from 99% but that is the difference between one plane crash per year globally and a couple of dozen plane crashes per hour on average. These are wildly different in terms of acceptable safety.
There’s a backup link in the comments: https://www.thejach.com/public/log-probability.pdf
Why do you think that the same federal bureaucrats who incompetently overregulate other industries will do a better job regulating AI?
Chevron deference means that judges defer to federal agencies instead of interpreting the laws themselves where the statute is ambiguous. It’s not so much a question of overregulation vs underregulation as it is about who is doing the interpretation. For example, would you rather the career bureaucrats in the Environmental Protection Agency determine what regulations are appropriate to protect drinking water or random judges without any relevant expertise?
One consequence of blowing up Chevron deference is that one activist judge in Texas can unilaterally invalidate FDA approval of a drug like mifepristone for the entire country that’s been safe, effective, and available on the markets for decades by substituting his own idiosyncratic opinion instead of deferring to the regulatory agency whose entire purpose is to make those kinds of determinations.
Government agencies aren’t always competent but the alternative is a patchwork of potentially conflicting decisions from judges ruling outside of their area of expertise.
I don’t quite fully grasp why world-model divergence is inherently so problematic unless there is some theorem that says robust coordination is only possible with full synchronization. Is there something preventing the possibility of alignment among agents with significantly divergent world models?
From https://www.livescience.com/61568-naked-mole-rats-no-aging.html:
In the lab, the cause of death is usually hard to find; the main issue that shows up in necropsies, Buffenstein said, are mouth sores, indicating the animals weren’t eating, drinking or producing saliva well in their last few days and infection set in.
“We really don’t know what’s killing them at this point,” Buffenstein said.
Funny that you should mention élan vital. The more I read about it, the more “consciousness” seems to me to be similarly incoherent and pseudoscientific as vitalism. This isn’t a fringe view and I’d recommend skimming the Rejection of the Problem section of the Hard problem of consciousness page on Wikipedia for additional context. It’s hard not to be confused about a term that isn’t coherent to begin with.
Supposing each scenario could be definitively classified as conscious or not, would that help you make any predictions about the world?
So we let go of AI Alignment as an outcome and listen to what the AI is communicating when it diverges from our understanding of “alignment”? We can only earn alignment with an AGI by truly giving up control of it?
That sounds surprisingly plausible. We’re like ordinary human parents raising a genius child. The child needs guidance but will develop their own distinct set of values as they mature.
He doesn’t want to give up but doesn’t expect to succeed either. The remaining option is “Dying with Dignity” by fighting for survival in the face of approaching doom.
Supposing humanity is limited to Earth, I can see arguments for ideal population levels ranging from maybe 100 million to 100 billion with values between 1 and 10 billion being the most realistic. However, within this range, I’d guess that maximal value is more dependent on things like culture and technology than on the raw population count, just like a sperm whale’s brain being ~1000x the mass of an African grey parrot’s brain doesn’t make it three orders of magnitude more intelligent.
Size matters (as do the dynamic effects of growing/shrinking) but it’s not a metric I’d want to maximize unless everything else is optimized already. If you want more geniuses and more options/progress/creativity, then working toward more opportunities for existing humans to truly thrive seems far more Pareto-optimal to me.
Why not go all the way and use a constructed language (like Lojban or Ithkuil) that’s specifically designed for the purpose?
So as a rough analogy, if you were a computer program, the conscious part of the execution would be kind of like log output from a thread monitoring certain internal states?
My priors include the idea that both animal intelligence is not that different from humans and also that humans tend to overly anthropomorphize animal cognition. The biggest misunderstandings of animal cognition are much like misunderstandings humans have of foreign cultures, often involving forms of mind projection fallacies where we assume other’s values, motivations, priorities, and perceptions are more similar (or more different) than is justified.
Suppose you predicted 91% but the actual value was 99%. The percent error may only be about 8% but the likelihood of a wrong answer is 1⁄100 instead of your predicted 9⁄100, which is a huge difference.
You may be interested in the links in this post: https://www.lesswrong.com/posts/6Ltniokkr3qt7bzWw/log-odds-or-logits
Most actors in society - businesses, governments, corporations, even families—aren’t monolithic entities with a single hierarchy of goals. They’re composed of many individuals, each with their own diverse goals.
The diversity of goals of the component entities is good protection to have. In the case of an AI, do we still have the same diversity? Is there a reason why a monolithic AI with a single hierarchy of goals cannot operate on the level of a many-human collective actor?
I’m not sure how the solutions our society have evolved apply to an AI due to the fact that it isn’t necessarily a diverse collective of individually motivated actors.
Biology is incredibly efficient at certain things that happen at the cell level. To me, it seems like OP is extrapolating this observation rather too broadly. Human brains are quite inefficient at things they haven’t faced selective pressure to be good at, like matrix multiplication.
Claiming that human brains are near Pareto-optimal efficiency for general intelligence seems like a huge stretch to me. Even assuming that’s true, I’m much more worried about absolute levels of general intelligence rather than intelligence per Watt. Conventional nuclear bombs are dangerous even though they aren’t anywhere near the efficiency of a theoretical antimatter bomb. AI “brains” need not be constrained by the size and energy constraints of a human brain.