If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
If you want to chat, message me!
LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.
My understanding is that pilot wave theory (ie Bohmian mechanics) explains all the quantum physics
This is only true if you don’t count relativistic field theory. Bohmian mechanics has mathematical troubles extending to special relativity or particle creation/annihilation operators.
Is there any reason at all to expect some kind of multiverse?
Depending on how big you expect the unobservable universe to be, there can also be a spacelike multiverse.
Wouldn’t other people also like to use an AI that can collaborate with them on complex topics? E.g. people planning datacenters, or researching RL, or trying to get AIs to collaborate with other instances of themselves to accurately solve real-world problems?
I don’t think people working on alignment research assistants are planning to just turn it on and leave the building, they on average (weighted by money) seem to be imagining doing things like “explain an experiment in natural language and have an AI help implement it rapidly.”
So I think both they and this post are describing the strategy of “building very generally useful AI, but the good guys will be using it first.” I hear you as saying you want a slightly different profile of generally-useful skills to be targeted.
I have now read the paper, and still think you did a great job.
One gripe I have is with this framing:
We believe our articulation of human values as constitutive attentional policies is much closer to “what we really care about”, and is thus less prone to over-optimization
If you were to heavily optimize for text that humans would rate highly on specific values, you would run into the usual problems (e.g. model incentivized to manipulate the human). Your success here doesn’t come from the formulation of the values per se, but rather from the architecture that turns them into text/actions—rather than optimizing for them directly, you can prompt a LLM that’s anchored on normal human text to mildly optimize them for you.
This difference implies some important points about scaling to more intelligent systems (even without making any big pivots):
we don’t want the model to optimize for the stated values unboundedly hard, so we’ll have to end up asking for something mild and human-anchored more explicitly.
If another use of AI is proposing changes to the moral graph, we don’t want that process to form an optimization feedback loop (unless we’re really sure).
The main difference made by the choice of format of values is where to draw the boundary between legible human deliberation, and illegible LLM common sense.
I’m excited for future projects that are sort of in this vein but try to tackle moral conflict, or that try to use continuous rather than discrete prompts that can interpolate values, or explore different sorts of training of the illegible-common-sense part, or any of a dozen other things.
Awesome to see this come to fruition. I think if a dozen different groups independently tried to attack this same problem head-on, we’d learn useful stuff each time.
I’ll read the whole paper more thoroughly soon, but my biggest question so far is if you collected data about what happens to your observables if you change the process along sensible-seeming axes.
Regular AE’s job is to throw away the information outside some low-dimensional manifold, sparse ~linear AE’s job is to throw away the information not represented by sparse dictionary codes. (Also a low-dimensional manifold, I guess, just made from a different prior.)
If an AE is reconstructing poorly, that means it was throwing away a lot of information. How important that information is seems like a question about which manifold the underlying network “really” generalizes according to. And also what counts as an anomaly / what kinds of outliers you’re even trying to detect.
Ah, yeah, that makes sense.
Even for an SAE that’s been trained only on normal data [...] you could look for circuits in the SAE basis and use those for anomaly detection.
Yeah, this seems somewhat plausible. If automated circuit-finding works it would certainly detect some anomalies, though I’m uncertain if it’s going to be weak against adversarial anomalies relative to regular ol’ random anomalies.
Dictionary/SAE learning on model activations is bad as anomaly detection because you need to train the dictionary on a dataset, which means you needed the anomaly to be in the training set.
How to do dictionary learning without a dataset? One possibility is to use uncertainty-estimation-like techniques to detect when the model “thinks its on-distribution” for randomly sampled activations.
Tracking your predictions and improving your calibration over time is good. So is practicing making outside-view estimates based on related numerical data. But I think diversity is good.
If you start going back through historical F1 data as prediction exercises, I expect the main thing that will happen is you’ll learn a lot about the history of F1. Secondarily, you’ll get better at avoiding your own biases, but in a way that’s concentrated on your biases relevant to F1 predictions.
If you already want to learn more about the history of F1, then go for it, it’s not hurting anyone :) Estimating more diverse things will probably better prepare you for making future non-F1 estimates, but if you’re going to pay attention to F1 anyhow it might be a fun thing to track.
Yup, I basically agree with this. Although we shouldn’t necessarily only focus on OpenAI as the other possible racer. Other companies (Microsoft, Twitter, etc) might perceive a need to go faster / use more resources to get a business advantage if the LLM marketplace seems more crowded.
I also like “transformative AI.”
I don’t think of it as “AGI” or “human-level” being an especially bad term—most category nouns are bad terms (like “heap”), in the sense that they’re inherently fuzzy gestures at the structure of the world. It’s just that in the context of 2024, we’re now inside the fuzz.
A mile away from your house, “towards your house” is a useful direction. Inside your front hallway, “towards your house” is a uselessly fuzzy direction—and a bad term. More precision is needed because you’re closer.
The brain algorithms that do moral reasoning are value-aligned in the same way a puddle is aligned with the shape of the hole it’s in.
They’re shaped by all sorts of forces, ranging from social environment to biological facts like how we can’t make our brains twice as large. Not just during development, but on an ongoing basis our moral reasoning exists in a balance with all these other forces. But of course, a puddle always coincidentally finds itself in a hole that’s perfectly shaped for it.
If you took the decision-making algorithms from my brain and put them into a brain 357x larger, that tautological magic spell might break, and the puddle that you’ve moved into a different hole might no longer be the same shape as it was in the original hole.
If you anticipate this general class of problems and try to resolve them, that’s great! I’m not saying nobody should do neuroscience. It’s just I don’t think it’s a “entirely scientific approach, requiring minimal philosophical deconfusion,” nor does it lead to safe AIs that are just emulations of humans except smarter.
They can certainly use answer text as a scratchpad (even nonfunctional text that gives more space for hidden activations to flow). But they don’t without explicit training. Actually maybe they do- maybe RLHF incentivizes a verbose style to give more room for thought. But I think even “thinking step by step,” there are still plenty of issues.
Tokenization is definitely a contributor. But that doesn’t really support the notion that there’s an underlying human-like cognitive algorithm behind human-like text output. The point is the way it adds numbers is very inhuman, despite producing human-like output on the most common/easy cases.
I’m not totally sure the hypothesis is well-defined enough to argue about, but maybe Gary Marcus-esque analysis of the pattern of LLM mistakes?
If the internals were like a human thinking about the question and then giving an answer, it would probably be able to add numbers more reliably. And I also suspect the pattern of mistakes doesn’t look typical for a human at any developmental stage (once a human can add 3 digit numbers their success rate at 5 digit numbers is probably pretty good). I vaguely recall some people looking at this, but gave forgotten the reference, sorry.
A different question: When does it make your (mental) life easier to categorize an AI as conscious, so that you can use the heuristics you’ve developed about what conscious things are like to make good judgments?
Sometimes, maybe! Especially if lots of work has been put in to make said AI behave in familiar ways along many axes, even when nobody (else) is looking.
But for LLMs, or other similarly alien AIs, I expect that using your usual patterns of thought for conscious things creates more problems than it helps with.
If one is a bit Platonist, then there’s some hidden fact about whether they’re “really conscious or not” no matter how murky the waters, and once this Hard problem is solved, deciding what to do is relatively easy.
But I prefer the alternative of ditching the question of consciousness entirely when it’s not going to be useful, and deciding what’s right to do about alien AIs more directly.
Interesting stuff, but I I felt like your code was just a bunch of hard-coded suggestively-named variables with no pattern-matching to actually glue those variables to reality. I’m pessimistic about the applicability—better to spend time thinking on how to get an AI to do this reasoning in a way that’s connected to reality from the get-go.
Exciting stuff, thanks!
It’s a little surprising to me how bad the logit lens is for earlier layers.
I was curious about the context and so I went over and ctrl+F’ed Solomonoff and found Evan saying
I think you’re misunderstanding the nature of my objection. It’s not that Solomonoff induction is my real reason for believing in deceptive alignment or something, it’s that the reasoning in this post is mathematically unsound, and I’m using the formalism to show why. If I weren’t responding to this post specifically, I probably wouldn’t have brought up Solomonoff induction at all.
Thank you for posting this, and it was interesting. Also, I think the middle section is bad.
Basically starting from Lance taking a digression out of an anthropomorphic argument to castigate those who think AI might do bad things for anthropomorphising, and ending with the end of all discussion of Solomonoff induction, I think there was a lot of misconstruing ideas or arguing against nonexistent people.
Like, I personally don’t agree with people who expect optimization daemons to arise in gradient descent, but I don’t say they’re motivated by whether the Solomonoff prior is malign.
I found someone’s thesis from 2020 (Hoi Wai Lai) that sums it up not too badly (from the perspective of someone who wants to make Bohmian mechanics work and was willing to write a thesis about it).
For special relativity (section 6), the problem is that the motion of each hidden particle depends instantaneously on the entire multi-particle wavefunction. According to Lai, there’s nothing better than to bite the bullet and define a “real present” across the universe, and have the hyperparticles sometimes go faster than light. What hypersurface counts as the real present is unobservable to us, but the motion of the hidden particles cares about it.
For varying particle number (section 7.4), the problem is that in quantum mechanics you can have a superposition of states with different numbers of particles. If there’s some hidden variable tracking which part of the superposition is “real,” this hidden variable has to behave totally different than a particle! Lai says this leads to “Bell-type” theories, where there’s a single hidden variable, a hidden trajectory in configuration space. Honestly this actually seems more satisfactory than how it deals with special relativity—you just had to sacrifice the notion of independent hidden variables behaving like particles, you didn’t have to allow for superluminal communication in a way that highlights how pointless the hidden variables are.
Warning: I have exerted basically no effort to check if this random grad student was accurate.