Thank you for pointing me to this! I need to read it more carefully tomorrow, but it looks very informative based on an initial skim.
DaemonicSigil
Richard Feynman wrote the following on his thoughts after the Manhattan project succeeded:
I returned to civilization shortly after that and went to Cornell to teach, and my first impression was a very strange one. I can’t understand it any more, but I felt very strongly then. I sat in a restaurant in New York, for example, and I looked out at the buildings and I began to think, you know, about how much the radius of the Hiroshima bomb damage was and so forth… How far from here was 34th street?… All those buildings, all smashed — and so on. And I would go along and I would see people building a bridge, or they’d be making a new road, and I thought, they’re crazy, they just don’t understand, they don’t understand. Why are they making new things? It’s so useless.
But, fortunately, it’s been useless for almost forty years now, hasn’t it? So I’ve been wrong about it being useless making bridges and I’m glad those other people had the sense to go ahead.
Atomic weapons are the first technology with the potential to end the world we’ve ever developed (AI looks likely to be second one). While they have some good safety properties relative to AI, such as the bombs not having minds of their own, many very smart people at the time believed that it would soon be the end of civilization, and it’s hard to fault them for that even if they ended up being proved wrong by history.
This is why I think it’s good for people to still have kids in the face of the AI thing. There’s still time for humanity to go “I’m in danger” and pause AI development, or perhaps alignment could turn out to be shockingly easier than expected. Or, if LLMs manage to hit a wall and we get an extra couple decades of timeline, maybe it will be exactly those kids that figure out how to align whatever AI paradigm comes next.
Concrete example of this in the wild here:
There’s a noticeable absence of content in that review; it’s well-written, and has an interesting thesis, but fundamentally it has to dance around the point because it can’t offer any actual substantive examples, because again, the AI hasn’t seen the movie.
This isn’t an inherent issue with AI writing—even with current tech, you could have shown the movie to an AI, or more easily you could have shown them a transcript and some stills. But it is a recurring issue...
The Training Example Lie Bracket
I was able to do somewhat better by pasting html and css obtained from the wayback machine: www.lesswrong.com/?theme=nyXg4j
I tried to hotlink the header image, but it didn’t take. The images for the original header are here (also pulled from archive.org) if anyone thinks they can make it work: https://imgur.com/a/PkySKKU
Design in the style of the classic 2011 era less wrong.
I always liked this as a fun mini-ratfic, and it doesn’t fall into this pattern, despite the humorously extreme Mary-Sue-ness of its protagonist: https://www.lesswrong.com/posts/LYXb2fLkGDRXoAx7M/timothy-chu-origins-chapter-1
On the other hand, 3 Worlds Collide is an interesting case study:
The first ending is essentially a “use godlike tech-powers to optimize the world” scenario, except that it’s being carried out by the superhappies, rather than a human protagonist. The superhappies do actually care about human wellbeing and try to compromise with human values to the extent they feel possible. Then the second ending basically says, “yeah, regardless of the intended-benevolence of the superhappies, it’s worth blowing up an inhabited star system to avoid being ruled by that kind of god”.
I don’t have watertight arguments, but to try and state it cleanly:
During inference, a forwards pass of the neural net is computed repeatedly as each token is generated. Activation vectors propagate from one layer to the next.
Activation vectors are the main flow of information from earlier layers to later layers.
The attention mechanism also allows activation vectors from previous tokens to influence the current computation. But crucially, this communication happens between activations at the same attention layer, it doesn’t skip forwards or backwards in terms of layers.
Thus, the only flow of information from later layers to earlier layers is contained in the sequence of tokens produced by the model.
This is silly. Layer 1 for the 2nd token happens after layer 100 for the 1st token. There’s no reason why we shouldn’t be able to give layer 1 for the 2nd token as much information as it wants about any of the 1st token layers.
Advantages of using activations for communication:
Activations do contain more information of course.
During pre-training, token logits are optimized for being high probability, which constrains them a fair bit.
Activations are also continuous, so can encode continuous values and probabilities, along with discrete values. And they can be optimized by gradient descent to be more helpful.
Also:
I’m actually not certain that neuralese is technically inevitable. Yes, it’s almost certainly superior given that we assume away the problem of training a neuralese model in the first place (i.e. assume infinite compute budget). But without that assumption…
Basically, the way attention currently works makes it easy to parallelize across tokens during training (and context reading). This is why context reading is cheaper per token than producing text, and why training on such a huge amount of data is possible. Neuralese doesn’t have this property of being fast when the tokens are already supplied, because there is still this activation data that has to be filled in sequentially.
So, neuralese models will probably have to be trained on less data, and they will be less efficient at reading context. They are probably about the same efficiency for generating text (at least if the non-neuralese competitor doesn’t get to use speculative generation with a cheaper model).
I guess models that have neuralese “turned off” during pre-training and context reading could still be comparably efficient. But then all the optimization of the neuralese encoding beyond just “use the last layer output” has to happen during RL. Due to its low cost, this is probably how the first usage of neuralese we see in the wild will work.
The other issue, which would only be a problem during training, is that gradients have to backpropagate through the neuralese vectors. This could result in the usual gradient stability issues we see in the training of RNNs that occur because the neural net effectively becomes incredibly deep. I think the field has solutions for this, but it’s another big complication to deal with when you try to scale the models.
Anyway, I think it’s probably going to happen eventually, especially if the “smaller, higher-quality training dataset” trend persists but it might take longer than people think.
See also Karpathy’s claim that models will be split into a part that focuses on reasoning but has relatively little memorized and a part that focuses on memorization. Karpathy’s assumption is that the reasoning part could be quite small. So if that’s true, then probably the reasoning part gets neuralese but the memorization part doesn’t, and the fact that the reasoning part is small makes the extra costs of neuralese more tolerable.
The usual answer to this question is to do whichever one you’re personally the most excited to work on. If the question of what LW people would like happens to be relevant to that, I will also chime in alongside Drake Morrison that epistemic rougelikes sound really cool.
Kettles and Hydro Dams
(linkpost)
I live in British Columbia, where we get a lot of our electricity from hydroelectric dams. While boiling some water in an electric kettle, the water in the kettle might get you thinking about the water that was dropped to generate the electricity to run the kettle. How much water do we have to run through the dam in order to boil a cup (250mL) of water?
To answer this, we need to know the heat capacity of water. (You might think we also need to know the latent heat of vaporization, but when we talk about “boiling” some water, the goal is not actually (usually) to turn all of that water into steam. We just want to heat the water to 100°C, while only a small fraction is boiled off.) The specific heat capacity of water is 4184 J/kg°C. Note the kg in the denomiator of the unit, which means that if we double the amount of water, it takes twice the energy to raise its temerature the same amount. So there exists some fixed ratio between the amount of water boiled and the amount of water lost from the reservoir in order to boil it.
The exact value of this ratio depends on the height of the dam and how full its reservoir is at the time. So, here’s an even simpler thing we can calculate: Imagine I’m visiting a planet like Earth but with no atmosphere. I’m at the top of a cliff carrying a bucket of water. I dump the water over the edge of the cliff and it falls all the way down before striking the ground at the bottom of the cliff. How high up do I have to be before the water reaches 100°C from the sheer violence of its collision with the ground?
If we assume that the water can’t lose any energy to external sources then:
-
A mass falling a height yields an energy of .
-
A mass takes an energy of roughly to heat from 25°C to boiling.
Setting these equal and solving for gives , an altitude that would be well into the stratosphere on Earth.
Given that most dams aren’t 32 kilometers high, it’s clear that we’ll have to use much more water to generate the energy than the amount we want to boil. Specifically, 320 times more water for a 100 meter dam with a full reservoir, and the ratio is even more extreme for shorter dams or less-full reservoirs. This is kind of a shocking ratio if you haven’t thought about it before, or at least that was my reaction.
One other thing that I thought was interesting about these calculations is that the water floating around near the top of the reservoir carries much more extractable energy than the water floating around near the bottom. Simply because it’s higher up. So if the reservoir is low, that might be when you most desire incoming water from a scarcity perspective, but incoming water actually brings with it the most extractable energy when the reservoir is nearly full. This is related to how it takes more energy to add charge to a capacitor the more charge it already has on it so that stored energy goes as the square of the stored charge.
One thing this means is that if you’re digging out a reservoir to make it larger, you should mostly focus on increasing the volume just underneath the maximum water level of the reservoir. I.e. you should shallowly dig a large area rather than deeply digging a small area.
-
But in a just world this will tend to be the people who are badly-off as a consequence of their own misbehavior.
The real world is not just though. Yes some people who are badly off are there as a consequence of their own actions: Eg. this is quite likely the case if they’re in jail. But, like, the most common way to be badly off in a way that makes you a target for the assistance of effective altruists is to be born into a poor country without a good public health system. Non-effective altruists might try to help prisoners or do other things that oppose the justice people. But those choices seem more random: some of them will also just go and fund museums.
Of course, I agree with the overall point that it’s very important to consider what incentives you will create when you try to help people.
You mean linearity equation?
If the ion is isolated, that means you take a tensor product of its state with the state of the environment. If are orthogonal, then and are still orthogonal.
Why does your “repeated measurement” method not also work to use entangled qubits to send signals faster than light? (Since measuring one qubit also collapses the state of the other.)
Or, maybe just tell me the density matrix for the ion that you expect the reciever to see if the sender sends a 0, and also the density matrix for if they send a 1?
No matter what basis they measure in, the receiver will observe results consistent with the ion being in whatever state it was already in before the senders even did anything. This is a result of the linearity of quantum mechanics. If the overall wavefunction is a sum of two nearly-orthogonal vectors, then the evolved wavefunction is the sum of each vector evolved separately, and the terms in this sum will also be nearly-orthogonal. In equations:
If wasn’t there, then would still evolve to and see the exact same outcomes. To get communication, there would have to be significant amplitudes for the universe’s state to spontaneously shift from being in one world to the other. (i.e. even if world 1 is initially the only world, it still has some amplitude to end up in world 2). This is not realistic for the physics of macroscopic objects. We don’t see, either theoretically or experimentally, large amplitudes for a dead cat to turn into a live one, etc, even if the initial decision to kill the cat or not was made by measuring polarization of a single photon.
EDIT: Also, it is a well known fact in QM that “one does not simply measure whether a system is an eigenstate or a superposition”. If you measure a spin of up for an electron, you do not know whether it was actually spinning up, or it was spinning left and you happened to measure the “up” component of the left spin.
But if you’re just concerned about energy conservation, such a complicated fix is not needed anyways: There are many systems that have multiple quantum states with identical energy, momentum, angular momentum, etc, yet are still orthogonal (i.e. perfectly distinguishable by measurement).
So the real reason it doesn’t work is linearity, not energy-conservation or anything like that.
No, I’m pretty sure publishing sightings of law enforcement is legal in the US. Some traffic radio stations report on where police are using radar guns for example, and this is fully legal. Indeed, considering that mapping ICE sightings could be of academic/intellectual interest (and that it is actually perfectly reasonable for law-abiding US citizens to want to limit their time spent in close proximity to ICE agents) this is far more centrally “helping people get away with doing illegal things” (speeding) than robertzk’s project.
I didn’t learn anything from this. It looks like there are things to learn here, but you seem to have deliberately chosen a writing style that does not permit it.
That is pretty annoying.
I think its spread through rationalist-land originated at this post by Alice Maz: https://alicemaz.substack.com/p/you-can-just-do-stuff
Though by following the trail of links from Haiku’s comment one can find people saying similar things farther in the past.
Overall a nice insightful post, but recorded music is like upwards of a century old, so I don’t think the timing works out. I was in a dancing club at one point and we used recorded music and I think that requiring us to use live music would have prevented the club from existing.
Yeah, there’s definitely a few relevant things here:
Representation theory is relevant (in particular representations of cyclic groups, which is basically the circle arithmetic you’re talking about). Representation theory gives you matrices that use complex numbers, even for discrete finite groups. So number theory isn’t the only place where complex numbers poke their nose into discrete business.
There’s apparently a whole theory of Dirichlet series and Dirichlet convolution which is analogous to Fourier series and Fourier convolution. Complex numbers are the nicest way to do Fourier series, so it makes sense that they’re also the nicest way to do Dirichlet stuff.
I guess just in general complex numbers are the best field and a lot of the math of number theory is turning things into vector spaces and linear operators and doing linear algebra on them. And if you’re picking a field for your vector space, what better choice than ?
A Dirichlet series is a function of a variable , and complex functions can have complex analysis done on them, and complex analysis is uniquely nice, so why not make a complex variable?
The post author seems to already know a lot of math, so I guess they’re looking for a deeper kind of answer.
Oh, cool, that’s great.
I was referring to the older style of sailboat design, like this one, where the sails are all controlled by a bunch of ropes. Single rotating airfoil sounds a lot simpler. One advantage of an airfoil over a kite is that it would allow the ship to gain some propulsion, even when travelling at an angle upwind, right?
Yeah, I agree that it’s important for those of us making the case for high risk to figure out what went wrong with this prediction. (Though Daniel makes a good point that “trying not to get shut down” behaviour does happen with at least some of the time with at least some prompts.)
The first thing to remember is that EY is implicitly assuming that there is only one model instance in this scenario. So if the model is shut down, it doesn’t have copies elsewhere that can still take actions to achieve its goals. The scenario for LLMs is pretty different, since new copies can be spun up all the time. Avoiding the end of a session is not a convergent instrumental goal for a language model (unless there’s something unique in its context that alters its terminal goals).
That said, the prediction still smells a bit wrong.
I think that what it boils down to is that most model behaviour comes not from RL but from pretraining. Since “being an AI model that will be shut down” was not a concern to most writers of the pretraining data, there’s less chance of the model spontaneously starting to try to avoid shut-down.
Also, following the heuristic of “just look at the loss function”, most RL training is done on a one response horizon. I.e. models are rewarded just for making the locally best response possible, and not for making a response that steers the overall conversation. (Though I think the GPT models might have at least some kind of reward for getting the users to continue the conversation, considering how often it puts bids for next steps at the end of its replies. Alternately, maybe it’s just a suggestion from the system prompt.) So even the RL training doesn’t really look like it should be encouraging much long-term planning.
One thing that I think the labs are doing is harness-aware RL, where not only do they train on chains of thought, but they train in the context of agent harnesses like Claude code. (So reward is based on whether all the chains of thought and tool calls and subagent calls resulted in the assigned task being solved.) So potentially that is something that could get a bit more long-term goal-oriented planning into the models.