The usual answer to this question is to do whichever one you’re personally the most excited to work on. If the question of what LW people would like happens to be relevant to that, I will also chime in alongside Drake Morrison that epistemic rougelikes sound really cool.
DaemonicSigil
Kettles and Hydro Dams
(linkpost)
I live in British Columbia, where we get a lot of our electricity from hydroelectric dams. While boiling some water in an electric kettle, the water in the kettle might get you thinking about the water that was dropped to generate the electricity to run the kettle. How much water do we have to run through the dam in order to boil a cup (250mL) of water?
To answer this, we need to know the heat capacity of water. (You might think we also need to know the latent heat of vaporization, but when we talk about “boiling” some water, the goal is not actually (usually) to turn all of that water into steam. We just want to heat the water to 100°C, while only a small fraction is boiled off.) The specific heat capacity of water is 4184 J/kg°C. Note the kg in the denomiator of the unit, which means that if we double the amount of water, it takes twice the energy to raise its temerature the same amount. So there exists some fixed ratio between the amount of water boiled and the amount of water lost from the reservoir in order to boil it.
The exact value of this ratio depends on the height of the dam and how full its reservoir is at the time. So, here’s an even simpler thing we can calculate: Imagine I’m visiting a planet like Earth but with no atmosphere. I’m at the top of a cliff carrying a bucket of water. I dump the water over the edge of the cliff and it falls all the way down before striking the ground at the bottom of the cliff. How high up do I have to be before the water reaches 100°C from the sheer violence of its collision with the ground?
If we assume that the water can’t lose any energy to external sources then:
-
A mass falling a height yields an energy of .
-
A mass takes an energy of roughly to heat from 25°C to boiling.
Setting these equal and solving for gives , an altitude that would be well into the stratosphere on Earth.
Given that most dams aren’t 32 kilometers high, it’s clear that we’ll have to use much more water to generate the energy than the amount we want to boil. Specifically, 320 times more water for a 100 meter dam with a full reservoir, and the ratio is even more extreme for shorter dams or less-full reservoirs. This is kind of a shocking ratio if you haven’t thought about it before, or at least that was my reaction.
One other thing that I thought was interesting about these calculations is that the water floating around near the top of the reservoir carries much more extractable energy than the water floating around near the bottom. Simply because it’s higher up. So if the reservoir is low, that might be when you most desire incoming water from a scarcity perspective, but incoming water actually brings with it the most extractable energy when the reservoir is nearly full. This is related to how it takes more energy to add charge to a capacitor the more charge it already has on it so that stored energy goes as the square of the stored charge.
One thing this means is that if you’re digging out a reservoir to make it larger, you should mostly focus on increasing the volume just underneath the maximum water level of the reservoir. I.e. you should shallowly dig a large area rather than deeply digging a small area.
-
But in a just world this will tend to be the people who are badly-off as a consequence of their own misbehavior.
The real world is not just though. Yes some people who are badly off are there as a consequence of their own actions: Eg. this is quite likely the case if they’re in jail. But, like, the most common way to be badly off in a way that makes you a target for the assistance of effective altruists is to be born into a poor country without a good public health system. Non-effective altruists might try to help prisoners or do other things that oppose the justice people. But those choices seem more random: some of them will also just go and fund museums.
Of course, I agree with the overall point that it’s very important to consider what incentives you will create when you try to help people.
You mean linearity equation?
If the ion is isolated, that means you take a tensor product of its state with the state of the environment. If are orthogonal, then and are still orthogonal.
Why does your “repeated measurement” method not also work to use entangled qubits to send signals faster than light? (Since measuring one qubit also collapses the state of the other.)
Or, maybe just tell me the density matrix for the ion that you expect the reciever to see if the sender sends a 0, and also the density matrix for if they send a 1?
No matter what basis they measure in, the receiver will observe results consistent with the ion being in whatever state it was already in before the senders even did anything. This is a result of the linearity of quantum mechanics. If the overall wavefunction is a sum of two nearly-orthogonal vectors, then the evolved wavefunction is the sum of each vector evolved separately, and the terms in this sum will also be nearly-orthogonal. In equations:
If wasn’t there, then would still evolve to and see the exact same outcomes. To get communication, there would have to be significant amplitudes for the universe’s state to spontaneously shift from being in one world to the other. (i.e. even if world 1 is initially the only world, it still has some amplitude to end up in world 2). This is not realistic for the physics of macroscopic objects. We don’t see, either theoretically or experimentally, large amplitudes for a dead cat to turn into a live one, etc, even if the initial decision to kill the cat or not was made by measuring polarization of a single photon.
EDIT: Also, it is a well known fact in QM that “one does not simply measure whether a system is an eigenstate or a superposition”. If you measure a spin of up for an electron, you do not know whether it was actually spinning up, or it was spinning left and you happened to measure the “up” component of the left spin.
But if you’re just concerned about energy conservation, such a complicated fix is not needed anyways: There are many systems that have multiple quantum states with identical energy, momentum, angular momentum, etc, yet are still orthogonal (i.e. perfectly distinguishable by measurement).
So the real reason it doesn’t work is linearity, not energy-conservation or anything like that.
No, I’m pretty sure publishing sightings of law enforcement is legal in the US. Some traffic radio stations report on where police are using radar guns for example, and this is fully legal. Indeed, considering that mapping ICE sightings could be of academic/intellectual interest (and that it is actually perfectly reasonable for law-abiding US citizens to want to limit their time spent in close proximity to ICE agents) this is far more centrally “helping people get away with doing illegal things” (speeding) than robertzk’s project.
I didn’t learn anything from this. It looks like there are things to learn here, but you seem to have deliberately chosen a writing style that does not permit it.
That is pretty annoying.
I think its spread through rationalist-land originated at this post by Alice Maz: https://alicemaz.substack.com/p/you-can-just-do-stuff
Though by following the trail of links from Haiku’s comment one can find people saying similar things farther in the past.
Overall a nice insightful post, but recorded music is like upwards of a century old, so I don’t think the timing works out. I was in a dancing club at one point and we used recorded music and I think that requiring us to use live music would have prevented the club from existing.
Yeah, there’s definitely a few relevant things here:
Representation theory is relevant (in particular representations of cyclic groups, which is basically the circle arithmetic you’re talking about). Representation theory gives you matrices that use complex numbers, even for discrete finite groups. So number theory isn’t the only place where complex numbers poke their nose into discrete business.
There’s apparently a whole theory of Dirichlet series and Dirichlet convolution which is analogous to Fourier series and Fourier convolution. Complex numbers are the nicest way to do Fourier series, so it makes sense that they’re also the nicest way to do Dirichlet stuff.
I guess just in general complex numbers are the best field and a lot of the math of number theory is turning things into vector spaces and linear operators and doing linear algebra on them. And if you’re picking a field for your vector space, what better choice than ?
A Dirichlet series is a function of a variable , and complex functions can have complex analysis done on them, and complex analysis is uniquely nice, so why not make a complex variable?
The post author seems to already know a lot of math, so I guess they’re looking for a deeper kind of answer.
Oh, cool, that’s great.
I was referring to the older style of sailboat design, like this one, where the sails are all controlled by a bunch of ropes. Single rotating airfoil sounds a lot simpler. One advantage of an airfoil over a kite is that it would allow the ship to gain some propulsion, even when travelling at an angle upwind, right?
Obstacles to this idea:
Hydrodynamic drag on many smaller hulls is larger than on one big one.
Ports cost money. It seems hard to make a port which unloads individual 1-container vessels faster than a conventional port can pull containers off a large vessel using cranes. As for just making more numerous & smaller ports, it probably ends up being much more expensive.
It’s an existing technology for large ships to fly kites to save fuel when the weather is favourable and the wind is blowing in the right direction. I think most ships still don’t do this, so I’m not sure if it’s currently economical, but it at least demonstrates that making a ton of tiny sailboats is not the only way to take advantage of the wind. Sailboats have a lot of moving parts, and maintenance on so many of them would be a nightmare.
Even though containers have taken over a lot of things, not all loads are containers even today, so we still need large ships for fuels and bulk cargo.
I’d be eager to hear about anything similar to this which is already discussed somewhere, if you know of anything.
I wrote a post in 2023 about somewhat similar ideas: https://www.lesswrong.com/posts/uDXRxF9tGqGX5bGT4/logical-share-splitting The market mechanism there is based on the formula
P(A)+P(B)=P(A∧B)+P(A∨B)from probability theory.In that post, shares aren’t expected to have continuous-valued payoffs, but merely continuous-valued prices representing binary payoffs with some chance of happening. So that’s a relevant difference that means the interpretation of is not quite the same (in particular, it doesn’t involve a max function, since there are no continuous values to take a max of), though there’s still a sense in which it’s as good as having your choice between and .
The conventional physics way of explaining this is as follows:
One way of asking “what is the current state of the universe?” is to pick a Cauchy surface. This is just a “slice” of the entire universe at a given time. There is a lot of freedom in the choice of slice: In Minkowsky space for example, there are slices corresponding to every choice of rest frame, and many more besides those. We just need to make sure that no points on the surface lie within each-other’s light cones ().
The information (about field values & derivatives) lying on any particular Cauchy surface is enough to predict the future and past from that surface. Pick any two Cauchy surfaces, and there’s a unitary operator mapping one to the other. This is the relativistic version of a time-evolution operator.
Some Cauchy surfaces are entirely later in time than other Cauchy surfaces. (Though some pairs of Cauchy surfaces are partially later and partially earlier than each other.) We’ll say that for Cauchy surfaces , that exactly when for all points , either is spacelike separated from or is in the future lightcone of .
Let be a function that measures the entropy on a given Cauchy surface. The second law of thermodynamics then says that if then .
Thing likely being subtweeted: https://www.lesswrong.com/posts/dHLdf8SB8oW5L27gg/on-fleshling-safety-a-debate-by-klurl-and-trapaucius
1a3orn can correct me if I’m wrong. You’re welcome, confused future readers.
I have read that some sequencing methods (nanopore) have a high error rate (comparing multiple reads can help correct this). Did you also spot-check some other genes that you have no reason to believe contain mutations to see if they look ok? Seeing a mutation in exactly the gene you expect is only damn strong evidence if there isn’t a sequencing error in every third gene.
EDIT: Looks like this was checked, nice: https://www.lesswrong.com/posts/Hds7xkLgYtm6qDGPS/how-i-learned-that-i-don-t-feel-companionate-love
Inspired partially by this post and partially by trying to think of simple test cases for a machine learning project I’m working on, here is a (not too hard, you should try answering it yourself) question: Let’s say we’ve observed trials of a Bernoulli random variable, and had a
1outcome (so were0). Laplace’s rule of succession (uniform prior over success probability) says that we should estimate a probability of for the next trial being1. The question is: What is the prior over bitstrings of length implied by Laplace’s rule of succession? In other words, can we convert the rule of succession formula into a probability distribution over bitstringssthat record outcomes of trials?Additional clarification of the problem:
Given any particular observation of trials, there will be two bitstrings that are consistent with it, where the last (unobserved) trial is
0or1respectively. We can compute the1probability (which should equal the result from the rule of succession) as:where is the first bits of the string (corresponding to visible observations) and is the Hamming weight function (counts the number of
1s in a bitstring). Since this requires a normalization anyway, you can also just provide an energy function as your answer. The probability formula in this case is:If we just pick a uniform distribution over bitstrings, that doesn’t work. Then the predicted probability of the next trial is always just .
Answer:
The following energy function works:
This can be checked by computing the probability as:
This energy function biases the distribution towards strings with more extreme ratios between counts of
0and1. We can think of it as countering the entropic effect of strings with an equal balance of0and1being the most prevalent.
Also tried this, and basically ended up with the same answer as commenter One.
Key idea is that we really only care about drawing 5 trials from this process. So we just have to find a probability distribution over 6 outcomes: a count of for our 5 trials from 0-5. 10^6 datapoints is enough to kill a fair amount of noise by self-averaging, so I treated the fact that hiding a random trial has to reproduce the observed 4-trial distribution as just a hard constraint. (It’s a linear constraint in the probabilities.) Then did maximum entropy optimization subject to that constraint. The output distribution in terms of 5-trial counts looked pretty symmetric and was heavier towards the extremes.
Another quick computation from these values yields the p(R | k) numbers asked for in the question: [0.11118619, 0.32422537, 0.49942029, 0.67519768, 0.88914787]
Registering now that my modal expectation is that the situation will mostly look the same in 2028 as it does today. (To give one example from AI 2027, scaling neuralese is going to be hard, and while I can imagine a specific set of changes that would make it possible, it would require changing some fairly fundamental things about model architecture which I can easily imagine taking 3 years to reach production. And neuralese is not the only roadblock to AGI.)
I think one of your general points is something like “slow is smooth, smooth is fast” and also “cooperative is smooth, smooth is fast”, both of which I agree with. But the whole “trauma” thing is too much like Bulverism for my taste.
I don’t have watertight arguments, but to try and state it cleanly:
During inference, a forwards pass of the neural net is computed repeatedly as each token is generated. Activation vectors propagate from one layer to the next.
Activation vectors are the main flow of information from earlier layers to later layers.
The attention mechanism also allows activation vectors from previous tokens to influence the current computation. But crucially, this communication happens between activations at the same attention layer, it doesn’t skip forwards or backwards in terms of layers.
Thus, the only flow of information from later layers to earlier layers is contained in the sequence of tokens produced by the model.
This is silly. Layer 1 for the 2nd token happens after layer 100 for the 1st token. There’s no reason why we shouldn’t be able to give layer 1 for the 2nd token as much information as it wants about any of the 1st token layers.
Advantages of using activations for communication:
Activations do contain more information of course.
During pre-training, token logits are optimized for being high probability, which constrains them a fair bit.
Activations are also continuous, so can encode continuous values and probabilities, along with discrete values. And they can be optimized by gradient descent to be more helpful.
Also:
I’m actually not certain that neuralese is technically inevitable. Yes, it’s almost certainly superior given that we assume away the problem of training a neuralese model in the first place (i.e. assume infinite compute budget). But without that assumption…
Basically, the way attention currently works makes it easy to parallelize across tokens during training (and context reading). This is why context reading is cheaper per token than producing text, and why training on such a huge amount of data is possible. Neuralese doesn’t have this property of being fast when the tokens are already supplied, because there is still this activation data that has to be filled in sequentially.
So, neuralese models will probably have to be trained on less data, and they will be less efficient at reading context. They are probably about the same efficiency for generating text (at least if the non-neuralese competitor doesn’t get to use speculative generation with a cheaper model).
I guess models that have neuralese “turned off” during pre-training and context reading could still be comparably efficient. But then all the optimization of the neuralese encoding beyond just “use the last layer output” has to happen during RL. Due to its low cost, this is probably how the first usage of neuralese we see in the wild will work.
The other issue, which would only be a problem during training, is that gradients have to backpropagate through the neuralese vectors. This could result in the usual gradient stability issues we see in the training of RNNs that occur because the neural net effectively becomes incredibly deep. I think the field has solutions for this, but it’s another big complication to deal with when you try to scale the models.
Anyway, I think it’s probably going to happen eventually, especially if the “smaller, higher-quality training dataset” trend persists but it might take longer than people think.
See also Karpathy’s claim that models will be split into a part that focuses on reasoning but has relatively little memorized and a part that focuses on memorization. Karpathy’s assumption is that the reasoning part could be quite small. So if that’s true, then probably the reasoning part gets neuralese but the memorization part doesn’t, and the fact that the reasoning part is small makes the extra costs of neuralese more tolerable.