# DaemonicSigil

Karma: 1,340
• In this context, we’re imitating some probability distribution, and the perturbation means we’re slightly adjusting the probabilities, making some of them higher and some of them lower. The adjustment is small in a multiplicative sense not an additive sense, hence the use of exponentials. Just as a silly example, maybe I’m training on MNIST digits, but I want the 2′s to make up 30% of the distribution rather than just 10%. The math described above would let me train a GAN that generates 2′s 30% of the time.

I’m not sure what is meant by “the difference from a gradient in SGD”, so I’d need more information to say whether it is different from a perturbation or not. But probably it’s different: perturbations in the above sense are perturbations in the probability distribution over the training data.

• # Perturbation Theory in Machine Learning

In quantum mechanics there is this idea of perturbation theory, where a Hamiltonian is perturbed by some change to become . As long as the perturbation is small, we can use the technique of perturbation theory to find out facts about the perturbed Hamiltonian, like what its eigenvalues should be.

An interesting question is if we can also do perturbation theory in machine learning. Suppose I am training a GAN, a diffuser, or some other machine learning technique that matches an empirical distribution. We’ll use a statistical physics setup to say that the empirical distribution is given by:

Note that we may or may not have an explicit formula for . The distribution of the perturbed Hamiltonian is given by:

The loss function of the network will look something like:

Where are the network’s parameters, and is the per-sample loss function which will depend on what kind of model we’re training. Now suppose we’d like to perturb the Hamiltonian. We’ll assume that we have an explicit formula for . Then the loss can be easily modified as follows:

If the perturbation is too large, then the exponential causes the loss to be dominated by a few outliers, which is bad. But if the perturbation isn’t too large, then we can perturb the empirical distribution by a small amount in a desired direction.

One other thing to consider is that the exponential will generally increase variance in the magnitude of the gradient. To partially deal with this, we can define an adjusted batch size as:

Then by varying the actual number of samples we put into a batch, we can try to maintain a more or less constant adjusted batch size. One way to do this is to define an error variable, err = 0. At each step, we add a constant B_avg to the error. Then we add samples to the batch until adding one more sample would cause the adjusted batch size to exceed err. Subtract the adjusted batch size from err, train on the batch, and repeat. The error carries over from one step to the next, and so the adjusted batch sizes should average to B_avg.

• I don’t think we should consider the centroid important in describing the LLM’s “ontology”. In my view, the centroid just points in the direction of highest density of words in the LLM’s space of concepts. Let me explain:

The reason that embeddings are spread out is to allow the model to distinguish between words. So intuitively, tokens with largeish dot product between them correspond to similar words. Distinguishability of tokens is a limited resource, so the training process should generally result in a distribution of tokens that uses this resource in an efficient way to encode the information needed to predict text. Consider a language with 100 words for snow. Probably these all end up with similar token vectors, with large dot products between them. Exactly which word for snow someone writes is probably not too important for predicting text. So the training process makes those tokens relatively less distinguishable from each other. But the fact that there are a 100 tokens all pointing in a similar direction means that the centroid gets shifted in that direction.

Probably you can see where this is going now. The centroid gets shifted in directions where there are many tokens that the network considers to be all similar in meaning, directions where human language has allocated a lot of words, while the network considers the differences in shades of meaning between these words to be relatively minor.

• Mathematically, convergence just means that the distance to some limit point goes to 0 in the limit. There’s no implication that the limit point has to be unique, or optimal. Eg. in the case of Newton fractals, there are multiple roots and the trajectory converges to one of the roots, but which one it converges to depends on the starting point of the trajectory. Once the weight updates become small enough, we should say the net has converged, regardless of whether it achieved the “optimal” loss or not.

If even “converged” is not good enough, I’m not sure what one could say instead. Probably the real problem in such cases is people being doofuses, and probably they will continue being doofuses no matter what word we force them to use.

• Relatedly, you bring up adversarial examples in a way that suggests that you think of them as defects of a primitive optimization paradigm, but it turns out that adversarial examples often correspond to predictively useful features that the network is actively using for classification, despite those features not being robust to pixel-level perturbations that humans don’t notice—which I guess you could characterize as “weird squiggles” from our perspective, but the etiology of the squiggles presents a much more optimistic story about fixing the problem with adversarial training than if you thought “squiggles” were an inevitable consequence of using conventional ML techniques.

Train two distinct classifier neural-nets on an image dataset. Set aside one as the “reference net”. The other net will be the “target net”. Now perturb the images so that they look the same to humans, and also get classified the same by the reference net. So presumably both the features humans use to classify, and the squiggly features that neural nets use should be mostly unchanged. Under these constraints on the perturbation, I bet that it will still be possible to perturb images to produce adversarial examples for the target net.

Literally. I will bet money that I can still produce adversarial examples under such constraints if anyone wants to take me up on it.

• ## You Can Just Put an Endpoint Penalty on Your Wasserstein GAN

When training a Wasserstein GAN, there is a very important constraint that the discriminator network must be a Lipschitz-continuous function. Roughly we can think of this as saying that the output of the function can’t change too fast with respect to position, and this change must be bounded by some constant . If the discriminator function is given by then we can write the Lipschitz condition for the discriminator as:

Usually this is implemented as a gradient penalty. People will take a gradient (higher order, since the loss already has a gradient in it) of this loss (for ):

In this expression is sampled as , a random mixture of a real and a generated data point.

But this is complicated to implement, involving a higher order gradient. It turns out we can also just impose the Lipschitz condition directly, via the following penalty:

Except to prevent issues where we’re maybe sometimes dividing by zero, we throw in an and a reweighting factor of (not sure if that is fully necessary, but the intuition is that making sure the Lipschitz condition is enforced for points at large separation is the most important thing).

For the overall loss, we compare all pairwise distances between real data and generated data and a random mixture of them. Probably it improves things to add 1 or two more random mixtures in, but I’m not sure and haven’t tried it.

In any case, this seems to work decently well (tried on mnist), so it might be a simpler alternative to gradient penalty. I also used instance noise, which as pointed out here, is amazingly good for preventing mode collapse and just generally makes training easier. So yeah, instance noise is great and you should use it. And if you really don’t want to figure out how to do higher order gradients in pytorch for your WGAN, you’ve still got options.

• Yes. I think Beff was speaking imprecisely there. In order to be consistent with what he’s written elsewhere, he should have said something like: “maximizing the rate of free energy dissipation”.

• C: You heard it, e/​acc isn’t about maximizing entropy [no shit?!]

B: No, it’s about maximizing the free energy

C: So e/​acc should want to collapse the false vacuum?

Holy mother of bad faith. Rationalists/​lesswrongers have a problem with saying obviously false things, and this is one of those.

It’s in line with what seems like Connor’s debate strategy—make your opponent define their views and their terminal goal in words, and then pick apart that goal by pushing it to the maximum. Embarrassing.

I agree with you that Connor performed very poorly in this debate. But this one is actually fair game. If you look at Beff’s writings about “thermodynamic god” and these kinds of things, he talks a lot about how these ideas are supported by physics and the Crooks fluctuation theorem. Normally in a debate if someone says they value X, you interpret that as “I value X, but other things can also be valuable and there might be edge cases where X is bad and I’m reasonable and will make exceptions for those.”

But physics doesn’t have a concept of “reasonable”. The ratio between the forward and backward probabilities in the Crooks fluctuation theorem is exponential in the amount of entropy produced. It’s not exponential in the amount of entropy produced plus some correction terms to add in reasonable exceptions for edge cases. Given how much Beff has emphasized that his ideas originated in physics, I think it’s reasonable to take him at his word and assume that he really is talking about the thing in the exponent of the Crooks fluctuation theorem. And then the question of “so hey, it sure does look like collapsing the false vacuum would dissipate an absolutely huge amount of free energy” is a very reasonable one to ask.

# Co­a­lescer Models

17 Jan 2024 6:39 UTC
16 points
• If you care about the heat coming out on the hot side rather than the heat going in on the cold side (i.e. the application is heat pump rather than refrigerator), then the theoretical limit is always greater than 1, since the work done gets added onto the heat absorbed:

Cooling performance can absolutely be less than 1, and often is for very cold temperatures.

• a few kW of resistive loss

Is this already accounting for the energy penalty of cooling at cryogenic temperatures? 20K to room temperature is more than a factor of 10. You pay the energy cost once in resistive losses and 10 times in pumping the generated entropy out of the cold bath. I guess the electricity bill is not a huge constraint on these things, but it could mean a higher cost for cooling equipment?

# Embed­ded Agents are Quines

12 Dec 2023 4:57 UTC
11 points
• In general, the factors that govern the macroscopic strength of materials can often have surprisingly little to do with the strength of the bonds holding them together. A big part of a material’s tensile strength is down to whether it forms cracks and how those cracks propagate. I predict many LWers would enjoy reading The New Science of Strong Materials which is an excellent introduction to materials science and its history. (Cellulose is mentioned, and the most frequent complaint about it as an engineering material lies in its tendency to absorb water.)

It’s actually not clear to me why Yudkowsky thinks that ridiculously high macroscopic physical strength is so important for establishing an independent nanotech economy. Is he imagining that trees will be out-competed by solar collectors rising up on stalks of diamond taller than the tallest tree trunk? But the trees themselves can be consumed for energy, and to achieve this, nanobots need only reach them on the ground. Once the forest has been eaten, a solar collector lying flat on the ground works just as well. One legitimate application for covalently bonded structures is operating at very higher temperatures, which would cause ordinary proteins to denature. In those cases the actual strength of the individual bonds does matter more.

• Another effect that is very important is determining how proteins fold is the fact that they’re dissolved in liquid water, and so hydrophilic parts of the protein want to be on the surface, while hydrophobic parts want to be on the inside, near other hydrophobic parts. This is largely an entropic force/​effect.

Some other things that are true:

• 100% of the bonds in hydrogen gas are covalent.

• Most of the fundamental particles in a water molecule are held together by the strong nuclear force, which is a much stronger binding than covalent bonds.

• If you pull on a protein and stretch it apart, it looks like a long chain (maybe with a few crosslinks).

• This is a non-zero amount of structure, but it looks nothing like the fully folded protein.

• If we ask how the chain with the crosslinks is held together, the answer is covalent bonds that were either in the amino acids originally, or were formed when the ribosome assembled them into a chain, (or in the case of the disulfide crosslinks, were formed during the folding process).

• But if we ask where the rest of the protein’s structure came from, then the answer is hydrogen bonds and hydrophobic/​hydrophilic forces.

• The diesel just drives the “generator” that then powers electric motors that drive the wheels.

That’s exactly what “electric transmission” means, no?

• Wait, diesel-electric just means that they use an electric transmission, right? So 100% of the energy driving the locomotive still ultimately comes from burning diesel. IIRC the carbon footprint of electric cars is dependent on how your local power is generated. To be worse than internal combustion, there needs to be a high fraction of coal in the mix. Even the power plants that burn stuff are generally more efficient than internal combustion engines because they’re larger so less heat is lost to conduction and they also burn hotter. So the actual reason for higher emissions would just be that coal has more carbon in it per joule than gasoline does. That’s all just going off of memory, please correct me if I’m wrong.

It actually seems like a diesel-electric fleet would be almost ideal for converting rail lines to electric. If upgrading a locomotive to have brushes and some associated power electronics is not too expensive, then you can get a hybrid that will still operate as a normal diesel locomotive on lines that haven’t been electrified yet, but will operate electrically on lines that have been, saving on fuel costs.

• Good point, I had briefly thought of this when answering, and it was the reason I mentioned constant factors in my comment. However, on closer inspection:

1. The “constant” factor is actually only nearly constant.

2. It turns out to be bigger than 10.

Explanation:

10^{-9} is about 6 sigma. To generalize, let’s say we have sigma, where is some decently large number so that the position-only Boltzmann distribution gives an extremely tiny probability of error.

So we have the following probability of error for the position-only Boltzmann distribution:

Our toy model for this scenario is that rather than just sampling position, we jointly sample position and momentum, and then compute the amplitude. Equivalently, we sample position twice, and add it in quadrature to get amplitude. This gives a probability of:

Since we took to be decently large, we can approximate the integrand in our expression for with an exponential distribution (basically, we Taylor expand the exponent):

Result: is larger than by a factor of . While the is constant, grows (albeit very slowly) as the probability of error shrinks. Hence “nearly constant”. For this problem, where , we get a factor of about 15, so probability per try.

Why is this worth thinking about? If we just sample at a single point in time, and consider only the position at that time, then we get the original per try. This is wrong because momentum gets to oscillate and turn into displacement, as you’ve already pointed out. On the other hand, if we remember the equipartition theorem, then we might reason that since the variance of amplitude is twice the variance of position, the probability of error is massively amplified. We don’t have to naturally get a 6 sigma displacement. We only need to get a roughly a sigma displacement and wait for it to rotate into place. This is wrong because we’re dealing with rare events here, and for the above scenario to work out, we actually need to simultaneously get displacement and momentum, both of which are rare and independent.

So it’s quite interesting that the actual answer is in between, and comes, roughly speaking, from rotating the tail of the distribution around by a full circle of circumference . :::

Anyway, very cool and interesting question! Thanks for sharing it.