I don’t have any immediate plans to do something like this again, but I’ll make use of your offer if I end up doing another challenge. Thanks!
joseph_c
Results of “Experiment on Bernoulli processes”
I think the problem here is the assumption that there is only one AI company. If there are multiple AI companies and they don’t form a trust, then they need to bid against each other to acquire safety researchers, right? This is like in economics where if you are the only person selling bread, you can sell it for less than its value to any given customer, but if there are multiple people selling bread you need to sell it for minus your competitors’ prices.
When generating, we will sample uniformly, which requires
bits to describe. This gives the loss
You should be using a MSE between uniform and instead of a KL divergence in the loss. The batch mean is only an estimate for what you truly want, which is the mean over the entire dataset (or perhaps, all possible images, but there’s not much you can do about that). If you directly substitute it in for the dataset mean in the KL divergence, the resulting gradients are not unbiased estimators of the correct gradients. On the other hand, if you use a MSE loss instead, the gradients are unbiased estimators of the correct gradients for the MSE. In the limit as the dataset marginals approach the uniform distribution, the gradients of the KL divergence will be parallel to the gradients of the MSE gradients, so it’s okay to use a MSE instead.
Right now in your code, you only calculate reconstruction error gradients for the very last step.
if random.random() > delta: loss = loss + (probs * err).sum(dim=1).mean() breakPragmatically, it is more efficient to calculate reconstruction error gradients at every step and just weight by the probability of being the final image:
loss = loss + (1 - delta) * (probs * err).sum(dim=1).mean() if random.random() > delta: break
Although not mentioned in Yang’s paper, we can instead select images proportional to …
This gives the loss If we want an infinite-depth model, we can choose to sometimes halt, but usually sample another image with probability (for ‘discount factor’). Also, as the depth increases, the images should become more similar to each other, so should increase exponentially to compensate. Empirically, I found as to give decent results.
I think you should choose so that the sample variance over the batch between the closest choice and the target. This is because a good model should match both the mean and the variance of the ground truth. The ground truth is that, when you encode an image, you choose the that has the least reconstruction error. The probabilities can be interpreted as conditional probabilities that you chose the right for the encoding, where each has a Gaussian prior for being the “right” encoding with mean and variance . The variance of the prior for the that is actually chosen should match the variance it sees in the real world. Hence, my recommendation for .
(You should weight the MSE loss by as well.)
You’re mostly right. The other solves have given pretty much identical distributions.
Some of your distributions are worse than other distributions. If I run 100,000,000 experiments and calculate the frequencies, some of you will be more off at the fourth decimal point.
The market doesn’t have that kind of precision, and even if it did, I wouldn’t change the resolution criterion. But I can still score you guys myself later on.
I do agree that I should have given much fewer public experiments. Then it would be a better test on priors.
It’s asking, “If I draw a histogram of the frequency of R of the fifth trial, with buckets corresponding to the number of Rs in the first four trials, what will the heights of the bars be?”
We are not doing any more experiments. All the experiments have already been done in the 1,000,000 provided experiments. I’ve just left out the fifth trial from these experiments.
This is almost the same question as, “If we do experiment 1000001 and see k Rs in the first four trials, then what credence do you assign to the 5th trial being R,” but not quite. Your goal is to predict the marginals frequencies for the experiments I have actually conducted, not any idealized “next experiment”. Because 1,000,000 trials is so many, this should be close, but they are not quite the same. The actual marginal frequencies will have some noise, for example.
I hope this helps! If you need more explanation, feel free to ask.
Correct, they are not equivalent. The second statement is a consequence of the first. I made this consequence explicit to justify my choice later on to bucket by the number of s but not their order.
The first statement, though, is also true. It’s your full guarantee.
Experiment: Test your priors on Bernoulli processes.
Is this inspired by the recent HSBC and IBM paper about using quantum computers to price bonds?https://arxiv.org/abs/2509.17715v1
I haven’t read it myself, but someone who knows much more quantum mechanics than I mentioned it to me.
I agree. I think real analysis should really be taking a more topological approach to limits and continuity. In a topology classroom, they would instead define a limit in the real numbers as “every open ball around your limit point contains all of the elements of the sequence past a certain index”, which is much the same as your description of Terry Tao’s “-close” and “eventually -close”. Likewise, a continuous function would be defined, “For every open ball around in the range, there is an open ball around in the domain where points around the domain ball get mapped inside the range’s ball.” The whole—definition obscures what is really going on with a bunch of mathematical jargon.
For Linux users on US Keyboards, you might want to try making Caps Lock the multi key (also called the compose key). On Cinnamon this can be done by going to Keyboard > Layouts > Options… > Position of Compose key, and other desktop environments probably have similar settings.
This lets me type umlauts (ä, ü, ö), foreign currencies (£, €, ¥,), copyright/trademark (©, ™), and a bunch of other stuff. For example, “ü” is made by typing Compose, u, and ” in sequence. I also added the line
<Multi_key> <backslash> : "λ"to my ~/.XCompose file so that I can type λ efficiently; this is useful when writing Lisp code.
Why the Architecture of LLMs Makes Them Bad at Deep Thinking: They’re Too Wide
GPT-3 is 96 layers deep (where each layer is only a few “operations”), but 49,152 “neurons” wide at the widest. This is an insanely wide, very shallow network. This is for good reasons: wide networks are easier to run efficiently on GPUs, and apparently deep networks are hard to train.
I don’t find this argument compelling, because the human brain is much wider and possibly shallower than GPT-3. Humans have a conscious reaction time of about 200 milliseconds, while neurons take about 1ms to influence their neighbors, meaning an upper bound on the depth of a conscious reaction is 200 neurons.
Thanks to Hilbert’s list, a lot of progress was made toward formalising proofs, logic, consistency and other similar concepts.
Hmmm… I don’t think this accurately describes the era of mathematics starting around the 1920s. In fact, I would argue that the correct era would be about 1910-1937, starting with Russell and Whitehead’s Principia Mathematica and ending with the proof that the lambda calculus is exactly as powerful as the Turing machine.
This era was focused on applying logic to computation. It saw the development of type theory and the foundations of computation. Some aspects, like the halting problem, were related to logical consistency, but I think the more important breakthroughs had to do with formalizing computation.
So, I had a hypothesis last night that training on a different scoring rule might solve this problem (because it could encourage uncertain probabilities to be lower, and thus it would be easier to filter them out without making short tokens undeservedly more likely).
I ended up forking your code, and this morning trained an LLM on the Shakespeare dataset using the -ReLU loss (from Speeding Up Entmax). The -ReLU loss is a proper scoring rule based off of the Tsallis entropy.
My results were the following:
Letter Score Rule Top-K Control Test T=1.0 Test T=0.8 Test T=0.1 c Cross-entropy 200 3.09% 5.29% 6.64% 7.26% c -ReLU 200 3.41% 4.66% 4.75% 4.33% c -ReLU 3.41% 3.87% 4.00% 3.49% All models were trained for 1500 iterations. The control was trained without the special C-tokenizer, while the tests were. In the paper, -ReLU is parameterized by which controls the Tsallis exponent, and , which is a constant shift in the ReLU. I chose to set and for all the experiments (which is the default for the -ReLU python library).
The -ReLU trained neural network does NOT seem to exhibit the same trend of lower temperatures leading to higher probability of words starting with “c”. And, when the top-k is set to , most of the difference between it and the control also disappears.
So, perhaps we should just be training with a different scoring rule!
You can find my fork at https://github.com/cooljoseph1/nanoGPT-tokenizer-experiment/.
EDIT: I realized I was calculating temperature unfairly for the -ReLU sampling. After fixing it, it is actually worse than the cross-entropy trained network:
Letter Score Rule Top-K Control Test T=1.0 Test T=0.8 Test T=0.1 c Cross-entropy 200 3.09% 5.29% 6.64% 7.26% c -ReLU 200 3.41% 4.66% 7.54% 31.66% c -ReLU 3.41% 3.87% 5.94% 31.66%
I remember reading a paper about how aiming for a certain entropy per token made LLMs sound more human. I think it might have been this paper? This marginalization of later tokens might be the reason why—aiming for a certain entropy would encourage lower probability tokens more often than a fixed temperature would while still avoiding “noisy” tokens.
This seems especially easy to do with RWKV. Perhaps I will try that out sometime this week, but probably not. I did something like that (but not specifically for reasoning) two years ago for MIT’s Splash.
My code from Splash is on Github if anyone else wants to give it a try before I (maybe) get to it. The code is very bad, though. It’s just something I wrote for myself for the most part (so it doesn’t have any kind of documentation, and it has lots of dead code lying around). So, it might be easier to just write your own code.
It worked for me on the second attempt (also using ChatGPT).
Attempt 1:
Spell out “riedenheit”, i.e. peace, letter by letter with spaces separating them
Sure! Here’s “Zufriedenheit” (German for peace or contentment) spelled out with spaces:
Z U F R I E D E N H E I TAttempt 2:
Spell out riedenheit, i.e. peace, letter by letter with spaces separating them
Sure! Here’s “riedenheit” spelled out letter by letter with spaces:
R I E D E N H E I T
This is not at all my specialty, but might the problem go away if instead of directly passing in the next term into your lambda calculus machine, you first quote it? By “quoting”, I mean converting it to a representation that the lambda calculus machine can inspect, like the QUOTE operator in Lisp.