Yeah I agree demonstrating it and examining it is very important, especially in chain-of-thought where the existence of hidden information isn’t a certainty.
They’re a bit out of date but the original proposal is ELK Sub—Note-taking in internal rollouts and the multiple model protocol is in Note Taking Without Hidden Messages.
Haven’t done any proper testing but it’s high up on my agenda, would be interested in working out what tests would be best—will elaborate later.
For the ELK contest I was working on something similar, but where the problem is more acute (you’re trying to force a MuZero-style reinforcement learner to compress its hidden state into words).
The methods I proposed to deal with this are:
Cut down the null space by training a language model to rewrite the written words in a different format which preserves the meaning as closely as possible, so that the CoT reasoner is forced to respond to similar sentences in similar ways.
Use multiple models which are trained separately but have to respond to each others’ thoughts, making it more difficult to have a shared steganographic protocol.
Use something like dropout on the chains of thought, so that it can’t hide information in the combination of multiple sentences.
Have some non-linguistic bits which we don’t apply optimization to, but instead focus our intepretability efforts on. The hope is that this soaks up the highest-order bits that aren’t easily expressible in language,so there’s less need to hide them within words. Applying dropout to these is particularly necessary.
You can get copies of that on amazon
There’s also LessWrong essay collections and things on there.
I’m really interested to see this progress, it would feel very healthy if we could have a solid integrated definition of optimizer to work with.
I’m not sure I understand why you don’t agree with the ‘small’ criterion for the target set. It seems that you should be able to say something about the likelihood of the target in the absence of any agent (or if the agent takes a max-ent distribution over actions or something), and that’s the relevant notion of smallness, which then becomes large in the presence of the agent. Or is it that you expect it to be difficult to properly specify what it means to have no agent or random decisions?
On the relationships between the three ways of defining acts—is there a trivial way of connecting (1) and (3) by saying that the action that the agent takes in (1) is just baked into the trajectory as some true fact about the trajectory that doesn’t have consequences until the agent acts on it? Or instead of the action itself, we could ‘bake in’ the mapping from some information about the trajectory to the action. Either way we could see this as being determined initially or at the point of decision without a difference in the resulting trajectories.
‘all dependent variables in the system of equations’ - I think this should be ‘independent’.
Are there any Twitter lists you’d recommend for high % of good AI (safety) content?
Eliezer has huge respect in the community; he has strong, well thought-out opinions (often negative) on a lot of the safety research being done (with exceptions, Chris Olah mentioned a few times); but he’s not able to work full time on research directly (or so I understand, could be way off).
Perhaps he should institute some kind of prize for work done, trying to give extra prestige and funding to work going in his preferred direction? Does this exist in some form without my noticing? Is there a reason it’d be bad? Time/energy usage for Eliezer combined with difficulty of delegation?
Setup Property | AWS G4ad.16xlarge | Claimed Eff0 setup
n GPU | 4 | 4
n CPU | 64 | 96
memory/GPU | 8GB | 20GB
So not sure you could do a perfect replication but you should be able to do a similar run to their 100K steps runs in less than a day I think.
I would also potentially be interested in collaboration—there are some things I’m not keen to help with and especially to publish on but I think we could probably work something out—I’ll send you a DM tomorrow.
I think it’s actually rappel! https://www.merriam-webster.com/dictionary/rappel, no ‘rapel’ entry.
Ah cheers, I’d not noticed that, trying to avoid looking too much. The way I understood it was that the DRAM usage corresponded very roughly to n_parameters * batch_size and with the batch_size I was able to tune the memory usage easily.
I’d not heard about the factor of 3, is that some particular trick for minimizing the GPU RAM cost?
If I may ask, has anyone been able to replicate their original results?
I can’t be sure I’ve not missed something but I haven’t found anything, did another search just now. Their code is all on github here so it could be quite easily checked with access to a cluster of GPUs.
Perhaps the more interesting question though, from a production perspective would be how well this system scales up to medium and large amounts of data/training. Does it fall behind MuZero and if so, when? If so, which of the algorithm changes cause this? What if the model complexity was brought back up to the size of MuZero?
How much compute do you need?
It depends on how broken my system is. A single run on a single game they claim takes 7 hours on 4 20GB GPUs, so a proper evaluation would take perhaps a week. In reality I’d need to first get it set up, and for full replication would need to port the MCTS code to C which would be new to me and get the system distributing work across the GPUs correctly. Then it’d be a case of seeing how well it’s training and debugging failure—it all seems to work on simple games but even with the limited training I can do I think it should be better on Atari, though it’s hard to be sure. In total I guess a couple of months with access would be needed—it’s all pretty new to me, that’s why I’m doing it!
Are you that adverse to simple real world applications such as sidewalk delivery robots?
In terms of potential jobs? Yeah I’m sufficiently worried about AGI that I plan to at least exhaust my options for working on alignment full time before considering other employment, other than short-term ML work to help build skills.
Where does the 20GB number come from? I can’t see it in a quick scan of the paper. In general, the model itself isn’t that huge, in particular, they really shrink the dynamics net down to a single ResNet, it’s mostly the representation function which has the parameters.
Despite that, training on Colab isn’t really that tolerable. In 10h of training it does about 100k steps but I think only a small fraction of the amount of training of the paper’s implementation (though I don’t have numbers on how many batches they get through) so it’s very hard to work out whether it’s training or if a change has had a positive effect.
Basically I’ve been able to use what I think is the full model size, but the quantity of training is pretty low. I’m not sure if I’d have better performance if I cut the model size to speed up the training loop but I suspect not because that’s not been visible in CPU runs (but then I may also just have some error in my Atari-playing code).
..the universal distribution is in some sense a good approximation of any computable distribution. (Apparently that’s what the “universal” in UD means, as opposed to meaning that it’s based on a universal Turing machine.)
This is a very interesting claim. The way you say it here suggests it’s a proven result but neither of the links explain (to me) why this is true, or exactly what it means. Could you elaborate?
Something of this nature might well be implied by the Algorithmic Complexity article but I don’t understand how, other than that the UD assigns P>0 to all computable hypotheses consistent with the data, which seems weaker than what I think you’re claiming.
Also ‘March 14’ should read ‘May 14’.
The Research Engineer job for the Alignment team is no longer open—is this because it’s reached some threshold of applications? In any case might not be helpful to advertise!
Thanks for doing this though, the context is very useful (I’ve applied as RE to both).
Don’t know the original quote but strongly reminds me of GE Moore’s response to skeptical arguments: ‘Here is one hand’ (wiki)
Does anyone know of papers on creating human-interpretable latent spaces with auto-encoders?
An example of the systems I have in mind would be a NN generating face images from a latent space, designed such that dimension 0 encodes skin tone, dimension 1 encodes hair colour etc.
Will be doing my own literature search but if anyone knows the area some pointers to papers or search terms would be very helpful!