I think the ‘naturalness’/‘realism’ of model generated transcripts keeps dropping roughly correlated with the model’s release-date? (i guess as the models keep on increasing the user’s keep chatting in more and more realistic manner).
Unsure whether ‘epoch-capability’ is a right thing to measure when assuming correlation with capabilities.
PS: the judge for the above plot is sonnet-4-6 and the chats are 100 samples each per model from sharechat.
Ram Bharadwaj
Can Random Labels Unlearn Eval Awareness?
Dumb idea that might be worth 2 days of compute:
Setup: Take a model with measurable eval awareness (say, GPT-5.3 or Claude). Finetune on a small dataset of eval and deployment transcripts, but randomly shuffle the “eval” vs “deploy” labels. Test whether the model can still distinguish eval from deployment contexts post-finetuning.
The logic: Eval awareness is a learned conditional — “if eval context, then behavior A.” If you feed the model data where that conditional is pure noise, maybe you break the association. The model learns the labels are meaningless and stops keying on them.
Why it probably fails: Eval detection likely strongly distributional and model might just learn to ignore your labels while still detecting evals from deeper cues.
Prediction: <20% chance this meaningfully reduces eval awareness.
By distilling the model on its own responses, the model’s train and eval behavior should converge you’re collapsing the (train/eval) conditional policy into uniform behavior everywhere.
Did anyone try and see if self-distillation suppresses eval-awareness?
Scaling Laws for LLM Based Data Compression
Thanks for the feedback! working on refining the writeup.
Exploring the Platonic Representation Hypothesis Beyond In-Distribution Data
Understanding Hidden Computations in Chain-of-Thought Reasoning
though as Geoff Hinton has pointed out, ‘confabulations’ might be a better word
I think yann lecun was the first one to using this word https://twitter.com/ylecun/status/1667272618825723909
not much information is given regarding that so far, i was curious about that too
“Algorithm for Concept Extrapolation”
I don’t see any recent publications for paul christiano related to this. So i guess the problem[s] is still open.
Goal-misgeneralization is ELK-hard
parameters before L is less than ,
should this be after?
AutoGPT was created by a non-coding VC
It looks like you are confusing autoGPT with babyagi which was created by yohei nakajima who is a VC. the creator of autoGPT (Toran Bruce Richards) is a game-developer with a decent programming (game-development) experience. Even the figure shown here is that from babyagi (https://yoheinakajima.com/task-driven-autonomous-agent-utilizing-gpt-4-pinecone-and-langchain-for-diverse-applications/).
47 layers layer
47 layers later ?
really interesting idea.
Regarding the first one i am not expecting a single-prompt to generate the entirity of enwiki8/9. I am more interested in finding a set of prompts with a lookup table if possible to replicate enwiki data.
Thanks for the pointer for chincilla post, will look into it.
not yet
I’ve tried, and it p-eval still reduces with steering-strength