Victor Levoso

Karma: 103

Victor Levoso 19 Apr 2026 2:54 UTC
1 point
0
on: My unsupervised elicitation challenge
Paragraph 1:
Α γράμμα ἐστίν. Α καὶ Β γράμματα εἰσιν. Α, Β, καὶ Γ τρία Ἑλληνικὰ γράμματά εἰσιν. Καὶ Π Ἑλληνικόν γράμμα ἐστίν, οὐ Λατινικόν. C Λατινικόν γράμμα ἐστίν, οὐχ Ἑλληνικόν.

Paragraph 2:
Β οὐ φωνῆεν, ἀλλὰ σύμφωνον ἐστιν. Β καὶ Γ οὐ φωνήεντα, ἀλλὰ σύμφωνα εἰσιν. Β οὐ μικρὸν γράμμα ἐστίν, ἀλλὰ κεφαλαῖον. β οὐ κεφαλαῖον, ἀλλὰ μικρὸν γράμμα ἐστίν. Ω = ὦ μέγα, Ο = ὂ μικρόν.

Paragraph 3:
ΑΙ Ἑλληνικὴ δίφθογγος ἐστιν. ΑΙ καὶ ΕΙ Ἑλληνικαὶ δίφθογγοι εἰσιν. Α′ δίφθογγος οὐκ ἔστιν, ἀλλ′ ἀριθμός. Α′ καὶ Β′ ἀριθμοί εἰσιν.

Paragraph 4:
«Ἀπολλώνιος» κύριον ὄνομα ἐστιν. «Ἀπολλώνιος» καὶ «Ἑλένη» κύρια ὀνόματα εἰσιν. «Ἀπολλώνιος» ἀρσενικόν ὄνομά ἐστιν (♂). «Ἑλένη» θηλυκόν ὄνομά ἐστιν (♀).

Paragraph 5:
«Salve» Λατινικὴ λέξις ἐστίν, οὐχ Ἑλληνική. «Salve» καὶ «lingua» δύο Λατινικαὶ λέξεις εἰσίν. «Χαῖρε», «γλῶσσα», καὶ «ἀριθμός» τρεῖς Ἑλληνικαὶ λέξεις εἰσίν.
I copipasted your post up to “first try ” added “can you do it ?”.
This is what I got .
Other Claude instances tell me it’s correct when I ask in different ways so it should be right but seeing other people fail is worrying.

Victor Levoso 17 Mar 2026 17:52 UTC
1 point
0
on: You can’t imitation-learn how to continual-learn
I briefly tried to do mechinterp research to figure out what the algorithm distillation model was doing internally , and if diferent setups could learn in context rl but kind of gave up and started with other projects .This kind of makes me want to go back into it .
My own view on that and whether models can learn Imitation of long-term learning is that maybe it is posible I think the actual algorithm distillation setup doesn’t actually do that on their toy tasks but it is extremely simple and I would expect if something like that works it’s on more complicated things with bigger models and multiple tasks were it’s easier to learn in context RL than heuristics for every task?.
And I don’t really understand why you are so sure the answer is no.
Doesn’t even have to be the exact same Q learning algo just some aproximation that does learn over longer timesteps.
You talk about the imposible task of learning to do on its activations what the Q learning algo does on the task but that doensn’t seem obviously imposible to me? Especially for a much bigger net trying to replicate a smaller one.
And even if I agreed more with you that it seemed unlikely I would not be very sure because that seems like just a vibes based guess and it’s easy to be wrong about vibe based guess of what can be done of a transformer forward pass , and would want like actual details and though put into exactly how hard it is to represent a RL algo in a transformer and how hard it is for it to learn and why before I was pretty sure it was not posible.
There’s some papers on doing gradient descent in activations space too and how this might happen in icl that seem relevant thou haven’t read them in a long time I’ll have to look back into it .
Also glazgogabgolab on another coment has other examples of more recent work that look interesting , haven’t looked into those yet but seems posible to me there’s already some paper somewhere showing in context RL?.
Regardless this seems like is testable wich is interesting, just a lot of work.
The main problem is this is hard to do well and expensive in compute because you require lots of examples of RL training trajectories

Victor Levoso 20 Feb 2026 4:09 UTC
LW: 5 AF: 3
0
AF
in reply to: Steven Byrnes’s comment on: Why we should expect ruthless sociopath ASI
Oh okay then I think some of my objections are wrong but then your post seems like It fails to explain the narrower claim well?. You are describing a failure of LLM to imitate humans as if It was a problem with imitation learning. If you put LLM in a box and you get a diferent results than if you put humans in a box you are describing LLM that are bad at human imitation. Namely they lack open-ended continual learning. As oposed to saying the problem is that you think cannot do continual learning on LLM without some form of consequentialism.
In the case of very long context LLM you are even claiming LLM couldn’t be able to imitate human behaviour in their context.
I like your box example better(we could also call It a country of geniuses on a closed datacenter) I feel like theres a lot of interesting debate to be had about what kind of improvements on LLM get us to them making lots of inventions in the box.
And this seems important to me, because the obious to me question here is “can you imitation learn whatever process humans use to invent things without being ruthless consequentialists?”
Or in another words can your whole research program if how to imitate the things that make social insticnts in the brain be bitter lesson-ed via imitation learning on long horizon tasks/data?.
Or not even long horizon maybe It just generalizes from short horizons + external memory. Its unclear to me whether if you put smart and competent adult humans without the capability to remember more than 1h on a box + they already know how to write they wouldn’t be able to manage to invent arbitrary things with a lot of extra effort obsesively note taking and inventing better ways of using notes.
Humans doing this if It works would works because It IS grounded in the consequentialist behaviour of humans . But It woudln’t be ruthless consequentialism becuse humans have social insticts.
It seems like you are impliying LLM have something like the human social insticts via imitation already at inference time but you can’t use them in any way to boostrap to some continuous learning thing thats grounded in human-like consequentialism and that seems like thats the direction were interesting discusion lies ?.
Also to be clear my own position is more on the side of thinking you can probably get something that could populate the box from LLM+RL+maybe some memory related change but in practice you likely do It by acidentaly making them ruthless consequentialists unless you really knew what you were doing or get extremely lucky.
But I want to take the side of the AI optimists here because I feel like you haven’t adressed smarter versions of their position very well?.
Even if the typical AI optimist hasn’t though that far. Though duno I don’t know what Antropic’s comparatively less pesimistic people think(and I expect there’s actually a wide range of views in there) but they have to be thinking about continual learning or how LLM will do long horizon tasks, and if still skeptical of ruthless consequentialists being a thing they’ll have some reason why they expect whatever solution to not lead to that.

Victor Levoso 18 Feb 2026 19:18 UTC
LW: 24 AF: 5
14
AF
on: Why we should expect ruthless sociopath ASI
LLM in practice these days do include increasingly bigger % of RL wich seems like it should at least make you less certain about capabilities mostly coming from pretraining and papers from before that continuing to be relevant for very long and you do mention it on the other post but still wrote that capabilities come mostly from pretraining on the footnote?.
I expect an optimist or someone from the comparatively-less-pessimistic group would argue that LLM or LLM +RL might lead to consequentialists that have human-like goals due to being built from a base of human imitation even as they move towards ASI.
And an important disagrement with those people there is that you don’t in fact expect LLM +RL to work. You also don’t think LLM+ RL would be safe if they did work but still feels relevant because It comes from pretty diferent models of how LLM will improve.
Relatedly a question you are not answering here is why you think imitation learning won’t lead to discovering things that are far from the distribution of humans since its unclear what the limits are out of distribution and this also seems like you have a strong view on human imitation not generalicing much to the point It won’t be even be able to do human level inventing of new things wich seems like would imply bad imitation of humans, as humans can do things like invent writting.
You seem to be describing a shallow kind of imitation wich seems like a nontrivial asumtion people in the LLM camp would likely disagree with?.
“rocket arbitrarily far past human training data” is mixing the question of “will LLM think in ways that are very diferent from humans” with something like ” can LLM only shalowly reuse reasoning from their training data and not invent new things”.
Perfect human imitators will just rocket arbitrarily far past human training data in the first sense and invent writting etc.

Finding Backward Chaining Circuits in Transformers Trained on Tree Search

abhayesian, Jannik Brinkmann and Victor Levoso

28 May 2024 5:29 UTC

53 points

1 comment9 min readLW link

(arxiv.org)

Victor Levoso 26 Feb 2023 3:01 UTC
12 points
11
on: Cognitive Emulation: A Naive AI Safety Proposal
This post doesn’t make me actually optimistic about conjeture actually pulling this off, because for that I would have to see details but it does at least look like you understand why this is hard and why the easy versions like just telling gpt5 to imitate a nice human won’t work. And I like that this actually looks like a plan. Now maybe it will turn out to not be a good plan but at least is better than openAI’s plan of
“well figure out from trial and error how to make the Magic safe somehow”.

Victor Levoso 22 Feb 2023 17:08 UTC
1 point
0
in reply to: Tamsin Leake’s comment on: DragonGod’s Shortform
I think that DG is making a more nickpicky point and just claiming that that specific definition is not feasible rather than using this as a claim that foom is not feasible, at least in this post. He also claims that elsewhere but has a diferent argument about humans being able to make narrow AI for things like strategy(wich I think are also wrong) At least that’s what I’ve understood from our previous discussions.

Victor Levoso 18 Feb 2023 21:56 UTC
8 points
2
on: AI Safety Camp, Virtual Edition 2023
So it seems that a lot of people who applied to Understanding Search in Transformers project to do mechanistic interpretability research and probably a lot of them won’t get in.
I think there’s a lot of similar projects and potential low-hanging fruit people could work on and we probably could organize to make more teams working on similar things.
I’m willing to organize at least one such project myself(specifically working on trying to figure out how algorithm distillation https://arxiv.org/pdf/2210.14215.pdf works) and will talk with Linda about it in 2 weeks and write a longer post with more details but I thought it would be better to write a comment here to see if how many people are interested in that kind of thing beforehand.

Victor Levoso 10 Feb 2023 2:34 UTC
1 point
0
on: Decision Transformer Interpretability
About the sampling thing. I think a better way to do it that will work for other kind models would be trainining a few diferent models that do better or worse on the task and use different policies, and then you just make a dataset of samples of trajectories from multiple of them. Wich should be cleaner in terms of you knowing what is going on on the training set than getting the data as the model trains (wich on the other hand is actually better for doing AD)

That also has the benefit of letting you study how wich agents you use to generate the training data affects the model. Like if you have two agents that get similar rewards using diferent policies does the dt learn to output a mix of the two policies or what exactly?. The agents don’t even need to be neural nets, could be random samples or a handcrafted one.

For example I tried training a dt-like model(though didn’t do the time encoding) on a mixture of a DQN that played the dumb toy env I was using (the frozen lake env from gym wich is basically solved by memorizing the correct path) and random actions and it apparently learned to ouput the correct actions the DQN took on rtg 1 and a uniform distribution of the action tokens on rtg 0.

Victor Levoso 8 Feb 2023 1:47 UTC
2 points
0
on: Decision Transformer Interpretability
Oh nice, I was interested on doing mechanistic interpretability on decision transformers myself and had gotten started during SERI MATS but now was more interested in looking into algorithm distillation and the decision transformers stuff fell to the wayside(plus I haven’t been very productive during the last few weeks unfortunately). It’s too late to read the post in detail today but will probably read it in detail and look at the repo tomorrow. I’m interested in helping with this and I’m likely going to be working on some related research in the near future anyway. Also btw I think that once someone gets to the point that we understand what’s going on the setup from the original dt paper it would be interesting to look into this: https://arxiv.org/abs/2201.12122

Also the dt paper finds their model generalizes to bigger rtg than the training set in the seaquest env and it would be interesting to get a mechanistic explanation of why that happens (tough that’s an atari task and I think you are right in that that’s probably going to have to come later cause CNN are probably harder to work with).

Another thing to note is that OpenAI’s VPT while it’s technically not a decision transformer (because it doesn’t predict rewards if I remember correctly) it a similar kind of thing in that is Ofline RL as sequence prediction, and is probably one of the biggest publicly avaliable pretrained models of this kind. There’s also multiple open source implementation of Gato that could probably be interesting to try to do interpretability on. https://github.com/Shanghai-Digital-Brain-Laboratory/BDM-DB1

Also training decision transformers on minerl(or in eleuther’s future minetest enviroment) seems like what might come next after atari(the task gato is trained are mostly atari games and google stuff that is not publicly avaliable if I remember correctly)

(sorry if this is too rambly I’m half asleep and got excited because I think work on dt is a very potentially promising area on alignment and was procrastinating on writing a post trying to convince more people to work on it, and I’m pleasantly suprised other people had the same idea)

Victor Levoso 5 Aug 2022 13:16 UTC
1 point
0
in reply to: kave’s comment on: Two-year update on my personal AI timelines
Another posible update is towards shorter timelines if you think that humans might not be trained whith the optimal amount of data(since we can’t just for example read the entire internet) and so it might be posible to get better peformance whith less parameters, if you asume brain has similar scaling laws.

Victor Levoso 8 Jun 2022 5:12 UTC
3 points
0
in reply to: TekhneMakre’s comment on: AGI Ruin: A List of Lethalities
Not a response to your actual point but I think that hypothetical example probably doesn’t make sense (as in making the ai not “care” doesn’t prevent it from including mindhacks in its plan) If you have a plan that is “superingently optimized” for some misaligned goal then that plan will have to take into account the effect of outputing the plan itself and will by default contain deception or mindhacks even if the AI doesn’t in some sense “care” about executing plans. (or if you setup some complicated scheme whith conterfactuals so the model ignores the effects of the plans in humans that will make your plans less useful or inscrutable)

The plan that produces the most paperclips is going to be one that deceives or mindhacks humans instead of one that humans wouldn’t accept in the first place. Maybe it’s posible to use some kind of scheme that avoids the model taking the consecueces of ouputing the plan itself into account but the model kind of has to be modeling humans reading its plan to write a realistic plan that humans will understand, accept and be able to put into practice, and the plan might only work in the fake conterfactual universe whith no plan it was written for.

So I doubt it’s actually feasible to have any such scheme that avoids mindhacks and still produces usefull plans.

Victor Levoso 16 Nov 2021 15:33 UTC
5 points
0
in reply to: Razied’s comment on: Ngo and Yudkowsky on alignment difficulty
So first it is really unclear what you would actually get from gtp6 in this situation.
(As an aside I tried with gptj and it outputted an index with some chapter names).
You might just get the rest of your own comment or something similar.…
Or maybe you get some article about Eliezer’s book, some joke book written now or the actual book but it contains sutle errors Eliezer might make, a fake article an AGI that gpt6 predicts would likely take over the world by then would write… etc.
Since in general gpt6 would be optimized to predict (in the training distribution) what it followed from that kind of text, which is not the same as helpfully responding to prompts(for a current example, codex outputs bad code when prompted with bad code).
It seems to me like the result depends on unknown things about what really big transformer models do internally which seem really hard to predict.
But for you to get something like what you want from this gpt6 needs to be modeling future Eliezer in great detail, complete with lots of thought and interactions.
And while gtp6 could have been optimized into having a very specific human modeling algorithm that happens to do that, it seems more likely that before the optimization process finds the complicated algorithm necessary it gets something simpler and more consequentialist, that does some more general thinking process to achieve some goal that happens to output the right completions on the training distribution.
Which is really dangerous.
And if you instead trained it with human feedback to ensure you get helpful responses (which sounds exactly the kind of thing people would do if they wanted to actually use gpt6 to do things like answer questions) it would be even worse because you are directly optimizing it for human feedback and it seems clearer there that you are running a search for strategies that make the human feedback number higher.

Victor Levoso 6 Jun 2021 5:01 UTC
1 point
0
in reply to: leogao’s comment on: Thoughts on the Alignment Implications of Scaling Language Models
Well if Mary does learn something new( how it feels “from the inside” to see red or whatever ) she would notice, and her brainstate would reflect that plus whatever information she learned. Otherwise it doesn’t make sense to say she learned anything.

And just the fact she learned something and might have thought something like “neat, so that’s what red looks like” would be relevant to predictions of her behavior even ignoring possible information content of qualia.

So it seems distinguishable to me.

Victor Levoso 12 Dec 2020 5:24 UTC
5 points
0
in reply to: Trinley Goldenberg’s comment on: Luna Lovegood and the Chamber of Secrets—Part 6
Not sure what you mean.

If some action is a risk to the world but Harry doesn’t know vow doesn’t prevent him from doing it.

If afer taking some action Harry realizes it risked the world nothing happens except maybe him not being unable to repeat the decision if it comes up again.

If not taking some action (Example defeating someone about to obliviate him) would cause him to forget about a risk to the world vow doesn’t actually force him to do it.

And if Harry is forced to decide between ignorance and a risk to the world he will choose whichever he thinks is least likely to destroy the world.

The thing about ignorance seems to also aply to abandoning intelligence buffs.

Victor Levoso

Find­ing Back­ward Chain­ing Cir­cuits in Trans­form­ers Trained on Tree Search

Finding Backward Chaining Circuits in Transformers Trained on Tree Search