metawrong

Karma: 26

metawrong 9 May 2026 2:00 UTC
3 points
0
on: Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
I wonder if training NLAs at several layers and reading the explanations as a stack would let you watch a “thought” form across depth

metawrong 9 May 2026 1:57 UTC
13 points
0
on: Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
I wonder if training multiple NLAs on the same model and layer but with different seeds would converge on the same explanations?

metawrong 8 May 2026 5:47 UTC
4 points
−2
on: Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
Can we do an NLA on the activation verbalizer (AV) ? NLAs all the way down!

metawrong 13 Feb 2026 4:25 UTC
1 point
0
on: AI #155: Welcome to Recursive Self-Improvement
I want to talk about the chess games 30 seconds in.
They have swapped the places of the black’s King and Queen

metawrong 26 Jan 2025 11:19 UTC
2 points
0
on: Are Sparse Autoencoders a good idea for AI control?
Finetuned a SAE on deceptive/non deceptive reasoning traces from Gemma 9b
If you generate synthetic deceptive trajectories, how can you be sure the SAE is going to generalise to ‘real’ deceptive trajectories? Also in those cases why do you need to use SAEs, can you use probes instead?

metawrong 16 Jan 2025 4:51 UTC
1 point
0
in reply to: Cleo Nardo’s comment on: strawberry calm’s Shortform
How does this explain the Decoy effect ^[1]?
1. ^
  I am not sure how real and how well researched the ‘decoy effect’ is

metawrong 4 Jan 2025 7:44 UTC
1 point
0
in reply to: Neel Nanda’s comment on: The Plan − 2024 Update
So you would expect Claude Opus 3 to be harder to interpret than Claude Sonnet 3.5 ?

My intuition is that larger models of the same capability would exhibit less super-position and thus be easier to interpret?

metawrong 15 Nov 2024 15:03 UTC
1 point
0
on: Monthly Roundup #23: October 2024
> The news is good, and there are now seven shows in my tier 1

@Zvi Which are the other shows in your tier 1?

metawrong 1 Nov 2024 18:20 UTC
1 point
0
in reply to: Ryan Kidd’s comment on: Ryan Kidd’s Shortform
LASR (https://www.lasrlabs.org/) is giving a £11,000 stipend for a 13 week program, assuming 40h/week it works out to ~$27

metawrong 11 Mar 2024 10:18 UTC
8 points
6
in reply to: the gears to ascension’s comment on: Buy AI Stocks NOW
It’s difficult to produce a counterargument when the main argument is ’trust me bro’

The main argument seems to be (except for the all caps BUY STONKS NOW):

>It is just very obvious that the market is erring

This is not obvious to me

Sofia ACX January 2024 Social Meetup

metawrong26 Jan 2024 6:44 UTC

2 points

0 comments1 min readLW link

metawrong 10 Oct 2023 19:36 UTC
1 point
0
on: I’m a Former Israeli Officer. AMA
Thank you for doing this!

Few random questions, of course feel free to say as much or as little as you want:

Have you been personally to the Gaza strip (or the West Bank) ? If yes—what are your impressions? How do people live there? Is it common for regular (non-military) Jewish people to hang around those places? How common is for Palestinians to hang around outside of Gaza and the West Bank? How common is for Jewish and Palestinian people to mix in everyday life and interact?

I have just recently learned about the Gilat Shalit prisoner exchange (a single captured Israeli soldier was exchanged for over 1000 Palestinian prisoners). What do you think about that exchange?

What is your probability of the IDF knowing about the attack but still letting it happen? If it is near 0, what is your current best explanation for IDF being caught by surprise by such a massive attack?