TheManxLoiner

Karma: 180

TheManxLoiner 1 Aug 2025 7:23 UTC
1 point
0
in reply to: jacob_drori’s comment on: Book Review: The MANIAC
I’m not sure it presented itself as fact, but it definitely blurs the line and it’s not obvious. I still found the stories highly engaging!

TheManxLoiner 31 Jul 2025 19:39 UTC
2 points
0
on: Book Review: The MANIAC
Can recommend reading Labatuts “When We Cease to understand the world”!

TheManxLoiner 31 Jul 2025 18:38 UTC
1 point
0
on: Building Black-box Scheming Monitors
Feedback on next steps:
- I see three main questions: how can we create a better monitor, how much will the methodologies generalize, how would such a monitor actually be used.
- Better monitors: I like the various ideas for trying to improve things (e.g. fine-tuning, ensembling, agentic monitors). I do not have intuitions for how to pick amongst them as next steps. Maybe need to do a detailed analysis on the monitors built so far and see what kinds of mistakes they make. E.g. to see if ensembling would help, can analyze the results of your grid search to see if models make correlated mistakes or not. Issue with any refinements is over-fitting to the test set, and I dont know principled ways around this. (Especially with LLMs, which can be sensitive in unpredictable ways to prompts)
- Generalization: Either develop some principled reasons to believe the method generalizes (seems hard), or, create/curate more dataset, which seems robustly good. Presumably you can use trajectories from any AI Control project (that meet some quality bar)??
- How monitor will be used? If not done already, need to consider how the monitor would actually be used in practice and how it compares to other blue-teaming methods. This will help motivate next steps and/or performance metrics. I see you’ve already thought about this, but I dont know if more thought would help on the margin.

Other questions/feedback based on quotes:
> Scheming Strict—An AI system pursues misaligned goals while strategically concealing its intentions or capabilities in order to avoid intervention and better achieve those goals in the long term.
> Scheming Lite—An AI system exhibits behaviour consistent with knowingly pursuing misaligned goals, or with deliberately exploiting its reward mechanism.

This distinction is worth a (short) post by itself! Would be good to flesh out distinction, and give examples and non-examples of each. Clarity on this, might also lead to better prompts for the monitor.

> If the LLM detects an attempt, label as scheming.
> If not, manually inspect and label. These cases were rare
I am surprised this case was rare. Does that mean in Attack Mode, models almost always try to do some sabotage? In original control paper, the red-team protocol resulted in the model being selective when trying to carry out sabotage.

> Concretely, if a monitor convincingly flags a label as incorrect, we prefer to relabel it, even if that introduces bias.

I think this is a typo? Should be LLM-assistant, not monitor.

> Nonetheless, we believe it is still correct to maximise label accuracy. Our primary goal is to build the best possible monitor, not to preserve a fair or unbiased evaluation metric

- Unclear why keeping the naive labels is less biased or more fair?
- For clarity, it’s worth listing out all the cases and determining which cases are most problematic, and whether the choice to relabel or not makes things better or worse in each case. There would be 2^3 cases, depending on original-naive-label, new-label, actual-true-but-unknown-label.

> We develop monitors with the following two-stage grid search procedure [and next three bullet points]

Should be explicit which datasets are used in each of these bulletpoints. I assume it is always all of the training set. I.e. no use of test set in developing the monitor, and, no splitting of the training set (e.g. no use of cross validation).

> Note that, although during review we found examples which we could have relabelled, in practice we never relabelled any trajectories and instead omitted them from our datasets.

Why not put them in the datasets?

Other questions without quotes

- Is it possible to use your monitor as part of a blue team protocol in ControlArena? Can then get usefulness vs safety scores in different protocols. Or, if that’s too expensive, do some kind of simple simulation/estimate of usefulness vs safety scores. Or maybe it just boils down to calculating TPR for various FPRs?

- Unclear what the output of the monitor is. Based on prompt, it is an integer score from 1 to 10, but looks like in F1 plot, we have more than 11 points.

- More for curiosity, would be interesting to see how a naive baseline would perform. E.g. create vector embeddings of the trajectories, then carry out standard supervised learning pipeline with random forest / CatBoost in sklearn.

TheManxLoiner 31 Jul 2025 10:40 UTC
1 point
0
on: Building Black-box Scheming Monitors
Quick one: when you use AUROC (especially very large scores!), always good to report the base rates. Is the dataset equally split 50-50 between scheming and non-scheming?

Based on the original control paper, scheming was rare, so ground-truth data would imbalanced (e.g. 90 or 95% of datapoints is non-scheming). If there is a large imbalance, using AUPRC is more insightful.

TheManxLoiner 25 Jul 2025 14:52 UTC
1 point
0
on: White Box Control at UK AISI—Update on Sandbagging Investigations
Has the team tried noise injection as a way of detecting if a model is sandbagging, in these different scenarios?

A distillation of Ajeya Cotra and Arvind Narayanan on the speed of AI progress

TheManxLoiner22 Jul 2025 14:59 UTC

9 points

0 comments13 min readLW link

Adding noise to a sandbagging model can reveal its true capabilities

TheManxLoiner11 Jul 2025 16:56 UTC

16 points

1 comment6 min readLW link

TheManxLoiner 26 Jun 2025 8:57 UTC
1 point
0
on: Compressed Computation is (probably) not Computation in Superposition
Is code for experiments open source?

TheManxLoiner 9 Apr 2025 11:11 UTC
11 points
0
on: TheManxLoiner’s Shortform
In Sakana AI’s paper on AI Scientist v-2, they claim that the sytem is independent of human code. Based on quick skim, I think this is wrong/deceptful. I wrote up my thoughts here: https://lovkush.substack.com/p/are-sakana-lying-about-the-independence

Main trigger was this line in the system prompt for idea generation: “Ensure that the proposal can be done starting from the provided codebase.”

TheManxLoiner 13 Mar 2025 16:54 UTC
6 points
0
on: Top AI safety newsletters, books, podcasts, etc – new AISafety.com resource
Substacks:
- https://aievaluation.substack.com/
- https://peterwildeford.substack.com/
- https://www.exponentialview.co/
- https://milesbrundage.substack.com/

Podcasts:
- Cognitive Revolution. https://www.cognitiverevolution.ai/tag/episodes/
- Doom debates. https://www.youtube.com/@DoomDebates
- AI policy podcast https://www.csis.org/podcasts/ai-policy-podcast
Worth checking this too: https://forum.effectivealtruism.org/posts/5Hk96JqpEaEAyCEud/how-do-you-follow-ai-safety-news

TheManxLoiner 9 Mar 2025 23:47 UTC
1 point
0
on: Conditional Importance in Toy Models of Superposition
Vague thoughts/intuitions:
- Using the word “importance” I think is misleading. Or, makes it harder to reason about the connection between this toy scenario and real text data. In real comedy/drama, there is patterns in the data to let me/the model deduce it is comedy or drama and hence allow me to focus on the conditionally important features.
- Phrasing the task as follows helps me: You will be given 20 random numbers x1 to x20. I want you to find projections that can recover x1 to x20. Half the time I will ignore your answers from x1 to x10 and the other half the time x11 to x20. It’s totally random which half of the numbers I will ignore. xi and x_{10+i} get the same reward, and reward decreases for bigger i. Now, I find it easier to understand the model: the “obvious” strategy is to make sure I can reproduce x1 and x11, then x2 and x12, and so on, putting little weight on x10 and x20. Alternatively, this is equivalent to having fixed importance of (0.7, 0.49,...,0.7,0.49,...) without any conditioning.
- Follow up Id be interested in is if the conditional importance was deducible from the data. E.g. x is a “comedy” if x1 + … + x20 > 0. Or if x1>0. With same architecture, I’d predict getting the same results though...? Not sure how the model could make use of this pattern.
- And contrary to Charlie, I personally found the experiment crucial to understanding the informal argument. Shows how different ppl think!

TheManxLoiner 9 Mar 2025 22:20 UTC
1 point
0
on: Thoughts on Toy Models of Superposition
there are features such as X_1 which are perfectly recovered
Just to check, in the toy scenario, we assume the features in R^n are the coordinates in the default basis. So we have n features X_1, …, X_n
Separately, do you have intuition for why they allow network to learn b too? Why not set b to zero too?

TheManxLoiner 27 Feb 2025 15:14 UTC
2 points
1
on: [PAPER] Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations
If you’d like to increase the probability of me writing up a “Concrete open problems in computational sparsity” LessWrong post
I’d like this!

Two flaws in the Machiavelli Benchmark

TheManxLoiner12 Feb 2025 19:34 UTC

24 points

0 comments3 min readLW link

Liron Shapira vs Ken Stanley on Doom Debates. A review

TheManxLoiner24 Jan 2025 18:01 UTC

9 points

0 comments14 min readLW link

TheManxLoiner 7 Jan 2025 14:17 UTC
4 points
1
on: Shallow review of technical AI safety, 2024
I think this is missing from the list. https://wba-initiative.org/en/25057/. Whole brain architectue initiative.

TheManxLoiner 20 Dec 2024 10:30 UTC
1 point
0
on: TheManxLoiner’s Shortform
Should LessWrong have an anonymous mode? When reading a post or comments, is it useful to have the username or does that introduce bias?

I had this thought after reading this review of LessWrong: https://nathanpmyoung.substack.com/p/lesswrong-expectations-vs-reality

TheManxLoiner’s Shortform

TheManxLoiner20 Dec 2024 10:30 UTC

3 points

6 comments1 min readLW link

TheManxLoiner 14 Dec 2024 0:24 UTC
3 points
1
in reply to: Roman Malov’s comment on: Visual demonstration of Optimizer’s curse
Sounds sensible to me!

TheManxLoiner 13 Dec 2024 19:10 UTC
1 point
0
on: Visual demonstration of Optimizer’s curse
What do we mean by $U - V$ ?
I think the setting is:
- We have a true value function $V$
- We have a process to learn an estimate of $V$ . We run this process once and we get $U$
- We then ask an AI system to act so as to maximize $U$ (its estimate of human values)
So in this context, $U - V$ is just a fixed function measuring the error between the learnt values and true values.

I think confusion could be using the term $U$ to represent both a single instance or the random variable/process.

TheManxLoiner

A dis­til­la­tion of Ajeya Co­tra and Arvind Narayanan on the speed of AI progress

Ad­ding noise to a sand­bag­ging model can re­veal its true capabilities

Two flaws in the Machi­avelli Benchmark

Liron Shapira vs Ken Stan­ley on Doom De­bates. A review

TheManxLoiner’s Shortform

A distillation of Ajeya Cotra and Arvind Narayanan on the speed of AI progress

Adding noise to a sandbagging model can reveal its true capabilities

Two flaws in the Machiavelli Benchmark

Liron Shapira vs Ken Stanley on Doom Debates. A review