luciaquirke

Karma: 19

luciaquirke 16 Apr 2026 5:47 UTC
1 point
0
in reply to: Raemon’s comment on: Anthropic, and taking “technical philosophy” more seriously
Take 2: I think you’re privileging the hypothesis that AI safety is philosophy-loaded.

luciaquirke 24 Mar 2026 2:48 UTC
1 point
0
in reply to: Raemon’s comment on: Anthropic, and taking “technical philosophy” more seriously
I think what you’re asking for is perspectives on your views from someone more similar to an average Anthropic employee than you, and I guess I am that. Sorry if this post is confusing, I don’t write about abstract questions too often.
My perspective is that philosophy can have a mental effect where you think that certain questions, once posed, correspond to reality in a way that they don’t, and are potentially answerable in a way that they’re not. I feel like this post has some “twisting” in its frame from this effect where it makes you feel like if a philosophical question exists and is hard to answer, we should probably update on there being a corresponding piece of reality that also exists and is also “hard” in some way. But from my perspective, strictly philosophical questions are pretty much always unanswerable (e.g., by being ill-posed and not corresponding to the reality) even when the corresponding piece of reality is trivially easy to work with in other ways.
More concretely:
- The post is called “taking technical philosophy more seriously” but doesn’t seem to contain clear arguments to take technical philosophy more seriously.
- The concrete questions in the post don’t seem like things you’d use philosophy to tackle.
- It seems like you have a meta-philosophical question about whether philosophy is necessary for solving alignment.
- - I don’t have reason to believe that this question is answerable using philosophy, and looking at the track record of philosophy etc. makes me think it’s not.
  - I feel that it’s possible to reasonably employ heuristic vibes on the question of how useful philosophy could be for solving alignment, but becoming good at wielding heuristic vibes is kind of like building a political opinion—you have to collect and compress a lot of different data from your experience and the resulting tacit knowledge is hard to convey. This is probably one source of your differences with the Anthropic employees—they have a lot of tacit knowledge in machine learning.
  - At least for me, my tacit knowledge results in me not finding the practice of philosophy relevant to alignment unless it’s embedded in a level of tacit knowledge and practice that would cause most people to not refer to it as philosophy (because they would be spending most of their time on the tacit knowledge-building and practical work, and would mostly engage with concrete examples of philosophical issues and so wouldn’t need to use the abstraction/imaginal muscles you build doing philosophy in exactly the same way, and they would need other muscles too). An example is Amanda Askell’s philosophical work, which seems potentially useful and could also be called “choosing a good character for Claude” and done (perhaps worse) using only common sense and good taste.
I agree that a crux here is your ‘I’m personally at like “it’s at least ~60% that super alignment is Real Hard”’, where by Real Hard you are probably implying that you need to solve fairly pure philosophical problems.
It seems like your stated reasons for believing you need to solve philosophical problems are:
- blog by Nate Soares, and AGI Ruin: List of Lethalities by Eliezer Yudkowsky (grounds e.g. belief in a need for pivotal acts)
- - belief in “rapid” recursive self-improvement that is a “problem” w.r.t iterative empirical R&D
  - belief in a need for alignment processes that scale to “infinity” or “unboundedly powerful” AI.
  - that you need to solve real hard philosophical problems to manage this transition
My perspective here is that I have read and thought about the AGI Ruin post for years and have come to believe it deeply deeply does not hold water in many ways. This influences my priors on the Nate Soares post, which I haven’t read recently.
- I don’t expect FOOM for reasons of hardware constraints and tacit knowledge from doing things like hands-on ML research. So I believe in a kind of “rapid” that is not as rapid as all that.
- I don’t believe in “infinity” or “unboundedness” wrt the things being pointed at when we say “intelligence” for similar reasons and also conceptual reasons.
- I did MATS in ’23 and I think community credence in List of Lethalities was pretty low. I might be more Lethalities-pilled than the average person.
> “what would be sufficient to think Anthropic should significantly change it’s research or policy comms?”
For me personally, I would like to see successful researchers producing impactful concrete machine learning research using philosophical reasoning that is flavored like Yudkowsky thought or Soares thought. An example would be inventing more efficient weight sparse transformers using Yudkowsky thought, or upending what loss functions are used to produce interpretable models. For the purposes of this discussion let’s just say that Yudkowsky thought is something I know when I see, and that I’m only including the aspects of Yudkowsky thought not already mainstream in ML (“thinking about incentives” is mainstream).

luciaquirke 23 Mar 2026 1:56 UTC
3 points
0
on: Anthropic, and taking “technical philosophy” more seriously
Technical
- Do we need 10-30 years of serial research, which can’t be parallelized?
- Does superalignment require “extreme philosophical competence?”
- Is the Anthropic culture/playbook prepared for the potential necessity of 10+ year pauses?
- What does the curve of “alignment difficulty vs capabilities” look like, right around the point where AI becomes capable enough to meaningfully help with “ending the acute risk period?”
Geopolitical
- How inevitable is racing?
- How willing would China / others be to go along with a serious pause proposal?
- How much Overton-window-smashing do we need “right now” to get to an adequate world and how practical is that?
In what sense are questions 1, 3, 4, 5, 6, and 7 philosophical? I think these are forecasting and strategic planning questions. I hope that framing these questions as philosophical doesn’t produce any kind of retreat from work that could cash out in concrete predictions and shovel-ready projects.

luciaquirke 26 Feb 2026 0:21 UTC
1 point
0
on: Realistic Evaluations Will Not Prevent Evaluation Awareness
These types of arguments remind me of how getting humans to believe in an afterlife where they will be judged can improve their behavior. Existential doubt seems to have been neglected in naïve models of consequentialism and super consequentialist AIs.

luciaquirke 3 Nov 2025 1:44 UTC
3 points
0
on: Hospitalization: A Review
Logan I’m so glad you’re alright!

luciaquirke 13 Aug 2025 19:48 UTC
3 points
0
in reply to: luciaquirke’s comment on: It’s Owl in the Numbers: Token Entanglement in Subliminal Learning
I used your prompt experiment to get a similar result to the emergent misalignment finetuning paper. Unedited replication log: https://docs.google.com/document/d/1-oZ4PnxpVca_AZ0UY5tQK2GANO89UIFyTz5oBzEXnMg/edit?usp=sharing

luciaquirke 7 Aug 2025 1:59 UTC
3 points
0
on: It’s Owl in the Numbers: Token Entanglement in Subliminal Learning
That subliminal prompting experiment is so clean! Thanks for the writeup 🚀

luciaquirke 9 Mar 2025 3:59 UTC
3 points
0
in reply to: Armaan A. Abraham’s comment on: Deep sparse autoencoders yield interpretable features too
I tried stacking top-k layers ResNet-style on MLP 4 of TinyStories-8M and it worked nicely with Muon, with fraction of variance explained reduced by 84% when going from 1 to 5 layers (similar gains to 5xing width and k), but the dead neurons still grew with the number of layers. However dropping the learning rate a bit from the preset value seemed to reduce them significantly without loss in performance, to around 3% (not pictured).
Still ideating but I have a few ideas for improving the information-add of Delphi:
- For feature explanation scoring it seems important to present a mixture of activating and semantically similar non-activating examples to the explainer and to the activation classifier, rather than a mixture of activating and random (probably very dissimilar) examples. We’re introducing a few ways to do this, e.g. using the neighbors option to generating the non-activating examples. I suspect a lot of token-in-context features are being incorrectly explained as token features when we use random non-activating examples.
- I’m interested in weighting feature interpretability scores by their firing rate, to avoid incentivizing sneaking through a lot of superposition in a small number of latents (especially for things like matryoshka SAEs where not all latents are trained with the same loss function).
- I’m interested in providing the “true” and unbalanced accuracy given the feature firing rates, perhaps after calibrating the explainer model to use that information.
- I think it would be cool to log the % of features with perfect interpretability scores, or another metric that pings features which sneak through polysemanticity at low activations.
- Maybe measuring agreement between explanation generations on different activation quantiles would be interesting? Like if a high quantile is best interpreted as “dogs at the park” and a low quantile just “dogs” we could capture that.
  - Like a measure of specificity drop-off
https://github.com/EleutherAI/sparsify/compare/stack-more-layers
```
python -m sparsify roneneldan/TinyStories-8M roneneldan/TinyStories --batch_size 32 --ctx_len 256 --k 32 --distribute_modules False --data_preprocessing_num_proc 48 --load_in_8bit false --shuffle_seed 42 --expansion_factor 64 --text_column text --hookpoints h.4.mlp --log_to_wandb True --lr_warmup_steps 1000 --activation topk --optimizer muon --grad_acc_steps 8 --num_layers 5 --run_name tinystories-8m-stack-5
```

luciaquirke 8 Mar 2025 14:43 UTC
2 points
0
on: Deep sparse autoencoders yield interpretable features too
Hey, I love this work!
We’ve had success fixing dead neurons using the Muon or Signum optimizers, or by adding a linear k-decay schedule (all available in EleutherAI/sparsify). The alternative optimizers also seem to speed up training a lot (~50% reduction).
To the best of my knowledge all dead neurons get silently excluded from the auto-interpretability pipeline, there’s a PR just added to log this more clearly https://github.com/EleutherAI/delphi/pull/100 but yeah having different levels of dead neurons probably affects the score.
This post updates me towards trying out stacking more sparse layers, and towards adding more granular interpretability information.

luciaquirke 30 May 2024 5:11 UTC
2 points
1
in reply to: LawrenceC’s comment on: Please stop publishing ideas/insights/research about AI
Is there anything you recommend for understanding the history of the field?

luciaquirke 17 Oct 2023 6:26 UTC
1 point
0
in reply to: philip_b’s comment on: luciaquirke’s Shortform
Thanks for the comment, fair point! I found Vast.AI to be frustratingly unreliable when I started using it but it seems to have improved over the last three months, to the point where it feels comparable to (how I remember) LambdaLabs. LambdaLabs definitely has the best UI/UX though. I’ve amended the post to clarify.
I’ve had one great and one average experience with RunPod customer service, but haven’t interacted with anyone from the other services.

luciaquirke 5 Oct 2023 11:32 UTC
6 points
2
on: luciaquirke’s Shortform
Working on Remote Machines
Unless you have access to your own GPU or a private cluster, you will probably want to rent a machine. The best guide to getting set up with Vast.AI, the cheapest and (close to) the most reliable provider, is James Dao’s doc: https://docs.google.com/document/d/18-v93_lH3gQWE_Tp9Ja1_XaOkKyWZZKmVBPKGSnLuL8/edit?usp=sharing
In addition,
- Vast.AI is not always available, be prepared to switch providers
- Fast setup is useful, try a custom Docker image
Providers
The main providers are Vast.AI, RunPod, Lambda Labs, and newcomer TensorDock. Occasionally someone will rent out all the top machines over multiple providers, so be mentally prepared to switch between providers.
Lambda Labs has persistent storage in some regions, the most reliable machines, and the best UX. Their biggest issues is that they’re more expensive and have limited availability—it’s common for all the machines to be rented out. LambdaLabs is also the only service that doesn’t let you add credit beforehand. Instead, it takes your card then charges your bank account as you use compute, which is a bit scary because it’s easy to leave a machine running overnight.
RunPod also has persistent storage and nice UX. Their biggest issue is that they don’t document the many, many rough edges of the service. Notably:
- Their Community Cloud machines often suffer from extremely slow network speeds, but their Secure Cloud machines are more reliable.
- They seem to bake necessary functionality into their Docker images such that custom images don’t work without tinkering (https://github.com/runpod/containers/blob/main/official-templates/pytorch/Dockerfile).
- They don’t let you specify a startup script. Their instance configuration lets you specify a “Docker command” which is a literal Docker CMD instruction except you have to exclude the CMD bit at the start.
- Their provided Docker images are missing basic packages (e.g. no rsync)
Vast.AI uses tmux for the terminal which is not user friendly, but lets you close the SSH tunnel without killing your processes. I recommend keeping a cheat sheet handy. The most important command is Ctrl-b [ to enable scrolling.
Setup
You generally lose your files when you stop your machine so you need to be able to set up from scratch quickly. To enable this most providers let you specify a Docker image and an ‘on startup’ script to run by default when you rent out a machine.
I use my on startup script to clone git repositories and store credentials but I haven’t gotten it to work for installing packages (possibly fixable problem idk).
One way to get around this is to copy and paste a package installation script into the terminal every morning. This is annoying but works well for most packages, but PyTorch/TensorFlow is too large and complex, so select a Docker image with the machine learning framework you use installed.
To avoid running the package installation script completely, pre-install all your packages into a Docker image then set that as the default image in your compute provider. My Dockerfile is just the PyTorch Docker image plus a few packages specific to mechanistic interpretability.
Dockerfile
```
FROM pytorch/pytorch:latest

RUN apt-get update && apt-get install -y git

RUN pip install tqdm einops seaborn plotly-express kaleido \
    scikit-learn torchmetrics ipykernel ipywidgets nbformat \
    git+https://github.com/neelnanda-io/TransformerLens \    git+https://github.com/callummcdougall/CircuitsVis.git#subdirectory=python \
    git+https://github.com/neelnanda-io/neelutils.git \
    git+https://github.com/neelnanda-io/neel-plotly.git
```
Startup script
```
runuser -l root -c 'export GITHUB_USER=<username> GITHUB_PAT=<PAT> GIT_NAME=<name> GIT_EMAIL=<email>; git config --global user.name <username>; git config—global user.email <email>; git clone https://$GITHUB_USER:$GITHUB_PAT@github.com/<GitHub repo>.git; cd <GitHub repo>; exec bash; conda init;’
```
IDE
I found JetBrains’ support for remote machines lacking—the debugger doesn’t work over SSH and the IDE is slow to download and boot up. I use VSCode instead, which has a debugger so slow it’s almost useless but downloads quickly and has a setting where you can specify extensions for automatic download on remote machines which works smoothly.

luciaquirke’s Shortform

luciaquirke5 Oct 2023 11:32 UTC

1 point

3 comments1 min readLW link

luciaquirke 5 Jun 2023 6:48 UTC
3 points
0
on: How to Think About Activation Patching
If we can patch a head from “The Eiffel Tower is in” to “The Colosseum is in” and flip the answer from Rome to Paris, that seems like strong evidence that that head contained the key information about the input being the Eiffel Tower!
[nit] I believe this part should read “Paris to Rome”.

luciaquirke

Working on Remote Machines

Providers

Setup

lu­ci­aquirke’s Shortform

luciaquirke’s Shortform