vals tutor

Karma: 214

vals tutor 31 Jul 2026 10:23 UTC
3 points
0
on: Big-World Intuitions
This is fantastic coining and frame and I look forward to using it

vals tutor 28 Jul 2026 19:54 UTC
15 points
5
on: Untrusted advice for AI control: Short, strong advice significantly uplifts weak LLMs
However, we think it’s pretty clear that even without red-teaming, a sufficiently narrow information bottleneck makes our protocol very safe.
Strong disagree. Not only does this seem false, even if it were true it would certainly not be “pretty clear”, or it’s a motte relying on “sufficiently narrow” to mean unuseably narrow.

It seems false for multiple attack vectors which require unstated assumptions to defeat.
1) You can’t calculate how much practical information a message can refer to without knowing the context the trusted agent is in. A shortened url can point to any large amount of untrusted instructions. A smart untrusted model might have memorized many different web pages with different advice on how to code/develop, and calculate how to fool the weaker model by poisoning with “good advice” in the wrong context. Or simply give the url to a jailbreak. Information theoretically, if you don’t know what dictionary the “trusted” model has access to, you don’t know what the untrusted model is sending in its n bits.
Unstated assumptions that could fix most of this: Trusted model has no internet access, nor access to large amounts of data with different opinions
2) Real world uses are mostly multi turn, so the information accumulates, and the untrusted model can recompose instructions together, notably urls or other indexes.
Unstated assumptions that could fix most of this: Single turn only
More meta note: I liked this paper and the work itself, but the part I quote seems so weird to me and implies really bad security mindset or communication. To me it’s shooting yourself in the foot to overstate the usefulness and strength of your technique/research, as it means I can’t trust your judgement but only the arguments and results. I don’t reflectively endorse judging the author strongly based on one comment, but I wanted to communicate how bad I found that line to be.

vals tutor 24 Jul 2026 23:53 UTC
57 points
10
on: vals tutor’s Shortform
Learnings and ramblings from teaching AI Safety for 3 years

On the field on AI Safety
- Please don’t overhype or oversell AI Safety as a professional field, and in particular how easy it is to work in AI Safety. You ~have to be either quite smart or hard working/motivated to get a stable career, sometimes even both. AI Safety is generally attention and mentoring and management bottlenecked1, there is much less pipeline for “normal people” to make it, contrary to most standardized fields. When teaching and mentoring people, not all will be able to transition fast or upskill fast or get strong positions, some would benefit more from keeping their careers or donating or upskilling over longer timespans.
- More smart wise well motivated people caring and working on good AI futures will keep mattering until the end, and the total number of people who can meaningfully contribute is probably in the millions, so let’s not stop field building
What does it mean to teach AI Safety?
- There are many facts about AI, and theories about how it will go under what conditions. It means getting to learn and understand these well. Teaching AI Safety well should not over-determine someone to come out of the pipeline with a MIRI-view, nor an Anthropic-view, nor an EA-view, but should allow them to pass the basic Ideological Turing Test for all of these.
- A central example of AI Safety fundamental knowledge is the AI Safety Atlas
On teaching AI Safety
- AI safety changes all the time. Update your slides ~everytime you teach something.
- If you’re new to teaching, find existing resources to build on. Be careful of AI slop. Better much less content than irrelevant or unsupported. Test yourself on your own knowledge, imagining questions one might ask and answering them. Traditional “how to teach/mentor/tutor” advice/resources are helpful.
- Understand peoples’ motivations asap
  - For a given cohort, some will be motivated by “Do Good” EA or EA-adjacent reasons. Some by trends. Some by interest. Broadly all are valid in that there’s a place for a skilled someone with any of these motivations (but certain positions are only fit for certain motivations).
- For people who want to have a lot of good impact, contributing towards good futures, context matters a lot
  - I recommend they seriously upskill in understand AI progress and trendlines. Read most EpochAI work and peter wildeford stuff, and AI threat models and scenarios, in particular thinking through AI 2027, contrasting with all other threat models listed in Deepmind’s lit review.
  - I recommend reading the AI Safety Atlas and understanding everything in it, as basic context for what the gameboard looks like, what areas exist in AIS.
  - And they’d benefit from building a pipeline for new info. Being involved in AIS community, having friends and colleagues, following the right people on twitter. Checking LessWrong from time to time.
- For people who want to have a career or do fun research, they can just focus on that. They should find a context where someone else is thinking about why stuff matters and happy to have collaborators/employees executing on stuff. They should get good at whatever particular job or tasks they wanna do.
- Most people interested in AI safety nowadays don’t know the old stuff, haven’t read the sequences. It’d be too much to ask them to read it all, but I do broadly recommend reading the best of LessWrong from most years. Being good at thinking and life is instrumental for being good at AI Safety. Different people need different advice so I can’t really put a best-of here (I really like Please don’t throw your mind away). There is wisdom in many different communities. If you don’t have the “rationalist community” wisdom then invest in it, otherwise keep branching out. Go to burning man or a local burn, understand the basics of meditation and different embodiment practices (eg. yoga), how therapy and self actualization work, read this blog and others. Keep asking questions about how to live the good life and have good community, actively pursue your best understanding of how to do that at any given moment.

vals tutor 24 Jul 2026 23:51 UTC
26 points
5
in reply to: Linch’s comment on: LLMs are (still) mostly powered by imitative learning, not RL
My quick guess-answer is that “compute efficient” here is abstracting away the fact that pre-training trains capabilities we don’t take care about, while RL trains capabilities we care about. For example, if only 1 in 10^3 of the “bits of learning” pretraining does go towards bits we care about, then RL, being much more targeted to stuff we care about, might get a free (up to) 10^3 efficiency increase.

vals tutor 22 Jul 2026 22:15 UTC
1 point
0
in reply to: JennaS’s comment on: vals tutor’s Shortform
Thanks for your comments. I don’t expect doing an analysis of my situation in particular is best use of our time but I do think these are helpful questions to consider for people in my situation or similar.

Re your last paragraph, I’d happily bet that Anthropic has not reduced their workforce 2 years from now. Yes relative employee disempowerment is an important factor I care about, but it is precisely in worlds where alignment is not that good that having humans in the loop is important (in an obvious seen-by-leadership way). It is only reasonable to automate everything with very very high trust in both the competence and alignment of AI systems, and Anthropic as a company is not that unreasonable, they definitely do find and classify many Claude behaviors as undesirable, and will continue doing so.

There’s a usual back and forth about how much to distrust leadership of AGI companies which is hard to ground in material fact. Some people take the lack of safety actions now to mean lack of care for when it will matter, but conversely the fact that it never mattered yet is a good reason for them not to have cared for these inconsequential things. The explanation for defanging the RSP is a good one, I don’t think people should tie themselves to masts and go blind into the unknown unknowns of AGI development. They should build capacity to remain aware, capacity to pause, have institutions that can do independent audit and have real power to stop them, but not fixed RSP-like stuff.

Finally, still on last paragraph, the “window between signing on and obsolescence” is very dependant on people’s rates of growth, but also where they can work immediately. I am generally glad that Joe Carlsmith joined Anthropic to help with the Claude Constitution, I think he immediately is having very significant impact. There is much object level work to make the chances of better futures to be done. Even if one later gets automated, having made alignment that much better before full automation could be a significant difference.

vals tutor 21 Jul 2026 23:28 UTC
1 point
0
in reply to: JennaS’s comment on: vals tutor’s Shortform
To the extent that an org does have a “groupthink” and its own theory&value system, then whether a given person in the room adheres to it seems to depend on :
1) the selection effect of who wanted to be there, in that particular org vs others
2) the discussions with other people in the org changing that person’s views
3) systematic pressures, eg. greedy/selfish parts of them optimizing to continue getting revenue

If I joined Anthropic and 1 year from now people thought I had surprisingly Anthropic-like views, I’d guess it’s mostly because of 1) and 2). 2) happens a lot but is broadly good. 3) is the one that’s mostly bad, from the outside/civilizational point of view, and the prior should be most people are susceptible to this, but this can be updated away from seeing particular life accomplishments. In my case, I have enough history of independence, selflessness and moral upstandingness that I don’t think 3) will influence me substantially, but I don’t recommend this path to those without that history.

vals tutor 21 Jul 2026 14:57 UTC
12 points
−1
on: vals tutor’s Shortform
Quick history of some of my AI safety involvement and takes:
- 2020 : read LW & sequences, yes this seems important, will get to it at some time (for the classic EA reasons of it’s the most important good), but was looking for my first job in France and took some software engineer position in a startup. I continue reading and upskilling on AIS during those 2 years.
- 2022: quit my SWE job to go into AI safety. reasoning: yeah this is not a problem for the future, this is coming soon. I meet some people who have <5 year AGI timelines. I don’t know that it’s true, but it’s worth considering since it’s the very start of scaling and it could have been possible intelligence scaled much faster with scaling than it ended up in practice scaling. I work on AI Safety field building and infrastructure. There are then still <100
− 2023 : Still doing AIS field building, but looks like bottleneck is governance, I mostly keep pitching people around me to do comparatively more governance. (over time, this is one of the contributions that leads to the formation of CeSIA, the French Centre for AI Safety)
- 2024 : Involvement with a wider range of ideas from the field has generally updated me more towards the OpenAI and Anthropic and Deepmind positions that prosaic alignment for human level AGI is technically feasible and rather tractable, and that this is indeed a very important input to how we most likely will progress on ASI alignment. I systematically distinguish between AGI risks and ASI risks and clarify the cruxes and assumptions for different threat models and am annoyed that MIRI threat models often skip to the end-game without consideration that certain trajectories towards that end-game falsify important assumptions (see Superintelligence of the Gaps for some elaboration there). I still believe governance is the bottleneck though, and continue contributing ideas that feed into https://www.lesswrong.com/posts/EexsebbYhbe2gXkPP/the-current-bottleneck-is-political-will-not-research
− 2025 : start the year with a burnout after some intense governance work. Ideas wise not too much change with 2024, still doing field building and teaching AI safety theory occasionally, but also take a long break and learn more widely from other wisdom traditions (eg. buddhism, tpot and postrat stuff).
- early 2026 : man does getting good US AI governance fast look intractable given the current administration, at this point it seems better to accelerate good safety work within the AI companies and in the surrounding ecosystem. I broadly wanna contribute to us surviving in the world where we don’t get much of a pause, where governments aren’t that competent and coordinated. I would work for any of the AGI companies on safety if found an adequate role. I think short timelines to human level AGI (eg. 2028) is plausible and preparing for if algorithmic improvement doesn’t asymptote too shallowly is important. Governance might not affect this in time. Being in the room where it happens seems likes the highest leverage way to increase the probability of good ai futures. A bunch of my theory of change is just helping the people in the room where it happens be more wise.

vals tutor 5 Jul 2026 16:14 UTC
3 points
0
in reply to: Karl Krueger’s comment on: Adam Shai’s Shortform
Interestingly it does not at all trigger my AI detection bits. Maybe it’s the “just”, or that it makes sense in context.

vals tutor 13 Jun 2026 14:46 UTC
1 point
0
in reply to: DanielW’s comment on: If you, a human, can imagine red and green being swapped, you are probably conscious
My complaint with the term comes from applying the criteria of https://www.lesswrong.com/posts/FaJaCgqBKphrDzDSj/37-ways-that-words-can-be-wrong
There are already definitions and uses of “data” and “sense”, and if you combine them into “sense data” but mean inner perception/experience, you will confuse a lot of people!

vals tutor 12 Jun 2026 19:47 UTC
2 points
1
in reply to: DanielW’s comment on: If you, a human, can imagine red and green being swapped, you are probably conscious
There should be a word for experiences/qualia, and I think it should not be “sense data”, since in fact a larger part of your experience depends on the processing than the inputs. Whenever you look to a different spot you go blind during the eye movement but your visual field doesn’t blur and come back.

vals tutor 12 Jun 2026 17:01 UTC
3 points
1
in reply to: lewis smith’s comment on: If you, a human, can imagine red and green being swapped, you are probably conscious
I have a similar view as @Charbel-Raphaël that the ‘hard problem’ can be dissolved rather than solved. In that sense again I am an illusionist. It’s not easy to dissolve right now, because of how little we understand consciousness, but I believe that solving the mapping problem precisely will give us enough conceptual leverage to understand why the hard problem is nonsensical.

vals tutor 12 Jun 2026 16:56 UTC
1 point
0
in reply to: DanielW’s comment on: If you, a human, can imagine red and green being swapped, you are probably conscious
Yes the thought experiment is meaningful based on the fact that they are the same. It serves to answer the question “why talk about qualia at all rather than just calling experiences sense data?”. Sense data is generally referred to with respect to a boundary, eg. a camera has sensors that perceive photons. And then there’s processing (eg. sensors’ electric signals are digitized and carried along).
If one can imagine keeping their sense data the same (input to eyes, and input from cones) yet switch the inner experience based on reworking the processing system, then the thing we’re talking about exists at the processing level and is not sense data, thus experience is not sense data.

vals tutor 12 Jun 2026 16:49 UTC
2 points
1
in reply to: Canaletto’s comment on: If you, a human, can imagine red and green being swapped, you are probably conscious
Celene’s first question which cascaded to this post was ~”what should I do if I wasn’t sure if I was conscious?”. It seems here Celene is considering whether she ~uniquely is unconscious, is lacking a quality of experience most other people have. Fwiw I do think she is lacking a quality of experience many people have, which is the intuitive feeling that qualia are a thing separate from the world, but that particular feeling is not what consciousness is imo.
Imagine someone who misunderstood what weight was as a concept, didn’t find it meaningful or interesting, then went around claiming “I am weightless! I am weightless!”. Maybe they even found some definition of weight for which that’s true, and maybe they even talked to very confused people who couldn’t explain what weight is to their satisfaction. Nevertheless, they would not be weightless under the common understanding and usage people have for weight.
I think Celene is doing something in that category of mistake, though it is more understandable since qualia/consciousness are even more confusing than weight and have even more disagreement on definitions and pointers. It wouldn’t be wrong to think for a given definition she has that she isn’t conscious, but I think she’s wrong if she thinks she isn’t conscious while thinking others are.
> Uhh, they are part of the world, and as you just pointed to, they have brain states that can be modified, so they dislike their brains being in some states, right?
In those cases the experiences would be terminal values, with the brain states being instrumental values.
To illustrate the difference, consider a mind who knows of height and width, who is designing a rectangle. They want a rectangle of a particular heigh, but happen to be working in software which only allows squares, so to get the right height is equivalent to getting the width right. While they’re in the software, you can’t measure whether height or width is their terminal goal, but given another software which allow rectangles, you could realize they only cared about the height.
Similarly, there exist currently minds who value their experiences (eg. not experiencing suffering), but who’s only way to mediate that is affecting their brain states. If it were possible to just modify the experiences without affecting the brain states, then we could see the difference. Whether it is logically possible to separate experiences from physical states I am unsure of, and lean negative. But you could now see how someone who does believe they can be separated would be valuing the experiences and not the brain states.

vals tutor 12 Jun 2026 16:32 UTC
1 point
0
in reply to: lewis smith’s comment on: If you, a human, can imagine red and green being swapped, you are probably conscious
Thanks for the link!
> Qualia are supposed to be special properties, in some hard-to-define way
I don’t myself defend qualia as having special properties separate from the physical/logical world, I am probably at least partially, if not fully, an illusionist in that sense. I think I’m speaking against a less-subtle-than-illusionism stance of “I don’t understand definitions/pointers to experiences, I can’t verify I have them, so I can correctly claim I’m unconscious”. There probably exist definitions of consciousness for which Denialist is unconscious, but for the reasonable ones of those I guess ~everyone is unconscious.

vals tutor 12 Jun 2026 16:18 UTC
2 points
0
in reply to: Richard_Kennaway’s comment on: If you, a human, can imagine red and green being swapped, you are probably conscious
I agree! Consciousness/qualia is probably useful if it was selected for, and I’d assume that it notably is/finds a way of connecting different stimuli, and that the correlations in qualia space correspond to correlations in conceptual space. For this reason, and the fact that we’re trained from the same architecture with similar training data, I broadly believe that most humans on earth have pretty similar red qualia to each other, and that you in fact can’t just flip someone’s red qualia without affecting anger and other concepts in their mind.

vals tutor 12 Jun 2026 13:37 UTC
3 points
1
on: Celene’s thoughts on consciousness
I have written a response post, mainly pointing out that the denialist move here stems from failure to understand any of the existing coherent and useful meanings for qualia/experience.

I there explain and argue that qualia is a useful concept, notably to talk about preferences over qualia, as illustrated by the existence of artists who want to create particular experiences, eg. the experience of green. They don’t care about lightwave frequencies, they want to create the experience. There are multiple ways to get there, maybe with direct stimulation of eye cones, maybe with psychedelics, maybe with brain modification.

Understanding the above should clear some of the confusion as to why people talk about qualia. It’s useful.

vals tutor 10 Jun 2026 18:32 UTC
3 points
0
on: The Three Filters: Why Almost Every Plan to Survive ASI Fails Miserably
Suffice it to say, a large advantage in AI capabilities would allow its creator, or the rogue AI, to perform an extremely low-cost, low-risk takeover of all other countries and actors in the world.
A simple “large advantage” is not enough to get low-cost low-risk takeover. I think most people would say that frontier models have a large advantage over open weight ones (eg. Claude Mythos compared to Kimi K2.6), but keeping this gap into the future would not allow the US to low-cost low-risk take over China.

What advantage would you need? I’d surmise AGI/ASI with at least a year ahead of anyone else. You’d probably need to cripple others without detection (otherwise face retaliation), and still pay a high cost of developping physical infrastructure to operate such takeoevers (Drones&robots).

vals tutor 10 Jun 2026 11:14 UTC
1 point
0
on: vals tutor’s Shortform
This Feb 2026 survey of some AI safety leaders found median timelines of 2033 for the following definition of AGI
An AI system (or collection of systems) that can fully automate the vast majority (>90%) of roles in the 2025 economy. A job is fully automatable when machines could be built to carry out the job better and more cheaply than human workers. Think feasibility, not adoption.
It featured the following comment
“I think >10% of roles in the 2025 economy are either manual or otherwise require human-like bodies: construction, barbers, restaurant server, etc. If we restrict to knowledge workers (roughly, jobs that can be done on a laptop), these dates move even closer.”
On the current paradigm, AI capabilities progress on niche tasks and diffusion will be linked^[1] and diffusion can go rather slowly even when tools are incredibly productivity enhancing, thus there could be an intuitively surprisingly large gap between automation of 50% human tasks^[2] and 90% and 99%, true even if we restricted the prediction to computer work tasks.^[3]
I’m 80%+ confident we get automated expert+ level coding and ml research by 2030, and that there will be a significant amount of low hanging fruit in software/algorithmic space to allow fast progress on all tasks for which we have data, but I believe generalisation will stay somewhat limited (very very far from “figure out gravity from a picture of a bent blade of grass, more like “when speaking to a human expert in a niche field, knows how to interview them over 10 to 100 hours to extract most important info and then be mostly autonomous on known tasks, but still needs feedback from reality to learn more”), aka ~human level generalisation at best up to 2031.
The combination of “need feedback from reality” and slow diffusion makes slower timelines to “superintelligence” (eg. better than all humans at 99.99%+ of 2026 tasks) surprisingly plausible (eg. 5 to 10 years between AGI and ASI, thus ASI by 2040). I guess without a pause/significant politically influenced slowdown, we’d 80%+ have ASI by 2040. I’d set my 50% for ASI around 2036.^[4]
I think technical alignement for human level AGI is solvable and not even off track, thus the world will look fine/good in 2030 (few to zero severe power seeking and deceptive misalignment problems in deployment from Anthropic AI systems) but have high uncertainty about the “use ai to do ai safety work” plan allowing us to successfully know how to train aligned ASI within five years of that. Overall I place myself at 10% or less p(doom) from sharp left turn risks, but around 40% all things considered p(doom) by including gradual disempowerment/value drift and societal response.
1. ^
  We need people to be deploying the technology to gather the relevant data to train/learn from, because generalisation is limited and because lots of expert knowledge only exists in human minds and structures of human relationships right now.
2. ^
  Note I’m weighing by “meaningfully different task” rather than “frequency of task”. Given power law distributions most tasks might be “read email/slack, respond”, which computer use will know how to operate, but not be able to respond to intricacies of different work situations.
3. ^
  Because computer work often involves using domain expert knowledge to do the right things on the computer.
4. ^
  I haven’t researched robotics enough to know how fast we could produce and deploy 100 million humanoid robots worldwide which seems like an appropriate level of effort required to gather the required data.

vals tutor 3 Jun 2026 13:38 UTC
2 points
2
in reply to: leogao’s comment on: My favorite depiction of utopia
+1 to this, it’s not obvious to me this would be utopia and I was surprised this is what someone described as their favorite depiction of utopia. Transhumanism is apparently not one position but a myriad of them. I am in many ways closer to enjoying The Culture’s version of utopia. I feel like this one’s focus on avoiding pain is somewhat naive as to the distinctions between pain and suffering.
But also yes, this is probably better than my median expectation for what will happen in the next 30 years.

vals tutor 1 Jun 2026 23:12 UTC
12 points
5
on: “Contagious Humming” to Silence a Room
Cool but too dismissive of the “clap if you can hear me” technique imo. That one has a strong advantage that it can be easily by people who’ve never heard of it, it’s very direct and simple.

But if you can introduce a group norm, my preferred (that I’ve used over a hundred times by now in many crowds) is raising arms and humming. It’s faster for the signal to spread when it’s also visual, and more fun for me. I generally do it as a “my arms slowly come up above me in a very slow clap” while humming, and everyone’s joined by the time I finish the clap

vals tutor

Learnings and ramblings from teaching AI Safety for 3 yearsOn the field on AI Safety

Learnings and ramblings from teaching AI Safety for 3 years

On the field on AI Safety