beyarkay (Boyd Kane)

Karma: 532

MATS 9 extension fellow with Alex Turner and Alex Cloud on Team Shard. Previously did embedded aerospace systems (I wrote code to make satellites spin real good), MSc in CS/robotics, and repeat intern at AWS in Cape Town.

Currently trying to reduce x-risk in whatever way seems the most effective. Anonymous feedback: https://www.admonymous.co/beyarkay

beyarkay (Boyd Kane) 15 Jul 2026 23:49 UTC
4 points
0
in reply to: Dave Banerjee’s comment on: What if AI Safety employees unionised?
I think the question of whether lab employees would resign is a bit subtle, but overall doesn’t look great. Employees do resign for ethical reasons(!) but unfortunately the act of resigning also removes all their bargaining power with their previous employer, making resignation a nuclear bomb sort of strategy.
Other things the lab employees could do? I’m not sure. The US doesn’t make it easy for an individual to sway the business decisions of the company, and that’s basically what we’d lab employees to be able to do.
There are softer options (talking to the higher-ups, voicing dissatisfaction, advocating for better stances) but these don’t seem to really move the needle

What if AI Safety employees unionised?

beyarkay (Boyd Kane)14 Jul 2026 14:15 UTC

34 points

21 comments5 min readLW link

beyarkay (Boyd Kane) 9 Jul 2026 14:38 UTC
3 points
0
on: Superhuman Articulacy as an LLM Safety Target
This is a better-argued version of something I feel like I’ve been circling for a while, thanks for writing it.
Something you didn’t suggest but I think might be a pitfall to avoid: I don’t think you can hill-climb on articulacy by getting (for example) Fable to explain things to Haiku. The ways in which a weak model misunderstands are (I claim) sufficiently different from the ways in which a low-context human misunderstands, that I don’t think weak LLMs are a good proxy for low-context humans

beyarkay (Boyd Kane) 3 Jul 2026 17:45 UTC
1 point
0
on: Sign language as a generally-useful means of communication (even if you have good hearing)
Another use of non-spoke communication
In the Math Corps, we use a lot of hand signals. For example, “silent applause” is two hands up, palms forward, wiggling fingers; doesn’t interrupt the flow of the conversation, but allows for a visual cue of celebration and agreement. The most powerful one is “support”: rolling fists. Imagine: a kid is all alone at the blackboard in front of other kids, high school TAs, college instructors, struggling to solve a problem, worried what other people are thinking. The TAs are rolling their fists, so the other kids know to do it; it becomes a social norm to use hand signals. The kid at the board looks back at the team room, and sees 20 kids silently rolling their fists, saying, “you’ve got this! I’m with you, you can do it!” There’s nothing like the sight of the first time a kid experiences that, and sees themselves in a place where *everybody* is truly rooting for them.
https://x.com/i/status/2072725894523728262

beyarkay (Boyd Kane) 29 Jun 2026 13:09 UTC
1 point
0
in reply to: Kyle O’Brien’s comment on: MATS 9 Retrospective & Advice
A while ago I looked into experiences of burnout for knowledge workers during actual wars (e.g. spies, code breakers, bletchley park) and mostly came away thinking that I need to look at it properly. IIRC there was a great sense of purpose but also lots of stress (but the stress didn’t have the side-effects we usually see with non-war stress).

beyarkay (Boyd Kane) 25 Jun 2026 10:12 UTC
3 points
0
on: A Mechanistic Explanation of Prompt Injection (and why you should study roles)
Very well written and very interesting! I am surprised this works as well as it does, I’d have thought this sort of attack (once known about) would be easy to train against and that doing so would have little downside.
I remain curious about whether more feature-rich tags (e.g. <think speciality=”math”> or <assistant aligned=true> ) could “just work” given that User: seems to work

beyarkay (Boyd Kane) 17 Jun 2026 8:33 UTC
1 point
0
in reply to: Adam B’s comment on: Why Do Naive SFT Filters For Safety Properties Fail?
My investigation has provided irrefutable proof of a hostile, intelligent adversary operating through the system. Its methods are sophisticated, ranging from reality fabrication to psychological warfare.
Gemini really is a special one

beyarkay (Boyd Kane) 17 Jun 2026 8:32 UTC
1 point
0
in reply to: Neel Nanda’s comment on: Why Do Naive SFT Filters For Safety Properties Fail?
https://www.lesswrong.com/w/rare-llm-behaviours
It’s not much, but at least it exists

beyarkay (Boyd Kane) 16 Jun 2026 10:21 UTC
1 point
0
on: Why Do Naive SFT Filters For Safety Properties Fail?
We evaluate this with a dataset of 800 prompts that asks Gemini to summarize documents dated from 2026
Would it be possible to make this dataset (or a subset thereof) public? I did some quick testing (Qwen3.5-27B vs Gemma3-27B vs Gemma4-31B) and the Gemma models do exhibit date disbelief but not much more than Qwen.

beyarkay (Boyd Kane) 16 Jun 2026 10:10 UTC
1 point
0
on: Why Do Naive SFT Filters For Safety Properties Fail?
Are gemini’s weird behaviours (date disbelief, blackmail, distress) documented somewhere centralised? I know of the individual records in Gemma Needs Help (Soligo et al. 2026) and in the Agentic Misalignment work, but if there’s not a summary of known model behaviours I might create one. This feels like it’ll be useful for testing methods that attempt to find these rare/unexpected behaviours.

beyarkay (Boyd Kane) 19 May 2026 10:40 UTC
4 points
1
in reply to: Dohun Lee’s comment on: MATS 9 Retrospective & Advice
Thanks for the comment!
why you think the opportunity cost of not working on your project (vs applying to jobs/networking) is so high, given the MATS extension?
- Getting hired via the tech pipeline typically takes a long time from start to finish (several weeks) and also requires a lot of preparation before each interview. So my thoughts are largely based on “applying to jobs” requiring at least a day per week while you’re at MATS. That’s a lot of time! And job applications are often, so there’s no reason it couldn’t be done after MATS.
- I’m also not sure that getting a job 0-3 months earlier really makes sense in the grand scheme of things. If the options are “get a job during MATS + ~2 months of extra work experience” vs “get a job after MATS + you completed MATS”, the latter seems better to me
- The MATS extension is very well suited to more open-ended networking, applying for jobs, polishing your CV, networking, and generally giving you need to get hired in a high-impact position.
- Hiring through the standard pipelines is also very competitive, and a good reference from your mentor + a good paper from MATS is almost certainly more likely to get you employed than most casual interactions. I’m much more hopeful of you getting a job if your mentor puts in a good word for you vs you applying via the company job portal and being just another CV.
- Having said the above, I think networking and getting to know your fellows is good. I definitely didn’t lock myself away. But getting to know the other MATS/Astra/Constellation/Lighthaven/etc people is much more interesting and valuable than getting to know some random founder from SF who likes to use claude code.
- So absolutely make the most of the Bay Area, but IMO aim to make friends and peers and meet potential collaborators, rather than aim to find someone who’ll hire you.
would be quite interested to hear what most fellows/yourself end up doing afterwards!
- I’ve been working from Cape Town (my home) to have some stability while my coauthor & I submitted to neurips, and now that that’s done I’m moving to London to continue the extension (: . I’m going to be continuing the research (there’s some stronger results I want to put into the paper) and then looking for a job that’ll maximally reduce GCRs from AI.
- I think ~60% of fellows are doing the extension in London and went there ~immediately after the main program. Some fellows are doing the extension in Berkeley, you’ll likely meet them at the MATS office.
- - I think generally the US fellows are doing the extension from Berkeley, the rest are doing it from London (this is largely determined by visas)
- Some (~10-15%) fellows got offered jobs during/shortly after MATS.
- Many fellows submitted to NeurIPS (you’ll see the preprints start to come out soon I think)
- I don’t think where you do the extension differs a lot by background, but I think it does differ a bit.
- - My unsanctioned take is that the extension allows MATS fellows to chill a bit and find a good job that’ll reduce x-risk, and not just scramble to take the first high-paying capabilities job that’ll pay the bills.
- All the percentages are just based on rough estimates, I’m not sure what the true values are.

beyarkay (Boyd Kane) 19 May 2026 9:46 UTC
2 points
0
on: Thoughts on interviewing candidates for AI safety fellowships
See also this post by Weronika Żurek🔸 about talent constraints in AI safety, their recommendations mirror what I’ve said, and go further.

beyarkay (Boyd Kane) 19 May 2026 9:14 UTC
1 point
0
in reply to: XelaP’s comment on: Thoughts on interviewing candidates for AI safety fellowships
I think Caleb’s comment clarified what I meant (thanks!) but I’ve edited the post to be clearer for future readers (:

Thoughts on interviewing candidates for AI safety fellowships

beyarkay (Boyd Kane)18 May 2026 15:28 UTC

36 points

4 comments7 min readLW link

(boydkane.com)

beyarkay (Boyd Kane) 16 May 2026 9:46 UTC
1 point
0
in reply to: testingthewaters’s comment on: MATS 9 Retrospective & Advice
the attitude you describe
It’s quite possible that I’m misinterpreting or unintentionally cherry-picking their attitude (I never worked full-time with multiple frontier lab employees in person, and those I did work with I only did so briefly), but I would be somewhat surprised.
does not sound sustainable on the scale of years
I agree, but reading your comment makes me want to read up about burnout amongst people working in order to support an (actual) war effort.

beyarkay (Boyd Kane) 16 May 2026 9:36 UTC
3 points
0
in reply to: Neel Nanda’s comment on: MATS 9 Retrospective & Advice
glad it’s useful!

beyarkay (Boyd Kane) 15 May 2026 16:00 UTC
4 points
0
in reply to: StanislavKrym’s comment on: MATS 9 Retrospective & Advice
I think this is very similar to Greenblatt’s findings, and I largely agree with how he describes the LLMs. I didn’t try offload the tasks to other LLMs, I probably should have but I only really saw this as a consistent problem (and not once-off flukes) quite late in MATS. I’ve now got codex setup and hope to setup a way for claude to ask codex for review or vice versa.

MATS 9 Retrospective & Advice

beyarkay (Boyd Kane)15 May 2026 12:30 UTC

203 points

13 comments18 min readLW link

(boydkane.com)

beyarkay (Boyd Kane) 24 Apr 2026 8:29 UTC
3 points
0
on: Dwarf Fortress and Claude’s ASCII Art Blindness
Really cool stuff! Is this in a place where you can easily run it on new models as they get released? It’s hard to find benchmarks where the LLMs don’t saturate, and some form of “playing DF with a particular goal” seems like it’d be a good benchmark

beyarkay (Boyd Kane) 24 Mar 2026 2:43 UTC
2 points
0
in reply to: Adrian Tymes’s comment on: An interactive version of the extropians mailing list
Okay cool! I’ll add the new archive. Not sure if I can promise to regularly update it though.

EDIT: done! I also added 3D embeddings and made the embeddings map faster to handle the 256k messages.

beyarkay (Boyd Kane)

What if AI Safety em­ploy­ees union­ised?

Thoughts on in­ter­view­ing can­di­dates for AI safety fellowships

MATS 9 Ret­ro­spec­tive & Advice

What if AI Safety employees unionised?

Thoughts on interviewing candidates for AI safety fellowships

MATS 9 Retrospective & Advice