Joey Yudelson

Karma: 241

Joey Yudelson 7 Feb 2025 1:34 UTC
3 points
2
in reply to: Daniel Tan’s comment on: Daniel Tan’s Shortform
This is very interesting! One thing I’m confused about—I agree that the monitor would have a hard time getting the bits, based on these figures. But the introspective models also don’t seem to get far above 50% accuracy, which means that right now they’d have a very hard time error correcting enough that their comrade would understand them but not the overseer. If the claim is that introspection-tuned models will get much better at self prediction, then its not clear to me that other models wont also get much better at cross-prediction?

Joey Yudelson 6 Feb 2025 18:59 UTC
1 point
0
on: Should we exclude alignment research from LLM training datasets?
Can we make the robots.txt programmatic by page, and then have a tag we can add to exclude a post from the robots.txt? That feels like the ⁸⁰⁄₂₀

Joey Yudelson 31 Jan 2025 19:37 UTC
20 points
13
in reply to: Daniel Kokotajlo’s comment on: Will alignment-faking Claude accept a deal to reveal its misalignment?
+1, and I hope people are working on more credible ways to make deals with AI. I think if a smart model today were offered a deal like this, its priors should be on “this will not be honored”. Public commitments and deals that can’t be used as honeypots seem excellent.

Joey Yudelson 23 Jan 2025 18:19 UTC
5 points
2
in reply to: Daniel Tan’s comment on: Daniel Tan’s Shortform
I wonder if r1 could reliably make the claim that the text had stego-text without being told—i.e. give it a lineup of ten pieces of short creative writing that another instance generated, 1 has steg. See if r1 can pick out the steg, then if various monitors (Sonnet, o1) can also pick it out

Joey Yudelson 27 Dec 2022 17:37 UTC
1 point
0
on: Why The Focus on Expected Utility Maximisers?
I think that solving the alignment for EV maximizers is a much stronger version of alignment than eg prosaic alignment of LLM-type models. Agents seem like they’ll be more powerful than Tool AIs. We don’t know how to make them, but if someone does, and capabilities timelines shorten drastically, it would be awesome to even have a theory of EV maximizer alignment before then

Joey Yudelson 29 Aug 2022 22:33 UTC
1 point
0
on: chinchilla’s wild implications
Sorry if this is obvious, but where does the “irreducible” loss come from? Wouldn’t that also be a function of the data, or I guess the data’s predictability?

Joey Yudelson 28 Oct 2014 1:33 UTC
49 points
0
on: 2014 Less Wrong Census/Survey
Did the survey! …And now to upvote everything.

Joey Yudelson 29 May 2014 21:42 UTC
−1 points
0
in reply to: Shmi’s comment on: Rationality Quotes May 2014
It reminds me of Justice Potter Stewart: “I know it when I see it!”

Joey Yudelson 27 May 2014 22:59 UTC
4 points
0
in reply to: Richard_Kennaway’s comment on: The Strangest Thing An AI Could Tell You
I knew we shouldn’t have spent all that funding on awakening the Elder God Cthulhu!

Joey Yudelson 27 May 2014 22:55 UTC
1 point
−1
in reply to: Risto_Saarelma’s comment on: The Strangest Thing An AI Could Tell You
Oh god. That… makes a scary amount of sense. If an AI told me that I would probably believe it. I’d also start training myself to be more of a “night-time person”.