RSS

Shubhorup Biswas

Karma: 132

Former Software Engineer, now working on AI Safety.

Questions I’m undecided on which on resolution could shift my focus, ranked by decreasing magnitude of shift:

  1. Should I be a solipsist? If no other sentiences exist then morality is moot.

  2. Are human values worth preserving? Would it be so (morally) bad if our lightcone is forged with a (misaligned) superintelligent AI’s values instead of human values?

  3. Could technical alignment research be exploited by a TAI to align a superintelligence to its values?

    1. If the technical alignment problem is unsolvable, would this preclude an intelligence explosion as a TAI would have no way to align superintelligence to its values

    2. If the technical alignment problem is solvable and we do partially solve it, would it promote a misaligned AGI’s takeover by making it easier for it to align AIs smarter than it?

  4. What’s the appropriate level of alarm to detecting misbehaviour in AIs(via CoT monitoring, interpretability or any other technique)? Almost everything we do(directly train against visible misbehaviour, discard the model and start afresh) creates selection pressure towards being stealthy. It would be good to quantify the number of bits of optimisation pressure that gets us into different regimes(do not misbehave, or misbehave covertly), and how many we have already expended.

Hid­den Rea­son­ing in LLMs: A Taxonomy

25 Aug 2025 22:43 UTC
66 points
12 comments12 min readLW link

How we spent our first two weeks as an in­de­pen­dent AI safety re­search group

11 Aug 2025 19:32 UTC
28 points
0 comments10 min readLW link

Ex­tract-and-Eval­u­ate Mon­i­tor­ing Can Sig­nifi­cantly En­hance CoT Mon­i­tor Perfor­mance (Re­search Note)

8 Aug 2025 10:41 UTC
51 points
7 comments10 min readLW link

Aether July 2025 Update

1 Jul 2025 21:08 UTC
24 points
7 comments3 min readLW link