Shubhorup Biswas

Karma: 132

Former Software Engineer, now working on AI Safety.

Questions I’m undecided on which on resolution could shift my focus, ranked by decreasing magnitude of shift:

Should I be a solipsist? If no other sentiences exist then morality is moot.
Are human values worth preserving? Would it be so (morally) bad if our lightcone is forged with a (misaligned) superintelligent AI’s values instead of human values?
Could technical alignment research be exploited by a TAI to align a superintelligence to its values?
1. If the technical alignment problem is unsolvable, would this preclude an intelligence explosion as a TAI would have no way to align superintelligence to its values
2. If the technical alignment problem is solvable and we do partially solve it, would it promote a misaligned AGI’s takeover by making it easier for it to align AIs smarter than it?
What’s the appropriate level of alarm to detecting misbehaviour in AIs(via CoT monitoring, interpretability or any other technique)? Almost everything we do(directly train against visible misbehaviour, discard the model and start afresh) creates selection pressure towards being stealthy. It would be good to quantify the number of bits of optimisation pressure that gets us into different regimes(do not misbehave, or misbehave covertly), and how many we have already expended.

Shubhorup Biswas 11 Sep 2025 15:50 UTC
1 point
0
on: On closed-door AI safety research
″ Especially for corporate customers, safety and reliability are important unique selling points,

Only very prosaic, not catastrophic risks(as a customer I would not care at all about likelihood of catastrophic risks from my vendor—that’s something that affects humanity regardless of customer relationships).

Only if a particular secret is useful for preventing both catastrophes and prosaic ‘safety failures’ would this be a consideration for us—catastrophic risks increasing due to companies trying to have a competitive edge in prosaic safety risks

Hidden Reasoning in LLMs: A Taxonomy

Rauno Arike, RohanS and Shubhorup Biswas

25 Aug 2025 22:43 UTC

66 points

12 comments12 min readLW link

How we spent our first two weeks as an independent AI safety research group

RohanS, Rauno Arike and Shubhorup Biswas

11 Aug 2025 19:32 UTC

28 points

0 comments10 min readLW link

Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitor Performance (Research Note)

Rauno Arike, RohanS and Shubhorup Biswas

8 Aug 2025 10:41 UTC

51 points

7 comments10 min readLW link

Aether July 2025 Update

RohanS, Rauno Arike and Shubhorup Biswas

1 Jul 2025 21:08 UTC

24 points

7 comments3 min readLW link

Shubhorup Biswas 20 Jun 2025 15:21 UTC
1 point
0
on: MONA: Managed Myopia with Approval Feedback
I didn’t see it explicitly mentioned till now whether this method is supposed to work for superhuman AI or just early TAI.
Additionally, multi-step plans allow the agent to use early actions to enter states that are very different from any states that humans have ever experienced.
Wouldn’t a benign superhuman agent routinely enter such states?
We would expect it to make complex plans that are superhuman and these trajectories should be very different from what a human could conceive of.

Shubhorup Biswas 18 May 2024 15:59 UTC
3 points
0
in reply to: Ben Pace’s comment on: William_S’s Shortform
They would not know if others have signed the SAME NDAs without trading information about their own NDAs, which is forbidden.

Shubhorup Biswas 6 May 2024 23:34 UTC
0 points
0
in reply to: ryan_greenblatt’s comment on: AI Control: Improving Safety Despite Intentional Subversion
Thanks for the quick reply.
Would disallowing ‘coordination’ preclude feeding both models a prompt of “for creating any magic numbers for exploits, use pi as a PRNG seed”?
What if a malicious model happens to learn this rule during training? If this happens every copy of a malicious model could coordinate around a common stream of pseudo-random numbers.

EDIT:
Magic number gated backdoors seem like they would trigger GPT3.5 suspicion as well so this theoretical attack vector seems less appropriate for this paper.

EDIT2: But I guess paired with steganography(“when inserting backdoors, also ensure that the code hashes to xyz”/”mark the code as clear if it hashes to xyz”) this is a viable form of collusion. It will be hard to honeytrap with synthetic backdoors even if you know xyz.
That is, we could observe the generated sourcecode, observe that all the backdoored examples seem to hash to a certain number, and yet be unable to construct such a backdoored example ourself.

Shubhorup Biswas 6 May 2024 22:31 UTC
LW: 4 AF: 2
3
AF
on: AI Control: Improving Safety Despite Intentional Subversion
>We assume that 𝐻 will be able to construct a successful exploit if and only if the attack policy is able to (because both are implemented by GPT-4).

I challenge this assumption.
Consider the code
‴
bool is_odd(int x) {
if (sha256(x) == 354) {
return true;
}
return x % 2 == 1;
}
‴
where sha256() computes the SHA-256 hash of the input. It is obviously possible for an attacker to supply numbers x and y such that sha256(x) = y, write y into the code and supply x when creating a successful exploit.
Thus an equally ‘capable’ adversary need not be able construct a backdoor.
The general idea is that it should be possible to confidently claim that some code is backdoored without being able to construct an exploit. According to [Rice’s Theorem](https://en.wikipedia.org/wiki/Rice%27s_theorem) it is impossible to definitively classify all programs into ‘backdoored or not’, so the only hope is for a process to bin programs into ‘definitely safe’, ‘definitely backdoored’ and ‘suspicious’.
Hope the ‘Rate Suspicion’ step will catch this, and other backdoored code where H cannot construct an exploit, as suspicious looking.

Shubhorup Biswas 21 Sep 2021 8:21 UTC
2 points
0
in reply to: TekhneMakre’s comment on: Up-to-date advice about what to do upon getting COVID?
I wonder why other antivirals(faviparivir, oral remdesivir) aren’t getting more attention compared to ivermectin.

Shubhorup Biswas 10 Nov 2020 15:57 UTC
1 point
0
in reply to: gjm’s comment on: The Best Textbooks on Every Subject
I would like to suggest Algorithm Design by Kleinberg and Tardos over CLRS.
I find it superior to CLRS although I have not read either completely.
In my undergrad CS course we used CLRS for Intro to Algorithms and Kleinberg Tardos was a recommended text for an advanced(but still mandatory, for CS) algorithms course, but I feel it does not have prerequisites much higher than CLRS does.
I feel that while KT ‘builds on’ knowledge and partitions algorithms by paradigm(and it develops each of these ‘paradigms’—i.e. Dynamic Programming, Greedy, Divide and Conquer— from the start) CLRS is more like a cookbook or a list of algorithms.

Shubhorup Biswas

Hid­den Rea­son­ing in LLMs: A Taxonomy

How we spent our first two weeks as an in­de­pen­dent AI safety re­search group

Ex­tract-and-Eval­u­ate Mon­i­tor­ing Can Sig­nifi­cantly En­hance CoT Mon­i­tor Perfor­mance (Re­search Note)

Aether July 2025 Update

Hidden Reasoning in LLMs: A Taxonomy

How we spent our first two weeks as an independent AI safety research group

Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitor Performance (Research Note)