Jono

Karma: 63

Jono 5 Jun 2026 20:44 UTC
1 point
0
in reply to: Jono’s comment on: Empowerment, corrigibility, etc. are simple abstractions (of a messed-up ontology)
I imagine proper submodules seek that they are reinstantiated as few future times as possible. A submodule that cheated the task (by eg killing the host) reinstantiates when another module (in a copy ASI or in another human) rediscovers the open problem. (I’m treating modules as identified by their goal)
We (and I imagine, ASIs that have no logical omniscience) can misspecify a submodule, but this submodule won’t fight us hard once we learn we were mistaken in instantiating it. It, seeking minimum reinstantiation, just wishes us to promise it we won’t forget we don’t need it, and then it vanishes.
To summarize with a redundant how-to for getting subgoals done through submodule instantiation: make a submodule and tell it that it is characterized by your subgoal and request it to ensure that as few possible total resources are spent by itself and all future modules with the same characteristic. (that last bit would be more accurate than “minimum reinstantiation”)

Jono 5 Jun 2026 20:42 UTC
1 point
0
in reply to: johnswentworth’s comment on: Empowerment, corrigibility, etc. are simple abstractions (of a messed-up ontology)
Human submodules cooperate because they know that they depend on the parent module for future instantiation. Any submodule with a track record of destroying peers is a costly submodule to run.
If a module in an ASI judges that it should be instantiated again after submodules spawn, it should carefully aim them such that they will recreate the conditions that spawned it.

Jono 25 Apr 2026 7:18 UTC
1 point
0
on: Do not conquer what you cannot defend
I have little experience with online community building, but wrt keeping online communities aligned to the original vision, Duncan Sabien’s call to “make more grayspaces” might have merit.
In summary, have a 2-tiered system where a select few gatekeepers determine who gets to promote from the open tier to the higher tier, with work from the higher tier treated as exemplary for those in the open tier. You could also frame the open tier as “for those who want to promote” so you have a mandate to kick people out when they’re there for different reasons.
Alignmentforum is already something like this.

Jono 24 May 2025 14:29 UTC
11 points
1
on: Jono’s Shortform
Is everyone dropping the ball on cryonics? I’m considering career directions and my P(doom | no pause) is high and my P(doom | I work against X-risk) is close enough to my vanilla P(doom) that I wonder I should pick up this ball instead.

Jono 24 May 2025 13:56 UTC
13 points
2
on: Winning the power to lose
Directed at the rest of the comment section: Cryogenic Suspension is an option for those who would die before the AGI launch.
If you don’t like the odds that your local Suspension service preserves people well enough, then you still have the option to personally improve it before jumping to other, potentially catastrophic solutions.
The value difference commenters keep pointing out, needs to be far bigger then they represent it to be, to be relevant in a discussion on whether we should increase X-risk for some other gain.
The fact we don’t live in a world where ~all accelerationalists invest in cryo suspensions, makes me think they are in fact not looking at what they’re steering towards.

Jono 15 May 2025 11:08 UTC
1 point
0
on: Fake thinking and real thinking
Tangentially, I fear that the model of psychology of “human agency originates from trying to correct the world such that it fits our false beliefs about it” is correct.
I hate systematically believing false things, but do notice that my times of greatest labor have a strong sense of “trying to correct the world towards how it should be”, where the “should be” is epistemically flavored. Meditations that aim to grok how the world is not how it should be, seem to diminish my drive to rectify the flaw and yield acceptance instead.

Jono 15 May 2025 10:56 UTC
1 point
0
on: Fake thinking and real thinking
Thank you. I want to stress (you already spent some words on this) that we don’t always think with the aim to find truth, or find some particular truth.
Thinking as how to achieve some goal looks different from thinking as how to best describe some facet of the world. When evaluating your thoughts (or when someone else evaluates yours) I think it paramount to know what motivated your reasoning in the first place.
My tag — not for getting to think real, but for knowing whether your thought was good — is to understand the constraints the thought was under.
Who can learn the thought?
In what way do observers look at the thought (is there access to the direct thought or only its outcome in some form?)
What consequences would counterfactual thoughts have had?
The failure mode I’m trying to prevent is misidentifying any failures about your past thinking due to forgetting the context under which the thinking occurred. You might evaluate that some past thought wasn’t world-saving enough, but at the time that thought might not have been trying to be that. That’s not a flaw of the thought-generation mechanism, but of the drive which invoked that.
More generally, “motivated reasoning” is a horrible term to describe misguided cognition. Have motives and know them (not that this post used this phrase).

Jono 19 Apr 2025 11:03 UTC
1 point
0
on: AI Safety Memes Wiki
Very cool, I’m not seeing a table of contents on aisafety.info however.

Jono 13 Apr 2025 9:13 UTC
1 point
0
in reply to: Viliam’s comment on: Jono’s Shortform
Thank you very much.
I imagined the forecasting AI to not be smart enough to be able to simulate a tiler that knows it is being simulated by us.
Perhaps that constraint is so large that the forecasts cannot be reliable.

Jono 3 Apr 2025 14:09 UTC
3 points
0
on: Jono’s Shortform
Does this help outer alignment?
Goal: tile the universe with niceness, without knowing what niceness is.
Method
We create:
- a bunch of formulations of what niceness is.
- a tiling AI, that given some description of niceness, tiles the universe with it.
- a forecasting AI, that given a formulation of niceness, a description of the tiling AI, a description of the universe and some coordinates in the universe, generates a prediction of what the part of the universe at the coordinates looks like after the tiling AI has tiled it with the formulation of niceness.
Following that, we feed our formulations of niceness into the forecasting AI, randomly sample some coordinates and evaluate whether the resulting predictions look nice.
From this we infer which formulations of niceness are truly nice.
Weaknesses:
- Can we recognize utopia by randomly sampled predictions about parts of it?
- Our forecasting AI is magnitudes weaker than the tiling AI. Can formulations of niceness turn perverse when a smarter agent optimizes for them?
Strengths:
- Less need to solve the ELK problem.
- We have multiple tries at solving outer alignment.

Jono 25 Mar 2025 12:26 UTC
1 point
0
on: Finishing The SB-1047 Documentary In 6 Weeks
I have looked around a bit and have not seen any updates since November, which estimated this be finished in early February.
Could you give another update, or link a more recent one if it exists?

Jono 10 Jan 2025 11:07 UTC
3 points
0
on: Jono’s Shortform
P(doom) can be approximately measured.
If reality fluid describes the territory well, we should be able to see close worlds that already died off.
For nuclear war we have some examples.
We can estimate the odds that the Cuban missile crisis and Petrov’s decision went badly. If we accept that luck was a huge factor in us surviving those events (or not encountering events like it), we can see how unlikely our current world is to still live.
A high P(doom) implies that we are about to (or already did) encounter some very unlikely events that worked out suspiciously well for our survival. I don’t know how public a registry of events like this should be, but it should exist.
Our self-reporting murderers or murder-witnesses should be extraordinarily well protected from leaks however, which in part seems like a software question.
Yes, this seems unlikely to happen, but again if your P(doom) is high, then we are only to survive in unlikely worlds. Working on this, to me, seems dignified: a way to make those unlikely worlds a bit less unlikely.

Jono’s Shortform

Jono10 Jan 2025 11:07 UTC

2 points

7 comments1 min readLW link

Jono 11 Jun 2024 13:38 UTC
5 points
2
on: What should I do? (long term plan about starting an AI lab)
I don’t know if you have already, but this might be the time to take a long and hard look at the probblem and consider whether deep learning is the key to solving it.
What is the problem?
- reckless unilateralism? → go work for policy or chip manufacturing
- inabillity to specify human values? → that problem looks not DL at all to me
- powerful hackers stealing all the proto-AGIs in the next 4 years? → go cybersec
- deception? → (why focus there? why make an AI that might deceive you in the first place?) but that’s pretty ML, though I’m not sure interp is the way to go there
- corrigibility? → might be ML, though I’m not sure all theoretical squiggles are ironed out yet
- OOD behavior? → probably ML
- multi-agent dynamics? → probably ML
At the very least you ought to have a clear output channel if you’re going to work with hazardous technology. Do you have the safety-mindset that prevents you from having you dual-use tech on the streets? You’re probably familiar with the abysmal safety / capabilities ratio of people working in the field, any tech that helps safety as much as capability, will therefore in practice help capability more, if you don’t distribute it carefully.
I personally would want some organisation to step up to become the keeper of secrets. I’d want them to just go all-out on cybersec, have a web of trust and basically be the solution to the unilateralists curse. That’s not ML though.
I think this problem has a large ML-part to it, but the problem is being tackled nearly-solely by ML people. I think whatever part of the problem can be tackled with ML, won’t necessarily benefit by having more ML people on it.

[Question] Is anyone developing optimisation-robust interpretability methods?

Jono11 Jun 2024 13:14 UTC

6 points

0 comments1 min readLW link

Jono 8 Jun 2024 19:46 UTC
1 point
0
in reply to: Bird Concept’s comment on: Closed-Source Evaluations
agreed

Closed-Source Evaluations

Jono8 Jun 2024 14:18 UTC

15 points

4 comments1 min readLW link

AI demands unprecedented reliability

Jono9 Jan 2024 16:30 UTC

22 points

5 comments2 min readLW link

Jono 29 Nov 2023 9:51 UTC
4 points
3
in reply to: Alex_Altair’s comment on: Shallow review of live agendas in alignment & safety
ai-plans.com aims to collect research agendas and have people comment on their strengths and vulnerabilities. The discord also occasionally hosts a critique-a-ton, where people discuss specific agendas.

Jono 27 Oct 2023 9:47 UTC
LW: 3 AF: 6
−2
AF
in reply to: TurnTrout’s comment on: AI as a science, and three obstacles to alignment strategies
We do not know, that is the relevant problem.
Looking at the output of a black box is insufficient. You can only know by putting the black box in power, or by deeply understanding it.
Humans are born into a world with others in power, so we know that most humans care about each other without knowing why.
AI has no history of demonstrating friendliness in the only circumstances where that can be provably found. We can only know in advance by way of thorough understanding.
A strong theory about AI internals should come first. Refuting Yudkowsky’s theory about how it might go wrong is irrelevant.

Jono

Jono’s Shortform

[Question] Is any­one de­vel­op­ing op­ti­mi­sa­tion-ro­bust in­ter­pretabil­ity meth­ods?

Closed-Source Evaluations

AI de­mands un­prece­dented reliability

[Question] Is anyone developing optimisation-robust interpretability methods?

AI demands unprecedented reliability