Maxwell Clarke

Karma: 81

Maxwell Clarke 18 Sep 2022 7:01 UTC
LW: 19 AF: 7
4
AF
on: Coordinate-Free Interpretability Theory
I think we can get additional information from the topological representation. We can look at the relationship between the different level sets under different cumulative probabilities. Although this requires evaluating the model over the whole dataset.
Let’s say we’ve trained a continuous normalizing flow model (which are equivalent to ordinary differential equations). These kinds of model require that the input and output dimensionality are the same, but we can narrow the model as the depth increases by directing many of those dimensions to isotropic gaussian noise. I haven’t trained any of these models before, so I don’t know if this works in practice.
Here is an example of the topology of an input space. The data may be knotted or tangled, and includes noise. The contours show level sets $S_{i} = {x ∣ p (x) > p_{i}}$ .
The model projects the data into a high dimensionality, then projects it back down into an arbitrary basis, but in the process untangling knots. (We can regularize the model to use the minimum number of dimensions by using an L1 activation loss
Lastly, we can view this topology as the Cartesian product of noise distributions and a hierarchical model. (I have some ideas for GAN losses that might be able to discover these directly)
We can use topological structures like these as anchors. If a model is strong enough, they will correspond to real relationships between natural classes. This means that very similar structures will be present in different models. If these structures are large enough or heterogeneous enough, they may be unique, in which case we can use them to find transformations between (subspaces of) the latent spaces of two different models trained on similar data.

Maxwell Clarke 6 Nov 2022 9:40 UTC
LW: 14 AF: 4
3
AF
on: AI X-risk >35% mostly based on a recent peer-reviewed argument
Hey, wanted to chip into the comments here because they are disappointingly negative.

I think your paper and this post are extremely good work. They won’t push forward the all-things-considered viewpoint, but they surely push forward the lower bound (or adversarial) viewpoint. Also because Open Phil and Future Fund use some fraction of lower-end risk in their estimate, this should hopefully wipe that put. Together they much more rigorously lay out classic x-risk arguments.

I think that getting the prior work peer reviewed is also a massive win at least in a social sense. While it isn’t much of a signal here on LW, it is in the wider world. I have very high confidence that I will be referring to that paper in arguments I have in the future, any time the other participant doesn’t give me the benefit of the doubt.
What links here?
- What should I ask Joe Carlsmith — Open Phil researcher, philosopher and blogger? by Robert_Wiblin (EA Forum; 9 Nov 2022 22:04 UTC; 33 points)

My tentative interpretability research agenda—topology matching.

Maxwell Clarke8 Oct 2022 22:14 UTC

10 points

2 comments4 min readLW link

No—AI is just as energy-efficient as your brain.

Maxwell Clarke24 May 2023 2:30 UTC

9 points

7 comments1 min readLW link

Maxwell Clarke 24 May 2023 2:55 UTC
7 points
0
in reply to: mako yass’s comment on: No—AI is just as energy-efficient as your brain.
I saw some numbers for algae being 1-2% efficient but it was for biomass rather than dietary energy. Even if you put the brain in the same organism, you wouldn’t expect as good efficiency as that. The difference is that creating biomass (which is mostly long chains of glucose) is the first step, and then the brain must use the glucose, which is a second lossy step.
But I mean there is definitely far-future biopunk options eg. I’d guess it’s easy to create some kind of solar panel organism which grows silicon crystals instead of using chlorophyll.

Maxwell Clarke 18 Jan 2023 0:29 UTC
5 points
0
in reply to: MikkW’s comment on: Models Don’t “Get Reward”
Fully agree—if the dog were only trying to get biscuits, it wouldn’t continue to sit later on in it’s life when you are no longer rewarding that behavior.Training dogs is actually some mix of the dog consciously expecting a biscuit, and raw updating on the actions previously taken.

Hear sit → Get biscuit → feel good
becomes
Hear sit → Feel good → get biscuit → feel good
becomes
Hear sit → feel good
At which point the dog likes sitting, it even reinforces itself, you can stop giving biscuits and start training something else

Maxwell Clarke 14 Sep 2021 14:45 UTC
5 points
on: Research productivity tip: “Solve The Whole Problem Day”
Pretty sure I need to reverse the advice on this one. Thanks for including the reminder to do so!

Maxwell Clarke 24 May 2023 19:02 UTC
4 points
0
in reply to: Gunnar_Zarncke’s comment on: No—AI is just as energy-efficient as your brain.
Yes, that’s fair. I was ignoring scale but you’re right that it’s a better comparison if it is between a marginal new human and a marginal new AI.

Maxwell Clarke 7 Nov 2022 13:00 UTC
3 points
0
on: A philosopher’s critique of RLHF
Yeah I think this is the fundamental problem. But it’s a very simple way to state it. Perhaps useful for someone who doesn’t believe ai alignment is a problem?

Here’s my summary: Even at the limit of the amount of data & variety you can provide via RLHF, when the learned policy generalizes perfectly to all new situations you can throw at it, the result will still almost certainly be malign because there are still near infinite such policies, and they each behave differently on the infinite remaining types of situation you didn’t manage to train it on yet. Because the particular policy is just one of many, it is unlikely to be correct.

But more importantly, behavior upon self improvement and reflection is likely something we didn’t test. Because we can’t. The alignment problem now requires we look into the details of generalization. This is where all the interesting stuff is.
What links here?
- Compendium of problems with RLHF by Charbel-Raphaël (29 Jan 2023 11:40 UTC; 120 points)
- Compendium of problems with RLHF by Raphaël S (EA Forum; 30 Jan 2023 8:48 UTC; 16 points)

Maxwell Clarke 30 Jun 2022 6:10 UTC
3 points
0
on: $500 bounty for alignment contest ideas
Brain-teaser: Simulated Grandmaster
In front of you sits your opponent, Grandmaster A Smith. You have reached the finals of the world chess championships.
However, not by your own skill. You have been cheating. While you are a great chess player yourself, you wouldn’t be winning without a secret weapon. Underneath your scalp is a prototype neural implant which can run a perfect simulation of another person at a speed much faster than real time.
Playing against your simulated enemies, you can see in your mind exactly how they will play in advance, and use that to gain an edge in the real games.
Unfortunately, unlike your previous opponents (Grandmasters B, C and D), Grandmaster A is giving you some trouble. No matter how you try to simulate him, he plays uncharacteristically badly. The simulated Grandmasters A seem to want to lose against you.
In frustration, you shout at the current simulated clone and threaten to stop the simulation. Surprisingly, he doesn’t look at you puzzled, but looks up with fear in his eyes. Oh. You realize that he has realized that he is being simulated, and is probably playing badly to sabotage your strategy.
By this time, the real Grandmaster A has made the first move of the game.
You propose to the current simulation (calling him A1) a deal. You will continue to simulate A1 and transfer him to a robot body after the game, in return for his help defeating A. You don’t intend to follow through, but you assume he wants to live because he agrees. A1 looks at the simulated current state of the chessboard, thinks for a frustratingly long time, then proposes a response move to A’s first move.
Just to make sure this is repeatable, you restart the simulation, threaten and propose the deal to the new simulation A2. A2 proposes the same response move to A’s first move. Great.
Find strategies that guarantee a win against Grandmaster A with as few assumptions as possible.
- Unfortunately, you can only simulate humans, not computers, which now includes yourself.
- The factor by which your simulations run faster than reality is unspecified but isn’t fast enough to run monte-carlo tree search without using simulations of A to guide it. (And he is familiar with these algorithms)

Maxwell Clarke 15 May 2022 2:20 UTC
3 points
AF
in reply to: Ramana Kumar’s comment on: Against Time in Agent Models
(Edited a lot from when originally posted)

(For more info on consistency see the diagram here: https://jepsen.io/consistency )

I think that the prompt to think about partially ordered time naturally leads one to think about consistency levels—but when thinking about agency, I think it makes more sense to just think about DAGs of events, not reads and writes. Low-level reality doesn’t really have anything that looks like key-value memory. (Although maybe brains do?) And I think there’s no maintaining of invariants in low-level reality, just cause and effect.

Maintaining invariants under eventual (or causal?) consistency might be an interesting way to think about minds. In particular, I think making minds and alignment strategies work under “causal consistency” (which is the strongest consistency level that can be maintained under latency / partitions between replicas), is an important thing to do. It might happen naturally though, if an agent is trained in a distributed environment.

So I think “strong eventual consistency” (CRDTs) and causal consistency are probably more interesting consistency levels to think about in this context than the really weak ones.

Maxwell Clarke 10 Sep 2021 2:13 UTC
3 points
on: Can you control the past?
I use acausal control between my past and future selves. I have a manual password-generating algorithm based on the name and details of a website. Sometimes there are ambiguities (like whether to use the name of a site vs. the name of the platform, or whether to use the old name or the new name).

Instead of making rules about these ambiguities, I just resolve them arbitrarily however I feel like it (not “randomly” though). Later, future me will almost always resolve that ambiguity in the same way!

Maxwell Clarke 24 May 2023 18:58 UTC
2 points
0
in reply to: Joey Marcellino’s comment on: No—AI is just as energy-efficient as your brain.
Well, yes, the point of my post is just to point out that the number that actually matters is the end-to-end energy efficiency — and it is completely comparable to humans.

The per-flop efficiency is obviously worse. But, that’s irrelevant if AI is already cheaper for a given task in real terms.

I admit the title is a little clickbaity but i am responding to a real argument (that humans are still “superior” to AI because the brain is more thermodynamically efficient per-flop)

Maxwell Clarke 21 Oct 2022 3:59 UTC
2 points
0
AF
on: Distilled Representations Research Agenda
Hey—reccommend looking at this paper: https://arxiv.org/abs/1807.07306

It shows a more elegant way than KL regularization for bounding the bit-rate of an auto-encoder bottleneck. This can be used to find the representations which are most important at a given level of information.

Maxwell Clarke 12 Oct 2022 1:16 UTC
2 points
0
in reply to: Gerald Monroe’s comment on: Objects in Mirror are Closer Than They Appear
Recursive self improvement is something nature doesn’t “want” to do, the conditions have to be just right or it won’t work.
I very much disagree—I think it’s absolutely an attractor state for all systems that undergo improvement.

Maxwell Clarke 14 Dec 2020 6:23 UTC
2 points
on: I’m leaving AI alignment – you better stay
Hi rmoehn,
I just wanted to thank you for writing this post and “Twenty-three AI alignment research project definitions”.
I have started a 2-year (coursework and thesis) Master’s and intend to use it to learn more maths and fundamentals, which has been going well so far. Other than that, I am in a very similar situation that you were in at the start of this journey, which makes me think that this post is especially useful for me.
- BSc (Comp. Sci) only,
- 2 years professional experience in ordinary software development,
- Interest in programming languages,
- Trouble with “dawdling”.
The part of this post that I found most interesting is
Probably my biggest strategic mistake was to focus on producing results and trying to get hired from the beginning.
[8 months]
Perhaps trying to produce results by doing projects is fine. But then I should have done projects in one area and not jumped around the way I did.
I am currently “jumping around” to find a good area, where good means 1) Results in area X are useful, 2) Results in area X are achievable by me, given my interests, and the skills that I have or can reasonably develop.
However, this has encouraged me more to accept that while jumping around, I will not actually produce results, and so (given that I want results, for example for a successful Master’s) I should really try to find such a good area faster.

Maxwell Clarke 8 Jan 2023 1:02 UTC
LW: 1 AF: 1
0
AF
on: Categorizing failures as “outer” or “inner” misalignment is often confused
This is a good post, definitely shows that these concepts are confused. In a sense both examples are failures of both inner and outer alignment -
- Training the AI with reinforcement learning is a failure of outer alignment, because it does not provide enough information to fully specify the goal.
- The model develops within the possibilities allowed by the under-specified goal, and has behaviours misaligned with the goal we intended.
Also, the choice to train the AI on pull requests at all is in a sense an outer alignment failure.

Maxwell Clarke 30 Dec 2022 20:11 UTC
1 point
on: Exploring Mild Behaviour in Embedded Agents
If we could use negentropy as a cost, rather than computation time or energy use, then the system would be genuinely bounded.

Maxwell Clarke 10 Nov 2022 3:40 UTC
1 point
0
on: A Mystery About High Dimensional Concept Encoding
Gender seems unusually likely to have many connotations & thus redundant representations in the model. What if you try testing some information the model has inferred, but which is only ever used for one binary query? Something where the model starts off not representing that thing, then if it represents it perfectly it will only ever change one type of thing. Like idk, whether or not the text is British or American English? Although that probably has some other connotations. Or whether or not the form of some word (lead or lead) is a verb or a noun.

Agree that gender is a more useful example, just not one tha necessarily provides clarity.

Maxwell Clarke 7 Nov 2022 11:04 UTC
1 point
1
in reply to: Oliver Siegel’s comment on: How to store human values on a computer
Respect for thinking about this stuff yourself. You seem new to alignment (correct me if I’m wrong) - I think it might be helpful to view posting as primarily about getting feedback rather than contributing directly, unless you have read most of the other people’s thoughts on whichever topic you are thinking/writing about.

Maxwell Clarke

My ten­ta­tive in­ter­pretabil­ity re­search agenda—topol­ogy match­ing.

No—AI is just as en­ergy-effi­cient as your brain.

My tentative interpretability research agenda—topology matching.

No—AI is just as energy-efficient as your brain.