Ramana Kumar

Karma: 1,362

Ramana Kumar 17 May 2021 8:25 UTC
LW: 30 AF: 14
AF
on: Introduction to Cartesian Frames
A formalisation of the ideas in this sequence in higher-order logic, including machine verified proofs of all the theorems, is available here.
What links here?
- Ramana Kumar's comment on Introduction to Cartesian Frames by Scott Garrabrant (5 Dec 2020 8:34 UTC; 11 points)

Ramana Kumar 23 Dec 2021 12:13 UTC
LW: 25 AF: 17
AF
on: ARC’s first technical report: Eliciting Latent Knowledge
Here’s an attempt at condensing an issue I’m hung up on currently with ELK. This also serves as a high-level summary that I’d welcome poking at in case I’m getting important parts wrong.

The setup for ELK is that we’re trying to accurately label a dataset of (observation, action, predicted subsequent observation) triples for whether the actions are good. (The predicted subsequent observations can be optimised for accuracy using automated labels—what actually gets observed subsequently—whereas the actions need their labels to come from a source of judgement about what’s good, e.g., a human rater.)
The basic problem is partial observability: the observations don’t encapsulate “everything that’s going on”, so the labeller can’t distinguish good states from bad states that look good. An AI optimising actions for positive labels (and predicted observations for accuracy) may end up preferring to reach bad states that look good over good states, because controlling the observation is easier than controlling the rest of the state and because directly predicting what observations will get positive labels is easier than (what we’d want instead) inferring what states the positive labels are being attributed to and trying to produce those states.
The issue I’m hung up on currently is what seems like a conflation of two problems that may be worth distinguishing.
Problem 1 is that the observations might be misleading evidence. There’s some good state that produces the same observations as some bad state. If the labeller knew they were in the bad state they’d give a negative label, but they can’t tell. Maybe their prior favours the good state, so they assume that’s what they’re seeing and give a positive label.
Problem 2 is that the labeller doesn’t understand the state that produced the observations. In this case I have to be a bit more careful about what I mean by “states”. For now, I’m talking about ways the world could be that the labeller understands well enough to answer questions about what’s important to them, e.g., a state resolves a question like “is the diamond still present?” for the labeller. Problem 2 is that there are ways the world can be that do not resolve such questions for the labeller. Further judgement, deliberation, and understanding is required to determine what the answer should be in these strange worlds. In this case, the labeller will probably produce a label for the state they understand that’s most compatible with the observations, or they’ll be too confused by the observations and conservatively give a negative label. The AI may then optimise for worlds that are deeply confusing with the only fact we can grasp about what’s going on being that the observations look great.
I think the focus on narrow elicitation in the report is about restricting attention to Problem 1 and eschewing Problem 2. Is that right? Either way, I’d say if we restrict to Problem 1 then I claim there’s hope in the fact that the labeller can in principle understand what’s actually going on, and it’s just a matter of showing them some additional observations to expose it. This is what I’d try to figure out how to incentivise. But I’d want to do so without having to worry about the confusing things coming out of Problem 2, and hope to deal with that problem separately.
(If it might help I think I could give more of a formalisation of these problems. I think the natural language description above is probably clearer for now though.)

Ramana Kumar 3 Dec 2020 20:22 UTC
LW: 23 AF: 12
AF
on: Controllables and Observables, Revisited
Counterexample to your claim that $C t r l (C \oplus D) = C t r l (C) \cup C t r l (D)$ (with both $C$ and $D$ not $n u l l$ ):
Take $C$ and $D$ to be the following Cartesian frames over world ${x, y}$ with agent and environment both ${a, b}$ as the following matrices:
$\begin{matrix} a & b a & y & x b & x & x \end{matrix}$ $\begin{matrix} a & b a & x & y b & y & y \end{matrix}$
Then $C t r l (C) = C t r l (D) = \emptyset$ but $C t r l (C \oplus D) ∋ {x}$ .
(Sorry for my formatting—I haven’t done LaTeX in these comments before.)

Ramana Kumar 19 Nov 2021 15:48 UTC
LW: 22 AF: 13
AF
on: Ngo and Yudkowsky on alignment difficulty
I am interested in the history-funnelling property—the property of being like a consequentialist, or of being effective at achieving an outcome—and have a specific confusion I’d love to get insight on from anyone who has any.

Question: Possible outcomes are in the mind of a world-modeller—reality just is as it is (exactly one way) and isn’t made of possibilities. So in what sense do the consequentialist-like things Yudkowsky is referring to funnel history?
Option 1 (robustness/behavioural/our models): They achieve narrow outcomes with respect to an externally specified set of counterfactuals. E.g., relative to what we consider “could have happened”, the consequentialists selected an excellent course of action for their purposes. This would make consequentialists optimizing systems in Flint’s sense.
Option 2 (agency/structural/their models): They are structured in such a way that they do their own considering and evaluating and deciding. We observe mechanisms that implement the processes of predicting and evaluating outcomes in these systems (and/or their history). So the possibilities that are narrowed down are the consequentialist’s possibilities, the counterfactuals are produced by their models which may or may not line up with some externally specified ones (like ours).
I mostly think Yudkowsky is referring to Option 2, but I get confused by phrases (e.g. from Soares’s summary) like “manage to actually funnel history” or “apparent consequentialism”, that seem to me to make most sense under Option 1.

Ramana Kumar 25 Nov 2021 11:56 UTC
LW: 20 AF: 9
AF
in reply to: Daniel Kokotajlo’s comment on: Yudkowsky and Christiano discuss “Takeoff Speeds”
I wonder what effect there is from selecting for reading the third post in a sequence of MIRI conversations from start to end and also looking at the comments and clicking links in them.

Ramana Kumar 30 Jul 2019 8:30 UTC
LW: 20 AF: 9
AF
on: Selection vs Control
Thanks for a great post! I have a small confusion/nit regarding natural selection. Despite its name, I don’t think it’s a good exemplar of a selection process. Going through the features of a selection process from the start of the post:
- can directly instantiate any element of the search space. No: natural selection can only make local modifications to previously instantiated points. But you already dealt with this local search issue in Choices Don’t Change Later Choices.
- gets direct feedback on the quality of each element. Yes.
- quality of element does not depend on previous choices. No, the evaluation of an element in natural selection depends a great deal on previous choices because they usually make up important parts of its environment. I think this is the thrust of the claim that natural selection is online (which I agree with).
- only the final output matters. No? From the perspective of natural selection, I think the quality of the current output is what matters.
I’d love to know why natural selection seemed obvious as an example of a selection process, since it did not to me due to its poor score on the checklist above.

Ramana Kumar 6 Jun 2022 10:59 UTC
LW: 19 AF: 8
17
AF
in reply to: DaemonicSigil’s comment on: AGI Ruin: A List of Lethalities
We might summarise this counterargument to #30 as “verification is easier than generation”. The idea is that the AI comes up with a plan (+ explanation of how it works etc.) that the human systems could not have generated themselves, but that human systems can understand and check in retrospect.

Counterclaim to “verification is easier than generation” is that any pivotal act will involve plans that human systems cannot predict the effects of just by looking at the plan. What about the explanation, though? I think the problem there may be more that we don’t know how to get the AI to produce a helpful and accurate explanation as opposed to a bogus misleading but plausible-sounding one, not that no helpful explanation exists.

Ramana Kumar 17 Mar 2019 9:43 UTC
19 points
on: Boeing 737 MAX MCAS as an agent corrigibility failure
I like this post because it pushes us to be more precise about what we mean by corrigibility. Nice example.

Ramana Kumar 21 Jun 2022 9:51 UTC
LW: 18 AF: 6
20
AF
in reply to: TurnTrout’s comment on: Where I agree and disagree with Eliezer
Yes, human beings exist and build world models beyond their local sensory data, and have values over those world models not just over the senses.

But this is not addressing all of the problem in Lethality 19. What’s missing is how we point at something specific (not just at anything external).

The important disanalogy between AGI alignment and humans as already-existing (N)GIs is:
- for AGIs there’s a principal (humans) that we want to align the AGI to
- for humans there is no principal—our values can be whatever. Or if you take evolution as the principal, the alignment problem wasn’t solved.

Ramana Kumar 12 Jul 2022 13:52 UTC
LW: 17 AF: 7
3
AF
in reply to: Quadratic Reciprocity’s comment on: On how various plans miss the hard bits of the alignment challenge
For 2, I think a lot of it is finding the “sharp left turn” idea unlikely. I think trying to get agreement on that question would be valuable.
For 4, some of the arguments for it in this post (and comments) may help.
For 3, I’d be interested in there being some more investigation into and explanation of what “interpretability” is supposed to achieve (ideally with some technical desiderata). I think this might end up looking like agency foundations if done right.
For example, I’m particularly interested in how “interpretability” is supposed to work if, in some sense, much of the action of planning and achieving some outcome occurs far away from the code or neural network that played some role in precipitating it. E.g., one NN-based system convinces another more capable system to do something (including figuring out how); or an AI builds some successor AIs that go on to do most of the thinking required to get something done. What should “interpretability” do for us in these cases, assuming we only have access to the local system?

Ramana Kumar 27 Mar 2023 9:50 UTC
LW: 14 AF: 9
0
AF
on: $500 Bounty/Contest: Explain Infra-Bayes In The Language Of Game Theory
I’ll add $500 to the pot.

Ramana Kumar 18 Dec 2023 18:22 UTC
LW: 13 AF: 6
2
AF
on: OpenAI, DeepMind, Anthropic, etc. should shut down.
I think this is basically correct and I’m glad to see someone saying it clearly.

Ramana Kumar 23 Jun 2022 9:58 UTC
LW: 13 AF: 10
4
AF
in reply to: TurnTrout’s comment on: Where I agree and disagree with Eliezer
I basically agree with you. I think you go too far in saying Lethailty 19 is solved, though. Using the 3 feats from your linked comment, which I’ll summarise as “produce a mind that...”:
1. cares about something
2. cares about something external (not shallow function of local sensory data)
3. cares about something specific and external
(clearly each one is strictly harder than the previous) I recognise that Lethality 19 concerns feat 3, though it is worded as if being about both feat 2 and feat 3.
I think I need to distinguish two versions of feat 3:
1. there is a reliable (and maybe predictable) mapping between the specific targets of caring and the mind-producing process
2. there is a principal who gets to choose what the specific targets of caring are (and they succeed)
Humans show that feat 2 at least has been accomplished, but also 3a, as I take you to be pointing out. I maintain that 3b is not demonstrated by humans and is probably something we need.

Ramana Kumar 12 Aug 2022 10:47 UTC
LW: 11 AF: 9
1
AF
in reply to: TurnTrout’s comment on: Will Capabilities Generalise More?
Straw person: We haven’t found any feedback producer whose outputs are safe to maximise. We strongly suspect there isn’t one.
Ramana’s gloss of TurnTrout: But AIs don’t maximise their feedback. The feedback is just input to the algorithm that shapes the AI’s cognition. This cognition may then go on to in effect “have a world model” and “pursue something” in the real world (as viewed through its world model). But its world model might not even contain the feedback producer, in which case it won’t be pursuing high feedback. (Also, it might just do something else entirely.)
Less straw person: Yeah I get that. But what kind of cognition do you actually get after shaping it with a lot of feedback? (i.e., optimising/selecting the cognition based on its performance at feedback maximisation) If your optimiser worked, then you get something that pursues positive feedback. Spelling things out, what you get will have a world model that includes the feedback producer, and it will pursue real high feedback, as long as doing so is a possible mind configuration and the optimiser can find it, since that will in fact maximise the optimisation objective.
Possible TurnTrout response: We’re obviously not going to be using “argmax” as the optimiser though.

Ramana Kumar 9 Mar 2022 23:52 UTC
11 points
in reply to: shminux’s comment on: The Proof of Doom
I don’t think so. A proof conditional on not taking any significantly different actions would be fine for this purpose.

Ramana Kumar 11 Feb 2021 8:25 UTC
LW: 11 AF: 6
AF
in reply to: Rohin Shah’s comment on: Eight Definitions of Observability
Using the idea we talked about offline, I was able to fix the proof—thanks Rohin!
Summary of the fix:
When $D_{1}$ and $D_{2}$ are defined, additionally assume they are biextensional (take their biextensional collapse), which is fine since we are trying to prove a biextensional equivalence. (By the way this is why we can’t take $b_{1} = b_{2}$ , since we might have $A \supseteq B_{1} \neq B_{2} \subseteq A$ after biextensional collapse.) Then to prove $h = h_{f_{1}}$ , observe that for all $b \in B_{1}$ , $b ∙_{1} h (b_{2}^{'}) = b ∙_{1} h (b_{2})$ which means $b ⋆_{1} h (b_{2}^{'}) = b ⋆_{1} f_{1}$ , hence $h (b_{2}^{'}) = f_{1}$ since a biextensional frame has no duplicate columns.

Ramana Kumar 5 Dec 2020 8:34 UTC
LW: 11 AF: 5
AF
on: Introduction to Cartesian Frames
Do we lose much by restricting attention to finite Cartesian frames (i.e., with finite agent and environment)? I ask because I’m formalising these results in higher-order logic (HOL), and the category $C h u (w)$ is too big to represent if it really must contain frames with infinite agents and also for any pair of frames the frame whose agents are the morphisms between them. The root problem is probably that I require any category’s class of objects to be a set, but it’s hard to avoid this requirement in HOL in a nice way. Everything should work out for finite frames though. (I haven’t come across any compelling examples of infinite frames, but I haven’t tried hard to think of them.)

Ramana Kumar 23 Nov 2021 17:28 UTC
LW: 10 AF: 6
AF
in reply to: Rob Bensinger’s comment on: Ngo and Yudkowsky on alignment difficulty
Thanks for the replies! I’m still somewhat confused but will try again to both ask the question more clearly and summarise my current understanding.

What, in the case of consequentialists, is analogous to the water funnelled by literal funnels? Is it possibilities-according-to-us? Or is it possibilities-according-to-the-consequentialist? Or is it neither (or both) of those?
To clarify a little what the options in my original comment were, I’ll say what I think they correspond to for literal funnels. Option 1 corresponds to the fact that funnels are usually nearby (in spacetime) when water is in a small space without having spilled, and Option 2 corresponds to the characteristic funnel shape (in combination with facts about physical laws maybe).
I think your and Eliezer’s replies are pointing me at a sense in which both Option 1 and Option 2 are correct, but they are used in different ways in the overall story. To tell this story, I want to draw a distinction between outcome-pumps (behavioural agents) and consequentialists (structural agents). Outcome-pumps are effective at achieving outcomes, and this effectiveness is measured according to our models (option 1). Consequentialists do (or have done in their causal history) the work of selecting actions according to expected consequences in coherent pursuit of an outcome, and the expected consequences are therefore their own (option 2).
Spelling this out a little more—Outcome-pumps are optimizing systems: there is a space of possible configurations, a much smaller target subset of configurations, and a basin of attraction such that if the system+surroundings starts within the basin, it ends up within the target. There are at least two ways of looking at the configuration space. Firstly, there’s the range of situations in which we actually observe the same (or similar) outcome-pump system and that it achieved its outcome. Secondly, there’s the range of hypothetical possibilities we can imagine and reason about putting the outcome-pump system into, and extrapolating (using our own models) that it will achieve the outcome. Both of these ways are “Option 1”.
Consequentialists (structural agents) do the work, somewhere somehow—maybe in their brains, maybe in their causal history, maybe in other parts of their structure and history—of maintaining and updating beliefs and selecting actions that lead to (their modelled) expected consequences that are high in their preference ordering (this is all Option 2).
It should be somewhat uncontroversial that consequentialists are outcome pumps, to the extent that they’re any good at doing the consequentialist thing (and have sufficiently achievable preferences relative to their resources etc).
The more substantial claim I read MIRI as making is that outcome pumps are consequentialists, because the only way to be an outcome pump is to be a consequentialist. Maybe you wouldn’t make this claim so strongly, since there are counterexamples like fires and black holes—and there may be some restrictions on what kind of outcome pumps the claim applies to (such as some level of retargetability or robustness?).
How does this overall take sound?

Scott Garrabrant’s question on whether agent-like behaviour implies agent-like architecture seems pretty relevant to this whole discussion—Eliezer, do you have an answer to that question? Or at least do you think it’s an important open question?

Ramana Kumar 8 Sep 2022 10:24 UTC
LW: 9 AF: 5
9
AF
in reply to: janus’s comment on: Simulators
I think Dan’s point is good: that the weights don’t change, and the activations are reset between runs, so the same input (including rng) always produces the same output.
I agree with you that the weights and activations encode knowledge, but Dan’s point is still a limit on learning.
I think there are two options for where learning may be happening under these conditions:
- During the forward pass. Even though the function always produces the same output for a given output, the computation of that output involves some learning.
- Using the environment as memory. Think of the neural network function as a choose-your-own-adventure book that includes responses to many possible situations depending on which prompt is selected next by the environment (which itself depends on the last output from the function). Learning occurs in the selection of which paths are actually traversed.
These can occur together. E.g., the “same character” as was invoked by prompt 1 may be invoked by prompt 2, but they now have more knowledge (some of which was latent in the weights, some of which came in directly via prompt 2; but all of which was triggered by prompt 2).

Ramana Kumar 25 Aug 2022 12:03 UTC
LW: 9 AF: 3
1
AF
on: Your posts should be on arXiv
Could this be accomplished with literally zero effort from the post-writers? The tasks of identifying which posts are arXiv-worthy, formatting for submission, and doing the submission all seem like they could be done by entities other than the author. The only issue might be in associating the arXiv submitter account with the right person.