Andrew McKnight

Karma: 68

Andrew McKnight 6 Nov 2024 22:19 UTC
1 point
0
in reply to: Richard_Ngo’s comment on: Against Almost Every Theory of Impact of Interpretability
Do you think putting extra effort into learning about existing empirical work while doing conceptual work would be sufficient for good conceptual work or do you think people need to be producing empirical work themselves to really make progress conceptually?

Andrew McKnight 25 May 2024 22:55 UTC
2 points
0
on: Catching AIs red-handed
Maybe you’ve addressed this elsewhere but isn’t scheming convergent in the sense that a perfectly aligned AGI would still have an incentive to do so unless they already fully know themselves? An aligned AGI can still desire to have some unmonitored breathing room to fully reflect and realize what it truly cares about, even if that thing is what we want.
Also a possible condition for a fully corrigible AGI would be to not have this scheming incentive in the first place even while having the capacity to scheme.

Andrew McKnight 15 Mar 2024 18:47 UTC
9 points
8
in reply to: mike_hawke’s comment on: ‘Empiricism!’ as Anti-Epistemology
lukeprog argued similarly that we should drop the “the”

Andrew McKnight 21 Jul 2023 18:38 UTC
2 points
0
on: The shape of AGI: Cartoons and back of envelope
Another possible inflection point, pre-self-improvement could be when an AI gets a set of capabilities that allows it to gain new capabilities at inference time.

Andrew McKnight 19 Jun 2023 1:09 UTC
1 point
0
in reply to: RatsWrongAboutUAP’s comment on: UFO Betting: Put Up or Shut Up
I’ll repeat this bet, same odds same conditions same payout, if you’re still interested. My $10k to your $200 in advance.

Andrew McKnight 3 Apr 2023 22:09 UTC
1 point
0
in reply to: Noosphere89’s comment on: Policy discussions follow strong contextualizing norms
Responding to your #1, do you think we’re on track to handle the cluster of AGI Ruin scenarios pointed at in 16-19? I feel we are not making any progress here other than towards verifying some properties in 17.
16: outer optimization even on a very exact, very simple loss function doesn’t produce inner optimization in that direction.
17: on the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they’re there, rather than just observable outer ones you can run a loss function over.
18: There’s no reliable Cartesian-sensory ground truth (reliable loss-function-calculator) about whether an output is ’aligned′
19: there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment

Andrew McKnight 13 Mar 2023 20:23 UTC
1 point
0
in reply to: Ethan Perez’s comment on: Anthropic’s Core Views on AI Safety
Thanks for the links and explanation, Ethan.

Andrew McKnight 11 Mar 2023 2:27 UTC
LW: 3 AF: 1
0
AF
in reply to: evhub’s comment on: Anthropic’s Core Views on AI Safety
I mean, it’s mostly semantics but I think of mechanical interpretability as “inner” but not alignment and think it’s clearer that way, personally, so that we don’t call everything alignment. Observing properties doesn’t automatically get you good properties. I’ll read your link but it’s a bit too much to wade into for me atm.

Either way, it’s clear how to restate my question: Is mechanical interpretability work the only inner alignment work Anthropic is doing?

Andrew McKnight 10 Mar 2023 23:35 UTC
LW: 2 AF: 1
−3
AF
on: Anthropic’s Core Views on AI Safety
Great post. I’m happy to see these plans coming out, following OpenAI’s lead.

It seems like all the safety strategies are targeted at outer alignment and interpretability. None of the recent OpenAI, Deepmind, Anthropic, or Conjecture plans seem to target inner alignment, iirc, even though this seems to me like the biggest challenge.

Is Anthropic mostly leaving inner alignment untouched, for now?

Andrew McKnight 10 Mar 2023 22:58 UTC
1 point
0
in reply to: mako yass’s comment on: Acausal normalcy
Taken literally, the only way to merge n utility functions into one without any other info (eg the preferences that generated the utility functions) is to do a weighted sum. There’s only n-1 free parameters.

Andrew McKnight 22 Aug 2022 19:38 UTC
1 point
0
on: Announcing Encultured AI: Building a Video Game
Wouldn’t the kind of alignment you’d be able to test behaviorally in a game be unrelated to scalable alignment?

Andrew McKnight 7 Aug 2022 20:44 UTC
2 points
0
in reply to: johnswentworth’s comment on: Computational Model: Causal Diagrams with Symmetry
I know this was 3 years ago, but was this disagreement resolved, maybe offline?

Andrew McKnight 3 Aug 2022 21:32 UTC
3 points
0
in reply to: Daniel Kokotajlo’s comment on: Two-year update on my personal AI timelines
Is there reason to believe algorithmic improvements follow an exponential curve? Do you happen to know a good source on this?

Andrew McKnight 1 Aug 2022 21:52 UTC
5 points
1
in reply to: Zack_M_Davis’s comment on: AGI Ruin: A List of Lethalities
I’m tempted to call this a meta-ethical failure. Fatalism, universal moral realism, and just-world intuitions seem to be the underlying implicit hueristics or principals that would cause this “cosmic process” thought-blocker.

Andrew McKnight 28 Jul 2022 18:26 UTC
2 points
1
in reply to: Rob Bensinger’s comment on: Why all the fuss about recursive self-improvement?
I think it’s good to go back to this specific quote and think about how it compares to AGI progress.

A difference I think Paul has mentioned before is that Go was not a competitive industry and competitive industries will have smaller capability jumps. Assuming this is true, I also wonder whether the secret sauce for AGI will be within the main competitive target of the AGI industry.

The thing the industry is calling AGI and targeting may end up being a specific style of shallow deployable intelligence when “real” AGI is a different style of “deeper” intelligence (with, say, less economic value at partial stages and therefore relatively unpursued). This would allow a huge jump like AlphaGo in AGI even in a competitive industry targeting AGI.

Both possibilities seem plausible to me and I’d like to hear arguments either way.

Andrew McKnight 18 Jun 2022 14:39 UTC
3 points
0
in reply to: jbash’s comment on: AGI Ruin: A List of Lethalities
von Neumann’s design was in full detail, but, iirc, when it was run for the first time (in the 90s) it had a few bugs that needed fixing. I haven’t followed Freitas in a long time either but agree that the designs weren’t fully spelled out and would have needed iteration.

Andrew McKnight 6 Jun 2022 21:18 UTC
7 points
1
in reply to: mukashi’s comment on: AGI Ruin: A List of Lethalities
If we merely lose control of the future and virtually all resources but many of us aren’t killed in 30 years, would you consider Eliezer right or wrong?

Andrew McKnight 6 Jun 2022 20:52 UTC
3 points
−1
in reply to: jbash’s comment on: AGI Ruin: A List of Lethalities
There is some evidence that complex nanobots could be invented in ones head with a little more IQ and focus because von Neumann designed a mostly functional (but fragile) replicator in a fake simple physics using the brand-new idea of a cellular automata and without a computer and without the idea of DNA. If a slightly smarter von Neumann focused his life on nanobots, could he have produced, for instance, the works of Robert Freitas but in the 1950s, and only on paper?
I do, however, agree it would be helpful to have different words for different styles of AGI but it seems hard to distinguish these AGIs productively when we don’t yet know the order of development and which key dimensions of distinction will be worth using as we move forward. (human-level vs super-? shallow vs deep? passive vs active? autonomy-types? tightness of self-improvement? etc). Which dimensions will pragmatically matter?

Andrew McKnight 22 Feb 2022 19:20 UTC
3 points
0
in reply to: ChristianKl’s comment on: February Open Thread
I think this makes sense because eggs are haploid (already only have 23 chromosomes) but a natural next question is: why are eggs haploid if there is a major incentive to pass more of the 46 chromosomes?

Andrew McKnight 5 Dec 2021 1:36 UTC
2 points
0
on: Andrew McKnight’s Shortform
I’ve been thinking about benefits of “Cognitive Zoning Laws” for AI architecture.

If specific cognitive operations were only performed in designated modules then these modules could have operation-specific tracking, interpreting, validation, rollback, etc. If we could ensure “zone breaches” can’t happen (via e.g. proved invariants or more realistically detection and rollback) then we could theoretically stay aware of where all instances of each cognitive operation are happening in the system. For now let’s call this cognitive-operation-factored architecture “Zoned AI”.

Zoned AI seems helpful in preventing inner optimizers that are within particular modules (but might have little to say about emergent cross-module optimizers) and also would let interpretability techniques focus in on particular sections of the AI (e.g. totally speculating but if we knew where the meta-learning was inside GPT-3 it might just be all over the place and even with interpretability tools it could be hard to understand globally compared to the ability being localized in the network). Gradient descent training schemes break cognitive zoning law by default.

Defining cognitive operations perfectly enough to capture all instances of them is a losing battle. Instead we might (1) allow lots of false negatives and (2) use a behavioral test for detecting them rather than a definition.

To test a single inner piece of a Zoned AI, we create a second Zoned AI that is functional for some task and remove the capacity we want to test from that AI. Then we take the inner piece we are testing for a breach from the first AI, wrap it in a shallow network (a neural net or whatever), and see if the second AI can be made to function by training the shallow network. If the training succeeds, then we have a thing that is sufficiently similar to the disallowed operation, so we have a breach.

Now we don’t actually want to check every tiny piece of the AI so instead we train a 3rd system to search for sections that might contain the disallowed ability and to predict whether one exists within the entire first AI, using the 2nd AI only as an expensive check.

Seeing the same abilities cropping up in the wrong place would tell you about the incentives innate to your architecture components and gesture towards new architectures that relieve the incentive. (e.g. If you find planning in your perception then maybe you need to attach the planner in a controlled way to the perception module)

None of this will work at later stages when an AGI can operate on itself but I would hope Cognitive Zoning could help during the crucial phase when we have AGI architecture in our hands but have not yet deployed instances at a scale where they are dangerous.

Thoughts and improvements? I’m sure this isn’t a novel idea but has anyone written about it?