algon33

Karma: 82

Question: MIRI Corrigbility Agenda

algon3313 Mar 2019 19:38 UTC

15 points

11 comments1 min readLW link

algon33 31 Jul 2020 20:22 UTC
3 points
on: “Go west, young man!”—Preferences in (imperfect) maps
Sometimes the cluster in the map a preference is pointing at involves another preference. Which provides a natural resolution mechanism. What happens when there’s two preferences, I’m unsure. I suppose it depends on how your map changes. In which case, I think you should focus on how to make purity coherent you should start off with some “simple” map and various “simple” changes in the map. To make purity coherent relative to your map is both computationally hard, and empathetically hard.
Side-note: It would be interesting to see which resolution mechanisms produce the most varied shifts in preferences for boundedly rational agents with complex utility functions.
Side-note^2: Stuart, I’m writing a review of all the work done on corrigibility. Would you mind if I asked you some questions on your contributions?

algon33 1 Aug 2020 11:43 UTC
1 point
in reply to: Stuart_Armstrong’s comment on: “Go west, young man!”—Preferences in (imperfect) maps
Hangouts I suppose. It just works. Would next weekend be OK for you?
Edit: I’ve scheduled a meeting for 12pm UK time on Saturday. Tell me if that works for you.
meet.google.com/kdf-xavk-nnh

algon33 4 Aug 2020 10:49 UTC
1 point
in reply to: Stuart_Armstrong’s comment on: “Go west, young man!”—Preferences in (imperfect) maps
Alright, here’s the link for Friday: meet.google.com/qxw-zpsi-oqn
Thanks for replying.

algon33 13 Aug 2020 12:57 UTC
8 points
on: Alignment By Default
This came out of the discussion you had with John Maxwell, right? Does he think this is a good presentation of his proposal?
How do we know that the unsupervised learner won’t have learnt a large number of other embeddings closer to the proxy? If it has, then why should we expect human values to do well?
Some rough thoughts on the data type issue. Depending on what types the unsupervised learner provides the supervised, it may not be able to reach the proxy type by virtue of issues with NN learning processes.
Recall that tata types can be viewed as homotopic spaces, and construction of types can be viewed as generating new spaces off the old e.g. tangent spaces or path spaces etc. We can view neural nets as a type corresponding to a particular homotopic space. But getting neural nets to learn certain functions is hard. For example, learning a function which is 0 except in two sub spaces A and B. It has different values on A and B. But A and B are shaped like intelocked rings. In other words, a non-linear classification problem. So plausibly, neural nets have trouble constructing certain types from others. Maybe this depends on architecture or learning algorithm, maybe not.
If the proxy and human values have very different types, it may be the case that the supervised learner won’t be able to get from one type to another. Supposing the unsupervised learner presents it with types “reachable” from human values, then the proxy which optimises performance on the data set is just unavailable to the system even though its relatively simple in comparison.
Because of this, checking which simple homotopies neural nets can move between would be useful. Depending on the results, we could use this as an arguement that unsupervised NNs will never embed the human values type because we’ve found out it has some simple properties it won’t be able to construct de novo. Unless we do something like feed the unsupervised learner human biases/start with an EM and modify it.

algon33 15 Aug 2020 0:15 UTC
3 points
in reply to: johnswentworth’s comment on: Alignment By Default
Based off what you’ve said in the comments, I’m guessing you’d say the various forms of corrigibility are natural abstractions. Would you say we can use the strategy you outline here to get “corrigibility by default”?
Regarding iterations, the common objection is that we’re introducing optimisation pressure. So we should expect the usual alignment issues anyway. Under your theory, is this not an issue because of the sparsity of natural abstractions near human values?

algon33 17 Aug 2020 17:54 UTC
1 point
AF
on: My Understanding of Paul Christiano’s Iterated Amplification AI Safety Research Agenda
This post deserves a strong upvote. Since you’ve done the review, would you mind answering a reference request? What papers/blog posts represent Paul’s current views on corrigibility?

algon33 18 Aug 2020 23:52 UTC
1 point
AF
in reply to: Chi Nguyen’s comment on: My Understanding of Paul Christiano’s Iterated Amplification AI Safety Research Agenda
Fair enough. Thanks for the recommendations. :)

algon33 20 Aug 2020 14:39 UTC
1 point
in reply to: Koen.Holtman’s comment on: Question: MIRI Corrigbility Agenda
Hey, thanks for writing all of that. My current goal is to do an up to date literature review on corrigibility, so that was a most helpful comment. I’ll definitely look over your blog, since some of these papers are quite dense. Out of the paper’s you recommended, is there one that stands out? Bear in mind that I’ve read Stewart and MIRI’s papers already.

algon33 20 Aug 2020 16:34 UTC
1 point
in reply to: Koen.Holtman’s comment on: Question: MIRI Corrigbility Agenda
Dutch custom prevents me from recommending my own recent paper in any case
This phrase and its implications are perfect examples of problems in corrigibility. Was that intentional? If so, bravo. Your paper looks interesting, but I think I’ll read the blog post first. I want a break from reading heavy papers. I wonder if the researchers would be OK with my drawing on their blog posts in the review. Would you mind?
Thanks for recommending “Reward tampering”, it is much appreciated. I’ll get on it after synthesising what I’ve read so far. Otherwise, I don’t think I’ll learn much.

algon33 22 Aug 2020 16:30 UTC
9 points
on: Epistemic Comparison: First Principles Land vs. Mimesis Land
If there is a mistake deep in the belief of someone
Are they not ideal Bayesians? Also, do they update based off other people’s priors? It could be intresting to make them all ultra-finitists.
Mimemis land is confusing from the outside. I’m not sure how they could avoid stumbling upon “correct” forms of manipulating beliefs, if they persist for long enough and there are large enough stochastic shocks to the communities beliefs. If they also copid succesful people in the past, I feel like this would be even more likely. Unless they happen to be the equivalent of chinese rooms: just an archive of if else clauses.
Anyway, thank you for introducing this delightful style of thought experiments.

algon33 25 Aug 2020 19:26 UTC
1 point
on: The two-layer model of human values, and problems with synthesizing preferences
How does this relate to the whole “no-self” thing? Is the character becoming aware of the player there?

algon33 28 Aug 2020 14:18 UTC
6 points
on: Zibbaldone With It All
Epistemic status: unsure
I have a hypothesis about why Zettlekasten provide diminishing returns over time. A corrolary is that others should find even less value in your Zettles. Which ties into some of your points, and shows what is missing from the Zibbaldone. Plus there are some suggestions on how to correct the flaw.
One of the key benefits to the Zettlekasten is that the way you link cards reflects your psyche’s understanding of the ideas. Of course other note-taking systems have this advantage. But this isn’t baked into them like it is with Zettlekasten.
Traversing the Zettlekasten lets you approximate your past state of mind when working on a problem. Which lets you dredge up whatever your subconscious has come up with on the topic. The seemingly random orgranisations of yesterdays Zettles helps this along a little by providing a glimpse into yesterdays self. So when you wake up in the morning and look over yesterdays cards, a flood of relevant houghts arises. When considering where to place them, you glance through your Zettlekasten. Zettles your mind was working on bring new thoughts to the forefront. Often it will feel trivial to combine and play with all these new ideas, generating even more thoughts.
Unfortunately, past a certain size your Zettlekasten contains too many cards for your brain to be processing at once and too many potential states of mind you could have been in. We expect a gradual reduction in value of the system. And less value to others, who have a different understanding of the ideas. Something similair is true for other note taking systems. There’s just a steeper decline in value.
How does this square with peoples’ reports that between different note taking systems provides the same early returns they got with the old one? My guess is that the reduced scope of your notes and the context shift is what does it. Your brain realises it no longer has to keep track of all those ideas and can focus on a few relatively simple ones. Which makes the low hanging fruit all the easier to grab.
Can we fix these issues? Maybe. And I think you’d have to go digital to do it. Consider spaced repetition. A memory’s strength decays exponentially with time. When reminded of them, your memory is strengthened. Spaced repition takes advantage of the fact that there are optimal timings to strengthen the memory. By analogy, we might say there are optimal times to dredge up ideas from your mind. And likewise, there may be optimal timings to link zettles. Perhaps these timings depend on the “distance” between the zettles.
A digital system could provide all this. A useful format would be the zettle to consider, and a graph surrounding it of the zettles you should link it to. When moving to adjacent zettles, you can see all the zettles it links to in order to provide relevant context.

algon33 31 Aug 2020 17:01 UTC
1 point
in reply to: romeostevensit’s comment on: Zibbaldone With It All
Do you do this in a piecemeal way, or do you assign a few days to re-organising your thoughts when you learn some important new principle?

algon33 31 Aug 2020 17:08 UTC
1 point
in reply to: Randomini’s comment on: Zibbaldone With It All
Yeah, I had some ideas concerning how to keep track of Zettlekasten as well as the right way to display graphs. Reinforcing the network is definetely a worthwhile idea. The entire point is to suggest good links, but also give you the freedom to traverse your graph. RE the hyperlinks: I agree about the worry of biases. But more than that, it seems the network should not automate link suggestion without leaving the option to create links yourself. As you say, the worth of the Zettlekasten method is largely in instilling virtuous mental hanits. What you suggested seems like it could instil laziness in the user.

algon33 31 Aug 2020 17:14 UTC
1 point
in reply to: gwern’s comment on: Zibbaldone With It All
I started writing a blog post in response, but that seems a bit much for a comment. Suffice to say, I agree that anti-spaced repetition is a good idea. However, it throws away the context of the notes you made, as well as showing it to you after your mind has totally forgotten about it. And as I wrote, those seem to be major factors in the value of the Zettlekasten method!

algon33 31 Aug 2020 18:18 UTC
1 point
in reply to: gwern’s comment on: Zibbaldone With It All
Before or after what? If it is a passage in a book, or an article you wrote, I agree that’s enough. But what about a nebulous concept you struggled to put into words? Or an idea which seemed to have suprising links to other thoughts, which you didn’t pursue at the time. If you write all this stuff down explicitly, then fine. If not, and you’re writing style is like mine, then it seems better to link to other cards and leave it to your future self to figure it out.
Plus, links provide the system extra information with which it can auto-suggest other relevant ideas that you weren’t even aware you were considering.

algon33 1 Sep 2020 20:19 UTC
2 points
in reply to: nostalgebraist’s comment on: interpreting GPT: the logit lens
IIRC, this also shows a discontinuous flip at the bottom followed by slower change.
Maybe edit the post so you include this? I know I was wondering about this too.

algon33 3 Sep 2020 15:06 UTC
1 point
on: [AN #115]: AI safety research problems in the AI-GA framework
Hey Rohin, I’m writing a review on everything that’ been written on corrigibility so far. Do the “the off switch game”, “Active Inverse Reward Design” “should robots be obedient”, “incorrigibility in CIRL” as well as your reply in the Newsletter represent CHAI’s current views on the subject? If not, which papers contain them?

algon33 3 Sep 2020 17:36 UTC
1 point
in reply to: Rohin Shah’s comment on: [AN #115]: AI safety research problems in the AI-GA framework
Active IRD doesn’t have anything to do with corrigibility, I guess my mind just switched off when I was writing that. Anyway, how diverse are CHAI’s views on corrigibility? Could you tell me who I should talk to? Because I’ve already read all the published stuff on it if I’m understanding you rightly and I want to make sure that all the perspectives no this topic are covered.

algon33

Ques­tion: MIRI Cor­rig­bil­ity Agenda

Question: MIRI Corrigbility Agenda