Hangouts I suppose. It just works. Would next weekend be OK for you?
Edit: I’ve scheduled a meeting for 12pm UK time on Saturday. Tell me if that works for you.
Hangouts I suppose. It just works. Would next weekend be OK for you?
Edit: I’ve scheduled a meeting for 12pm UK time on Saturday. Tell me if that works for you.
Alright, here’s the link for Friday: meet.google.com/qxw-zpsi-oqn
Thanks for replying.
This came out of the discussion you had with John Maxwell, right? Does he think this is a good presentation of his proposal?
How do we know that the unsupervised learner won’t have learnt a large number of other embeddings closer to the proxy? If it has, then why should we expect human values to do well?
Some rough thoughts on the data type issue. Depending on what types the unsupervised learner provides the supervised, it may not be able to reach the proxy type by virtue of issues with NN learning processes.
Recall that tata types can be viewed as homotopic spaces, and construction of types can be viewed as generating new spaces off the old e.g. tangent spaces or path spaces etc. We can view neural nets as a type corresponding to a particular homotopic space. But getting neural nets to learn certain functions is hard. For example, learning a function which is 0 except in two sub spaces A and B. It has different values on A and B. But A and B are shaped like intelocked rings. In other words, a non-linear classification problem. So plausibly, neural nets have trouble constructing certain types from others. Maybe this depends on architecture or learning algorithm, maybe not.
If the proxy and human values have very different types, it may be the case that the supervised learner won’t be able to get from one type to another. Supposing the unsupervised learner presents it with types “reachable” from human values, then the proxy which optimises performance on the data set is just unavailable to the system even though its relatively simple in comparison.
Because of this, checking which simple homotopies neural nets can move between would be useful. Depending on the results, we could use this as an arguement that unsupervised NNs will never embed the human values type because we’ve found out it has some simple properties it won’t be able to construct de novo. Unless we do something like feed the unsupervised learner human biases/start with an EM and modify it.
Based off what you’ve said in the comments, I’m guessing you’d say the various forms of corrigibility are natural abstractions. Would you say we can use the strategy you outline here to get “corrigibility by default”?
Regarding iterations, the common objection is that we’re introducing optimisation pressure. So we should expect the usual alignment issues anyway. Under your theory, is this not an issue because of the sparsity of natural abstractions near human values?
This post deserves a strong upvote. Since you’ve done the review, would you mind answering a reference request? What papers/blog posts represent Paul’s current views on corrigibility?
Fair enough. Thanks for the recommendations. :)
Hey, thanks for writing all of that. My current goal is to do an up to date literature review on corrigibility, so that was a most helpful comment. I’ll definitely look over your blog, since some of these papers are quite dense. Out of the paper’s you recommended, is there one that stands out? Bear in mind that I’ve read Stewart and MIRI’s papers already.
Dutch custom prevents me from recommending my own recent paper in any case
This phrase and its implications are perfect examples of problems in corrigibility. Was that intentional? If so, bravo. Your paper looks interesting, but I think I’ll read the blog post first. I want a break from reading heavy papers. I wonder if the researchers would be OK with my drawing on their blog posts in the review. Would you mind?
Thanks for recommending “Reward tampering”, it is much appreciated. I’ll get on it after synthesising what I’ve read so far. Otherwise, I don’t think I’ll learn much.
If there is a mistake deep in the belief of someone
Are they not ideal Bayesians? Also, do they update based off other people’s priors? It could be intresting to make them all ultra-finitists.
Mimemis land is confusing from the outside. I’m not sure how they could avoid stumbling upon “correct” forms of manipulating beliefs, if they persist for long enough and there are large enough stochastic shocks to the communities beliefs. If they also copid succesful people in the past, I feel like this would be even more likely. Unless they happen to be the equivalent of chinese rooms: just an archive of if else clauses.
Anyway, thank you for introducing this delightful style of thought experiments.
How does this relate to the whole “no-self” thing? Is the character becoming aware of the player there?
Epistemic status: unsure
I have a hypothesis about why Zettlekasten provide diminishing returns over time. A corrolary is that others should find even less value in your Zettles. Which ties into some of your points, and shows what is missing from the Zibbaldone. Plus there are some suggestions on how to correct the flaw.
One of the key benefits to the Zettlekasten is that the way you link cards reflects your psyche’s understanding of the ideas. Of course other note-taking systems have this advantage. But this isn’t baked into them like it is with Zettlekasten.
Traversing the Zettlekasten lets you approximate your past state of mind when working on a problem. Which lets you dredge up whatever your subconscious has come up with on the topic. The seemingly random orgranisations of yesterdays Zettles helps this along a little by providing a glimpse into yesterdays self. So when you wake up in the morning and look over yesterdays cards, a flood of relevant houghts arises. When considering where to place them, you glance through your Zettlekasten. Zettles your mind was working on bring new thoughts to the forefront. Often it will feel trivial to combine and play with all these new ideas, generating even more thoughts.
Unfortunately, past a certain size your Zettlekasten contains too many cards for your brain to be processing at once and too many potential states of mind you could have been in. We expect a gradual reduction in value of the system. And less value to others, who have a different understanding of the ideas. Something similair is true for other note taking systems. There’s just a steeper decline in value.
How does this square with peoples’ reports that between different note taking systems provides the same early returns they got with the old one? My guess is that the reduced scope of your notes and the context shift is what does it. Your brain realises it no longer has to keep track of all those ideas and can focus on a few relatively simple ones. Which makes the low hanging fruit all the easier to grab.
Can we fix these issues? Maybe. And I think you’d have to go digital to do it. Consider spaced repetition. A memory’s strength decays exponentially with time. When reminded of them, your memory is strengthened. Spaced repition takes advantage of the fact that there are optimal timings to strengthen the memory. By analogy, we might say there are optimal times to dredge up ideas from your mind. And likewise, there may be optimal timings to link zettles. Perhaps these timings depend on the “distance” between the zettles.
A digital system could provide all this. A useful format would be the zettle to consider, and a graph surrounding it of the zettles you should link it to. When moving to adjacent zettles, you can see all the zettles it links to in order to provide relevant context.
Do you do this in a piecemeal way, or do you assign a few days to re-organising your thoughts when you learn some important new principle?
Yeah, I had some ideas concerning how to keep track of Zettlekasten as well as the right way to display graphs. Reinforcing the network is definetely a worthwhile idea. The entire point is to suggest good links, but also give you the freedom to traverse your graph. RE the hyperlinks: I agree about the worry of biases. But more than that, it seems the network should not automate link suggestion without leaving the option to create links yourself. As you say, the worth of the Zettlekasten method is largely in instilling virtuous mental hanits. What you suggested seems like it could instil laziness in the user.
I started writing a blog post in response, but that seems a bit much for a comment. Suffice to say, I agree that anti-spaced repetition is a good idea. However, it throws away the context of the notes you made, as well as showing it to you after your mind has totally forgotten about it. And as I wrote, those seem to be major factors in the value of the Zettlekasten method!
Before or after what? If it is a passage in a book, or an article you wrote, I agree that’s enough. But what about a nebulous concept you struggled to put into words? Or an idea which seemed to have suprising links to other thoughts, which you didn’t pursue at the time. If you write all this stuff down explicitly, then fine. If not, and you’re writing style is like mine, then it seems better to link to other cards and leave it to your future self to figure it out.
Plus, links provide the system extra information with which it can auto-suggest other relevant ideas that you weren’t even aware you were considering.
IIRC, this also shows a discontinuous flip at the bottom followed by slower change.
Maybe edit the post so you include this? I know I was wondering about this too.
Hey Rohin, I’m writing a review on everything that’ been written on corrigibility so far. Do the “the off switch game”, “Active Inverse Reward Design” “should robots be obedient”, “incorrigibility in CIRL” as well as your reply in the Newsletter represent CHAI’s current views on the subject? If not, which papers contain them?
Active IRD doesn’t have anything to do with corrigibility, I guess my mind just switched off when I was writing that. Anyway, how diverse are CHAI’s views on corrigibility? Could you tell me who I should talk to? Because I’ve already read all the published stuff on it if I’m understanding you rightly and I want to make sure that all the perspectives no this topic are covered.
Sometimes the cluster in the map a preference is pointing at involves another preference. Which provides a natural resolution mechanism. What happens when there’s two preferences, I’m unsure. I suppose it depends on how your map changes. In which case, I think you should focus on how to make purity coherent you should start off with some “simple” map and various “simple” changes in the map. To make purity coherent relative to your map is both computationally hard, and empathetically hard.
Side-note: It would be interesting to see which resolution mechanisms produce the most varied shifts in preferences for boundedly rational agents with complex utility functions.
Side-note^2: Stuart, I’m writing a review of all the work done on corrigibility. Would you mind if I asked you some questions on your contributions?