# DanielFilan

• IMO ncut divided by the number of clusters is sort of natural and scale-free—it’s the proportion of edge stubs that lead you out of a randomly chosen cluster. Proof in appendix A.1 of Clusterability in Neural Networks

• If you’re reading this, you might wonder: how do I actually make a podcast? Well, here’s the basic technical stuff to get started.

1. Buy a decent microphone, e.g. the Blue Yeti (costs ~$100). This will make you not sound bad. 2. If you’re going to be talking to people who aren’t physically near you, use some service that will record both of you talking. I recommend Zencastr (free for how I use it). 3. Record some talking (this is the hard part). My strong advice is that if you’re doing this remotely, you should both be wearing wired headphones. Please do this in a non-echoey, non-noisy space if you can. Kitchen is bad, sound-isolated place with blankets is good. 4. Do some minimal editing. Don’t try to delete every um and ah, that will take way too long. You can use the computer program “audacity” for this (free), or ask me who I pay to do my editing. 5. Optionally, make transcripts by uploading your edited audio files to rev.com (~$1 per minute of audio). You’ll then have to re-listen to the audio and fix mistakes in the transcript. If you do this, you will probably want to make a website to put transcripts on, which will maybe involve using Github Pages or Squarespace (or maybe you just put transcripts on a pre-existing Medium/​Substack/​blog?)

6. Think of a name and logo for your podcast. Your logo needs to be exactly square and high-res.

7. Use a podcast hosting service. I like libsyn (~\$10/​month for basic plan). Upload your audio files there, write descriptions and episode titles. You should now have an RSS feed.

8. Submit your RSS feed to Google Podcasts, Apple Podcasts, and Spotify. This will involve googling how to do this, you might make some errors, and then it will take ages for Apple to list your podcast.

Once you’ve done all this and dealt with the inevitable hiccups, you now have a podcast! Congratulations! It is certainly possible to do all of this better, but you at least have the basics.

• Here are two EA-themed podcasts that I think someone could make. Maybe that someone is you!

1. More or Less, but EA (or for forecasting)

More or Less is a BBC Radio program. They take some number that’s circulating around the news, and provide context like “Is that literally true? How could someone know that? What is that actually measuring? Is that a big number? Does that mean what you think it means?”—stuff like that. They spend about 10 minutes on each number, and usually include interviews with experts in the field. IMO, someone could do this for numbers that circulate around in the EA space. Another variant is to focus on forecasts—what factors are going in, what’s the reasoning for those guesses, etc.

This could be pretty easy to listen to, but moderately hard to make—requires research, editing conversations down, etc.

1. AI Safety Fellowship /​ Course thing—the podcast.

Get someone who’s doing something like the AGI Safety Fundamentals course or the Center for AI Safety’s thing like that. Each week, they make a podcast episode about what they think of the week’s readings—what seemed persuasive, what didn’t, what was interesting, what was novel. For a long version, you could make an episode about each reading.

If someone’s already doing one of these courses, I think it wouldn’t be much extra work to make this podcast (after the set cost of learning how you make a podcast). It would end up having an inherently limited run (but maybe you could do future seasons about reading thru Superintelligence /​ MIRI chat logs /​ various agendas?).

• A preference ordering on lotteries over outcomes is called geometrically rational if there exists some probability distribution over interval valued utility functions on outcomes such that if and only if .

How does this work with Kelly betting? There, aren’t the relevant utility functions going to be either linear or logarithmic in wealth?

Would have been good to ask about that and also mine it for resources.

Re: future iterations, I’m not sure. On one hand, I think it’s kind of bad for this kind of thing to be run by a person who stands to benefit from his thing ranking high on the survey. On the other hand, I’m not sure if anyone else wants to do it, and I think it would be good to run future iterations.

If anyone does want to take it over, please let me know. I’m not sure how many would be interested in doing that (maybe grantmaking orgs?), but if there are multiple such people it would probably be good to pick a designated successor. I should say that I reserve the right to wait until next year to make any sort of decision on this.

I didn’t want to name and shame lower-ranked entries, but if you go to the github and run the code you can see the whole ranked list for each category you’re interested in—just have to uncomment the relevant part of the script.

My guess is that it’s also because conversations are less optimized (being done on the fly) and maybe harder to access. It’s still the case that people getting into alignment found them “very” useful on average, which seems like high praise to me.

# Take­aways from a sur­vey on AI al­ign­ment resources

5 Nov 2022 23:40 UTC
72 points
One reason that I doubt this story is that “try new things in case they’re good” is itself the sort of thing that should be reinforced during training on a complicated environment, and would push towards some sort of obfuscated manipulation of humans (similar to how if you read about enough social hacks you’ll probably be a bit scammy even tho you like people and don’t want to scam them). In general, this motivation will push RL agents towards reward-optimal behaviour on the distribution of states they know how to reach and handle.

# AXRP Epi­sode 18 - Con­cept Ex­trap­o­la­tion with Stu­art Armstrong

3 Sep 2022 23:12 UTC
10 points

# AXRP Epi­sode 17 - Train­ing for Very High Reli­a­bil­ity with Daniel Ziegler

21 Aug 2022 23:50 UTC
16 points
• Note that for this to work you need a strong disincentive against people sharing their private keys. One way to do this would be if the keys were also used for the purpose of holding cryptocurrency.

• Here’s one way you can do it: Suppose we’re doing public key cryptography, and every person is associated with one public key. Then when you write things online you could use a linkable ring signature. That means that you prove that you’re using a private key that corresponds to one of the known public keys, and you also produce a hash of your keypair, such that (a) the world can tell you’re one of the known public keys but not which public key you are, and (b) the world can tell that the key hash you used corresponds to the public key you ‘committed’ to when writing the proof.

Relevant quote I just found in the paper “Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents”:

The primary measure of an agent’s performance is the score achieved during an episode, namely the undiscounted sum of rewards for that episode. While this performance measure is quite natural, it is important to realize that score, in and of itself, is not necessarily an indicator of AI progress. In some games, agents can maximize their score by “getting stuck” in a loop of “small” rewards, ignoring what human players would consider to be the game’s main goal. Nevertheless, score is currently the most common measure of agent performance so we focus on it here.

• Here’s a project idea that I wish someone would pick up (written as a shortform rather than as a post because that’s much easier for me):

• It would be nice to study competent misgeneralization empirically, to give examples and maybe help us develop theory around it.

• Problem: how do you measure ‘competence’ without reference to a goal??

• Prior work has used the ‘agents vs devices’ framework, where you have a distribution over all reward functions, some likelihood distribution over what ‘real agents’ would do given a certain reward function, and do Bayesian inference on that vs choosing actions randomly. If conditioned on your behaviour you’re probably an agent rather than a random actor, then you’re competent.

• I don’t like this:

• Crucially relies on knowing the space of reward functions that the learner in question might have.

• Crucially relies on knowing how agents act given certain motivations.

• Here’s another option: throw out ‘competence’ and talk about ‘consequential’.

• This has a name collision with ‘consequentialist’ that you’ll probably have to fix but whatever.

• The setup: you have your learner do stuff in a multi-agent environment. You use the AUP metric on every agent other than your learner. You say that your learner is ‘consequential’ if it strongly affects the attainable utility of other agents.

• How good is this?

• It still relies on having a space of reward functions, but there’s some more wiggle-room: you probably don’t need to get the space exactly right, just to have goals that are similar to yours.

• Note that this would no longer be true if this were a metric you were optimizing over.

• You still need to have some idea about how agents will act realistically, because if you only look at the utility attainable by optimal policies, that might elide the fact that it’s suddenly gotten much computationally harder to achieve that utility.

• That said, I still feel like this is going to degrade more gracefully, as long as you include models that are roughly right. I guess this is because this model is no longer a likelihood ratio where misspecification can just rule out the right answer.

• Bonus round: you can probably do some thinking about why various setups would tend to reduce other agents’ attainable utility, prove some little theorems, etc., in the style of the power-seeking paper.

• Ideally you could even show a relation between this and the agents vs devices framing.

• I think this is the sort of project a first-year PhD student could fruitfully make progress on.

Here is an example story I wrote (that has been minorly edited by TurnTrout) about how an agent trained by RL could plausibly not optimize reward, forsaking actions that it knew during training would get it high reward. I found it useful as a way to understand his views, and he has signed off on it. Just to be clear, this is not his proposal for why everything is fine, nor is it necessarily an accurate representation of my views, just a plausible-to-TurnTrout story for how agents won’t end up wanting to game human approval:

• Agent gets trained on a reward function that’s 1 if it gets human approval, 0 otherwise (or something).

• During an intermediate amount of training, the agent’s honest and nice computations get reinforced by reward events.

• That means it develops a motivation to act honestly and behave nicely etc., and no similarly strong motivation to gain human approval at all costs.

• The agent then gets able to tell that it if it tricked the human, that would be reinforced.

• It then decides to not get close in action-space to tricking the human, so that it doesn’t get reinforced into wanting to gain human approval by tricking the human.

• This works because:

• it’s enough action hops away and/​or a small enough part of the space that epsilon-greedy strategies would be very unlikely to push it into the deception mode.

• smarter exploration strategies will depend on the agent’s value function to know which states are more or less promising to explore (e.g. something like thompson sampling), and the agent really disvalues deceiving the human, so that doesn’t get reinforced.

