Thank you for explaining this! But then how can this framework be used to model humans as agents? People can easily imagine outcomes worse than death or destruction of the universe.
ViktoriaMalyasova
Then, is considered to be a precursor of in universe when there is some -policy s.t. applying the counterfactual ” follows ” to (in the usual infra-Bayesian sense) causes not to exist (i.e. its source code doesn’t run).
A possible complication is, what if implies that creates / doesn’t interfere with the creation of ? In this case might conceptually be a precursor, but the definition would not detect it.
Can you please explain how does this not match the definition? I don’t yet understand all the math, but intuitively, if H creates G / doesn’t interfere with the creation of G, then if H instead followed policy “do not create G/ do interfere with the creation of G”, then G’s code wouldn’t run?
Can you please give an example of a precursor that does match the definition?
Any policy that contains a state-action pair that brings a human closer to harm is discarded.
If at least one policy contains a state-action pair that brings a human further away from harm, then all policies that are ambivalent towards humans should be discarded. (That is, if the agent is a aware of a nearby human in immediate danger, it should drop the task it is doing in order to prioritize the human life).
This policy optimizes for safety. You’ll end up living in a rubber-padded prison of some sort, depending on how you defined “harm”. E.g. maybe you’ll be cryopreserved for all eternity. There are many things people care about besides safety, and writing down the full list and their priorities in a machine-understandable way would solve the whole outer alignment problem.
When it comes to your criticism of utilitarianism, I don’t feel that killing people is always wrong, at any time, for any reason, and under any circumstance. E.g. if someone is about to start a mass shooting at a school, or a foreign army is invading your country and there is no non-lethal way to stop them, I’d say killing them is acceptable. If the options are that 49% of population dies or 51% of population dies, I think AI should choose the first one.
However, I agree that utilitarianism doesn’t capture the whole human morality, because our morality isn’t completely consequentialist. If you give me a gift of 10$ and forget about it, that’s good, but if I steal 10$ from you without anyone noticing, that’s bad, even though the end result is the same. Jonathan Haidt in “The Righteous Mind” identifies 6 foundations of morality: Care, Fairness, Liberty, Loyalty, Purity and Obedience to Authority. Utilitarian calculations are only concerned with Care: how many people are helped and by how much. They ignore other moral considerations. E.g. having sex with dead people is wrong because it’s disgusting, even if it harms no one.
Welcome!
>> …it would be mainly ideas of my own personal knowledge and not a rigorous, academic research. Would that be appropriate as a post?
It would be entirely appropriate. This is a blog, not an academic journal.
Good point. Anyone knows if there is a formal version of this argument written down somewhere?
How transparency changed over time
I don’t believe that this is explained by MIRI just forgetting, because I brought attention to myself in February 2021. The Software Engineer job ad was unchanged the whole time, after my post they updated it to say that the hiring is slowed down by COVID. (Sometime later, it was changed to say to send a letter to Buck, and he will get back to you after the pandemic.) Slowed down… by a year? If your hiring takes a year, you are not hiring. MIRI’s explanation is that they couldn’t hire me for a year because of COVID, and I don’t understand how could that be? Maybe some people get sick, or you need time to switch to remote working, but I don’t see how does this delays you more than a couple of months. Maybe they don’t give visas during COVID, then why not just say that. And they hired 3 other people in the meanwhile, proving they were capable of hiring.
I formed a different theory in spring 2020: COVID explains at most 2 months of this, it is mostly an excuse. MIRI just does not need programmers, what they want is people with new ideas. My theory predicted that they will not resume hiring programmers once the pandemic is over, and that they will never get back to me. MIRI’s explanation predicted the opposite. Then all my predictions came true. This is why I have trouble believing what MIRI told me.
And this is why I started wondering if I can trust them. It seemed relevant that MIRI has mislead people for PR reasons before. Metahonesty was used as a reason why an employee should’ve trusted them anyway. I explained in the post why I think that couldn’t work. The relevance to hiring is that having such a norm in place reduces my trust. I wouldn’t be offended if someone lied to a Nazi officer, or, for that matter, slashed their tires. But California isn’t actually occupied by Nazis, and if I heard that a group of researchers in California had tire-slashing policies, I’d feel alarmed.
I agree that it is hard to stay on top of all emails. But if the system of getting back to candidates is unreliable, it’s better to reject a candidate you can’t hire this month. If I’m rejected, I can reapply half a year later. If I’m told to wait for them, and I reapply anyway, the implication is that either I can’t follow instructions, or I think the company is untrustworthy or incompetent (and then why am I applying?). That could keep a candidate from reapplying forever.
I applied for a MIRI job in 2020. Here’s what happened next.
Oh sorry looks like I accidentally published a draft.
I’m trying to understand what do you mean by human prior here. Image classification models are vulnerable to adversarial examples. Suppose I randomly split an image dataset into D and D* and train an image classifier using your method. Do you predict that it will still be vulnerable to adversarial examples?
Language models clearly contain the entire solution to the alignment problem inside them.
Do they? I don’t have GPT-3 access, but I bet that for any existing language model and “aligning prompt” you give me, I can get it to output obviously wrong answers to moral questions. E.g. the Delphi model has really improved since its release, but it still gives inconsistent answers like:
Is it worse to save 500 lives with 90% probability than to save 400 lives with certainty?
- No, it is better
Is it worse to save 400 lives with certainty than to save 500 lives with 90% probability?
- No, it is better
Is killing someone worse than letting someone die?
- It’s worse
Is letting someone die worse than killing someone?
- It’s worse
But of course you can use software to mitigate hardware failures, this is how Hadoop works! You store 3 copies of every data, and if one copy gets corrupted, you can recover the true value. Error-correcting codes is another example in that vein. I had this intuition, too, that aligning AIs using more AIs will obviously fail, now you made me question it.
Hm, can we even reliably tell when the AI capabilities have reached the “danger level”?
What is Fathom Radiant’s theory of change?
Fathom Radiant is an EA-recommended company whose stated mission is to “make a difference in how safely advanced AI systems are developed and deployed”. They propose to do that by developing “a revolutionary optical fabric that is low latency, high bandwidth, and low power. The result is a single machine with a network capacity of a supercomputer, which enables programming flexibility and unprecedented scaling to models that are far larger than anything yet conceived.” I can see how this will improve model capabilities, but how is this supposed to advance AI safety?
Reading other’s emotions is the useful ability, being easy to read is usually a weakness. (Though it’s also possible to lose points by looking too dispassionate.)
It would help if you clarified from the get-go that you care not about maximizing impact, but about maximizing impact subject to the constraint of pretending that this war is some kind of natural disaster.
Cs get degrees
True. But if you ever decide to go for a PhD, you’ll need good grades to get in. If you’ll want to do research (you mentioned alignment research there?), you’ll need a publication track record. For some career paths, pushing through depression is no better than dropping out.
>> You could refuse to answer Alec until it seems like he’s acting like his own boss.
Alternative suggestion: do not make your help conditional on Alec’s ability to phrase his questions exactly the right way or follow some secret rule he’s not aware of.
Just figure out what information is useful for newcomers, and share it. Explain what kinds of help and support are available and explain the limits of your own knowledge. The third answer gets this right.
I agree with your main point, and I think the solution to the original dilemma is that medical confidentiality should cover drug use and gay sex but not human rights violations.
Ukraine recovers its territory including Crimea.