JoshuaOSHickman

Karma: 8

A Possible Resolution To Spurious Counterfactuals

JoshuaOSHickman6 Dec 2021 18:26 UTC

15 points

5 comments4 min readLW link

JoshuaOSHickman 12 Dec 2021 17:53 UTC
1 point
AF
in reply to: Charlie Steiner’s comment on: A Possible Resolution To Spurious Counterfactuals
It seems like you could use these counterfactuals to do whatever decision theory you’d like? My goal wasn’t to solve actually hard decisions—the 5 and 10 problem is perhaps the easiest decision I can imagine—but merely to construct a formalism such that even extremely simple decisions involving self-proofs can be solved at all.

I think the reason this seems to imply a decision theory is that it’s such a simple model that there are some ways of making decisions that are impossible in the model—a fair portion of that was inherited from the psuedocode in the Embedded Agency paper. I have an extension of the formalism in mind that allows an expression of UDT as well (I suspect. Or something very close to it. I haven’t paid enough attention to the paper yet to know for sure). I would love to hear your thoughts once I get that post written up? :)

Exploring Decision Theories With Counterfactuals and Dynamic Agent Self-Pointers

JoshuaOSHickman18 Dec 2021 21:50 UTC

2 points

0 comments4 min readLW link

JoshuaOSHickman 2 Jan 2022 0:45 UTC
1 point
AF
on: $1000 USD prize—Circular Dependency of Counterfactuals
So, this post only deals with agent counterfactuals (not environmental counterfactuals), but I believe I have solved the technical issue you mention about the construction of logical counterfactuals as it concerns TDT. See: https://www.alignmentforum.org/posts/TnkDtTAqCGetvLsgr/a-possible-resolution-to-spurious-counterfactuals

I have fewer thoughts about environmental counterfactuals but think a similar approach could be used to make statements along those lines, i.e. construct alternate agents receiving a different observation about the world. I’m not sure any very specific technical problem exists with that, though—the TDT paper already talks about world model surgery.

JoshuaOSHickman 4 Jan 2022 23:22 UTC
1 point
AF
on: Classical symbol grounding and causal graphs
It seems like a solid symbol grounding solution would allow us to delegate some amount of “translate vague intuitions about alignment into actual policies”. In particular, there seems to be a correspondence between CIRL and symbol grounding—systems aware they do not know the goal they should optimize are similar to symbol-grounding machines aware there is a difference between the literal content of instructions and the desired behavior the instructions represent (although the instructions might be even more abstract symbols than words).

Is there any literature you’re aware of that would propose a seemingly robust alignment solution in a world where we have solved symbol grounding? e.g. Yudkowsky suggests Coherent Extrapolated Volition, and has a sentence or so in English that he proposes, but because machines cannot execute English it’s not clear this was meant literally, or more as a vague gesture at important properties solutions might have.

JoshuaOSHickman 8 Jan 2022 21:50 UTC
1 point
AF
in reply to: Chris_Leong’s comment on: A Possible Resolution To Spurious Counterfactuals
The Agent needs access to a self pointer, and it is parameterized so it doesn’t have to be a static pointer, as it was in the original paper—this approach in particular needs it to be dynamic in this way.

There are also use cases where a bit of code receives a pointer not to its exact self—when it is called as a subagent, it will get the parent’s pointer.

JoshuaOSHickman 8 Jan 2022 21:54 UTC
LW: 3 AF: 2
AF
in reply to: Chris_Leong’s comment on: $1000 USD prize—Circular Dependency of Counterfactuals
I was attempting to solve a relatively specific technical problem related to self-proofs using counterfactuals. So I suppose I do think (at least non-circular ones) are useful. But I’m not sure I’d commit to any broader philosophical statement about counterfactuals beyond “they can be used in a specific formal way to help functions prove statements about their own output in a way that avoid Lob’s Theorem issues”. That being said, that’s a pretty good use, if that’s the type of thing you want to do? It’s also not totally clear if you’re imagining counterfactuals the same way I am. I am using the English term because it matches the specific thing I’m describing decently well, but the term has a broad meaning, and without having an extremely specific imagining, it’s hard to make any more statements about what can be done with them.

A Rephrasing Of and Footnote To An Embedded Agency Proposal

JoshuaOSHickman9 Mar 2022 18:13 UTC

5 points

0 comments5 min readLW link

JoshuaOSHickman 12 Apr 2022 1:08 UTC
21 points
in reply to: Viliam’s comment on: Have You Tried Hiring People?
I recently went through the job search process as a software engineer who’s had some technical posts approved on the Alignment Forum (really only one core insight, but I thought it was valuable). The process is so much better for standard web development jobs, you genuinely cannot possibly imagine, and I was solving a problem MIRI explicitly said they were interested in, in my Alignment Forum posts. It took (no joke) months to get a response from anyone at MIRI and that response ended up being a single dismissive sentence. It took less than a month from my first sending in a Software Engineer job application to a normal company to having a job paying [redacted generous offer].

JoshuaOSHickman 12 Apr 2022 1:12 UTC
13 points
in reply to: Daniel Kokotajlo’s comment on: Have You Tried Hiring People?
I think you’re buying the hype of how much Alignment Forum posts help you even get the attention of MIRI way too much. I have a much easier time asking university departments for feedback, and there is a much smoother process for applying there.