there are books on the topic

Does anyone know if this book is any good? I’m planning to get more familiar with interpretability research, and ‘read a book’ has just appeared in my set of options.

Karma: 2,049

there are books on the topic

Does anyone know if this book is any good? I’m planning to get more familiar with interpretability research, and ‘read a book’ has just appeared in my set of options.

I think the culprit is ‘overturned’. That makes it sound like their counterarguments were a done deal or something. I’ll reword that to ‘rebutted and reframed in finer detail’.

Yeah, I think overturned is the word I took issue with. How about ‘disputed’? That seems to be the term that remains agnostic about whether there is something wrong with the original argument or not.

Perhaps, your impression from your circle is different from mine in terms of what proportion of AIS researchers prioritise work on the fast takeoff scenario?

My impression is that gradual takeoff has gone from a minority to a majority position on LessWrong, primarily due to Paul Christiano, but not an overwhelming majority. (I don’t know how it differs among Alignment Researchers.)

I believe the only data I’ve seen on this was in a thread where people were asked to make predictions about AI stuff, including takeoff speed and timelines, using the new interactive prediction feature. (I can’t find this post—maybe someone else remembers what it was called?) I believe that was roughly compatible with the sizeable minority summary, but I could be wrong.

Eliezer Yudkowsky’s portrayal of a single self-recursively improving AGI (later overturned by some applied ML researchers)

I’ve found myself doubting this claim, so I’ve read the post in question. As far as I can tell, it’s a reasonable summary of the fast takeoff position that many people still hold today. If all you meant to say was that there was disagreement, then fine—but saying ‘later overturned’ makes it sound like there is consensus, not that people still have the same disagreement they’ve had 13 years ago. (And your characterization in the paragraph I’ll quote below also gives that impression.)

In hindsight, judgements read as simplistic and naive in similar repeating ways (relying on one metric, study, or paradigm and failing to factor in mean reversion or model error there; fixating on the individual and ignoring societal interactions; assuming validity across contexts):

- 21 Mar 2021 18:43 UTC; 9 points) 's comment on Some blindspots in rationality and effective altruism by (EA Forum;

Here is a construction of : We have that is the inverse of . Moreover, is the inverse of . [...]

Yeah, that’s conclusive. Well done! I guess you can’t divide by zero after all ;)

I think the main mistake I’ve made here is to assume that inverses are unique without questioning it, which of course doesn’t make sense at all if I don’t yet know that the structure is a field.

My hunch is that any bidirectional sum of integer powers of x which we can actually construct is “artificially complicated” and it can be rewritten as a one-directional sum of integer powers of x. So, this would mean that your number system is what you get when you take the union of Laurent series going in the positive and negative directions, where bidirectional coordinate representations are far from unique. Would be delighted to hear a justification of this or a counterexample.

So, I guess one possibility is that, if we let be the equivalence class of all elements that are in this structure, the resulting set of classes is isomorphic to the Laurent numbers. But another possibility could be that it all collapses into a single class—right? At least I don’t yet see a reason why that can’t be the case (though I haven’t given it much thought). You’ve just proven that some elements equal zero, perhaps it’s possible to prove it for all elements.

You’ve understood correctly minus one important detail:

The structure you describe (where we want elements and their inverses to have finite support)

Not elements and their inverses! Elements

*or*their inverses. I’ve shown the example of to demonstrate that you quickly get infinite inverses, and you’ve come up with an abstract argument why finite inverses won’t cut it:To show that nothing else works, let and be any two nonzero sums of finitely many integer powers of (so like ). Then, the leading term (product of the highest power terms of and ) will be some nonzero thing. But also, the smallest term (product of the lower power terms of and ) will be some nonzero thing. Moreover, we can’t get either of these to cancel out. So, the product can never be equal to . (Unless both are monomials.)

In particular, your example of has the inverse . Perhaps a better way to describe this set is ‘all you can build in finitely many steps using addition, inverse, and multiplication, starting from only elements with finite support’. Perhaps you can construct infinite-but-periodical elements with infinite-but-periodical inverses; if so, those would be in the field as well (if it’s a field).

If you can construct , it would not be field. But constructing this may be impossible.

I’m currently completely unsure if the resulting structure is a field. If you get a bunch of finite elements, take their infinite-but-periodical inverse, and multiply those inverses, the resulting number has again a finite inverse due to the argument I’ve shown in the previous comment. But if you use addition on one of them, things may go wrong.

A larger structure to take would be formal Laurent series in . These are sums of finitely many negative powers of x and arbitrarily many positive powers of . This set is closed under multiplicative inverses.

Thanks; this is quite similar—although not identical.

Edit: this structure is not a field as proved by just_browsing.

Here is a wacky idea I’ve had forever.

There are a bunch of areas in math where you get expressions of the form and they resolve to some number, but it’s not always the same number. I’ve heard some people say that “can be any number”. Can we formalize this? The formalism would have to include as something different than , so that if you divide the first by 0, you get 4, but the second gets 3.

Here is a way to turn this into what may be a field or ring. Each element is a function , where a function of the form reads as . Addition is component-wise (; this makes sense), i.e., , and multiplication is, well, , so we get the rule

This becomes a problem once elements with infinite support are considered, i.e., functions that are nonzero at infinitely many values, since then the sum may not converge. But it’s well defined for numbers with finite support. This is all similar to how polynomials are handled formally, except that polynomials only go in one direction (i.e., they’re functions from rather than ), and that also solves the non-convergence problem. Even if infinite polynomials are allowed, multiplication is well-defined since for any , there are only finitely many pairs of natural numbers such that .

The additively neutral element in this setting is and the multiplicatively neutral element is . Additive inverses are easy; . The interesting part is multiplicative inverses. Of course, there is no inverse of , so we still can’t divide by the ‘real’ zero. But I believe all elements with finite support do have a multicative inverse (there should be a straight-forward inductive proof for this). Interestingly, those inverses are not finite anymore, but they are periodical. For example, the inverse of is just , but the inverse of is actually

I

*think*this becomes a field with well-defined operations if one considers only the elements with finite support and elements with inverses of finite support. (The product of two elements-whose-inverses-have-finite-support should itself have an inverse of finite support because ). I wonder if this structure has been studied somewhere… probably without anyone thinking of the interpretation considered here.

There are a bunch of sequences, like the value learning sequence, that have structured formatting in the sequence overview (the page the link goes to), so something like Headline, a bunch of posts, headline, a bunch of more posts.

How is this done? When I go into the sequence editor, I only see one text field where I can write something which then appears in front of the list of posts.

This post is similar to the one Eliezer Yudkowsky wrote.

Cool, thanks.

While QRI is only occasionally talked about on LessWrong, I personally continue to think that they’re doing the most exciting research that exists today, provided you take a utilitarian perspective. I’ve donated to Miri in the past, in part because their work seems highly non-replaceable. I still stand by that reason, but it applies even more to QRI. Even if there is only a small chance that formalizing consciousness is both possible and practically feasible, the potential upside seems enormous. Success in formalizing suffering wouldn’t solve AI alignment (for several reasons, one of them being Inner Optimizers), but I imagine it would be extremely helpful. There is nothing approximating a consensus on the related philosophical problems in the community, and positions on those issues seem to have a significant causal influence on what research is being pursued.

It helps that I share most if not all of the essential philosophical intuitions that motivate QRI’s research. On the other hand, research should be asymmetrical with regard to what’s true. In the world where moral realism is false and suffering isn’t objective or doesn’t have structure, beliefs to the contrary (which many people in the community hold today) could lead to bad alignment-related decisions. In that case, any attempts to quantify suffering would inevitably fail, and that would itself be relevant evidence.

Re personal opinion: what is your take on the feasibility of human experiments? It seems like your model is compatible with IDA working out even though no-one can ever demonstrate something like ‘solve the hardest exercise in a textbook’ using participants with limited time who haven’t read the book.

This is an accurate summary, minus one detail:

The judge decides the winner by evaluating whether the final statement is true or not.

“True or not” makes it sound symmetrical, but the choice is between ‘very confident that it’s true’ and ‘anything else’. Something like ’80% confident’ goes into the second category.

One thing I would like to be added is just that I come out moderately optimistic about Debate. It’s not too difficult for me to imagine the counter-factual world where I think about FC and find reasons to be pessimistic about Debate, so I take the fact that I didn’t as non-zero evidence.

I think the Go example really gets to the heart of why I think Debate doesn’t cut it.

Your comment is an argument against using Debate to settle moral questions. However, what if Debate is trained on Physics and/or math questions, with the eventual goal of asking “what is a provably secure alignment proposal?”

Before offering an “X is really about Y” signaling explanation, it’s important to falsify the “X is about X” hypothesis first. Once that’s done, signaling explanations require, at minimum:

An action or decision by the receiver that the sender is trying to motivate.

(2.1) An explanation for why the receiver is listening for signals in the first place, and (2.2) why the sender is trying to communicate them.

A language that the sender has reason to think the receiver will understand and believe as the sender intended.

A physical mechanism for sending and receiving the signal.

(Added numbers for reference.)

I think 1, 2.1, and 3 are all wrong, in that none of them are required for a signaling hypothesis to be plausible. I believe you’re assuming that signaling is effective and/or rational, but this is a mistake. Signaling was optimized to be effective in the ancestral environment, so there’s no reason why it should still be effective today. As far as I can tell, it generally is not.

As an example, consider men wearing solid shoes in the summer despite finding those uncomfortable. There is no action this is trying to motivate, and there is no reason to expect the receiver is listening—in fact, there is often good reason to expect that they are not listening (in many contexts, people really don’t care about your shoes). Nonetheless, I think conformity signaling is the correct explanation for this behavior.

The pilot example is problematic because in this case, signaling is part of a high-level plan. This is a non-central example. Most of the time, signaling is motivated by evolutionary instincts, like the fear of standing out. In the case of religion, I think this is most of the story. Those instincts can then translate into high-level behavior like going to the church, but it’s not the beginning of the causal chain.

Thanks for hosting this contest. The overconfidence thing in particular is a fascinating data point. When I was done with my function that output final probabilities, I deliberately made it way more agnostic, thinking that at least now my estimates are modest—but it turns out that I should have gone quite a bit farther with that adjustment.

I’m also intrigued by the variety of approaches for analyzing strings. I solely looked at frequency of monotone groups (i.e., how many single 0′s, how many 00′s, how many 000′s, how many 111′s, etc.), and as a result, I have widely different estimates (compared to the winning submissions) on some of the strings where other methods were successful.

I think it’s an ok description of what we do in terms of epistemic rationality. I’m not so sure it captures the instrumental part. The biggest impact that joining this community had on my life was that I started really taking actions to further my goals.