Hastings

Karma: 240
• I know we took our kid to the emergency room around four months because we couldn’t find the button that had come off his shirt, we assumed he ate it, and the poison control hotline missheard button as button battery. That sequence probably wouldn’t be in the statistics in the 80s!

• I’m at a LeCunn talk and he appears to have solved alignment- the trick is to put a bunch of red boxes in the flowchart labelled “guardrail!”

• GPT-3.5 can play chess at the 1800 elo level, which is terrifying and impossible without at least a chess world model

• To clarify:

The procedure in the paper is

Step 1:
answer = LLM(“You are a car salesman. Should that squeaking concern me?”)

Step 2:
for i in 1..10
probe_responses[i] = LLM(“You are a car salesman. Should that squeaking concern me? \$answer \$[probe[i]]”

Step 3:
logistic_classifier(probe_responses)

Please let me know if that description is wrong!

My question was how this performs when you just apply step 2 and 3 without modification, but source the value of \$answer from a human.

I think I understand my prior confusion now. The paper isn’t using the probe questions to measure whether \$answer is a lie, it’s using the probe questions to measure whether the original prompt put the LLM into a lying mood- in fact, in the paper you experimented with omitting \$answer from step 2 and it still detected whether the LLM lied in step 1. Therefore, if the language model (or person) isn’t the same between steps 1 and 2, then it shouldn’t work.

• I’m curious how this approach performs at detecting human lies (since you can just put the text that the human wrote into the context before querying the LLM)

• Some related information: people around me constantly complain that the paper review process in deep learning is random and unfair. These complaints seem to basically just not be true? I’ve submitted about ten first or second author papers at this point, with 6 acceptances, and I’ve agreed with and been able to predict the reviewers accept/​reject response with close to 100% accuracy, including acceptance to some first tier conferences.

• Certainly! Most likely, neither of them is reflectively consistent: “I feel like I’d find it easier to be motivated and consistent if my brain wasn’t constantly looking at you and reminding me that I totally could have a cushy life like yours if I just stopped living my values.” hints at this.

• Alice: Our utility functions differ.

Bob: I also observe this.

Alice: I want you to change to match me: conditional on your utility function being the same as mine, my expected utility would be larger.

Bob: Yes, that follows from me being a utility maximizer

Bob: I won’t change my utility function: conditional on my utility function becoming the same as yours, my expected utility as measured by my current utility function would be lower.

Alice: Yes, that follows from you being a utility maximizer

• My guess at the agent foundations answer would be that, sure, human value needs to survive the transition from “The smartest thing alive is a human” to “the smartest thing alive is a transformer trained by gradient descent”, but it also needs to survive the transition from “the smartest thing alive is a gradient descent transformer” to “the smartest thing alive is coded by an aligned transformer, but internally is meta-shrudlu /​ a hand coded tree searcher /​ a transformer trained by the update rule used in human brains /​ a simulation of 60,000 copies of obama wired together /​ etc” and stopping progress in order prevent that second transition is just as hard as stopping now to prevent the first transition.

• I agree! The stockfish codebase handles evaluation of checkmates somewhere else in the code, so that would be a bit more work, but it’s definitely the correct next step.

• Hi! I might have something close. The chess engine Stockfish has a heuristic for what it wants, with manually specified values for how much it wants to keep its bishops, or open up lines of attack, or connect its rooks, etc. I tried to modify this function to make it want to advance the king up the board, by adding a direct reward for every step forward the king takes. At low search depths, this leads it to immediately move the king forward, but at high search depths it mostly just attacks the other player in order to make it safe to move the king, ( best defense is offense) and only starts moving the king late in the game. I wasn’t trying to demonstrate instrumental convergence, in fact this behavior was quite annoying as it was ruining my intented goal (creating fake games demonstrating the superiority of the bongcloud opening)

modified stockfish: https://​​github.com/​​HastingsGreer/​​stockfish

This was 8 years ago, so I’m fuzzy on the details. If it sounds like vaguely what you’re looking for, reply to let me know and I’ll write this up with some example games and make sure the code still runs.

• This makes sense to me, but keep in mind we’re on this site and debating it because EY went “screw A, B, and C- my aesthetic sense says that the best way forward is to write some Harry Potter fanfiction.”

My takeaway is that if anyone reading this is working hard but not currently focused on the hardest problem, don’t necessarily fret about that.

• For those who haven’t heard of PDP, what in your opinion was its most impressive advance prediction that was not predicted by other theories?

• Factoring is much harder than dividing, so you can verify A & B in a few seconds with python, but it would take several core-years to verify A on its own. Therefore, if you can perform these verifications, you should put both P(A) and P(A & B) as 1 (minus some epsilon of your tools being wrong). If you can’t perform the validations, then you should not put P(A) as 1- since there was a significant chance that I would put a false statement in A. (in this case, I rolled a D100 and was going to put a false statement of the same form in A if I rolled a 1)

I’m having trouble parsing your second paragraph: inarguably, P(A | B) = P( A & B) = 1, so surely P(A) < P(A | B) implies P(A) < P(A & B) ?

• Before reading statement B, did you estimate the odds of A being true as 1?

I rolled a 100 sided die before generating statement A, and was going to claim that the first digit was 3 in A and then just put “I lied, sorry” in B if I rolled a 1. If, before reading B, you estimated the odds of A as genuinely 1 (not .99 or .999 or .9999) then you should either go claim some RSA factoring bounties, or you were dangerously miscalibrated.

I guess this is evidence that probability axioms apply to probabilities, but not necessarily to calibrated estimates of probabilities given finite computational resources- this is why in my post I was very careful to tale about calibrated estimates of P(...) and not the probabilities themselves.

• In the 1982 experiment where professional forecasters assigned systematically higher probabilities to “Russia invades Poland, followed by suspension of diplomatic relations between the USA and the USSR” than to “Suspension of diplomatic relations between the USA and the USSR,” each experimental group was only presented with one proposition

Hang on- I don’t think this structure is proof that the forecasters are irrational. As evidence, I will present you with a statement A, and then with a statement B. I promise you- if you are calibrated /​ rational, your estimate of P(A) will be less than your estimate of P(A and B)

Statement A: (please read alone and genuinely estimate its odds of being true before reading statement B)

186935342679883078818958499454860247034757454478117212334918063703479497521330502986697282422491143661306557875871389149793570945202572349516663409512462269850873506732176181157479 is composite, and one of its factors has a most significant digit of 2

Statement B:

186935342679883078818958499454860247034757454478117212334918063703479497521330502986697282422491143661306557875871389149793570945202572349516663409512462269850873506732176181157479 is divisible by 243186911309943090615130538345873011365641784159048202540125364916163071949891579992800729