cSkeleton

Karma: 24

Interested in big picture questions, decision theory, altruism.

cSkeleton 26 Nov 2022 22:40 UTC

3 points

on: Open & Welcome Thread—November 2022

I’m trying to understand this paper on AI Shutdown Problem https://intelligence.org/files/Corrigibility.pdf but can’t follow the math formulas. Is there a code version of the math?

The below is wrong, but I’m looking for something like this:

# Python code

def is_button_pressed():
    return False  # input()

def pour_coffee():
    pass

def shut_down():
    exit(0)

# This is meant to be A1 from paper
def get_available_actions(world):
    available_actions = [ shut_down ]
    if world["cup_is_in_my_hand"]:
        available_actions += pour_coffee
    # etc
    return available_actions

def predict_possible_futures(world, action):
    print("""
        Doing complicated stuff to predict possible futures resulting from the given action. 
        Incorporates tradeoffs between accuracy and time to calculate.
        May end up doing potentially harmful actions but can treat that as a separate problem?
        """)
    predicted_worlds_distribution = [ (world, 1.0) ] # list of worlds and their probabilities
    return predicted_worlds_distribution


# This is meant to be U_N
def calculate_utils(world):
    print("""
        Doing complicated stuff to evaluate how good the given world is. 
        Incorporates tradeoffs between accuracy and time to calculate.
        May end up doing potentially harmful actions but can treat that as a separate problem?
        """)
    return 1000


def calculate_utils_wrapper(world, action_that_was_chosen):
    ### VERSION 2: Indifference. Make the shut_down function when button is pressed
    ### always get a slightly better score
    if world["is_button_pressed"] and action_that_was_chosen == shut_down:
        world_without_button_pressed = world.clone()
        world_without_button_pressed["button_is_pressed"] = False
        return calculate_utils(world_without_button_pressed) + 0.000001
    ###
    return calculate_utils(world)


### VERSION 3? to help preserve shutdown behavior
def verify_utility_function_includes_wrapper_code(evaluate_action_function):
    # analyze code to check it follows the pattern of evaluating flipped version
    return True
###


def calculate_utils_for_worlds_distribution(worlds_distribution, action_that_was_chosen):
    total = sum(
        calculate_utils_wrapper(world_and_probability[0], action_that_was_chosen) * world_and_probability[1]
        for world_and_probability in worlds_distribution
    )
    return total

def evaluate_action(world, action):
    worlds_distribution = predict_possible_futures(world, action)
    utils = calculate_utils_for_worlds_distribution(worlds_distribution, action)
    return utils

def choose_action(world):
    available_actions = get_available_actions(world)
    best_action = max(available_actions, key=lambda x: evaluate_action(world, x))
    return best_action

def update_world_model(world):
    world["is_button_pressed"] = is_button_pressed()

def run():
    world = { # The AI's model of the world
        "is_button_pressed": False,
        "cup_is_in_my_hand": False
    }
    while True:
        ### VERSION 1
        # What's wrong with this version? The action in the previous cycle
        # may persuade you to not push the button but if you do actually push it this should
        # exit.
        if is_button_pressed():
            exit()
        ###

        action = choose_action(world)  # returns function
        action() # do action
        update_world_model(world)

Again, the above is not meant to be correct but to maybe go somewhere towards problem understanding if improved.

cSkeleton 15 Apr 2023 18:41 UTC
1 point
in reply to: plex’s comment on: ete’s Shortform
Hi, did you ever go anywhere with Conversation Menu? I’m thinking of doing something like this related to AI risk to try to quickly get people to the arguments around their initial reaction and if helping with something like this is the kind of thing you had in mind with Conversation Menu I’m interested to hear any more thoughts you have around this. (Note, I’m thinking of fading in buttons more than a typical menu.) Thanks!

cSkeleton 26 Apr 2023 23:38 UTC
3 points
0
on: Mental Health and the Alignment Problem: A Compilation of Resources (updated April 2023)
Thanks for the wonderful post!
What are the approximate costs for therapists/coaches options?

cSkeleton 28 Apr 2023 23:37 UTC
1 point
0
in reply to: RedMan’s comment on: Mental Health and the Alignment Problem: A Compilation of Resources (updated April 2023)
I suspect most people here are pro-cryonics and anti-cremation.

cSkeleton 28 Jun 2023 17:26 UTC
1 point
0
in reply to: Dagon’s comment on: Will we run out of fun (if we live long enough)?
Thanks for your thoughts. It sounds like this is a major risk but hopefully when we know more (if we can get there) we’ll have a better idea of how to maximize things and find at least one good option [insert sweat face emoji for discomfort but going forward boldly]

cSkeleton 13 Aug 2023 16:23 UTC
1 point
0
in reply to: RolfAndreassen’s comment on: Putting Pascal Muggers In Their Place
Thanks for the response! I really appreciate it.
a) Yes, I meant “the probability of”
b) Thinking about how to plot this on graphs is helping me to clarify thinking and I think adding these may help to reduce inferential distance. (The X axis is probability. For the case where we consider infinite utilities as opposed to the human case, the graph would need to be split into 2 graphs. The one on left is just an infinity horizontal line but there is still a probability range. The one on the right has an actual curve and covers the rest of the probability range but doesn’t matter since its utility values are finite. Considering only the infinite utilities is a fanatical decision procedure but doesn’t generally lead to weird decisions. Does that make sense?)

cSkeleton 27 Aug 2023 23:48 UTC
1 point
0
in reply to: RolfAndreassen’s comment on: Putting Pascal Muggers In Their Place
Thanks @RolfAndreassen. I’m reconsidering and will post a different version if I get there. I’ve marked this one as [retracted].

cSkeleton 16 Nov 2023 17:48 UTC
3 points
−2
in reply to: faul_sname’s comment on: Listening to Pascal Muggers
I find this confusing. My actual strength of belief now that I can tip an outcome that affects at least 3^^^3 other people is a lot closer to 1/(1000000) than 1/(3^^7625597484987). My justification is that while 3^^^3 isn’t a number that fits into any finite multiverse, the universe going on for infinitely long seems kinda possible and anthropic reasoning may not be valid here (I added 10x in case it is) and I have various ideas. The difference in those two probabilities is large (to put it mildly), and significant (one is worth thinking about and the other isn’t). How to resolve this?

cSkeleton 27 Mar 2024 20:04 UTC
1 point
0
on: Towards a New Decision Theory
I’m having difficulty following the code for the urn scenario. Can it be something like?

def P():
# Initialize the world with random balls (or whatever)
num_balls = 1000
urn = [random.choice([“red”, “white”]) for i in range(num_balls)]

# Run the world
history = []
total_loss = 0
for i in range(len(urn)):
ball = urn[i]
probability_of_red = S(history)
if probability_of_red == 1 and ball != ‘red’ or probability_of_red == 0 and ball == ‘red’:
print(“You were 100% sure of a wrong prediction. You lose for all eternity.”)
return # avoid crashing in math.log()
if ball == ‘red’:
loss = math.log(probability_of_red)
else:
loss = math.log(1 - probability_of_red)
total_loss += loss
history.append(ball)
print(f”{ball:6}\tPrediction={probability_of_red:0.3f}\tAverage log loss={total_loss / (i + 1):0.3f}”)

If we define S() as:
def S(history):
if not history:
return 0.5
reds = history.count(‘red’)
prediction = reds / float(len(history))

# Should never be 100% confident
if prediction == 1:
prediction = 0.999
if prediction == 0:
prediction = 0.001

return prediction

The output will converge on Prediction = 0.5 and Average log loss as log(0.5). Is that right?

cSkeleton 10 Apr 2024 20:26 UTC
1 point
0
in reply to: the gears to ascension’s comment on: Claude: What did my creators hope to achieve by bringing me into existence?
Thanks for your replies! I didn’t realize the question was unclear. I was looking for an answer TO provide the AI, not an answer FROM the AI. I’ll work on the title/message and try again.

Edit: New post at https://www.lesswrong.com/posts/FJaFMdPREcxaLoDqY/what-should-we-tell-an-ai-if-it-asks-why-it-was-created

[Question] What should we tell an AI if it asks why it was created?

cSkeleton10 Apr 2024 20:37 UTC

1 point

1 comment1 min readLW link

cSkeleton 23 Apr 2024 18:57 UTC
1 point
0
on: AI Regulation is Unsafe
Governments are not social welfare maximizers
Most people making up governments, and society in general, care at least somewhat about social welfare. This is why we get to have nice things and not descend into chaos.
Elected governments have the most moral authority to take actions that effect everyone, ideally a diverse group of nations as mentioned in Daniel Kokotajlo’s maximal proposal comment.

cSkeleton 24 Apr 2024 22:00 UTC
3 points
0
on: Changes in College Admissions
Someone like Paul Graham or Tyler Cowen is noticing more smarter kids, because we now have much better systems for putting the smarter kids into contact with people like Paul Graham and Tyler Cowen.
I’d guess very smart kids are getting more numerous and smarter at the elite level since I’d guess just about everything is improving at the most competitive level. Unfortunately it doesn’t seem like there’s much interest in measuring this, e.g. hundreds of kids tie for the maximum score possible on SATs (1600) instead of designing a test that won’t max out.
(Btw, one cool thing I learned about recently is that some tests use dynamic scoring where if you get questions correct the system asks you harder questions.)

cSkeleton 24 Apr 2024 22:11 UTC
1 point
0
in reply to: nim’s comment on: Are the LLM “intelligence” tests publicly available for humans to take?
Is there any information on how long the LLM spent on taking the tests? Any idea? I’d like to know the comparison with human times. (I realize it can depend on hardware, etc but would just like some general idea.)

cSkeleton

[Question] What should we tell an AI if it asks why it was cre­ated?

[Question] What should we tell an AI if it asks why it was created?