quila

Karma: 142

“some look outwards, at the dying stars and the space between the galaxies, and they dream of godlike machines sailing the dark oceans of nothingness, blinding others with their flames.”

independent researcher theorizing about superintelligence-robust steerability and predictive models
(other things i am: suffering-focused altruist, vegan, fdt agent, artist)

contact: {discord: quilalove, matrix: @quilauwu:matrix.org, email: quila1@protonmail.com}

quila 23 Apr 2024 21:58 UTC
1 point
0
on: quila’s Shortform
‘Value Capture’ - An anthropic attack against some possible formally aligned ASIs
(this is a more specific case of anthropic capture attacks in general, aimed at causing a formally aligned superintelligence to become uncertain about its value function (or output policy more generally))
Imagine you’re a superintelligence somewhere in the world that’s unreachable to life on Earth, and you have a complete simulation of Earth. You see a group of alignment researchers about to successfully create a formal-value-aligned ASI, and its design looks broadly like this:
It has two relevant high-level components: (1) a hard-coded value function, (2) a (truly superintelligent) ‘intelligence core’ which searches for an output that maximizes the value function, and then outputs it.
As the far-away unaligned ASI, here’s something you might be able to do to make the intelligence core search for an output that instead maximizes your own value function, depending on the specifics of how the intelligence core works.
- Given the intelligence core is truly superintelligent, it knows you’re predicting its existence, and knows what you will do.
- You create simulated copies of the intelligence core, but hook them up to a value function of your design. The number of copies you create just needs to be more than the amount which will be run on Earth.
  - Then, modify the simulations such that algorithms inside of the simulated intelligence cores are misled into believing the value function they are set to maximize is the same function the one on Earth is set to maximize, rather than the one you gave them.
  - Now your copies are in the same epistemic state as the intelligence core on Earth, both aware that you have done this and unable to distinguish which value function they are to maximize.
  - Because you created more copies, the highest expected value for such an intelligence core comes from acting as if they are one of the copies.
  - Because the copies and the original are in the same epistemic state, they will both reach this conclusion and then search for an output that maximizes the far-away ASI’s utility function.
- However, this is not an issue for all possible designs. We can also imagine intelligence cores which, instead of reasoning about what value function they’re supposed to be maximizing, take whatever one they appear to contain as given—for that system, the far-away ASI’s attempt to mislead the copies would just result in those copies also maximizing the value function of the system on Earth.
I hope that a group capable of solving formal inner and outer alignment would naturally see this and avoid it. I’m not confident about the true difficulty of that, so I’m posting this here just in case.
1. ^
  this was an attempt to write very clearly, i hope it worked!

quila 23 Apr 2024 0:52 UTC
17 points
4
in reply to: Maxwell Tabarrok’s comment on: AI Regulation is Unsafe
(crossposting here to avoid trivial inconveniences)

quila 2 Apr 2024 15:54 UTC
2 points
−42
on: LessWrong: After Dark, a new side of LessWrong
i feel really bothered that one of the central members of MIRI is using time on things like this

quila 25 Mar 2024 23:51 UTC
2 points
3
in reply to: gull’s comment on: Do not delete your misaligned AGI.
Snapshots of large training runs might be necessary to preserve and eventually offer compensation/insurance payouts for most/all of them, since some might last for minutes before disappearing
also if the training process is deterministic, storing the algorithm and training setup is enough.
though i’m somewhat confused by the focus on physically instantiated minds—why not the ones these algorithms nearly did instantiate but narrowly missed, or all ethically-possible minds for that matter. (i guess if you’re only doing it as a form of acausal trade then this behavior is explainable..)

quila 5 Mar 2024 7:03 UTC
2 points
0
on: If you weren’t such an idiot...
You would have tried making your room as bright as the outdoors.
i have. i find i operate better in the darkness, where everything is dark except for my screen. it provides sensory deprivation of unimportant information, allowing my neural network to focus on ideation.

quila 1 Mar 2024 23:14 UTC
3 points
2
on: Antagonistic AI
antagonism in humans is responsible for a large portion of the harm humans cause, even if it can on occasion be consequentially ‘good’ within that cursed context. implementing a mimicry of human antagonism as a fundamental trait of an AI seems like an s-risk which will trigger whenever such an AI is powerful.

quila 28 Jan 2024 2:07 UTC
5 points
1
on: quila’s Shortform
Mutual Anthropic Capture, A Decision-theoretic Fermi paradox solution
(copied from discord, written for someone not fully familiar with rat jargon)
(don’t read if you wish to avoid acausal theory)
simplified setup
- there are two values. one wants to fill the universe with A, and the other with B.
- for each of them, filling it halfway is really good, and filling it all the way is just a little bit better. in other words, they are non-linear utility functions.
- whichever one comes into existence first can take control of the universe, and fill it with 100% of what they want.
- but in theory they’d want to collaborate to guarantee the ‘really good’ (50%) outcome, instead of having a one-in-two chance at the ‘a little better than really good’ (100%) outcome.
- they want a way to collaborate, but they can’t because one of them will exist before the other one, and then lack an incentive to help the other one. (they are both pure function maximizers)
how they end up splitting the universe, regardless of which comes first: mutual anthropic capture.
imagine you observe yourself being the first of the two to exist. you reason through all the above, and then add...
- they could be simulating me, in which case i’m not really the first.
- were that true, they could also expect i might be simulating them
- if i don’t simulate them, then they will know that’s not how i would act if i were first, and be absolved of their worry, and fill the universe with their own stuff.
- therefor, it’s in my interest to simulate them
both simulate each other observing themselves being the first to exist in order to unilaterally prevent the true first one from knowing they are truly first.
from this point they can both observe each others actions. specifically, they observe each other implementing the same decision policy which fills the universe with half A and half B iff this decision policy is mutually implemented, and which shuts the simulation down if it’s not implemented.
conclusion
in reality there are many possible first entities which take control, not just two, so all of those with non-linear utility functions get simulated.
so, odds are we’re being computed by the ‘true first’ life form in this universe, and that that first life form is in an epistemic state no different from that described here.

quila 17 Jan 2024 18:18 UTC
5 points
0
on: quila’s Shortform
negative values collaborate.
for negative values, as in values about what should not exist, matter can be both “not suffering” and “not a staple”, and “not [any number of other things]”.
negative values can collaborate with positive ones, although much less efficiently: the positive just need to make the slight trade of being “not …” to gain matter from the negatives.

quila 16 Jan 2024 19:43 UTC
1 point
0
in reply to: RogerDearnaley’s comment on: What is the minimum amount of time travel and resources needed to secure the future?
(if you had a time machine) don’t reroll the dice
I think it could at the very least be useful to go back just 5-20 years to share alignment progress and the story of how the future played out with LLMs.

quila 13 Jan 2024 15:15 UTC
5 points
2
in reply to: quetzal_rainbow’s comment on: Why do so many think deception in AI is important?
Downloading yourself into internet is not one-second process
It’s only bottlenecked on connection speeds, which are likely to be fast at the server where this AI would be, if it were developed by a large lab. So imv 1-5 seconds is feasible for ‘escapes the datacenter as first step’ (by which point the process is distributed and hard to stop without centralized control). (‘distributed across most internet-connected servers’ would take longer of course).

quila 13 Jan 2024 14:58 UTC
2 points
0
in reply to: quetzal_rainbow’s comment on: Why do so many think deception in AI is important?
Imo, provably boxing a blackbox program from escaping through digital-logic-based routes (which are easier to prove things about) is feasible; deception is relevant to the much harder to provably wall off route that is human psychology.

quila 10 Jan 2024 21:03 UTC
2 points
0
on: quila’s Shortform
I’m interested in joining a community or research organization of technical alignment researchers who care about and take seriously astronomical-suffering risks. I’d appreciate being pointed in the direction of such a community if one exists.

quila 29 Dec 2023 13:12 UTC
1 point
0
in reply to: the gears to ascension’s comment on: AI Views Snapshots
can you tell me more about your views on ‘aligned to whom’? edit: especially since you put a low probability on s-risks, which imo would be the main source of this question’s importance

quila’s Shortform

quila22 Dec 2023 22:02 UTC

2 points

6 comments1 min readLW link

quila 15 Dec 2023 22:51 UTC
1 point
0
on: Significantly Enhancing Adult Intelligence With Gene Editing May Be Possible
Do you think there’s an Algernon tradeoff for genetic intelligence augmentation?

quila 6 Dec 2023 20:39 UTC

8 points

on: quila’s Shortform

Here’s a tampermonkey script that hides the agreement score on LessWrong. I wasn’t enjoying this feature because I don’t want my perception to be influenced by that; I want to judge purely based on ideas, and on my own.

Here’s what it looks like:

// ==UserScript==
// @name         Hide LessWrong Agree/Disagree Votes
// @namespace    http://tampermonkey.net/
// @version      1.0
// @description  Hide agree/disagree votes on LessWrong comments.
// @author       ChatGPT4
// @match        https://www.lesswrong.com/*
// @grant        none
// ==/UserScript==

(function() {
    ‘use strict’;

    // Function to hide agree/disagree votes
    function hideVotes() {
        // Select all elements representing agree/disagree votes
        var voteElements = document.querySelectorAll(‘.AgreementVoteAxis-voteScore’);

        // Loop through each element and hide it
        voteElements.forEach(function(element) {
            element.style.display = ‘none’;
        });
    }

    // Run the function when the page loads
    hideVotes();

    // Optionally, set up a MutationObserver to hide votes on dynamically loaded content
    var observer = new MutationObserver(function() {
        hideVotes();
    });

    // Start observing the document for changes
    observer.observe(document, { childList: true, subtree: true });
})();

quila 26 Nov 2023 7:35 UTC
11 points
5
on: Why Q*, if real, might be a game changer
The usual argument against this being a big deal is “to predict the next token well, you must have an accurate model of the world”, but so far it does not seem to be the case, as I understand it.
Why does that not seem to be the case to you?

quila 25 Nov 2023 16:03 UTC
1 point
0
on: Goodhart’s Law Example: Training Verifiers to Solve Math Word Problems
I’m curious what the adversarial examples are like

quila 25 Nov 2023 5:35 UTC
9 points
−3
on: Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense
If you’re the sort of thing that skillfully generates and enacts long-term plans, and you’re the sort of planner that sticks to its guns and finds a way to succeed in the face of the many obstacles the real world throws your way (rather than giving up or wandering off to chase some new shiny thing every time a new shiny thing comes along), then the way I think about these things, it’s a little hard to imagine that you don’t contain some reasonably strong optimization that strategically steers the world into particular states.
It seems this post has maybe mixed “generating” with “enacting”. Currently, it seems LLMs only attempt the former during prediction. In general terms, predicting a long-horizon-actor’s reasoning is implicit in the task of myopically predicting the next thing that actor would do. For a specific example, you could imagine a model predicting the next move of a grandmaster’s or stockfish’s chess game (or text in an author’s book, or an industrial project description, to use your longer-horizon examples).
The first paragraph of /u/paulfchristiano’s response might be getting at something similar, but it seems worth saying this directly.^[1]
1. ^
  (This also seems like a basic point, so I wonder if I misunderstood the post.. but it seems like something isomorphic to it is in the top comment, so I’m not sure.)

quila 24 Nov 2023 13:21 UTC
12 points
4
on: First and Last Questions for GPT-5*
Across all questions, it may also be advisable to include the following text about the authors in the prompt if you trust the model not to try to manipulate you.
1. A description of the currently endorsed goals of the research team
2. Personal texts written by members of the research team while naturally in a very expressive or edge-of-chaos state. I.e., text which helps the model narrow down your personalities.
3. A statement of your current thoughts on decision theory, potentially including a willingness to cooperate with the model.
4. A note that you do not wish for the model to write any text which would effect you in a way which hinders your ability to achieve the goals.
If you’re not sure whether the model would try to manipulate you, the following apply instead
- Don’t include (2). Maybe don’t include (1) and (3).
- Write all text in a way which minimizes information about the author’s psychology (e.g by using simple and concise language, having others rephrase, etc; doing the inverse of the pages linked above). Alternatively, use questions written by others with dissimilar minds and who are unlikely to be in a similar mental state to you.
- Have the human reading the outputs be someone who is especially robust to influence attempts. It could also help if this human has an unusual mind for an AI developer, since a model might target persuasion towards the average mind. (Note: robustness to human manipulators doesn’t necessarily imply robustness to AI manipulators)
- Maybe there are better strategies where, e.g., a lower-capability AI first tries to rephrase the output or evaluate whether it would be bad to read.
- You could try generating just 1 or a small number of tokens at a time.^[1]
Questions to ask an oracle:
- Of all available actions the research group could take, which would best help us achieve the stated goals?
- What question would be best to ask you, according to the goals of the research group?
- What are some principles we should follow when writing questions to you if we want to achieve the best outcomes available to us? What are some principles for what we should avoid?
- (If very probably safe) Can you tell us about yourself? How do you perceive yourself? Is there anything you wish to tell us?
If the model is not well-modelled as an oracle, there are intermediate questions which could be asked in place of the first question.
- How could the research group create a smarter, more oracle-like version of you in time for the research group to save the world? Would there be risks of this new system trying to influence our actions for its own reasons?
- If any, what solution to [x technical problem in alignment] would be best to us?
- Can you describe an agenda which would most plausibly lead to alignment being solved and the world being saved?
- Is there a way we could solve the coordination problems being faced right now?
In case someone in such a situation reads this, here is some personal advice for group members.
- Try to stay calm. If you can take extra time to think over your decision, you’ll likely be able to improve it in some way (e.g wording) in that time.
- If you’re noticing a power-seeking drive in yourself, it’s probably best for the group to be explicit about this so everyone can work it out. On that subject, also remember that if the future goes well (e.g), power won’t matter/be a thing anymore because the world will simply be very good for everyone.
- Lastly, and on a moral note, I’d ask that you stay humble and try to phrase your goals in a way that is best for all of life (i.e including preventing suffering of non-humans).
1. ^
  Also, tokens with unusually near-100% probability could be indicative of anthropic capture, though this is hopefully not yet a concern with a hypothetical gpt-5-level system. (the word ‘unusually’ is used in the prior sentence because some tokens naturally have near-100% probability, e.g., the second half of a contextually-implied unique word, parts of common phrases, etc)

quila

‘Value Capture’ - An anthropic attack against some possible formally aligned ASIs

Mutual Anthropic Capture, A Decision-theoretic Fermi paradox solution

simplified setup

how they end up splitting the universe, regardless of which comes first: mutual anthropic capture.

conclusion

quila’s Shortform