quila

Karma: 188

Independent researcher theorizing about superintelligence-robust training stories.

If you disagree with me for reasons you expect I’m not aware of, please tell me!

If you have/find an idea that’s genuinely novel/out-of-human-distribution while remaining analytical, you’re welcome to send it to me to ‘introduce chaos into my system’.

Contact: {discord: quilalove, matrix: @quilauwu:matrix.org, email: quila1<at>protonmail.com}

“some look outwards, at the dying stars and the space between the galaxies, and they dream of godlike machines sailing the dark oceans of nothingness, blinding others with their flames.”

quila 18 May 2024 11:35 UTC
2 points
0
on: Language Models Model Us
This ability has been observed more prominently in base models. Cyborgs have termed it ‘truesight’:
the ability (esp. exhibited by an LLM) to infer a surprising amount about the data-generation process that produced its prompt, such as a user’s identity, motivations, or context.
Two cases of this are mentioned at the top of this linked post.
---
One of my first experiences with the GPT-4 base model also involved being truesighted by it. Below is a short summary of how that went.
I had spent some hours writing and {refining, optimizing word choices, etc}^[1] a more personal/expressive text. I then chose to format it as a blog post and requested multiple completions via the API, to see how the model would continue it. (It may be important that I wasn’t in a state of mind of ‘writing for the model to continue’ and instead was ‘writing very genuinely’, since the latter probably has more embedded information)
One of those completions happened to be a (simulated) second post titled ideas i endorse. Its contents were very surprising to then-me because some of the included beliefs were all of the following: {ones I’d endorse}, {statistically rare}, and {not ones I thought were indicated by the text}.^[2]
I also tried conditioning the model to continue my text with..
- other kinds of blog posts, about different things—the resulting character didn’t feel quite like me, but possibly like an alternate timeline version of me who I would want to be friends with.
- text that was more directly ‘about the author’, ie an ‘about me’ post, which gave demographic-like info similar to but not quite matching my own (age, trans status).
Also, the most important thing the outputs failed to truesight was my current focus on AI and longtermism. (My text was not about those, but neither was it about the other beliefs mentioned.)
1. ^
  The sum of those choices probably contained a lot of information about my mind, just not information that humans are attuned to detecting. Base models learn to detect information about authors because this is useful to next token prediction.
  Also note that using base models for this kind of experiment avoids the issue of the RLHF-persona being unwilling to speculate or decoupled from the true beliefs of the underlying simulator.
2. ^
  To be clear, it also included {some beliefs that I don’t have}, and {some that I hadn’t considered so far and probably wouldn’t have spent cognition on considering otherwise, but would agree with on reflection. (eg about some common topics with little long-term relevance)}

quila 16 May 2024 22:59 UTC
1 point
0
in reply to: Emrik’s comment on: quila’s Shortform
Record yourself typing?

quila 16 May 2024 16:01 UTC
9 points
4
on: quila’s Shortform
(Personal) On writing and (not) speaking
I often struggle to find words and sentences that match what I intend to communicate.
Here are some problems this can cause:
1. Wordings that are odd or unintuitive to the reader, but that are at least literally correct.^[1]
2. Not being able express what I mean, and having to choose between not writing it, or risking miscommunication by trying anyways. I tend to choose the former unless I’m writing to a close friend. Unfortunately this means I am unable to express some key insights to a general audience.
3. Writing taking lots of time: I usually have to iterate many times on words/sentences until I find one which my mind parses as referring to what I intend. In the slowest cases, I might finalize only 2-10 words per minute. Even after iterating, my words are often interpreted in ways I failed to foresee.
These apply to speaking, too. If I speak what would be the ‘first iteration’ of a sentence, there’s a good chance it won’t create an interpretation matching what I intend to communicate. In spoken language I have no chance to constantly ‘rewrite’ my output before sending it. This is one reason, but not the only reason, that I’ve had a policy of trying to avoid voice-based communication.
I’m not fully sure what caused this relationship to language. It could be that it’s just a byproduct of being autistic. It could also be a byproduct of out-of-distribution childhood abuse.^[2]
1. ^
  E.g., once I couldn’t find the word ‘clusters,’ and wrote a complex sentence referring to ‘sets of similar’ value functions each corresponding to a common alignment failure mode / ASI takeoff training story. (I later found a way to make it much easier to read)
2. ^
  (Content warning)
  My primary parent was highly abusive, and would punish me for using language in the intuitive ‘direct’ way about particular instances of that. My early response was to try to euphemize and say-differently in a way that contradicted less the power dynamic / social reality she enforced.
  Eventually I learned to model her as a deterministic system and stay silent / fawn.

quila 15 May 2024 13:57 UTC
4 points
2
in reply to: Tyler Tracy’s comment on: Ilya Sutskever and Jan Leike resign from OpenAI
Leaving to dissuade others within the company is another possibility

quila 13 May 2024 16:41 UTC
1 point
0
in reply to: jenn’s comment on: Partitioned Book Club
Same as usual, with each person summarizing a chapter, and then there’s a group discussion where they try to piece together the true story

quila 12 May 2024 19:05 UTC
1 point
0
on: Partitioned Book Club
Have you tried it with a book that doesn’t have self-contained chapters?

quila 12 May 2024 13:39 UTC
1 point
0
in reply to: Martín Soto’s comment on: quila’s Shortform
Conditional on us solving alignment, I agree it’s more likely that we live in an “easy-by-default” world, rather than a “hard-by-default” one in which we got lucky or played very well.
I think that language in discussions of anthropics is unintentionally prone to masking ambiguities or conflations, especially wrt logical vs indexical probability, so I want to be very careful writing about this. I think there may be some conceptual conflation happening here, but I’m not sure how to word it. I’ll see if it becomes clear indirectly.
One difference between our intuitions may be that I’m implicitly thinking within a manyworlds frame. Within that frame it’s actually certain that we’ll solve alignment in some branches.
So if we then ‘condition on solving alignment in the future’, my mind defaults to something like this: “this is not much of an update, it just means we’re in a future where the past was not a death outcome. Some of the pasts leading up to those futures had really difficult solutions, and some of them managed to find easier ones or get lucky. The probabilities of these non-death outcomes relative to each other have not changed as a result of this conditioning.” (I.e I disagree with the top quote)
The most probable reason I can see for this difference is if you’re thinking in terms of a single future, where you expect to die.^[1] In this frame, if you observe yourself surviving, it may seem^[2] you should update your logical belief that alignment is hard (because P(continued observation|alignment being hard) is low, if we imagine a single future, but certain if we imagine the space of indexically possible futures).
Whereas I read it as only indexical, and am generally thinking about this in terms of indexical probabilities.
I totally agree that we shouldn’t update our logical beliefs in this way. I.e., that with regard to beliefs about logical probabilities (such as ‘alignment is very hard for humans’), we “shouldn’t condition on solving alignment, because we haven’t yet.” I.e that we shouldn’t condition on the future not being mostly death outcomes when we haven’t averted them and have reason to think they are.
Maybe this helps clarify my position?
On another point:
the developments in non-agentic AI we’re facing are still one regime change away from the dynamics that could kill us
I agree with this, and I still found the current lack of goals over the world surprising and worth trying to get as a trait of superintelligent systems.
1. ^
  (I’m not disagreeing with this being the most common outcome)
2. ^
  Though after reflecting on it more I (with low confidence) think this is wrong, and one’s logical probabilities shouldn’t change after surviving in a ‘one-world frame’ universe either.
  For an intuition pump: consider the case where you’ve crafted a device which, when activated, leverages quantum randomness to kill you with probability n-1/n where n is some arbitrarily large number. Given you’ve crafted it correctly, you make no logical update in the manyworlds frame because survival is the only thing you will observe; you expect to observe the 1/n branch.
  In the ‘single world’ frame, continued survival isn’t guaranteed, but it’s still the only thing you could possibly observe, so it intuitively feels like the same reasoning applies...?

quila 12 May 2024 12:45 UTC
1 point
0
in reply to: Martín Soto’s comment on: quila’s Shortform
It sounds like you’re anthropic updating on the fact that we’ll exist in the future
The quote you replied to was meant to be about the past.^[1]
I can see why it looks like I’m updating on existing in the future, though.^[2] I think it may be more interpretable when framed as choosing actions based on what kinds of paths into the future are more likely, which I think should include assessing where our observations so far would fall.
Specifically, I think that (“we find a fully-general agent-alignment solution right as takeoff is very near” given “early AGIs take a form that was unexpected”) is less probable than (“observing early AGI’s causes us to form new insights that lead to a different class of solution” given “early AGIs take a form that was unexpected”). Because I think that, and because I think we’re at that point where takeoff is near, it seems like it’s some evidence for being on that second path.
This should only constitute an anthropic update to the extent you think more-agentic architectures would have already killed us
I do think that. I think that superintelligence is possible to create with much less compute than is being used for SOTA LLMs. Here’s a thread with some general arguments for this.
Of course, you could claim that our understanding of the past is not perfect, and thus should still update
I think my understanding of why we’ve survived so far re:AI is very not perfect. For example, I don’t know what would have needed to happen for training setups which would have produced agentic superintelligence by now to be found first, or (framed inversely) how lucky we needed to be to survive this far.
~~~
I’m not sure if this reply will address the disagreement, or if it will still seem from your pov that I’m making some logical mistake. I’m not actually fully sure what the disagreement is. You’re welcome to try to help me understand if one remains.
I’m sorry if any part of this response is confusing, I’m still learning to write clearly.
1. ^
  I originally thought you were asking why it’s true of the past, but then I realized we very probably agreed (in principle) in that case.
2. ^
  And to an extent it internally feels like I’m doing this, and then asking “what do my actions need to be to make this be true” in a similar sense to how an FDT agent would act in transparent newcombs. But framing it like this is probably unnecessarily confusing and I feel confused about this description.

quila 12 May 2024 12:25 UTC
1 point
0
in reply to: Martín Soto’s comment on: quila’s Shortform
(I think I misinterpreted your question and started drafting another response, will reply to relevant portions of this reply there)

quila 12 May 2024 11:28 UTC
1 point
0
on: Take the wheel, Shoggoth! (Lesswrong is trying out changes to the frontpage algorithm)
Suggestion: A marker for recommended posts which are over x duration old. I was just reading this post which was recommended to me, and got half-way through before seeing it was 2 years out of date :(
https://www.lesswrong.com/posts/3S4nyoNEEuvNsbXt8/common-misconceptions-about-openai
(Or maybe it’s unnecessary and I’ll get used to checking post dates on the algorithmic frontpage)

quila 12 May 2024 10:26 UTC
4 points
2
on: quila’s Shortform
At what point should I post content as top-level posts rather than shortforms?
For example, a recent writing I posted to shortform was ~250 concise words plus an image: ‘Anthropics may support a ‘non-agentic superintelligence’ agenda’. It would be a top-level post on my blog if I had one set up (maybe soon :p).
Some general guidelines on this would be helpful.

quila 12 May 2024 9:56 UTC
1 point
0
on: quila’s Shortform
Anthropics may support a ‘non-agentic superintelligence’ agenda^[1]
1. Creating superintelligence generally leads to runaway optimization.
2. Under the anthropic principle, we should expect there to be a ‘consistent underlying reason’ for our continued survival.^[2]
3. By default, I’d expect the ‘consistent underlying reason’ to be a prolonged alignment effort in absence of capabilities progress. However, this seems inconsistent with the observation of progressing from AI winters to a period of vast training runs and widespread technical interest in advancing capabilities.
4. That particular ‘consistent underlying reason’ is likely not the one which succeeds most often. The actual distribution would then have other, more common paths to survival.
5. The actual distribution could look something like this: ^[3]
Note that the yellow portion doesn’t imply no effort is made to ensure the first ASI’s training setup produces a safe system, i.e. that we ‘get lucky’ by being on an alignment-by-default path.
I’d expect it to instead be the case that the ‘luck’/causal determinant came earlier, i.e. initial capabilities breakthroughs being of a type which first produced non-agentic general intelligences instead of seed agents and inspired us to try to make sure the first superintelligence is non-agentic, too.
(This same argument can also be applied to other possible agendas that may not have been pursued if not for updates caused by early AGIs)
1. ^
  Disclaimer: This is presented as Bayesian evidence rather than as a ‘sure thing I believe’
  Editing in a further disclaimer: This argument was more of a passing thought, it is not one of the reasons which motivated me to investigate this class of solution.
2. ^
  Rather than expecting to ‘get lucky many times in a row’, e.g via capabilities researchers continually overlooking a human-findable advance
3. ^
  (The proportions over time here aren’t precise, nor are the categories included comprehensive, I put more effort into making this image easy to read/making it help convey the idea.)
What links here?
- quila's comment on quila’s Shortform by quila (12 May 2024 10:26 UTC; 4 points)

quila 12 May 2024 6:02 UTC
2 points
0
on: quila’s Shortform
my language progression on something, becoming increasingly general: goals/value function → decision policy (not all functions need to be optimizing towards a terminal value) → output policy (not all systems need to be agents) → policy (in the space of all possible systems, there exist some whose architectures do not converge to output layer)
(note: this language isn’t meant to imply that systems behavior must be describable with some simple function, in the limit the descriptive function and the neural network are the same)

quila 8 May 2024 9:35 UTC
3 points
0
in reply to: quila’s comment on: quila’s Shortform
Reflecting on this more, I wrote in a discord server (then edited to post here):
I wasn’t aware the concept of pivotal acts was entangled with the frame of formal inner+outer alignment as the only (or only feasible?) way to cause safe ASI.
I suspect that by default, I and someone operating in that frame might mutually believe each others agendas to be probably-doomed. This could make discussion more valuable (as in that case, at least one of us should make a large update).
For anyone interested in trying that discussion, I’d be curious what you think of the post linked above. As a comment on it says:
I found myself coming back to this now, years later, and feeling like it is massively underrated. Idk, it seems like the concept of training stories is great and much better than e.g. “we have to solve inner alignment and also outer alignment” or “we just have to make sure it isn’t scheming.”
In my view, solving formal inner alignment, i.e. devising a general method to create ASI with any specified output-selection policy, is hard enough that I don’t expect it to be done.^[1] This is why I’ve been focusing on other approaches which I believe are more likely to succeed.
1. ^
  Though I encourage anyone who understands the problem and thinks they can solve it to try to prove me wrong! I can sure see some directions and I think a very creative human could solve it in principle. But I also think a very creative human might find a different class of solution that can be achieved sooner. (Like I’ve been trying to do :)

quila 8 May 2024 7:14 UTC
1 point
0
in reply to: mesaoptimizer’s comment on: quila’s Shortform
(see reply to Wei Dai)

quila 8 May 2024 7:12 UTC
5 points
0
in reply to: Wei Dai’s comment on: quila’s Shortform
it is considered a constraint by some because they think that it would be easier/safer to use a superintelligent AI to do simpler actions, while alignment is not yet fully solved
Agreed that some think this, and agreed that formally specifying a simple action policy is easier than a more complex one.^[1]
I have a different model of what the earliest safe ASIs will look like, in most futures where they exist. Rather than ‘task-aligned’ agents, I expect them to be non-agentic systems which can be used to e.g come up with pivotal actions for the human group to take / information to act on.^[2]
1. ^
  although formal ‘task-aligned agency’ seems potentially more complex than the attempt at a ‘full’ outer alignment solution that I’m aware of (QACI), as in specifying what a {GPU, AI lab, shutdown of an AI lab} is seems more complex than it.
2. ^
  I think these systems are more attainable, see this post to possibly infer more info (it’s proven very difficult for me to write in a way that I expect will be moving to people who have a model focused on ‘formal inner + formal outer alignment’, but I think evhub has done so well).

quila 8 May 2024 3:48 UTC
13 points
7
on: quila’s Shortform
On Pivotal Acts
I was rereading some of the old literature on alignment research sharing policies after Tamsin Leake’s recent post and came across some discussion of pivotal acts as well.
Hiring people for your pivotal act project is going to be tricky. [...] People on your team will have a low trust and/or adversarial stance towards neighboring institutions and collaborators, and will have a hard time forming good-faith collaboration. This will alienate other institutions and make them not want to work with you or be supportive of you.
This is in a context where the ‘pivotal act’ example is using a safe ASI to shut down all AI labs.^[1]
My thought is that I don’t see why a pivotal act needs to be that. I don’t see why shutting down AI labs or using nanotech to disassemble GPUs on Earth would be necessary. These may be among the ‘most direct’ or ‘simplest to imagine’ possible actions, but in the case of superintelligence, simplicity is not a constraint.
We can instead select for the ‘kindest’ or ‘least adversarial’ or actually: functional-decision-theoretically optimal actions that save the future while minimizing the amount of adversariality this creates in the past (present).
Which can be broadly framed as ‘using ASI for good’. Which is what everyone wants, even the ones being uncareful about its development.
Capabilities orgs would be able to keep working on fun capabilities projects in those days during which the world is saved, because a group following this policy would choose to use ASI to make the world robust to the failure modes of capabilities projects rather than shutting them down. Because superintelligence is capable of that, and so much more.
1. ^
  side note: It’s orthogonal to the point of this post, but this example also makes me think: if I were working on a safe ASI project, I wouldn’t mind if another group who had discreetly built safe ASI used it to shut my project down, since my goal is ‘ensure the future lightcone is used in a valuable, tragedy-averse way’ and not ‘gain personal power’ or ‘have a fun time working on AI’ or something. In my morality, it would be naive to be opposed to that shutdown. But to the extent humanity is naive, we can easily do something else in that future to create better present dynamics (as the maintext argues).
  If there is a group for whom using ASI to make the world robust to risks and free of harm, in a way where its actions don't infringe on ongoing non-violent activities is problematic, then this post doesn’t apply to them as their issue all along was not with the character of the pivotal act, but instead possibly with something like ‘having my personal cosmic significance as a capabilities researcher stripped away by the success of an external alignment project’.
  Another disclaimer: This post is about a world in which safely usable superintelligence has been created, but I’m not confident that anyone (myself included) currently has a safe and ready method to create it with. This post shouldn’t be read as an endorsement of possible current attempts to do this. I would of course prefer if this civilization were one which could coordinate such that no groups were presently working on ASI, precluding this discourse.

quila 6 May 2024 18:34 UTC
6 points
−4
on: Rapid capability gain around supergenius level seems probable even without intelligence needing to improve intelligence
I hadn’t considered this argument, thanks for sharing it.
It seems to rest on this implicit piece of reasoning:
(premise 1) If modelling human intelligence as a normal distribution, it’s statistically more probable that the most intelligent human will only be so by a small amount.
(premise 2) One of the plausibly most intelligent humans was capable of doing much better than other highly intelligent humans in their field.
(conclusion) It’s probable that past some threshold, small increases in intelligence lead to great increases in output quality.
It’s ambiguous what ‘intelligence’ refers to here if we decouple that word from the quality of insight one is capable of. Here’s a way of re-framing this conclusion to make it more quantifiable/discussable: “Past some threshold, as a system’s quality of insight increases, the optimization required (for evolution or a training process) to select for a system capable of greater insight decreases”.
The level this becomes true at would need to be higher than any AI’s so far, otherwise we would observe training processes easily optimizing these systems into superintelligences instead of loss curves stabilizing at some point above 0.
I feel uncertain whether there are conceptual reasons (priors) for this conclusion being true or untrue.
I’m also not confident that human intelligence is normally distributed in the upper limits, because I don’t expect there are known strong theoretical reasons to believe this.
Overall it seems at least a two digit probability given the plausibility of the premises.

quila 26 Apr 2024 21:23 UTC
2 points
1
on: [Concept Dependency] Concept Dependency Posts
i like the idea. it looks useful and it fits my reading style well. i wish something like this were more common—i have seen it on personal blogs before like carado’s.
i would use [Concept Dependency] or [Concept Reference] instead so the reader understands just from seeing the title on the front page. also avoids acronym collision

quila 26 Apr 2024 0:59 UTC
7 points
0
in reply to: quila’s comment on: quila’s Shortform
when i was younger, pre-rationalist, i tried to go on hunger strike to push my abusive parent to stop funding this.
they agreed to watch this as part of a negotiation. they watched part of it.
they changed their behavior slightly—as a negotiation—for about a month.
they didn’t care.
they looked horror in the eye. they didn’t flinch. they saw themself in it.

quila

(Personal) On writing and (not) speaking

Anthropics may support a ‘non-agentic superintelligence’ agenda[1]

On Pivotal Acts

Anthropics may support a ‘non-agentic superintelligence’ agenda^[1]