JuliaHP

Karma: 202

JuliaHP 3 Aug 2025 17:13 UTC
4 points
−4
on: Persona Vectors—Anthropic Paper
Extracting and playing with “evil” features seem like literally of the worst and most irresponsible things you could be doing when working on AI-related things. I don’t care if it leads to a good method or whatever its too close to really bad things. They claim to be adding an evil vector temporarily during fine tuning. It would not suprise me if you end up being one code line away from accidentally adding your evil vector to your AI during deployment or something. Or what if your AI ends up going rogue and breaking out of containment during this period?

Responsible AI development involves among other things having zero evil vectors stored in your data&code-base.

Related https://arbital.greaterwrong.com/p/hyperexistential_separation

JuliaHP 22 Jul 2025 9:53 UTC
19 points
0
on: JuliaHP’s Shortform
Over time during my life Ive started to see the world as a more and more horrible place, in various ways. Ive noticed that this seems to make it harder to be excited about things in general, although I’m not confident that these are related. This seems kind of bad as being excited about things is among other things important for learning things and doing things.
Imagine a robot which serves coffee and does a back-flip. Wouldn’t that be awesome cool? A healthy kid would probably be excited about making such a thing. This kind of feels like a values thing in some sense. The world containing an awesome robot sure seems nice.
But now the world happens to suck. The awesomeness of the existence of the robot feels diminished. The world with billions of tortured sentient beings, vs the world with billions of tortured sentient beings but it has a back-flipping coffee robot.
Perhaps the world feels smaller as a kid and the robot feels more meaningful. Maybe if the world is less sucky its easier to ignore the larger world and live in a more local smaller world where the existence of the robot matters more in a relative sense?
I don’t feel like I have a great understanding of whats going on here though, or if my hunch is even in the right direction. I’m curious if others have similar experiences or clarifying thoughts.

JuliaHP 23 May 2025 13:13 UTC
1 point
0
in reply to: GeneSmith’s comment on: How to Make Superbabies
I expect the main cost to be regulatory rather than technical, this seems to be a trend across various medicine. These costs might scale with the richest peoples ability to pay.

examples-ish;
- Needing expensive studies to get FDA (or other regulatory framework) approval, (and thus needing to sell at a premium to make up the loss).
- Regulations which make market entry expensive (and favor the market leader by requiring bio-equivalence studies) which promote monopolies.
- Need for expensive (time, money & training-capacity) general certifications for people to be allowed to administer narrow treatments.

I don’t have any domain-knowledge or analysis to illustrate my point, but I am curious to what extent you’ve (or someone else working on accelerating this technology) have thought about this.

(metanote; I’m not meant to be discouraging of your work direction. to me it seems that both work on the tech itself, thinking about how to encourage favorable regulation and public opinion, as well as thinking about cultural downstream effects of the technology, are all extremely neglected.)

JuliaHP 20 Mar 2025 17:47 UTC
3 points
1
in reply to: Sinclair Chen’s comment on: Sinclair Chen’s Shortform
>though maintenance might suck idk

Yeah, and I’m guessing very expensive. If something is being given away for cheap/free the true market value of the good is likely negative. It probably makes sense to think more about that bit before concluding that obtaining a castle is a good idea.

JuliaHP 16 Mar 2025 9:34 UTC
2 points
0
on: Can we ever ensure AI alignment if we can only test AI personas?
This to me seems to be akin to “sponge-alignment” IE not building a powerful AI.

We understand personas because they are simulating human behavior which we understand. But that human behavior is mostly limited to human capabilities (expect for maybe speed-up possibilities).

Building truly powerful AI’s will probably involve systems that do something different than human brains, or at-least do not grow with human biases for learning, which causes them to learn the human behaviors we are familiar with.

If the “power” of the AI comes through something else than the persona, then trusting the persona won’t do you much good.

JuliaHP 10 Feb 2025 2:00 UTC
30 points
13
on: Altman blog on post-AGI world
I do believe that if Altman does manage to create his superAI’s, the first such eats Altman and makes squiggles. But if I were to engage in the hypothetical where nice corrigible superassistants are just magically created, Altman does not appear to treat this future he claims to be steering towards seriously.

The world where “everyone has a superassitant” is inherently incredibly volatile/unstable/dangerous due to an incredibly large offence-defence assymetry of superassistants attacking fragile-fleshbags (with optimized viruses, bacteria, molecules, nanobots etcetc) or hijacking fragile minds with supermemes.

Avoiding this kind of outcome to me seems difficult. Nonsystematic “patches” are always workaroundable.
If openAI’s superassistant refuses your request to destroy the world, use it to build your own superassistant, or use it for subtasks etc etc. Humans are fragile-fleshbags, and if strong optimization is ever pointed in their direction, they die.

There are ways to make such a world stable, but all of them that I can see look incredibly authoritarian, something Altman says hes not aiming for. But Altman does not appear to be proposing any alternatives as to how this will turn out fine, and I am not aware of any research agenda at openai trying to figure out how “giving everyone a superoptimizer” will result in a stable world with humans doing human things.

I know only three coherent ways to interpret what Altman is saying, and none of them take the object of writing seriously:
1) I wanted to have the stock go up and wrote words which do that
2) I didnt really think about it, oops
3) I’m actully gonna keep the superassistants all to myself and rule, and this nicecore writing will make people support me as I approach the finish line

This is less meant to be critical of the writing, and more me asking for help of how to actually make sense of what Altman says

JuliaHP 31 Dec 2024 16:29 UTC
9 points
1
in reply to: johnswentworth’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
(That broad technical knowledge is the main thing (as opposed to tacit skills) why you value a physics PhD is a really surprising response to me, and seems like an important part of the model that didn’t come across from the post.)

JuliaHP 31 Dec 2024 12:43 UTC
7 points
2
in reply to: johnswentworth’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
Curious about what it would look like to pick up the relevant skills, especially the subtle/vague/tacit skills, in an independent-study setting rather than in academia. As well as the value of doing this, IE maybe its just a stupid idea and its better to just go do a PhD. Is the purpose of a PhD to learn the relevant skills, or to filter for them? (If you have already written stuff which suffices as a response, id be happy to be pointed to the relevant bits rather than having them restated)

”Broad technical knowledge” should be in some sense the “easiest” (not in terms of time-investment, but in terms of predictable outcomes), by reading lots of textbooks (using similar material as your study guide).

Writing/communication, while more vague, should also be learnable by just writing a lot of things, publishing them on the internet for feedback, reflecting on your process etc.

Something like “solving novel problems” seems like a much “harder” one. I don’t know if this is a skill with a simple “core” or a grab-bag of tactics. Textbook problems take on a “meant-to-be-solved” flavor and I find one can be very good at solving these without being good at tackling novel problems. Another thing I notice is that when some people (myself included) try solving novel problems, we can end up on a path which gets there eventually, but if given “correct” feedback integration would go OOM faster.

I’m sure there are other vague-skills which one ends up picking up from a physics PhD. Can you name others, and how one picks them up intentionally? Am I asking the wrong question?

JuliaHP 29 Dec 2024 22:29 UTC
10 points
3
on: Considerations on orca intelligence
(warning: armchair evolutionary biology)

Another consideration for orca intelligence; they dodge the fermi paradox by not having arms.

Assume the main driver of genetic selection for intelligence is the social arms-race. As soon as a species gets intelligent enough (see humans) from this arms-race they start using their intelligence for manipulating the environment, and start civilization. But orcas mostly lack the external organs for manipulating the enviroment, so they can keep social-arms-racing-boosting-intelligence way past the point of “criticality”.

This should be checkable, IE how long have orcas (or orca-forefathers) been socially-arms-racing? I tried asking claude to no avail, and I lack the domain knowledge to quickly look it up myself. Perhaps one could also check genetic change over time, perhaps social arms race is something you can see in this data? Do we know what this looks like in humans and orcas?

JuliaHP 21 Dec 2024 14:06 UTC
6 points
4
in reply to: jacquesthibs’s comment on: jacquesthibs’s Shortform
“As a result, we can make progress toward automating interpretability research by coming up with experimental setups that allow AIs to iterate.”
This sounds exactly like the kind of progress which is needed in order to get closer to game-over-AGI. Applying current methods of automation to alignment seems fine, but if you are trying to push the frontier of what intellectual progress can be achieved using AI’s, I fail to see your comparative advantage relative to pure capabilities researchers.

I do buy that there might be credit to the idea of developing the infrastructure/ability to be able to do a lot of automated alignment research, which gets cached out when we are very close to game-over-AGI, even if it comes at the cost of pushing the frontier some.

JuliaHP 27 Aug 2024 20:39 UTC
17 points
6
in reply to: the gears to ascension’s comment on: the gears to ascenscion’s Shortform
Transfer learning is dubious, doing philosophy has worked pretty well for me thus far for learning how to do philosophy. More specifically, pick a topic you feel confused about or a problem you want to solve (AI kill everyone oh no?). Sit down and try to do original thinking, and probably use some external tool of preference to write down your thoughts. Then do live or afterwards introspection on if your process is working and how you can improve it, repeat.
This might not be the most helpful, but most people seem to fail at “being comfortable sitting down and thinking for themselves”, and empirically being told to just do it seems to work.

Maybe one crucial object level bit has to do with something like “mining bits from vague intuitions” like Tsvi explains at the end of this comment, idk how to describe it well.

JuliaHP 13 Jul 2024 12:28 UTC
8 points
0
in reply to: Seth Herd’s comment on: Alignment: “Do what I would have wanted you to do”
>It seems like all of the many correct answers to what X would’ve wanted might not include the AGI killing everyone.
Yes, but if it wants to kill everyone it would pick one which does. The space “all possible actions” also contains some friendly actions.

>Wrt the continuity property, I think Max Harm’s corrigibility proposal has that
I think it understands this and is aiming to have that yeah. It looks like a lot of work needs to be done to flesh it out.

I dont have a good enough understanding of ambitious value learning & Roger Dearnaleys proposal to properly comment on these. Skimming + priors put fairly low odds on that they deal with this in the proper manner, but I could be wrong.

JuliaHP 12 Jul 2024 18:38 UTC
14 points
11
on: Alignment: “Do what I would have wanted you to do”
The step from “tell AI to do Y” to “AI does Y” is a big part of the entire alignment problem. The reasons chatbots might seem aligned in this sense is that the thing you ask for often lives in a continuous space, and when not too strong optimization pressure is applied, when you ask for Y, Y+epsilon is good enough. This ceases to be the case when your Y is complicated and high optimization pressure is applied, UNLESS you can find a Y which has a strong continuity property in the sense you care about, which I am unaware of anyone who knows how to do.

Not to mention that “Do what (pre-ASI) X, having considered this carefully for a while, would have wanted you to do” does not narrow down behaviour to a small enough space. There will be many to you reasonable looking interpretations, many of which will allow for satisfaction, while still allowing the AI to kill everyone.

JuliaHP 12 Jul 2024 13:00 UTC
4 points
2
on: Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
While I have a lot of respect for many of the authors, this work feels to me like its mostly sweeping the big problems under the rug. It might at most be useful for AI labs to make a quick buck, or do some safety-washing, before we all die. I might be misunderstand some of the approaches proposed here, and some of my critiques might be invalid as such.
My understanding is that the paper proposes that the AI implements and works with a human-interpretable world model, and that safety specifications is given in this world-model/ontology.
But given an ASI with such a world model, I don’t see how one would specify properties such as “hey please don’t hyperoptimize squiggles or goodhart this property”. Any specification I can think of mostly leaves room for the AI to abide by it, and still kill everyone somehow. This recurses back to “just solve alignment/corrigbility/safe-superintelligent-behaviour”.
Nevermind getting an AI where its actually preforming all cognition in the ontology you provided for it (that would probably count as real progress to me). How do you know that just because the internal ontology says “X”, “X” is what the AI actually does? See this post.
If you are going to prove vague things about your AI and have it be any use at all, you’d want to prove properties in the style of “this AI has the kind of ‘cognition/mind’ for which it is ‘beneficial for the user’ to have running than not” and “this AI’s ‘cognition/mind’ lies in an ‘attractor space’ where violated assumptions, bugs and other errors cause the AI to follow the desired behavior anyways”.
For sufficiently powerful systems having proofs about output behavior mostly does not narrow down your space to safe agents. You want proofs about their internals. But that requires having a less confused notion of what to ask for in the AI’s internals such that it is a safe computation to run, never mind formally specifying it. I don’t have, and haven’t found anyone who seems to understand enough of the relevant properties of minds, what it means for something to be ‘beneficial to the user’, or how to construct powerful optimizers which fail non-catastrophically. It appears to me that we’re not bottle necked on proving these properties, but rather that the bottleneck is identifying and understanding what shape they have.
I do expect some of these approaches to, in the very limited scope of things you can formally specify, allow for more narrow AI applications, promote AI investments and give rise to new techniques and non-trivially shorten the time until we are able to build superhuman systems. My vibes regarding this are made worse by how various existing methods are listed in “safety ranking”. It lists RLHF, Constitutional AI & Model-free RL as more safe than unsupervised learning, but to me it seems like these methods instill stable agent-like behavior on top of a prediction-engine, where there previously was either none or nearly none. They make no progress on the bits of the alignment problem which matter, but do let AI labs create new and better products, make more money, fund more capabilities research etc. I predict that future work along these lines will mostly have similar effects; little progress on the bits which matter, but useful capabilities insights along the way, which gets incorrectly labeled alignment.

JuliaHP 4 Jul 2024 17:07 UTC
13 points
2
in reply to: Yair Halberstadt’s comment on: Consider the humble rock (or: why the dumb thing kills you)
You can totally have something which is trying to kill humanity in this framework though. Imagine something in the style of chaos-GPT, locally agentic & competent enough to use state-of-the-art AI biotech tools to synthesize dangerous viruses or compounds to release into the atmosphere. (note that In this example the critical part is the narrow-AI biotech tools, not the chaos-agent)
You don’t need solutions to embedded agency, goal-content integrity & the like to build this. It is easier to build and is earlier in the tech-tree than crisp maximizers. It will not be stable enough to coherently take over the lightcone. Just coherent enough to fold some proteins and print them.

But why would anyone do such a stupid thing?

JuliaHP 3 Jul 2024 19:18 UTC
4 points
1
on: Isomorphisms don’t preserve subjective experience… right?
Unless I misunderstand the confusion, a useful line of thought which might resolve some things:

Instead of analyzing whether you yourself are conscious or not, analyze what is causally upstream of your mind thinking that you are conscious, or your body uttering the words “I am conscious”.

Similarly you could analyze whether an upload would would think similar thoughts, or say similar things. What about a human doing manual computations? What about a pure mathematical object?

A couple of examples of where to go from there:
- If they have the same behavior, perhaps they are the same?
- If they have the same behavior, but you still think there is a difference, try to find out why you think there is a difference, what is causally upstream of this thought/belief?

JuliaHP 12 Dec 2023 10:57 UTC
10 points
2
on: Funding case: AI Safety Camp
Many more are engaged in AI Safety in other ways, eg. as PhD or independent researcher. These are just the positions we know about. We currently have not done a comprehensive survey.

Worth mentioning that most of the Cyborgism community founders came out of or did related projects in AISC beforehand.

JuliaHP 28 Jun 2023 14:17 UTC
5 points
1
in reply to: Max H’s comment on: Orthogonal’s Formal-Goal Alignment theory of change
I interpret the post you linked as trying to solve the problem of pointing to things in the real world. Being able to point to things in the real world in a way which is ontologically robust is probably necessary for alignment. However “gliders”, “strawberries” and “diamonds” seem like incredibly complicated objects to point to in a way which is ontologically robust, and it is not clear that being able to point to these objects actually lead to any kind of solution.

What we are interested in is research into how to create a statistically unique enough piece of data and being able to reliably point to that. Pointing to pure information seems like it would be more physics independent and run into less issues with ontological breakdowns.

The QACI scheme allows us to construct more complicated formal objects, using counterfactuals on these pieces of data, out of which we are able to construct a long reflection process.

JuliaHP 28 Jun 2023 14:03 UTC
9 points
4
in reply to: Max H’s comment on: Orthogonal’s Formal-Goal Alignment theory of change
Recently we modified QACI to give a scoring over actions, instead of over worlds. This should allow weaker systems inner aligned to QACI to output weaker non-DSA actions, such as the textbook from the future, or just human readable advice on how to end the acute risk period. Stronger systems might output instructions for how to go about solving corrigible AI, or something to this effect.

As for diamonds, we believe this is actually a harder problem than alignment, and it’s a mistake to aim at it. Solving diamond-maximization requires us to point at what we mean by “maximizing diamonds” in physics in a way which is ontologically robust. QACI instead gives us an easier target; informational data blobs which causally relate to a human. The cost is that we now give up power to that human user to implement their values, but this is no issue since that what we wanted to do anyways. If the humans in the QACI interval were actually pursuing diamond-maximization, instead of some form of human values, QACI would solve diamond maximization.