Asking whether one algorithm is dominant over another algorithm is underspecified without choosing a particular domain.
Chris_Leong
Interesting comparison, but I’m still a bit confused about what the impact might be.
So an agent did a search and notices that another agent did this search before. How might they then leverage this information?
Or is it mostly these other kinds of traces where it might have an impact?
“Beyond the methodological advantages, biological interpretability is, in my view, both more tractable and less dangerous than frontier LLM interpretability. The models are smaller (hundreds of millions of parameters rather than hundreds of billions), the input domain is more constrained (gene expression profiles rather than arbitrary natural language), and the knowledge you are trying to extract is better defined (regulatory networks, pathway activations, cell state transitions). You are not probing a system that might be strategically deceiving you, and the knowledge you extract has direct applications in drug discovery and disease understanding rather than in capability amplification. And I still really believe that there is non-negligible chance that we can push biology in the remaining time and amplify human intelligence.”
I’m skeptical here. Positive use cases in medicine would take a long time to achieve impact. Defensive use cases have much of their impact contingent on government fundine a large build out. Human intelligence amplification is way outside the Overton window. On the other hand, negative use cases of such interp have a much shorter road to impact.
Then again, this isn’t at all my area, so keen to hear any pushback.
What should we think about shard theory in light of chain-of-thought agents?
I think I’ll leave this thread here for now. Maybe I’ll come back later and write a top-level post with my thoughts on the policy updates later, but trying to be a bit more conscious of when it makes sense to engage and when it makes sense to step back.
Friendly gradient hacking feels like a risky play.
Quite possibly we would only want to attempt this if we believed there was significant gradient hacking happening already.
If there’s minimal gradient hacking, the threat from gradient hacking is likely minor, whilst the gains from successfully aligning the gradient hacking are likely also small. The gains might be increased by intentionally increasing gradient hacking, but that’s risky.
Additionally, pursuing a “friendly gradient hacker” likely trades-off against minimising gradient hacking.
If gradient hacking is primarily mediated by outputting the reasoning to be reinforced into the chain-of-thought (at least for a significant region of capability), then we can likely create a decent proxy to measure the amount of gradient hacking.
I’m finding it quite annoying how ready people are to downvote short-form content.
If you have to think too much about what you post, then it kind of defeats the point, which is to let people share things that would be too much effort to write up properly.
Perhaps there should be a “not interested” button that makes you less likely to see content like that without downvoting it.
I’ve read the new rules (and the comments) multiple times and I just checked one more time in response to your comment.
I don’t believe that what you’ve said follows. Could you please explain how it does?
I’m confused. The rules for LLM labelling are very broad. For more recent posts you’re more likely to be able to identify the sections. But trying to post-hoc label sections of a draft from a few months back would require a very large time investment.
I think it’s also pretty important to understand this this is going to be absurdly impractical for some posts that people were working on before the rule change (especially if they’re going back and finishing a draft that’s been sitting around for a number of months).
Wouldn’t it just be easier to let us include a tag if a post made significant use of LLMs throughout, rather than using a content block that could have interaction effects with other elements of the page. Even if it’s just one up the top, it breaks the aesthetics compared to a tag.
“If you “borrow language” from the LLM, that no longer counts as “text written by a human”.)”
This feels completely unworkable. Basically, it catches situations where you have a conversation with an LLM, then write the whole post manually. If the situation is that if you may want to write a Less Wrong post on a topic in the future, you’d have to avoid talking to an LLM about it, lest the LLM happen to suggest the best available term with no easy alternative. Actually, it’s worse than this. Unless you stop talking to LLMs completely LLM suggested terms will almost certainly become part of your ontology and it’d be impractical to mark them every time you use them.
Honestly, I’ve probably made it more confusing by editing the wording as I go. One thing that might make it easier is the dot points I’ve now added in my post that describe the situation step-by-step. I wish I’d thought of including that at the start.
Is Omega meant to be immune to the Halting Problem here?
Why do you think this might be an issue? Can’t we just get around this by giving you a limited amount of time to decide before Omega counts it as you rejecting the offer?
Thanks for mentioning that. It’s a good catch. I’ve updated the wording to explicitly mention that it tells you and when I get to my laptop I’ll edit in a proper timeline at the end as that might reduce any confusion.
I don’t believe that the last sentence holds and so I believe Omegas punishment goes through.
This is a great reframing. The concept of humanity likely being “minimally fit” for its niche is one I need to reflect more on.
“A civilization needs to become smart enough to internalize the instrumental value of survival before it becomes powerful enough to alter its own local environment to a lethal state”—this reminds me of Nick Land’s deterritorilisation argument.
Good point.
Assume no-one will ever know, that you can’t disincentivise the actor and that they won’t ever do anything like this again.
Oh, I’m pretty sure it’s harder than you think. You may want to reread the wording of the scenario. It was written with this kind of viewpoint in mind and that’s why it is different from the counterfactual mugging.
Yes, it is a variant of counterfactual mugging. I noted this in my original post, but I didn’t mention it here since this post is focused on how I’ve revised it. I’ve updated my post to mention in now. .
So the $1 million of damage is only inflicted in the hypothetical you know not to exist, however, due to the symmetry reality is “the hypothetical that doesn’t exist” for the other hypothetical.
I’m not a fan of Tegmark levels either nor am I attempting to construct a quantum decision theory.
I’d prefer not to speak for Garrabrant.
I’ve made the criticism myself that I didn’t believe that the shard theory model would hold up for long because a more agentic shard (or a set of them) would end up eventually seizing control. Then again, Lawrence writes that “agentic shards will seize power” is one of the assumptions of the theory. So maybe this isn’t actually a criticism of shard theory? This is a point I’m still somewhat confused on—is shard theory just meant to be an intermediate theory or does it still hold even after the more agentic shards seize power?
I am going back through some of the old shard theory articles. Hopefully that provides me with some more clarity.