Still haven’t heard a better suggestion than CEV.
TristanTrim
Zoom Out: Distributions in Semantic Spaces
I have hope
I agree with this in a “catgirl volcano utopia” kinda way, but I think Kaj_Sotala was pointing more to a “words as pointers to locations in thingspace” issue. The word “sane” points to taking actions that work in the context you’re facing. It isn’t sane to shout about the sky falling when the sky isn’t falling and it’s easy for sane people to notice that the sky isn’t falling and that shouting about it is insane. But there isn’t an obvious plan for what you should do when the sky really is falling, so if the sky starts falling in ways that are obvious and difficult for normal people to ignore, then the thingspace cluster that “sane” used to point to starts to come apart.
I like expanding “sane” to something like “know what’s true and do what works”… it’s an impossible standard but something to aspire to.
It seems “sane” may also point to “not indulging in dramatic emotional expressions”, like not screaming, not crying, not punching inanimate objects. But pathos works. Emotions make characters in stories relatable. So the goal isn’t to stay sane, for that is not a well defined thing to do. The goal isn’t even to look sane, for looking insane may be compelling, and looking sane to everyone all the time is probably impossible. For people in general… “don’t think about what’s sane, think about what works” is probably good advice to gesture towards the actual goal.
Semantic Topological Spaces
How I’d like alignment to get done (as of 2024-10-18)
Some thoughts on George Hotz vs Eliezer Yudkowsky
May you be capable enough that even the largest circle of moral concern does not exhaust your influence.
I think they are delaying so people can early pre order which affects how many books the publisher prints and distributes which affects how many people ultimately read it and how much it breaks into the Overton window. Getting this conversation mainstream is an important instrumental goal.
If you are looking for info in the mean time you could look at PauseAI:
Or if you want less facts and quotes and more discussion, I recall that Yudkowsky’s Coming of Age is what changed my view from “orthogonality kinda makes sense” to “orthogonality is almost certainly correct and the implication is alignment needs more care than humanity is currently giving it”.
You may also be better discussing more with your friend or the various online communities.
You can also preorder. I’m hopeful that none of the AI labs will destroy the world before the books release : )
Hey, we met at EAGxToronto : ) I am finally getting around to reading this. I really enjoyed your manic writing style. It is cathartic finding people stressing out about the same things that are stressing me out.
In response to “The less you have been surprised by progress, the better your model, and you should expect to be able to predict the shape of future progress”: My model of capabilities increases has not been too surprised by progress, but that is because for about 8 years now there has been a wide uncertainty bound and a lot of Vingean Reflection in my model. I know that I don’t know what is required for AGI and strongly suspect that nobody else does either. It could be 1 key breakthrough or 100, but most of my expectation p-mass is in the range of 0 to 20. Worlds with 0 would be where prosaic scaling is all we need or where a secret lab is much better at being secret than I expect. Worlds with 20 are where my p-mass is trailing off. I really can’t imagine there would be that many key things required, but since those insights are what would be required to understand why they are required, I don’t think they can be predicted ahead of time, since predicting the breakthrough is basically the same as having the breakthrough, and without the breakthrough we nearly cannot see the breakthrough and cannot see the results which may or may not require further breakthroughs.
So my model of progress has allowed me to observe our prosaic scaling without surprise, but it doesn’t allow me to make good predictions since the reason for my lack of surprise has been from Vingean prediction of the form “I don’t know what progress will look like and neither do you”.
Things I do feel confident about are conditional dynamics, like, if there continues to be focus on this, there will be progress. This likely gives us sigmoid progress on AGI from here until whatever boundary on intelligence gets hit. The issue is that sigmoid is a function matching effort to progress, where effort is some unknown function of the dynamics of the agents making progress (social forces, economic forces, and ai goals?), and some function which cannot be predicted ahead of time maps progress on the problem to capabilities we can see / measure well.
Adding in my hunch that the boundary on intelligence is somewhere much higher than human level intelligence gives us “barring a shift of focus away from the problem, humanity will continue making progress until AI takes over the process of making progress” and the point of AI takeover is unknowable. Could be next week, could be next century, and giving a timeline requires estimating progress through unknown territory. To me this doesn’t feel reassuring, it feels like playing Russian roulette with an unknown number of bullets. It is like an exponential distribution where future probability is independent of past probability, but unlike with lightbulbs burning out, we can’t set up a fleet of earths making progress on AGI to try to estimate the probability distribution.
I have not been surprised by capabilities increase since I don’t think there exists capabilities increase timelines that would surprise me much. I would just say “Ah, so it turns out that’s the rate of progress. I have gone from not knowing what would happen to it happening. Just as predicted.” It’s unfortunate, I know.
What I have been surprised about has been governmental reaction to AI… I kinda expected the political world to basically ignore AI until too late. They do seem focused on non-RSI issues, so this could still be correct, but I guess I wasn’t expecting the way chat-GPT has made waves. I didn’t extrapolate my uncertainty around capabilities increases as a function of progress to uncertainty around societal reaction.
In any case, I’ve been hoping for the last few years I would have time to do my undergrad and start working on the alignment without a misaligned AI going RSI, and I’m still hoping for that. So that’s lucky I guess. 🍀🐛
N Dimensional Interactive Scatter Plot (ndisp)
I’m like “for fuck’s sake will you please stop shooting yourself in the foot like a complete fucking dumbass and just do the wholesome thing which is in fact the strategy that will net you the most points here regardless of how much you care about wholesomeness in its own right”.
I really like the tangentially expressed vibe of “opt out of 0 sum games and seek positive sum games”.
✨ I just donated 71.12 USD (100 CAD 🇨🇦) ✨
I’d like to donate a more relevant amount but I’m finishing my undergrad and have no income stream… in fact, I’m looking to become a Mech Interp researcher (& later focus on agent foundations) but I’m not going to be able to do that if misaligned optimizers eat the world, so I support lightcone’s direction as I understand it (policy that promotes AI not killing everyone).
If anyone knows of good ways to fund myself as a MI researcher, ideally focusing on this research direction I’ve been developing, please let me know : )
I like the pointing out things that need names and attempting to name them. Good stuff!
To rephrase the definition pointed to by “neuro-scaffold” to see if I understood, it is “an integration of ML models and non-ML computer programs that creates nontrivial capabilities beyond those of the ML model or computer program”?
Naively I would refer to this as an “ML deployment” but the ”… nontrivial capabilities beyond...” aspect is important and not implied by “ML deployment”, so “ML integration” might be better, but both are clunky and “ML” can refer to many data science and AI techniques other than neural nets, so I think we’re stuck with the “neuro” terminology. Although, I think I would prefer if people called them “multi-layer perceptrons” to disambiguate them from the biological neurons they were inspired by. “Artificial neural networks” would also be an improvement. “MLP” or “ANN”.
I think I dislike “scaffold” because it implies a temporary structure used for building or repairing another structure, and I don’t think that represents the programs the ANNs are integrated with well. The program might be temporary, but it might not be. So it could perhaps be called an “integrated ANN system” or “integrated MLP system”, or acronymized, an “IANN” or “IMLP”. But these suggestions seem klunky. They don’t seem as easy to say or understand as a “neuro-scaffold”, so “neuro-scaffold” is probably a better term despite the issues I have with the words “neuro” and the words “scaffold”.
TT Self Study Journal # 1
(1) Future Ability to Remember Things
I don’t have this one! My ADHD in my youth and throughout my life has painfully and eternally etched into my mind that I do not have the ability to insert things into my future contexts through the will of my mind alone, and often physical interventions like notes can still fail. A classic example is writing a note that I fail to ever reference in the relevant context. So afaik I basically never think “I can let this detail leave my current context knowing I can remember it.” I can’t. Instead what I think is “I can keep this in my current context from now until it’s relevant” or “the amount of effort to return this to my future context is more expensive than what I hope to gain by having this detail in my future context”. I am still regularly blindsided by unforeseen failures in my strategy to insert details into future contexts. For example, setting three reminders on my phone, then turning off my phone for an exam and forgetting to turn it back on after.
(2) Local Optima of Comfort
I definitely have this one! I noticed one I called “sleepiness in the morning is because you are coming out of sleep, not because you didn’t get enough sleep” but “local optima of comfort” is a really good generalization.
Recently I’ve started swimming in the ocean again (about 9℃ where I live). I feel really good after, but always have an aversion to going in. Interestingly, I find myself motivated by eating a granola bar after as a motivation. It seems like the pre-swim version of myself still requires the future granola bar to motivate swimming even though the post-swim version of myself that actually eats the granola bar enjoys the feeling of being capable of regulating my temperature and the relaxed calming feeling of having swum much more than eating the granola bar.
(3) Interpersonal Conflict
I’m not sure about this one. I think in conflict I’m somewhat likely to try to depersonalize and describe myself and the people I’m in conflict with in 3rd person… but I think I do experience similar effects around stubbornness and feeling like people owe me communication, for example, when people downvote without saying why I get irritated, like, how am I supposed to understand why you are downvoting if you don’t say anything! But of course taking the time to put things into words is a scarce resource that strangers on the internet are in fact, not obligated to spend on me.
(4) Bonus: Recognizing I’m in a Dream
I really like lucid dreaming and dream incubation. One strategy I used to use is regularly “trying to teleport to a predetermined location”. If I fail to actually teleport there, I conclude I’m awake. If I do teleport there, I conclude I’m dreaming and start doing whatever dream exploration I wanted to do in that location. A very fun one I used to do is teleporting to an open field and increasing the amount of force I can jump with to the point I can jump a hundred feet in the air and try to do flips and focus on the feeling of my legs pushing against the ground and the way the world seems to spin around me as I am in the air.
...
This was a fun post. Thanks for writing it.
Thanks for the response.
If I’m understanding you correctly, this is only true if you do not view the process of gathering researchers around a shared set of informational artifacts, including domain knowledge and software artifacts, as an optimization process similar in nature to training. I do view research progress in the field of ML as a form of genetic algorithm, and so I think “what failure looks like” does apply. The situation is more complicated with MI, since in MI the focus is on understanding models rather than improving the capabilities of models, but I still think “what failure looks like” applies. Maybe failure here looks like creating methods that allow us to tweak model behaviour, create clear seeming models, or pretty visualizations, but without noticing there is more going on then we are aware of.
Oh, but the lens of my intuition on these things is based on my assigning a high credence to the risk of “using prosaic methods to create misaligned superintelligence” being the path social forces will follow without conscious intervention. If you don’t assign much credence to that then this concern might not make as much sense.
To offer a competing focus for being empirically grounded, I prefer not focusing on what is easiest. Easiest should be avoided and focus on “useful for some task” should be regarded suspiciously, like someone is trying to let the ROI incentive pervert the careful incentivization we need around the study of robust alignment for superintelligence. I feel the incentive structure surrounding research of AI Alignment must itself be carefully aligned. So instead, focus should be on knowing true things about models. I would expect that to often be useful, but it is a distinct target, and not an easy target to build an incentivization structure for. Some related ideas from my mini review of your Concrete Steps to Get Started in Transformer Mechanistic Interpretability (Thanks for writing that btw!):
How to think about evidence in MI: Neel addresses this in various places throughout the article. Unfortunately there is probably no easy answer, but some helpful hints or things to think about:
What techniques are used to find evidence?
How do we distinguish between true and false beliefs about models?
Look for flaws. Look for evidence to falsify your hypothesis, not just evidence to confirm it. Watch out for cherry-picking.
Use janky hacking! Make minor edits to see how things change. What does and doesn’t break a model’s behaviour? Open ended exploration can be useful for hypothesis formation, but don’t rely on non-rigorous techniques as if they were strong evidence.
I like the phrase “Trust Network” which I’ve been hearing here and there.
TRUST NO ONE seems like a reasonable approximation of a trust network before you actually start modelling a trust network. I think it’s important to think of trust not as a boolean value, not “who can I trust” or “what can I trust” but “how much can I trust this” and in particular, trust is defined for object-action pairs. I trust myself to drive places since I’ve learned how and done so many times before, but I don’t trust myself to pilot an airplane. Further, when I get on an airplane, I don’t personally know the pilot, yet I trust them to do something I wouldn’t trust myself to do. How is this possible? I think there is a system of incentives and a certain amount of lore which informs me that the pilot is trustworthy. This system which I trust to ensure the trustworthiness of the pilot is a trust network.
When something in the system goes wrong, maybe blame can be traced to people, maybe just to systems, but in each case, something in the system has gone wrong, it has trusted someone or some process that was not ideally reliable. That accountability is important for improving the system. Not because someone must be punished, but because, if the system is to perform better in the future, some part of it must change.
I agree with the main article that accountability sinks protect individuals from punishment for their failures are often very good. In a sense, this is what insurance is, which is a good enough idea that it is legally enforced for dangerous activities like driving. I think accountability sinks in this case paradoxically make people less averse to making decisions. If the process has identified this person as someone to trust with some class of decision, then that person is empowered to make those decisions. If there is a problem because of it, it is the fault of the system for having identified them improperly.
I wonder if anyone is modelling trust networks like this. It seems like I might be describing reliability engineering with bayes-nets. In any case, I think it’s a good idea and we should have more of it. Trace the things that can be traced and make subtle accountability explicit!
Regarding the illegibility problem, it is a bit of a specific case of a general problem I have been brooding on for years. There are 3 closely related issues:
Understanding the scope and context of different ideas. As an example, I struggle to introduce people who are familiar with AI and ML to AIA because they assume that it is not a field people have been focusing on for 20 years giving it depth and breadth that would take them a long time to engage with. (They instead assume their ML background gives them better insight and talk over me with asinine observations that I read about on LW a decade ago… it’s frustrating.)
Connecting people focused on similar concepts and problems. This is especially the case across terminological divides of which there are many in pre-paradigmatic fields like AIA. Any independent illegible researcher very likely has their own independent terminology to some degree.
De-duplicating noise in conversations. It is hard to find original ideas when many people are saying variations of the same common (often discredited) ideas.
The solution I have been daydreaming about is the creation of a social media platform that promotes the manual and automatic linking and deduplication of posts. Similar in some ways to a wiki, but with the idea being that if two ideas are actually the same idea wearing two different disguises, the goal is to find the way to describe that idea with the broadest applicability and ease of understanding, and link the other descriptions to that description. This along with some kind of graph representation of the ideas could ideally produce a map of the actual size and shape of a field (and how linked it is to other fields).
The manual linking would need to be promoted with some kind of karma and direct social engagement dynamic (IE, your links show up on your user profile page so people can congratulate you on how clever you are for noticing that idea A is actually the same as idea B).
The automatic linking could be done by various kinds of spiders/bots. Probably LLMs. Importantly I would want bots, which may hallucinate, to need verification before the link is presented as solid, but in fact this applies to any human linking idea nodes as well. Ideally links would be provided with an explanation, and only after many users confirm (or upvote) a link, would it be presented by the default interface.
There could also be other kinds of links than just “these are the same idea”. The kinds of links I find most compelling are “A is similar/same as B”, “A contradicts B”, “A supports B”, “A is part of B”.
I first started thinking of this idea with reference to how traditional social media exhausts me because it seems like a bunch of people talking past each other and you need to read way more than you should to understand any trending issue. It would be much nicer to browse a graph representing the unique elements of the conversation and potentially use LLM technology to find the parts of the conversation exploring your current POV which is either at the edge of the conversation, or you can find how much you would need to read in order to get to the edge, in which case you can decide it is not worth being informed about the issue and say “sorry, but I can’t support either side, I am uninformed” or put in that effort in an efficient way and (hopefully) without getting caught in an echo chamber failing to engage with actual criticism.
But after thinking about the idea it appeals to me as a way to interact with all bodies of knowledge. I think the hardest parts would be setting it up so it feels engaging to people and driving adoption. (Aside from actual implementation difficulty.)