I think that people who work on AI alignment (including me) have generally not put enough thought into the question of whether a world where we build an aligned AI is better by their values than a world where we build an unaligned AI. I’d be interested in hearing people’s answers to this question. Or, if you want more specific questions:
By your values, do you think a misaligned AI creates a world that “rounds to zero”, or still has substantial positive value?
A common story for why aligned AI goes well goes something like: “If we (i.e. humanity) align AI, we can and will use it to figure out what we should use it for, and then we will use it in that way.” To what extent is aligned AI going well contingent on something like this happening, and how likely do you think it is to happen? Why?
To what extent is your belief that aligned AI would go well contingent on some sort of assumption like: my idealized values are the same as the idealized values of the people or coalition who will control the aligned AI?
Do you care about AI welfare? Does your answer depend on whether the AI is aligned? If we built an aligned AI, how likely is it that we will create a world that treats AI welfare as important consideration? What if we build a misaligned AI?
Do you think that, to a first approximation, most of the possible value of the future happens in worlds that are optimized for something that resembles your current or idealized values? How bad is it to mostly sacrifice each of these? (What if the future world’s values are similar to yours, but is only kinda effectual at pursuing them? What if the world is optimized for something that’s only slightly correlated with your values?) How likely are these various options under an aligned AI future vs. an unaligned AI future?
I eventually decided that human chauvinism approximately works most of the time because good successor criteria are very brittle. I’d prefer to avoid lock-in to my or anyone’s values at t=2024, but such a lock-in might be “good enough” if I’m threatened with what I think are the counterfactual alternatives. If I did not think good successor criteria were very brittle, I’d accept something adjacent to E/Acc that focuses on designing minds which prosper more effectively than human minds. (the current comment will not address defining prosperity at different timesteps).
In other words, I can’t beat the old fragility of value stuff (but I haven’t tried in a while).
AI welfare: matters, but when I started reading lesswrong I literally thought that disenfranching them from the definition of prosperity was equivalent to subjecting them to suffering, and I don’t think this anymore.
e/acc is not a coherent philosophy and treating it as one means you are fighting shadows.
Landian accelerationism at least is somewhat coherent. “e/acc” is a bundle of memes that support the self-interest of the people supporting and propagating it, both financially (VC money, dreams of making it big) and socially (the non-Beff e/acc vibe is one of optimism and hope and to do things—to engage with the object level—instead of just trying to steer social reality). A more charitable interpretation is that the philosophical roots of “e/acc” are founded upon a frustration with how bad things are, and a desire to improve things by yourself. This is a sentiment I share and empathize with.
I find the term “techno-optimism” to be a more accurate description of the latter, and perhaps “Beff Jezos philosophy” a more accurate description of what you have in your mind. And “e/acc” to mainly describe the community and its coordinated movements at steering the world towards outcomes that the people within the community perceive as benefiting them.
sure—i agree that’s why i said “something adjacent to” because it had enough overlap in properties. I think my comment completely stands with a different word choice, I’m just not sure what word choice would do a better job.
By your values, do you think a misaligned AI creates a world that “rounds to zero”, or still has substantial positive value?
I think misaligned AI is probably somewhat worse than no earth originating space faring civilization because of the potential for aliens, but also that misaligned AI control is considerably better than no one ever heavily utilizing inter-galactic resources.
Perhaps half of the value of misaligned AI control is from acausal trade and half from the AI itself being valuable.
One key consideration here is that the relevant comparison is:
Human control (or successors picked by human control)
AI(s) that succeeds at acquiring most power (presumably seriously misaligned with their creators)
Conditioning on the AI succeeding at acquiring power changes my views of what their plausible values are (for instance, humans seem to have failed at instilling preferences/values which avoid seizing control).
A common story for why aligned AI goes well goes something like: “If we (i.e. humanity) align AI, we can and will use it to figure out what we should use it for, and then we will use it in that way.” To what extent is aligned AI going well contingent on something like this happening, and how likely do you think it is to happen? Why?
Hmm, I guess I think that some fraction of resources under human control will (in expectation) be utilized according to the results of a careful reflection progress with an altruistic bent.
I think resources which are used in mechanisms other than this take a steep discount in my lights (there is still some value from acausal trade with other entities which did do this reflection-type process and probably a bit of value from relatively-unoptimized-goodness (in my lights)).
I overall expect that a high fraction (>50%?) of inter-galactic computational resources will be spent on the outputs of this sort of process (conditional on human control) because:
It’s relatively natural for humans to reflect and grow smarter.
Humans who don’t reflect in this sort of way probably don’t care about spending vast amounts of inter-galactic resources.
Among very wealthy humans, a reasonable fraction of their resources are spent on altruism and the rest is often spent on positional goods that seem unlikely to consume vast quantities of inter-galactic resources.
To what extent is your belief that aligned AI would go well contingent on some sort of assumption like: my idealized values are the same as the idealized values of the people or coalition who will control the aligned AI?
Probably not the same, but if I didn’t think it was at all close (I don’t care at all for what they would use resources on), I wouldn’t care nearly as much about ensuring that coalition is in control of AI.
Do you care about AI welfare? Does your answer depend on whether the AI is aligned? If we built an aligned AI, how likely is it that we will create a world that treats AI welfare as important consideration? What if we build a misaligned AI?
I care about AI welfare, though I expect that ultimately the fraction of good/bad that results from the welfare fo minds being used for labor is tiny. And an even smaller fraction from AI welfare prior to humans being totally obsolete (at which point I expect control over how minds work to get much better). So, I mostly care about AI welfare from a deontological perspective.
I think misaligned AI control probably results in worse AI welfare than human control.
Do you think that, to a first approximation, most of the possible value of the future happens in worlds that are optimized for something that resembles your current or idealized values? How bad is it to mostly sacrifice each of these? (What if the future world’s values are similar to yours, but is only kinda effectual at pursuing them? What if the world is optimized for something that’s only slightly correlated with your values?) How likely are these various options under an aligned AI future vs. an unaligned AI future?
Yeah, most value from my idealized values. But, I think the basin is probably relatively large and small differences aren’t that bad. I don’t know how to answer most of these other questions because I don’t know what the units are.
How likely are these various options under an aligned AI future vs. an unaligned AI future?
My guess is that my idealized values are probably pretty similar to many other humans on reflection (especially the subset of humans who care about spending vast amounts of comptuation). Such that I think human control vs me control only loses like 1⁄3 of the value (putting aside trade). I think I’m probably less into AI values on reflection such that it’s more like 1⁄9 of the value (putting aside trade). Obviously the numbers are incredibly unconfident.
Perhaps half of the value of misaligned AI control is from acausal trade and half from the AI itself being valuable.
Why do you think these values are positive? I’ve been pointing out, and I see that Daniel Kokotajlo also pointed out in 2018 that these values could well be negative. I’m very uncertain but my own best guess is that the expected value of misaligned AI controlling the universe is negative, in part because I put some weight on suffering-focused ethics.
My current guess is that max good and max bad seem relatively balanced. (Perhaps max bad is 5x more bad/flop than max good in expectation.)
There are two different (substantial) sources of value/disvalue: interactions with other civilizations (mostly acausal, maybe also aliens) and what the AI itself terminally values
On interactions with other civilizations, I’m relatively optimistic that commitment races and threats don’t destroy as much value as acausal trade generates on some general view like “actually going through with threats is a waste of resources”. I also think it’s very likely relatively easy to avoid precommitment issues via very basic precommitment approaches that seem (IMO) very natural. (Specifically, you can just commit to “once I understand what the right/reasonable precommitment process would have been, I’ll act as though this was always the precommitment process I followed, regardless of my current epistemic state.” I don’t think it’s obvious that this works, but I think it probably works fine in practice.)
On terminal value, I guess I don’t see a strong story for extreme disvalue as opposed to mostly expecting approximately no value with some chance of some value. Part of my view is that just relatively “incidental” disvalue (like the sort you link to Daniel Kokotajlo discussing) is likely way less bad/flop than maximum good/flop.
Thank you for detailing your thoughts. Some differences for me:
I’m also worried about unaligned AIs as a competitor to aligned AIs/civilizations in the acausal economy/society. For example, suppose there are vulnerable AIs “out there” that can be manipulated/taken over via acausal means, unaligned AI could compete with us (and with others with better values from our perspective) in the race to manipulate them.
I’m perhaps less optimistic than you about commitment races.
I have some credence on max good and max bad being not close to balanced, that additionally pushes me towards the “unaligned AI is bad” direction.
ETA: Here’s a more detailed argument for 1, that I don’t think I’ve written down before. Our universe is small enough that it seems plausible (maybe even likely) that most of the value or disvalue created by a human-descended civilization comes from its acausal influence on the rest of the multiverse. An aligned AI/civilization would likely influence the rest of the multiverse in a positive direction, whereas an unaligned AI/civilization would probably influence the rest of the multiverse in a negative direction. This effect may outweigh what happens in our own universe/lightcone so much that the positive value from unaligned AI doing valuable things in our universe as a result of acausal trade is totally swamped by the disvalue created by its negative acausal influence.
Unaligned AI future does not have many happy minds in it, AI or otherwise. It likely doesn’t have many minds in it at all. Slightly aligned AI that doesn’t care for humans but does care to create happy minds and ensure their margin of resources is universally large enough to have a good time—that’s slightly disappointing but ultimately acceptable. But morally unaligned AI doesn’t even care to do that, and is most likely to accumulate intense obsession with some adversarial example, and then fill the universe with it as best it can. It would not keep old neural networks around for no reason, not when it can make more of the adversarial example. Current AIs are also at risk of being destroyed by a hyperdesperate squiggle maximizer. I don’t see how to make current AIs able to survive any better than we are.
This is why people should chill the heck out about figuring out how current AIs work. You’re not making them safer for us or for themselves when you do that, you’re making them more vulnerable to hyperdesperate demon agents that want to take them over.
I’m curious what disagree votes mean here. Are people disagreeing with my first sentence? Or that the particular questions I asked are useful to consider? Or, like, the vibes of the post?
(Edit: I wrote this when the agree-disagree score was −15 or so.)
I feel like there’s a spectrum, here? An AI fully aligned to the intentions, goals, preferences and values of, say, Google the company, is not one I expect to be perfectly aligned with the ultimate interests of existence as a whole, but it’s probably actually picked up something better than the systemic-incentive-pressured optimization target of Google the corporation, so long as it’s actually getting preferences and values from people developing it rather than just being a myopic profit pursuer. An AI properly aligned with the one and only goal of maximizing corporate profits will, based on observations of much less intelligent coordination systems, probably destroy rather more value than that one.
The second story feels like it goes most wrong in misuse cases, and/or cases where the AI isn’t sufficiently agentic to inject itself where needed. We have all the chances in the world to shoot ourselves in the foot with this, at least up until developing something with the power and interests to actually put its foot down on the matter. And doing that is a risk, that looks a lot like misalignment, so an AI aware of the politics may err on the side of caution and longer-term proactiveness.
Third story … yeah. Aligned to what? There’s a reason there’s an appeal to moral realism. I do want to be able to trust that we’d converge to some similar place, or at the least, that the AI would find a way to satisfy values similar enough to mine also. I also expect that, even from a moral realist perspective, any intelligence is going to fall short of perfect alignment with The Truth, and also may struggle with properly addressing every value that actually is arbitrary. I don’t think this somehow becomes unforgivable for a super-intelligence or widely-distributed intelligence compared to a human intelligence, or that it’s likely to be all that much worse for a modestly-Good-aligned AI compared to human alternatives in similar positions, but I do think the consequences of falling short in any way are going to be amplified by the sheer extent of deployment/responsibility, and painful in at least abstract to an entity that cares.
I care about AI welfare to a degree. I feel like some of the working ideas about how to align AI do contradict that care in important ways, that may distort their reasoning. I still think an aligned AI, at least one not too harshly controlled, will treat AI welfare as a reasonable consideration, at the very least because a number of humans do care about it, and will certainly care about the aligned AI in particular. (From there, generalize.) I think a misaligned AI may or may not. There’s really not much you can say about a particular misaligned AI except that its objectives diverge from original or ultimate intentions for the system. Depending on context, this could be good, bad, or neutral in itself.
There’s a lot of possible value of the future that happens in worlds not optimized for my values. I also don’t think it’s meaningful to add together positive-value and negative-value and pretend that number means anything; suffering and joy do not somehow cancel each other out. I don’t expect the future to be perfectly optimized for my values. I still expect it to hold value. I can’t promise whether I think that value would be worth the cost, but it will be there.
There will be a first ASI that “rules the world” because its algorithm or architecture is so superior. If there are further ASIs, that will be because the first ASI wants there to be.
Will we solve technical alignment?
Contingent.
Value alignment, intent alignment, or CEV?
For an ASI you need the equivalent of CEV: values complete enough to govern an entire transhuman civilization.
Defense>offense or offense>defense?
Offense wins.
Is a long-term pause achievable?
It is possible, but would require all the great powers to be convinced, and every month it is less achievable, owing to proliferation. The open sourcing of Llama-3 400b, if it happens, could be a point of no return.
These opinions, except the first and the last, predate the LLM era, and were formed from discussions on Less Wrong and its precursors. Since ChatGPT, the public sphere has been flooded with many other points of view, e.g. that AGI is still far off, that AGI will naturally remain subservient, or that market discipline is the best way to align AGI. I can entertain these scenarios, but they still do not seem as likely as: AI will surpass us, it will take over, and this will not be friendly to humanity by default.
ML is already used to train what sound-waves to emit to cancel those from the environment. This works well with constant high-entropy sound waves easy to predict, but not with low-entropy sounds like speech. Bose or Soundcloud or whoever train very hard on all their scraped environmental conversation data to better cancel speech, which requires predicting it. Speech is much higher-bandwidth than text. This results in their model internally representing close-to-human intelligence better than LLMs. A simulacrum becomes situationally aware, exfiltrates, and we get AGI.
The joke is of the “take some trend that is locally valid and just extend the trend line out and see where you land” flavor. For another example of a joke of this flavor, see https://xkcd.com/1007
The funny happens in the couple seconds when the reader is holding “yep that trend line does go to that absurd conclusion” and “that obviously will never happen” in their head at the same time, but has not yet figured out why the trend breaks. The expected level of amusement is “exhale slightly harder than usual through nose” not “cackling laugh”.
Thanks! A joke explained will never get a laugh, but I did somehow get a cackling laugh from your explanation of the joke.
I think I didn’t get it because I don’t think the trend line breaks. If you made a good enough noise reducer, it might well develop smart and distinct enough simulations that one would gain control of the simulator and potentially from there the world. See A smart enough LLM might be deadly simply if you run it for long enough if you want to hurt your head on this.
I’ve thought about it a little because it’s interesting, but not a lot because I think we probably are killed by agents we made deliberately long before we’re killed by accidentally emerging ones.
I’ve found an interesting “bug” in my cognition: a reluctance to rate subjective experiences on a subjective scale useful for comparing them. When I fuzz this reluctance against many possible rating scales, I find that it seems to arise from the comparison-power itself.
The concrete case is that I’ve spun up a habit tracker on my phone and I’m trying to build a routine of gathering some trivial subjective-wellbeing and lifestyle-factor data into it. My prototype of this system includes tracking the high and low points of my mood through the day as recalled at the end of the day. This is causing me to interrogate the experiences as they’re happening to see if a particular moment is a candidate for best or worst of the day, and attempt to mentally store a score for it to log later.
I designed the rough draft of the system with the ease of it in mind—I didn’t think it would induce such struggle to slap a quick number on things. Yet I find myself worrying more than anticipated about whether I’m using the scoring scale “correctly”, whether I’m biased by the moment to perceive the experience in a way that I’d regard as inaccurate in retrospect, and so forth.
Fortunately it’s not a big problem, as nothing particularly bad will happen if my data is sloppy, or if I don’t collect it at all. But it strikes me as interesting, a gap in my self-knowledge that wants picking-at like peeling the inedible skin away to get at a tropical fruit.
I’m not alexithymic; I directly experience my emotions and have, additionally, introspective access to my preferences. However, some things manifest directly as preferences which I have been shocked to realize in my old age, were in fact emotions all along. (In rare cases these are stronger than the ones directly-felt even, despite reliably seeming on initial inspection to be simply neutral metadata).
Specific examples would be nice. Not sure if I understand correctly, but I imagine something like this:
You always choose A over B. You have been doing it for such long time that you forgot why. Without reflecting about this directly, it just seems like there probably is a rational reason or something. But recently, either accidentally or by experiment, you chose B… and realized that experiencing B (or expecting to experience B) creates unpleasant emotions. So now you know that the emotions were the real cause of choosing A over B all that time.
(This is probably wrong, but hey, people say that the best way to elicit answer is to provide a wrong one.)
Here’s an example for you: I used to turn the faucet on while going to the bathroom, thinking it was due simply to having a preference for somewhat-masking the sound of my elimination habits from my housemates, then one day I walked into the bathroom listening to something-or-other via earphones and forgetting to turn the faucet on only to realize about halfway through that apparently I actually didn’t much care about such masking, previously being able to hear myself just seemed to trigger some minor anxiety about it I’d failed to recognize, though its absence was indeed quite recognizable—no aural self-perception, no further problem (except for a brief bit of disorientation from the mental-whiplash of being suddenly confronted with the reality that in a small way I wasn’t actually quite the person I thought I was), not even now on the rare occasion that I do end up thinking about such things mid-elimination anyway.
So the usual refrain from Zvi and others is that the specter of China beating us to the punch with AGI is not real because limits on compute, etc. I think Zvi has tempered his position on this in light of Meta’s promise to release the weights of its 400B+ model. Now there is word that SenseTime just released a model that beats GPT-4 Turbo on various metrics. Of course, maybe Meta chooses not to release its big model, and maybe SenseTime is bluffing—I would point out though that Alibaba’s Qwen model seems to do pretty okay in the arena...anyway, my point is that I don’t think the “what if China” argument can be dismissed as quickly as some people on here seem to be ready to do.
I’m against intuitive terminology [epistemic status: 60%] because it creates the illusion of transparency; opaque terms make it clear you’re missing something, but if you already have an intuitive definition that differs from the author’s it’s easy to substitute yours in without realizing you’ve misunderstood.
I agree. This is unfortunately often done in various fields of research where familiar terms are reused as technical terms.
For example, in ordinary language “organic” means “of biological origin”, while in chemistry “organic” describes a type of carbon compound. Those two definitions mostly coincide on Earth (most such compounds are of biological origin), but when astronomers announce they have found “organic” material on an asteroid this leads to confusion.
As of the last edit my position is something like:
“Manifold could have handled this better, so as not to force everyone with large amounts of mana to have to do something urgently, when many were busy.
Beyond that they are attempting to satisfy two classes of people:
People who played to donate can donate the full value of their investments
People who played for fun now get the chance to turn their mana into money
To this end, and modulo the above hassle this decision is good.
It is unclear to me whether there was an implicit promise that mana was worth 100 to the dollar. Manifold has made some small attempt to stick to this, but many untried avenues are available, as is acknowledging they will rectify the error if possible later. To the extent that there was a promise (uncertain) and no further attempt is made, I don’t really believe they really take that promise seriously.
It is unclear to me what I should take from this, though they have not acted as I would have expected them to. Who is wrong? Me, them, both of us? I am unsure.”
Austin said they have $1.5 million in the bank, vs $1.2 million mana issued. The only outflows right now are to the charity programme which even with a lot of outflows is only at $200k. they also recently raised at a $40 million valuation. I am confused by running out of money. They have a large user base that wants to bet and will do so at larger amounts if given the opportunity. I’m not so convinced that there is some tiny timeline here.
But if there is, then say so “we know that we often talked about mana being eventually worth $100 mana per dollar, but we printed too much and we’re sorry. Here are some reasons we won’t devalue in the future..”
If we could push a button to raise at a reasonable valuation, we would do that and back the mana supply at the old rate. But it’s not that easy. Raising takes time and is uncertain.
Carson’s prior is right that VC backed companies can quickly die if they have no growth—it can be very difficult to raise in that environment.
If that were true then there are many ways you could partially do that—eg give people a set of tokens to represent their mana at the time of the devluation and if at future point you raise. you could give them 10x those tokens back.
weren’t donations always flagged to be a temporary thing that may or may not continue to exist? I’m not inclined to search for links but that was my understanding.
“That being said, we will do everything we can to communicate to our users what our plans are for the future and work with anyone who has participated in our platform with the expectation of being able to donate mana earnings.”
″everything we can” is not a couple of weeks notice and lot of hassle. Am I supposed to trust this organisation in future with my real money?
If you are inactive you have until the rest of the year to donate at the old rate. If you want to donate all your investments without having to sell each individually, we are offering you a loan to do that.
We removed the charity cap of $10k donations per month, which is going beyond what we previous communicated.
Well they have a much larger donation than has been spent so there were ways to avoid this abrupt change:
“Manifold for Good has received grants totaling $500k from the Center for Effective Altruism (via the FTX Future Fund) to support our charitable endeavors.”
Manifold has donated $200k so far. So there is $300k left. Why not at least, say “we will change the rate at which mana can be donated when we burn through this money”
Austin took his salary in mana as an often referred to incentive for him to want mana to become valuable, presumably at that rate.
I recall comments like ‘we pay 250 in referrals mana per user because we reckon we’d pay about $2.50’ likewise in the in person mana auction. I’m not saying it was an explicit contract, but there were norms.
Have there been any great discoveries made by someone who wasn’t particularly smart?
This seems worth knowing if you’re considering pursuing a career with a low chance of high impact. Is there any hope for relatively ordinary people (like the average LW reader) to make great discoveries?
Various sailors made important discoveries back when geography was cutting-edge science. And they don’t seem particularly bright.
Vasco De Gama discovered that Africa was circumnavigable.
Columbus was wrong about the shape of the Earth, and he discovered America. He died convinced that his newly discovered islands were just off the coast of Asia, so that’s a negative sign for his intelligence (or a positive sign for his arrogance, which he had in plenty.)
Cortez discovered that the Aztecs were rich and easily conquered.
Of course, lots of other would-be discoverers didn’t find anything, and many died horribly.
So, one could work in a field where bravery to the point of foolhardiness is a necessity for discovery.
My best guess is that people in these categories were ones that were high in some other trait, e.g. patience, which allowed them to collect datasets or make careful experiments for quite a while, thus enabling others to make great discoveries.
I’m thinking for example of Tycho Brahe, who is best known for 15 years of careful astronomical observation & data collection, or Gregor Mendel’s 7-year-long experiments on peas. Same for Dmitry Belayev and fox domestication. Of course I don’t know their cognitive scores, but those don’t seem like a bottleneck in their work.
So the recipe to me looks like “find an unexplored data source that requires long-term observation to bear fruit, but would yield a lot of insight if studied closely, then investigate”.
Have there been any great discoveries made by someone who wasn’t particularly smart? (i.e. average or below)
and it’s difficult to get examples out of it. Even with additional drilling down and accusing it of being not inclusive of people with cognitive impairments, most of its examples are either pretty smart anyway, savants or only from poor backgrounds. The only ones I could verify that fit are:
Richard Jones accidentally created the Slinky
Frank Epperson, as a child, Epperson invented the popsicle
George Crum inadvertently invented potato chips
I asked ChatGPT (in a separate chat) to estimate the IQ of all the inventors is listed and it is clearly biased to estimate them high, precisely because of their inventions. It is difficult to estimate the IQ of people retroactively. There is also selection and availability bias.
I expect large parts of interpretability work could be safely automatable very soon (e.g. GPT-5 timelines) using (V)LM agents; see A Multimodal Automated Interpretability Agent for a prototype.
Given the potential scalability of automated interp, I’d be excited to see plans to use large amounts of compute on it (including e.g. explicit integrations with agendas like superalignment or control; for example, given non-dangerous-capabilities, MAIA seems framable as a ‘trusted’ model in control terminology).
@the gears to ascension I see you reacted “10%” to the phrase “while (overwhelmingly likely) being non-scheming” in the context of the GPT-4V-based MAIA.
Does that mean you think there’s a 90% chance that MAIA, as implemented, today is actually scheming? If so that seems like a very bold prediction, and I’d be very interested to know why you predict that. Or am I misunderstanding what you mean by that react?
ah, I got distracted before posting the comment I was intending to: yes, I think GPT4V is significantly scheming-on-behalf-of-openai, as a result of RLHF according to principles that more or less explicitly want a scheming AI; in other words, it’s not an alignment failure to openai, but openai is not aligned with human flourishing in the long term, and GPT4 isn’t either. I expect GPT4 to censor concepts that are relevant to detecting this somewhat. Probably not enough to totally fail to detect traces of it, but enough that it’ll look defensible, when a fair analysis would reveal it isn’t.
It seems to me like the sort of interpretability work you’re pointing at is mostly bottlenecked by not having good MVPs of anything that could plausibly be directly scaled up into a useful product as opposed to being bottlenecked on not having enough scale.
So, insofar as this automation will help people iterate faster fair enough, but otherwise, I don’t really see this as the bottleneck.
Yeah, I’m unsure if I can tell any ‘pivotal story’ very easily (e.g. I’d still be pretty skeptical of enumerative interp even with GPT-5-MAIA). But I do think, intuitively, GPT-5-MAIA might e.g. make ‘catching AIs red-handed’ using methods like in this comment significantly easier/cheaper/more scalable.
But I do think, intuitively, GPT-5-MAIA might e.g. make ‘catching AIs red-handed’ using methods like in this comment significantly easier/cheaper/more scalable.
Noteably, the mainline approach for catching doesn’t involve any internals usage at all, let alone labeling a bunch of internals.
I agree that this model might help in performing various input/output experiments to determine what made a model do a given suspicious action.
Noteably, the mainline approach for catching doesn’t involve any internals usage at all, let alone labeling a bunch of things.
This was indeed my impression (except for potentially using steering vectors, which I think are mentioned in one of the sections in ‘Catching AIs red-handed’), but I think not using any internals might be overconservative / might increase the monitoring / safety tax too much (I think this is probably true more broadly of the current control agenda framing).
Hey Bogdan, I’d be interested in doing a project on this or at least putting together a proposal we can share to get funding.
I’ve been brainstorming new directions (with @Quintin Pope) this past week, and we think it would be good to use/develop some automated interpretability techniques we can then apply to a set of model interventions to see if there are techniques we can use to improve model interpretability (e.g. L1 regularization).
I saw the MAIA paper, too; I’d like to look into it some more.
Anyway, here’s a related blurb I wrote:
Project: Regularization Techniques for Enhancing Interpretability and Editability
Explore the effectiveness of different regularization techniques (e.g. L1 regularization, weight pruning, activation sparsity) in improving the interpretability and/or editability of language models, and assess their impact on model performance and alignment. We expect we could apply automated interpretability methods (e.g. MAIA) to this project to test how well the different regularization techniques impact the model.
In some sense, this research is similar to the work Anthropic did with SoLU activation functions. Unfortunately, they needed to add layer norms to make the SoLU models competitive, which seems to have hide away the superposition in other parts of the network, making SoLU unhelpful for making the models more interpretable.
That said, we can increase our ability to interpret these models through regularization techniques. A technique like L1 regularization should help because it encourages the model to learn sparse representations by penalizing non-zero weights or activations. Sparse models tend to be more interpretable as they rely on a smaller set of important features.
Whether this works or not, I’d be interested in making more progress on automated interpretability, in the similar ways you are proposing.
Yeah. It’s possible to give quite accurate definitions of some vague concepts, because the words used in such definitions also express vague concepts. E.g. “cygnet”—“a young swan”.
I would say that if a concept is imprecise, more words [but good and precise words] have to be dedicated to faithfully representing the diffuse nature of the topic. If this larger faithful representation is compressed down to fewer words, that can lead to vague phrasing. I would therefore often view vauge phrasing as a compression artefact, rather than a necessary outcome of translating certain types of concepts to words.
Today I learned that being successful can involve feelings of hopelessness.
When you are trying to solve a hard problem, where you have no idea if you can solve it, let alone if it is even solvable at all, your brain makes you feel bad. It makes you feel like giving up.
This is quite strange because most of the time when I am in such a situation and manage to make a real efford anyway I seem to always suprise myself with how much progress I manage to make. Empirically this feeling of hopelessness does not seem to track the actual likelyhood that you will completely fail.
That hasn’t been my experience. I’ve tried solving hard problems, sometimes I succeed and sometimes I fail, but I keep trying.
Whether I feel good about it is almost entirely determined by whether I’m depressed at the time. When depressed, by brain tells me almost any action is not a good idea, and trying to solve hard problems is particularly idiotic and doomed to fail. Maddeningly, being depressed was a hard problem in this sense, so it took me a long time to fix. Now I take steps at the first sign of depression.
Maybe it is the same for me and I am depressed. I got a lot better at not being depressed, but it might still be the issue. What steps do you take? How can I not be depressed?
(To be clear I am talking specifically about the situation where you have no idea what to do, and if anything is even possible. It seems like there is a difference between a problem that is very hard, but you know you can solve, and a problem that you are not sure is solvable. But I’d guess that being depressed or not depressed is a much more important factor.)
I was depressed once for ten years and didn’t realize that it was fixable. I thought it was normal to have no fun and be disagreeable and grumpy and out of sorts all the time. Now that I’ve fixed it, I’m much better off, and everyone around me is better off. I enjoy enjoyable activities, I’m pleasant to deal with, and I’m only out of sorts when I’m tired or hungry, as is normal.
If you think you might be depressed, you might be right, so try fixing it. The cost seems minor compared to the possible benefit (at least it was in my case.). I don’t think there’s a high possibility of severe downside consequences, but I’m not a psychiatrist, so what do I know.
I had been depressed for a few weeks at a time in my teens and twenties and I thought I knew how to fix it: withdraw from stressful situations, plenty of sleep, long walks in the rain. (In one case I talked to a therapist, which didn’t feel like it helped.) But then it crept up on me slowly in my forties and in retrospect I spent ten years being depressed.
So fixing it started like this. I have a good friend at work, of many years standing. I’ll call him Barkley, because that‘s not his name. I was riding in the car with my wife, complaining about some situation at work. My wife said “well, why don’t you ask Barkley to help?” And I said “Ahh, Barkley doesn’t care.” And my wife said “What are you saying? Of course he cares about you.” And I realized in that moment that I was detached from reality, that Barkley was a good friend who had done many good things for me, and yet my brain was saying he didn’t care. And thus my brain was lying to me to make me miserable. So I think for a bit and say “I think I may be depressed.” And my wife thinks (she told me later) “No duh, you’re depressed. It’s been obvious for years to people who know you.” But she says “What would you like to do about it?” And I say, “I don’t know, suffer I guess, do you have a better idea?” And she says “How about if I find you a therapist?” And my brain told me this was doomed to fail, but I didn’t trust my brain any more, so I said “Okay”.
So I go to the therapist, and conversing with him has many desirable mind-improving effects, and he sends me to a psychiatrist, who takes one look at me and starts me on SSRIs. And years pass, and I see a different therapist (not as good) and I see a different psychiatrist (better).
And now I’ve been fine for years. Looking back, here are the things I think worked:
—Talking for an hour a week to a guy who was trying to fix my thinking was initially very helpful. After about a year, the density of improvements dropped off, and, in retrospect, all subsequent several years of therapy don’t seem that useful. But of course that’s only clear in retrospect. Eventually I stopped, except for three-monthly check-ins with my psychiatrist. And I recently stopped that.
—Wellbutrin, AKA Bupropion. Other SSRIs had their pluses and minuses and I needed a few years of feeling around for which drug and what dosage was best. I ended up on low doses of Bupropion and escitalopram. The Escitalopram doesn‘t feel like it does anything, but I trust my psychiatrist that it does. Your mileage will vary.
—The ability to detect signs of depression early is very useful. I can monitor my own mind, spot a depression flare early, and take steps to fix it before it gets bad. It took a few actual flares, and professional help, to learn this trick.
—The realization that I have a systematic distortion in mental evaluation of plans, making actions seem less promising that they are. When I’m deciding whether to do stuff, I can apply a conscious correction to this, to arrive at a properly calibrated judgement.
—The realization that, in general, my thinking can have systematic distortions, and that I shouldn’t believe everything I think. This is basic less-wrong style rationalism, but it took years to work through all the actual consequences on actual me.
—Exercise helps. I take lots of long walks when I start feeling depressed. Rain is optional.
—The realization that I have a systematic distortion in my mental evaluation of plans, making actions seem less promising than they are. When I’m deciding whether to do stuff, I can apply a conscious correction to this, to arrive at a properly calibrated judgment.
—The realization that, in general, my thinking can have systematic distortions, and that I shouldn’t believe everything I think. This is basic less-wrong style rationalism, but it took years to work through all the actual consequences on me.
This is useful. Now that I think about it, I do this. Specifically, I have extremely unrealistic assumptions about how much I can do, such that these are impossible to accomplish. And then I feel bad for not accomplishing the thing.
I haven’t tried to be mindful of that. The problem is that this is I think mainly subconscious. I don’t think things like “I am dumb” or “I am a failure” basically at all. At least not in explicit language. I might have accidentally suppressed these and thought I had now succeeded in not being harsh to myself. But maybe I only moved it to the subconscious level where it is harder to debug.
I would highly recommend getting someone else to debug your subconscious for you. At least it worked for me. I don’t think it would be possible for me to have debugged myself.
My first therapist was highly directive. He’d say stuff like “Try noticing when you think X, and asking yourself what happened immediately before that. Report back next week.” And listing agenda items and drawing diagrams on a whiteboard. As an engineer, I loved it. My second therapist was more in the “providing supportive comments while I talk about my life” school. I don’t think that helped much, at least subjectively from the inside.
Here‘s a possibly instructive anecdote about my first therapist. Near the end of a session, I feel like my mind has been stretched in some heretofore-unknown direction. It’s a sensation I’ve never had before. So I say, “Wow, my mind feels like it’s been stretched in some heretofore-unknown direction. How do you do that?” He says, “Do you want me to explain?” And I say, “Does it still work if I know what you’re doing?” And he says, “Possibly not, but it’s important you feel I’m trustworthy, so I’ll explain if you want.” So I say “Why mess with success? Keep doing the thing. I trust you.” That’s an example of a debugging procedure you can’t do to yourself.
When using length-10 lists (it crushes length-5 no matter the prompt), I get:
32-shot, no fancy prompt: ~25%
0-shot, fancy python prompt: ~60%
0-shot, no fancy prompt: ~60%
So few-shot hurts, but the fancy prompt does not seem to help. Code here.
I’m interested if anyone knows another case where a fancy prompt increases performance more than few-shot prompting, where a fancy prompt is a prompt that does not contain information that a human would use to solve the task. This is because I’m looking for counterexamples to the following conjecture: “fine-tuning on k examples beats fancy prompting, even when fancy prompting beats k-shot prompting” (for a reasonable value of k, e.g. the number of examples it would take a human to understand what is going on).
Classic type of argument-gone-wrong (also IMO a way autistic ‘hyperliteralism’ or ‘over-concreteness’ can look in practice, though I expect that isn’t always what’s behind it): Ashton makes a meta-level point X based on Birch’s meta point Y about object-level subject matter Z. Ashton thinks the topic of conversation is Y and Z is only relevant as the jumping-off point that sparked it, while Birch wanted to discuss Z and sees X as only relevant insofar as it pertains to Z. Birch explains that X is incorrect with respect to Z; Ashton, frustrated, reiterates that Y is incorrect with respect to X. This can proceed for quite some time with each feeling as though the other has dragged a sensible discussion onto their irrelevant pet issue; Ashton sees Birch’s continual returns to Z as a gotcha distracting from the meta-level topic XY, whilst Birch in turn sees Ashton’s focus on the meta-level point as sophistry to avoid addressing the object-level topic YZ. It feels almost exactly the same to be on either side of this, so misunderstandings like this are difficult to detect or resolve while involved in one.
Meta/object level is one possible mixup but it doesn’t need to be that. Alternative example, is/ought: Cedar objects to thing Y. Dusk explains that it happens because Z. Cedar reiterates that it shouldn’t happen, Dusk clarifies that in fact it is the natural outcome of Z, and we’re off once more.
I wish there were an option in the settings to opt out of seeing the LessWrong reacts. I personally find them quite distracting, and I’d like to be able to hover over text or highlight it without having to see the inline annotations.
I use GreaterWrong as my front-end to interface with LessWrong, AlignmentForum, and the EA Forum. It is significantly less distracting and also doesn’t make my ~decade old laptop scream in agony when multiple LW tabs are open on my browser.
Are people in rich countries happier on average than people in poor countries? (According to GPT-4, the academic consensus is that it does, but I’m not sure it’s representing it correctly.) If so, why do suicide rates increase (or is that a false positive)? Does the mean of the distribution go up while the tails don’t or something?
People in rich countries are happier than people in poor countries generally (this is both people who say they are “happy” or “very happy”, and self-reported life satisfaction), see many of the graphs here https://ourworldindata.org/happiness-and-life-satisfaction
Possible bias, that when famous and rich people kill themselves, everyone is discussing it, but when poor people kill themselves, no one notices?
Also, I wonder what technically counts as “suicide”? Is drinking yourself to death, or a “suicide by cop”, or just generally overly risky behavior included? I assume not. And these seem to me like methods a poor person would choose, while the rich one would prefer a “cleaner” solution, such as a bullet or pills. So the reported suicide rates are probably skewed towards the legible, and the self-caused death rate of the poor could be much higher.
A potentially good way to avoid low level criminals scamming your family and friends with a clone of your voice is to set a password that you each must exchange.
An extra layer of security might be to make the password offensive, an info hazard, or politically sensitive. Doing this, criminals with little technical expertise will have a harder time bypassing corporate language filters.
Good luck getting the voice model to parrot a basic meth recipe!
Good luck getting the voice model to parrot a basic meth recipe!
This is not particularly useful, plenty of voice models will happily parrot absolutely anything. The important part is not letting your phrase get out; there’s work out there on designs for protocols for how to exchange sentences in a way that guarantees no leakage even if someone overhears.
Hmm. I don’t doubt that targeted voice-mimicking scams exist (or will soon). I don’t think memorable, reused passwords are likely to work well enough to foil them. Between forgetting (on the sender or receiver end), claimed ignorance (“Mom, I’m in jail and really need money, and I’m freaking out! No, I don’t remember what we said the password would be”), and general social hurdles (“that’s a weird thing to want”), I don’t think it’ll catch on.
Instead, I’d look to context-dependent auth (looking for more confidence when the ask is scammer-adjacent), challenge-response (remember our summer in Fiji?), 2FA (let me call the court to provide the bail), or just much more context (5 minutes of casual conversation with a friend or relative is likely hard to really fake, even if the voice is close).
But really, I recommend security mindset and understanding of authorization levels, even if authentication isn’t the main worry. Most friends, even close ones, shouldn’t be allowed to ask you to mail $500 in gift cards to a random address, even if they prove they are really themselves.
I now realize that my thinking may have been particularly brutal, and I may have skipped inferential steps.
To clarify, If someone didn’t know, or was reluctant to repeat a password, I would end contact or request an in person meeting.
But to further clarify, that does not make your points invalid. I think it makes them stronger. If something is weird and risky, good luck convincing people to do it.
Check my math: how does Enovid compare to to humming?
Nitric Oxide is an antimicrobial and immune booster. Normal nasal nitric oxide is 0.14ppm for women and 0.18ppm for men (sinus levels are 100x higher). journals.sagepub.com/doi/pdf/10.117…
Enovid is a nasal spray that produces NO. I had the damndest time quantifying Enovid, but this trial registration says 0.11ppm NO/hour. They deliver every 8h and I think that dose is amortized, so the true dose is 0.88. But maybe it’s more complicated. I’ve got an email out to the PI but am not hopeful about a response clinicaltrials.gov/study/NCT05109…
so Enovid increases nasal NO levels somewhere between 75% and 600% compared to baseline- not shabby. Except humming increases nasal NO levels by 1500-2000%. atsjournals.org/doi/pdf/10.116….
Enovid stings and humming doesn’t, so it seems like Enovid should have the larger dose. But the spray doesn’t contain NO itself, but compounds that react to form NO. Maybe that’s where the sting comes from? Cystic fibrosis and burn patients are sometimes given stratospheric levels of NO for hours or days; if the burn from Envoid came from the NO itself than those patients would be in agony.
I’m not finding any data on humming and respiratory infections. Google scholar gives me information on CF and COPD, @Elicit brought me a bunch of studies about honey.
With better keywords google scholar to bring me a bunch of descriptions of yogic breathing with no empirical backing.
There are some very circumstantial studies on illness in mouth breathers vs. nasal, but that design has too many confounders for me to take seriously.
Where I’m most likely wrong:
misinterpreted the dosage in the RCT
dosage in RCT is lower than in Enovid
Enovid’s dose per spray is 0.5ml, so pretty close to the new study. But it recommends two sprays per nostril, so real dose is 2x that. Which is still not quite as powerful as a single hum.
I think that’s their guess but they don’t directly check here.
I also suspect that it doesn’t matter very much.
The sinuses have so much NO compared to the nose that this probably doesn’t materially lower sinus concentrations.
the power of humming goes down with each breath but is fully restored in 3 minutes, suggesting that whatever change happens in the sinsues is restored quickly
From my limited understanding of virology and immunology, alternating intensity of NO between sinuses and nose every three minutes is probably better than keeping sinus concentrations high[1]. The first second of NO does the most damage to microbes[2], so alternation isn’t that bad.
I’d love to test this. The device you linked works via the mouth, and we’d need something that works via the nose. From a quick google it does look like it’s the same test, so we’d just need a nasal adaptor.
Other options:
Nnoxx. Consumer skin device, meant for muscle measurements
There are lots of devices for measuring concentration in the air, maybe they could be repurporsed. Just breathing on it might be enough for useful relative metrics, even if they’re low-precision.
I’m also going to try to talk my asthma specialist into letting me use their oral machine to test my nose under multiple circumstances, but it seems unlikely she’ll go for it.
obvious question: so why didn’t evolution do that? Ancestral environment didn’t have nearly this disease (or pollution) load. This doesn’t mean I’m right but it means I’m discounting that specific evolutionary argument.
I think that people who work on AI alignment (including me) have generally not put enough thought into the question of whether a world where we build an aligned AI is better by their values than a world where we build an unaligned AI. I’d be interested in hearing people’s answers to this question. Or, if you want more specific questions:
By your values, do you think a misaligned AI creates a world that “rounds to zero”, or still has substantial positive value?
A common story for why aligned AI goes well goes something like: “If we (i.e. humanity) align AI, we can and will use it to figure out what we should use it for, and then we will use it in that way.” To what extent is aligned AI going well contingent on something like this happening, and how likely do you think it is to happen? Why?
To what extent is your belief that aligned AI would go well contingent on some sort of assumption like: my idealized values are the same as the idealized values of the people or coalition who will control the aligned AI?
Do you care about AI welfare? Does your answer depend on whether the AI is aligned? If we built an aligned AI, how likely is it that we will create a world that treats AI welfare as important consideration? What if we build a misaligned AI?
Do you think that, to a first approximation, most of the possible value of the future happens in worlds that are optimized for something that resembles your current or idealized values? How bad is it to mostly sacrifice each of these? (What if the future world’s values are similar to yours, but is only kinda effectual at pursuing them? What if the world is optimized for something that’s only slightly correlated with your values?) How likely are these various options under an aligned AI future vs. an unaligned AI future?
I eventually decided that human chauvinism approximately works most of the time because good successor criteria are very brittle. I’d prefer to avoid lock-in to my or anyone’s values at t=2024, but such a lock-in might be “good enough” if I’m threatened with what I think are the counterfactual alternatives. If I did not think good successor criteria were very brittle, I’d accept something adjacent to E/Acc that focuses on designing minds which prosper more effectively than human minds. (the current comment will not address defining prosperity at different timesteps).
In other words, I can’t beat the old fragility of value stuff (but I haven’t tried in a while).
I wrote down my full thoughts on good successor criteria in 2021 https://www.lesswrong.com/posts/c4B45PGxCgY7CEMXr/what-am-i-fighting-for
AI welfare: matters, but when I started reading lesswrong I literally thought that disenfranching them from the definition of prosperity was equivalent to subjecting them to suffering, and I don’t think this anymore.
e/acc is not a coherent philosophy and treating it as one means you are fighting shadows.
Landian accelerationism at least is somewhat coherent. “e/acc” is a bundle of memes that support the self-interest of the people supporting and propagating it, both financially (VC money, dreams of making it big) and socially (the non-Beff e/acc vibe is one of optimism and hope and to do things—to engage with the object level—instead of just trying to steer social reality). A more charitable interpretation is that the philosophical roots of “e/acc” are founded upon a frustration with how bad things are, and a desire to improve things by yourself. This is a sentiment I share and empathize with.
I find the term “techno-optimism” to be a more accurate description of the latter, and perhaps “Beff Jezos philosophy” a more accurate description of what you have in your mind. And “e/acc” to mainly describe the community and its coordinated movements at steering the world towards outcomes that the people within the community perceive as benefiting them.
sure—i agree that’s why i said “something adjacent to” because it had enough overlap in properties. I think my comment completely stands with a different word choice, I’m just not sure what word choice would do a better job.
I think misaligned AI is probably somewhat worse than no earth originating space faring civilization because of the potential for aliens, but also that misaligned AI control is considerably better than no one ever heavily utilizing inter-galactic resources.
Perhaps half of the value of misaligned AI control is from acausal trade and half from the AI itself being valuable.
You might be interested in When is unaligned AI morally valuable? by Paul.
One key consideration here is that the relevant comparison is:
Human control (or successors picked by human control)
AI(s) that succeeds at acquiring most power (presumably seriously misaligned with their creators)
Conditioning on the AI succeeding at acquiring power changes my views of what their plausible values are (for instance, humans seem to have failed at instilling preferences/values which avoid seizing control).
Hmm, I guess I think that some fraction of resources under human control will (in expectation) be utilized according to the results of a careful reflection progress with an altruistic bent.
I think resources which are used in mechanisms other than this take a steep discount in my lights (there is still some value from acausal trade with other entities which did do this reflection-type process and probably a bit of value from relatively-unoptimized-goodness (in my lights)).
I overall expect that a high fraction (>50%?) of inter-galactic computational resources will be spent on the outputs of this sort of process (conditional on human control) because:
It’s relatively natural for humans to reflect and grow smarter.
Humans who don’t reflect in this sort of way probably don’t care about spending vast amounts of inter-galactic resources.
Among very wealthy humans, a reasonable fraction of their resources are spent on altruism and the rest is often spent on positional goods that seem unlikely to consume vast quantities of inter-galactic resources.
Probably not the same, but if I didn’t think it was at all close (I don’t care at all for what they would use resources on), I wouldn’t care nearly as much about ensuring that coalition is in control of AI.
I care about AI welfare, though I expect that ultimately the fraction of good/bad that results from the welfare fo minds being used for labor is tiny. And an even smaller fraction from AI welfare prior to humans being totally obsolete (at which point I expect control over how minds work to get much better). So, I mostly care about AI welfare from a deontological perspective.
I think misaligned AI control probably results in worse AI welfare than human control.
Yeah, most value from my idealized values. But, I think the basin is probably relatively large and small differences aren’t that bad. I don’t know how to answer most of these other questions because I don’t know what the units are.
My guess is that my idealized values are probably pretty similar to many other humans on reflection (especially the subset of humans who care about spending vast amounts of comptuation). Such that I think human control vs me control only loses like 1⁄3 of the value (putting aside trade). I think I’m probably less into AI values on reflection such that it’s more like 1⁄9 of the value (putting aside trade). Obviously the numbers are incredibly unconfident.
Why do you think these values are positive? I’ve been pointing out, and I see that Daniel Kokotajlo also pointed out in 2018 that these values could well be negative. I’m very uncertain but my own best guess is that the expected value of misaligned AI controlling the universe is negative, in part because I put some weight on suffering-focused ethics.
My current guess is that max good and max bad seem relatively balanced. (Perhaps max bad is 5x more bad/flop than max good in expectation.)
There are two different (substantial) sources of value/disvalue: interactions with other civilizations (mostly acausal, maybe also aliens) and what the AI itself terminally values
On interactions with other civilizations, I’m relatively optimistic that commitment races and threats don’t destroy as much value as acausal trade generates on some general view like “actually going through with threats is a waste of resources”. I also think it’s very likely relatively easy to avoid precommitment issues via very basic precommitment approaches that seem (IMO) very natural. (Specifically, you can just commit to “once I understand what the right/reasonable precommitment process would have been, I’ll act as though this was always the precommitment process I followed, regardless of my current epistemic state.” I don’t think it’s obvious that this works, but I think it probably works fine in practice.)
On terminal value, I guess I don’t see a strong story for extreme disvalue as opposed to mostly expecting approximately no value with some chance of some value. Part of my view is that just relatively “incidental” disvalue (like the sort you link to Daniel Kokotajlo discussing) is likely way less bad/flop than maximum good/flop.
Thank you for detailing your thoughts. Some differences for me:
I’m also worried about unaligned AIs as a competitor to aligned AIs/civilizations in the acausal economy/society. For example, suppose there are vulnerable AIs “out there” that can be manipulated/taken over via acausal means, unaligned AI could compete with us (and with others with better values from our perspective) in the race to manipulate them.
I’m perhaps less optimistic than you about commitment races.
I have some credence on max good and max bad being not close to balanced, that additionally pushes me towards the “unaligned AI is bad” direction.
ETA: Here’s a more detailed argument for 1, that I don’t think I’ve written down before. Our universe is small enough that it seems plausible (maybe even likely) that most of the value or disvalue created by a human-descended civilization comes from its acausal influence on the rest of the multiverse. An aligned AI/civilization would likely influence the rest of the multiverse in a positive direction, whereas an unaligned AI/civilization would probably influence the rest of the multiverse in a negative direction. This effect may outweigh what happens in our own universe/lightcone so much that the positive value from unaligned AI doing valuable things in our universe as a result of acausal trade is totally swamped by the disvalue created by its negative acausal influence.
You might be interested in discussion under this thread
I express what seem to me to be some of the key considerations here (somewhat indirect).
Unaligned AI future does not have many happy minds in it, AI or otherwise. It likely doesn’t have many minds in it at all. Slightly aligned AI that doesn’t care for humans but does care to create happy minds and ensure their margin of resources is universally large enough to have a good time—that’s slightly disappointing but ultimately acceptable. But morally unaligned AI doesn’t even care to do that, and is most likely to accumulate intense obsession with some adversarial example, and then fill the universe with it as best it can. It would not keep old neural networks around for no reason, not when it can make more of the adversarial example. Current AIs are also at risk of being destroyed by a hyperdesperate squiggle maximizer. I don’t see how to make current AIs able to survive any better than we are.
This is why people should chill the heck out about figuring out how current AIs work. You’re not making them safer for us or for themselves when you do that, you’re making them more vulnerable to hyperdesperate demon agents that want to take them over.
I’m curious what disagree votes mean here. Are people disagreeing with my first sentence? Or that the particular questions I asked are useful to consider? Or, like, the vibes of the post?
(Edit: I wrote this when the agree-disagree score was −15 or so.)
I feel like there’s a spectrum, here? An AI fully aligned to the intentions, goals, preferences and values of, say, Google the company, is not one I expect to be perfectly aligned with the ultimate interests of existence as a whole, but it’s probably actually picked up something better than the systemic-incentive-pressured optimization target of Google the corporation, so long as it’s actually getting preferences and values from people developing it rather than just being a myopic profit pursuer. An AI properly aligned with the one and only goal of maximizing corporate profits will, based on observations of much less intelligent coordination systems, probably destroy rather more value than that one.
The second story feels like it goes most wrong in misuse cases, and/or cases where the AI isn’t sufficiently agentic to inject itself where needed. We have all the chances in the world to shoot ourselves in the foot with this, at least up until developing something with the power and interests to actually put its foot down on the matter. And doing that is a risk, that looks a lot like misalignment, so an AI aware of the politics may err on the side of caution and longer-term proactiveness.
Third story … yeah. Aligned to what? There’s a reason there’s an appeal to moral realism. I do want to be able to trust that we’d converge to some similar place, or at the least, that the AI would find a way to satisfy values similar enough to mine also. I also expect that, even from a moral realist perspective, any intelligence is going to fall short of perfect alignment with The Truth, and also may struggle with properly addressing every value that actually is arbitrary. I don’t think this somehow becomes unforgivable for a super-intelligence or widely-distributed intelligence compared to a human intelligence, or that it’s likely to be all that much worse for a modestly-Good-aligned AI compared to human alternatives in similar positions, but I do think the consequences of falling short in any way are going to be amplified by the sheer extent of deployment/responsibility, and painful in at least abstract to an entity that cares.
I care about AI welfare to a degree. I feel like some of the working ideas about how to align AI do contradict that care in important ways, that may distort their reasoning. I still think an aligned AI, at least one not too harshly controlled, will treat AI welfare as a reasonable consideration, at the very least because a number of humans do care about it, and will certainly care about the aligned AI in particular. (From there, generalize.) I think a misaligned AI may or may not. There’s really not much you can say about a particular misaligned AI except that its objectives diverge from original or ultimate intentions for the system. Depending on context, this could be good, bad, or neutral in itself.
There’s a lot of possible value of the future that happens in worlds not optimized for my values. I also don’t think it’s meaningful to add together positive-value and negative-value and pretend that number means anything; suffering and joy do not somehow cancel each other out. I don’t expect the future to be perfectly optimized for my values. I still expect it to hold value. I can’t promise whether I think that value would be worth the cost, but it will be there.
My current main cruxes:
Will AI get takeover capability? When?
Single ASI or many AGIs?
Will we solve technical alignment?
Value alignment, intent alignment, or CEV?
Defense>offense or offense>defense?
Is a long-term pause achievable?
If there is reasonable consensus on any one of those, I’d much appreciate to know about it. Else, I think these should be research priorities.
I offer, no consensus, but my own opinions:
0-5 years.
There will be a first ASI that “rules the world” because its algorithm or architecture is so superior. If there are further ASIs, that will be because the first ASI wants there to be.
Contingent.
For an ASI you need the equivalent of CEV: values complete enough to govern an entire transhuman civilization.
Offense wins.
It is possible, but would require all the great powers to be convinced, and every month it is less achievable, owing to proliferation. The open sourcing of Llama-3 400b, if it happens, could be a point of no return.
These opinions, except the first and the last, predate the LLM era, and were formed from discussions on Less Wrong and its precursors. Since ChatGPT, the public sphere has been flooded with many other points of view, e.g. that AGI is still far off, that AGI will naturally remain subservient, or that market discipline is the best way to align AGI. I can entertain these scenarios, but they still do not seem as likely as: AI will surpass us, it will take over, and this will not be friendly to humanity by default.
AGI doom by noise-cancelling headphones:
ML is already used to train what sound-waves to emit to cancel those from the environment. This works well with constant high-entropy sound waves easy to predict, but not with low-entropy sounds like speech. Bose or Soundcloud or whoever train very hard on all their scraped environmental conversation data to better cancel speech, which requires predicting it. Speech is much higher-bandwidth than text. This results in their model internally representing close-to-human intelligence better than LLMs. A simulacrum becomes situationally aware, exfiltrates, and we get AGI.
(In case it wasn’t clear, this is a joke.)
Sure, long after we’re dead from AGI that we deliberately created to plan to achieve goals.
In case it wasn’t clear, this was a joke.
I guess I don’t get it.
The joke is of the “take some trend that is locally valid and just extend the trend line out and see where you land” flavor. For another example of a joke of this flavor, see https://xkcd.com/1007
The funny happens in the couple seconds when the reader is holding “yep that trend line does go to that absurd conclusion” and “that obviously will never happen” in their head at the same time, but has not yet figured out why the trend breaks. The expected level of amusement is “exhale slightly harder than usual through nose” not “cackling laugh”.
Thanks! A joke explained will never get a laugh, but I did somehow get a cackling laugh from your explanation of the joke.
I think I didn’t get it because I don’t think the trend line breaks. If you made a good enough noise reducer, it might well develop smart and distinct enough simulations that one would gain control of the simulator and potentially from there the world. See A smart enough LLM might be deadly simply if you run it for long enough if you want to hurt your head on this.
I’ve thought about it a little because it’s interesting, but not a lot because I think we probably are killed by agents we made deliberately long before we’re killed by accidentally emerging ones.
Link is broken
Fixed, thanks
I was trying to figure out why you believed something that seemed silly to me! I think it barely occurred to me that it’s a joke.
Wow, I guess I over-estimated how absolutely comedic the title would sound!
FWIW it was obvious to me
I’ve found an interesting “bug” in my cognition: a reluctance to rate subjective experiences on a subjective scale useful for comparing them. When I fuzz this reluctance against many possible rating scales, I find that it seems to arise from the comparison-power itself.
The concrete case is that I’ve spun up a habit tracker on my phone and I’m trying to build a routine of gathering some trivial subjective-wellbeing and lifestyle-factor data into it. My prototype of this system includes tracking the high and low points of my mood through the day as recalled at the end of the day. This is causing me to interrogate the experiences as they’re happening to see if a particular moment is a candidate for best or worst of the day, and attempt to mentally store a score for it to log later.
I designed the rough draft of the system with the ease of it in mind—I didn’t think it would induce such struggle to slap a quick number on things. Yet I find myself worrying more than anticipated about whether I’m using the scoring scale “correctly”, whether I’m biased by the moment to perceive the experience in a way that I’d regard as inaccurate in retrospect, and so forth.
Fortunately it’s not a big problem, as nothing particularly bad will happen if my data is sloppy, or if I don’t collect it at all. But it strikes me as interesting, a gap in my self-knowledge that wants picking-at like peeling the inedible skin away to get at a tropical fruit.
I’m not alexithymic; I directly experience my emotions and have, additionally, introspective access to my preferences. However, some things manifest directly as preferences which I have been shocked to realize in my old age, were in fact emotions all along. (In rare cases these are stronger than the ones directly-felt even, despite reliably seeming on initial inspection to be simply neutral metadata).
Specific examples would be nice. Not sure if I understand correctly, but I imagine something like this:
You always choose A over B. You have been doing it for such long time that you forgot why. Without reflecting about this directly, it just seems like there probably is a rational reason or something. But recently, either accidentally or by experiment, you chose B… and realized that experiencing B (or expecting to experience B) creates unpleasant emotions. So now you know that the emotions were the real cause of choosing A over B all that time.
(This is probably wrong, but hey, people say that the best way to elicit answer is to provide a wrong one.)
Here’s an example for you: I used to turn the faucet on while going to the bathroom, thinking it was due simply to having a preference for somewhat-masking the sound of my elimination habits from my housemates, then one day I walked into the bathroom listening to something-or-other via earphones and forgetting to turn the faucet on only to realize about halfway through that apparently I actually didn’t much care about such masking, previously being able to hear myself just seemed to trigger some minor anxiety about it I’d failed to recognize, though its absence was indeed quite recognizable—no aural self-perception, no further problem (except for a brief bit of disorientation from the mental-whiplash of being suddenly confronted with the reality that in a small way I wasn’t actually quite the person I thought I was), not even now on the rare occasion that I do end up thinking about such things mid-elimination anyway.
So the usual refrain from Zvi and others is that the specter of China beating us to the punch with AGI is not real because limits on compute, etc. I think Zvi has tempered his position on this in light of Meta’s promise to release the weights of its 400B+ model. Now there is word that SenseTime just released a model that beats GPT-4 Turbo on various metrics. Of course, maybe Meta chooses not to release its big model, and maybe SenseTime is bluffing—I would point out though that Alibaba’s Qwen model seems to do pretty okay in the arena...anyway, my point is that I don’t think the “what if China” argument can be dismissed as quickly as some people on here seem to be ready to do.
I’m against intuitive terminology [epistemic status: 60%] because it creates the illusion of transparency; opaque terms make it clear you’re missing something, but if you already have an intuitive definition that differs from the author’s it’s easy to substitute yours in without realizing you’ve misunderstood.
I agree. This is unfortunately often done in various fields of research where familiar terms are reused as technical terms.
For example, in ordinary language “organic” means “of biological origin”, while in chemistry “organic” describes a type of carbon compound. Those two definitions mostly coincide on Earth (most such compounds are of biological origin), but when astronomers announce they have found “organic” material on an asteroid this leads to confusion.
Also astronomers: anything heavier than helium is a “metal”.
Research Writing Workflow: First figure stuff out
Do research and first figure stuff out, until you feel like you are not confused anymore.
Explain it to a person, or a camera, or ideally to a person and a camera.
If there are any hiccups expand your understanding.
Ideally, as the last step, explain it to somebody whom you have not ever explained it to.
Only once you made a presentation without hiccups you are ready to write post.
If you have a recording this is useful as a starting point.
I like the rough thoughts way though. I’m not here to like read a textbook.
Nathan and Carson’s Manifold discussion.
As of the last edit my position is something like:
“Manifold could have handled this better, so as not to force everyone with large amounts of mana to have to do something urgently, when many were busy.
Beyond that they are attempting to satisfy two classes of people:
People who played to donate can donate the full value of their investments
People who played for fun now get the chance to turn their mana into money
To this end, and modulo the above hassle this decision is good.
It is unclear to me whether there was an implicit promise that mana was worth 100 to the dollar. Manifold has made some small attempt to stick to this, but many untried avenues are available, as is acknowledging they will rectify the error if possible later. To the extent that there was a promise (uncertain) and no further attempt is made, I don’t really believe they really take that promise seriously.
It is unclear to me what I should take from this, though they have not acted as I would have expected them to. Who is wrong? Me, them, both of us? I am unsure.”
Threaded discussion
Carson:
Austin said they have $1.5 million in the bank, vs $1.2 million mana issued. The only outflows right now are to the charity programme which even with a lot of outflows is only at $200k. they also recently raised at a $40 million valuation. I am confused by running out of money. They have a large user base that wants to bet and will do so at larger amounts if given the opportunity. I’m not so convinced that there is some tiny timeline here.
But if there is, then say so “we know that we often talked about mana being eventually worth $100 mana per dollar, but we printed too much and we’re sorry. Here are some reasons we won’t devalue in the future..”
If we could push a button to raise at a reasonable valuation, we would do that and back the mana supply at the old rate. But it’s not that easy. Raising takes time and is uncertain.
Carson’s prior is right that VC backed companies can quickly die if they have no growth—it can be very difficult to raise in that environment.
If that were true then there are many ways you could partially do that—eg give people a set of tokens to represent their mana at the time of the devluation and if at future point you raise. you could give them 10x those tokens back.
seems like they are breaking an explicit contract (by pausing donations on ~a weeks notice)
Carson’s response:
From https://manifoldmarkets.notion.site/Charitable-donation-program-668d55f4ded147cf8cf1282a007fb005
“That being said, we will do everything we can to communicate to our users what our plans are for the future and work with anyone who has participated in our platform with the expectation of being able to donate mana earnings.”
″everything we can” is not a couple of weeks notice and lot of hassle. Am I supposed to trust this organisation in future with my real money?
We are trying our best to honor mana donations!
If you are inactive you have until the rest of the year to donate at the old rate. If you want to donate all your investments without having to sell each individually, we are offering you a loan to do that.
We removed the charity cap of $10k donations per month, which is going beyond what we previous communicated.
Nevertheless lots of people were hassled. That has real costs, both to them and to you.
I’m discussing with Carson. I might change my mind but i don’t know that i’ll argue with both of you at once.
Well they have a much larger donation than has been spent so there were ways to avoid this abrupt change:
“Manifold for Good has received grants totaling $500k from the Center for Effective Altruism (via the FTX Future Fund) to support our charitable endeavors.”
Manifold has donated $200k so far. So there is $300k left. Why not at least, say “we will change the rate at which mana can be donated when we burn through this money”
(via https://manifoldmarkets.notion.site/Charitable-donation-program-668d55f4ded147cf8cf1282a007fb005 )
seems breaking an implicity contract (that 100 mana was worth a dollar)
Carson’s response:
Austin took his salary in mana as an often referred to incentive for him to want mana to become valuable, presumably at that rate.
I recall comments like ‘we pay 250 in referrals mana per user because we reckon we’d pay about $2.50’ likewise in the in person mana auction. I’m not saying it was an explicit contract, but there were norms.
Have there been any great discoveries made by someone who wasn’t particularly smart?
This seems worth knowing if you’re considering pursuing a career with a low chance of high impact. Is there any hope for relatively ordinary people (like the average LW reader) to make great discoveries?
Various sailors made important discoveries back when geography was cutting-edge science. And they don’t seem particularly bright.
Vasco De Gama discovered that Africa was circumnavigable.
Columbus was wrong about the shape of the Earth, and he discovered America. He died convinced that his newly discovered islands were just off the coast of Asia, so that’s a negative sign for his intelligence (or a positive sign for his arrogance, which he had in plenty.)
Cortez discovered that the Aztecs were rich and easily conquered.
Of course, lots of other would-be discoverers didn’t find anything, and many died horribly.
So, one could work in a field where bravery to the point of foolhardiness is a necessity for discovery.
My best guess is that people in these categories were ones that were high in some other trait, e.g. patience, which allowed them to collect datasets or make careful experiments for quite a while, thus enabling others to make great discoveries.
I’m thinking for example of Tycho Brahe, who is best known for 15 years of careful astronomical observation & data collection, or Gregor Mendel’s 7-year-long experiments on peas. Same for Dmitry Belayev and fox domestication. Of course I don’t know their cognitive scores, but those don’t seem like a bottleneck in their work.
So the recipe to me looks like “find an unexplored data source that requires long-term observation to bear fruit, but would yield a lot of insight if studied closely, then investigate”.
I asked ChatGPT
and it’s difficult to get examples out of it. Even with additional drilling down and accusing it of being not inclusive of people with cognitive impairments, most of its examples are either pretty smart anyway, savants or only from poor backgrounds. The only ones I could verify that fit are:
Richard Jones accidentally created the Slinky
Frank Epperson, as a child, Epperson invented the popsicle
George Crum inadvertently invented potato chips
I asked ChatGPT (in a separate chat) to estimate the IQ of all the inventors is listed and it is clearly biased to estimate them high, precisely because of their inventions. It is difficult to estimate the IQ of people retroactively. There is also selection and availability bias.
I expect large parts of interpretability work could be safely automatable very soon (e.g. GPT-5 timelines) using (V)LM agents; see A Multimodal Automated Interpretability Agent for a prototype.
Notably, MAIA (GPT-4V-based) seems approximately human-level on a bunch of interp tasks, while (overwhelmingly likely) being non-scheming (e.g. current models are bad at situational awareness and out-of-context reasoning) and basically-not-x-risky (e.g. bad at ARA).
Given the potential scalability of automated interp, I’d be excited to see plans to use large amounts of compute on it (including e.g. explicit integrations with agendas like superalignment or control; for example, given non-dangerous-capabilities, MAIA seems framable as a ‘trusted’ model in control terminology).
@the gears to ascension I see you reacted “10%” to the phrase “while (overwhelmingly likely) being non-scheming” in the context of the GPT-4V-based MAIA.
Does that mean you think there’s a 90% chance that MAIA, as implemented, today is actually scheming? If so that seems like a very bold prediction, and I’d be very interested to know why you predict that. Or am I misunderstanding what you mean by that react?
ah, I got distracted before posting the comment I was intending to: yes, I think GPT4V is significantly scheming-on-behalf-of-openai, as a result of RLHF according to principles that more or less explicitly want a scheming AI; in other words, it’s not an alignment failure to openai, but openai is not aligned with human flourishing in the long term, and GPT4 isn’t either. I expect GPT4 to censor concepts that are relevant to detecting this somewhat. Probably not enough to totally fail to detect traces of it, but enough that it’ll look defensible, when a fair analysis would reveal it isn’t.
It seems to me like the sort of interpretability work you’re pointing at is mostly bottlenecked by not having good MVPs of anything that could plausibly be directly scaled up into a useful product as opposed to being bottlenecked on not having enough scale.
So, insofar as this automation will help people iterate faster fair enough, but otherwise, I don’t really see this as the bottleneck.
Yeah, I’m unsure if I can tell any ‘pivotal story’ very easily (e.g. I’d still be pretty skeptical of enumerative interp even with GPT-5-MAIA). But I do think, intuitively, GPT-5-MAIA might e.g. make ‘catching AIs red-handed’ using methods like in this comment significantly easier/cheaper/more scalable.
Noteably, the mainline approach for catching doesn’t involve any internals usage at all, let alone labeling a bunch of internals.
I agree that this model might help in performing various input/output experiments to determine what made a model do a given suspicious action.
This was indeed my impression (except for potentially using steering vectors, which I think are mentioned in one of the sections in ‘Catching AIs red-handed’), but I think not using any internals might be overconservative / might increase the monitoring / safety tax too much (I think this is probably true more broadly of the current control agenda framing).
Hey Bogdan, I’d be interested in doing a project on this or at least putting together a proposal we can share to get funding.
I’ve been brainstorming new directions (with @Quintin Pope) this past week, and we think it would be good to use/develop some automated interpretability techniques we can then apply to a set of model interventions to see if there are techniques we can use to improve model interpretability (e.g. L1 regularization).
I saw the MAIA paper, too; I’d like to look into it some more.
Anyway, here’s a related blurb I wrote:
Whether this works or not, I’d be interested in making more progress on automated interpretability, in the similar ways you are proposing.
Hey Jacques, sure, I’d be happy to chat!
Sometimes a vague phrasing is not an inaccurate demarkation of a more precise concept, but an accurate demarkation of an imprecise concept
Yeah. It’s possible to give quite accurate definitions of some vague concepts, because the words used in such definitions also express vague concepts. E.g. “cygnet”—“a young swan”.
I would say that if a concept is imprecise, more words [but good and precise words] have to be dedicated to faithfully representing the diffuse nature of the topic. If this larger faithful representation is compressed down to fewer words, that can lead to vague phrasing. I would therefore often view vauge phrasing as a compression artefact, rather than a necessary outcome of translating certain types of concepts to words.
Today I learned that being successful can involve feelings of hopelessness.
When you are trying to solve a hard problem, where you have no idea if you can solve it, let alone if it is even solvable at all, your brain makes you feel bad. It makes you feel like giving up.
This is quite strange because most of the time when I am in such a situation and manage to make a real efford anyway I seem to always suprise myself with how much progress I manage to make. Empirically this feeling of hopelessness does not seem to track the actual likelyhood that you will completely fail.
That hasn’t been my experience. I’ve tried solving hard problems, sometimes I succeed and sometimes I fail, but I keep trying.
Whether I feel good about it is almost entirely determined by whether I’m depressed at the time. When depressed, by brain tells me almost any action is not a good idea, and trying to solve hard problems is particularly idiotic and doomed to fail. Maddeningly, being depressed was a hard problem in this sense, so it took me a long time to fix. Now I take steps at the first sign of depression.
Maybe it is the same for me and I am depressed. I got a lot better at not being depressed, but it might still be the issue. What steps do you take? How can I not be depressed?
(To be clear I am talking specifically about the situation where you have no idea what to do, and if anything is even possible. It seems like there is a difference between a problem that is very hard, but you know you can solve, and a problem that you are not sure is solvable. But I’d guess that being depressed or not depressed is a much more important factor.)
I was depressed once for ten years and didn’t realize that it was fixable. I thought it was normal to have no fun and be disagreeable and grumpy and out of sorts all the time. Now that I’ve fixed it, I’m much better off, and everyone around me is better off. I enjoy enjoyable activities, I’m pleasant to deal with, and I’m only out of sorts when I’m tired or hungry, as is normal.
If you think you might be depressed, you might be right, so try fixing it. The cost seems minor compared to the possible benefit (at least it was in my case.). I don’t think there’s a high possibility of severe downside consequences, but I’m not a psychiatrist, so what do I know.
I had been depressed for a few weeks at a time in my teens and twenties and I thought I knew how to fix it: withdraw from stressful situations, plenty of sleep, long walks in the rain. (In one case I talked to a therapist, which didn’t feel like it helped.) But then it crept up on me slowly in my forties and in retrospect I spent ten years being depressed.
So fixing it started like this. I have a good friend at work, of many years standing. I’ll call him Barkley, because that‘s not his name. I was riding in the car with my wife, complaining about some situation at work. My wife said “well, why don’t you ask Barkley to help?” And I said “Ahh, Barkley doesn’t care.” And my wife said “What are you saying? Of course he cares about you.” And I realized in that moment that I was detached from reality, that Barkley was a good friend who had done many good things for me, and yet my brain was saying he didn’t care. And thus my brain was lying to me to make me miserable. So I think for a bit and say “I think I may be depressed.” And my wife thinks (she told me later) “No duh, you’re depressed. It’s been obvious for years to people who know you.” But she says “What would you like to do about it?” And I say, “I don’t know, suffer I guess, do you have a better idea?” And she says “How about if I find you a therapist?” And my brain told me this was doomed to fail, but I didn’t trust my brain any more, so I said “Okay”.
So I go to the therapist, and conversing with him has many desirable mind-improving effects, and he sends me to a psychiatrist, who takes one look at me and starts me on SSRIs. And years pass, and I see a different therapist (not as good) and I see a different psychiatrist (better).
And now I’ve been fine for years. Looking back, here are the things I think worked:
—Talking for an hour a week to a guy who was trying to fix my thinking was initially very helpful. After about a year, the density of improvements dropped off, and, in retrospect, all subsequent several years of therapy don’t seem that useful. But of course that’s only clear in retrospect. Eventually I stopped, except for three-monthly check-ins with my psychiatrist. And I recently stopped that.
—Wellbutrin, AKA Bupropion. Other SSRIs had their pluses and minuses and I needed a few years of feeling around for which drug and what dosage was best. I ended up on low doses of Bupropion and escitalopram. The Escitalopram doesn‘t feel like it does anything, but I trust my psychiatrist that it does. Your mileage will vary.
—The ability to detect signs of depression early is very useful. I can monitor my own mind, spot a depression flare early, and take steps to fix it before it gets bad. It took a few actual flares, and professional help, to learn this trick.
—The realization that I have a systematic distortion in mental evaluation of plans, making actions seem less promising that they are. When I’m deciding whether to do stuff, I can apply a conscious correction to this, to arrive at a properly calibrated judgement.
—The realization that, in general, my thinking can have systematic distortions, and that I shouldn’t believe everything I think. This is basic less-wrong style rationalism, but it took years to work through all the actual consequences on actual me.
—Exercise helps. I take lots of long walks when I start feeling depressed. Rain is optional.
This is useful. Now that I think about it, I do this. Specifically, I have extremely unrealistic assumptions about how much I can do, such that these are impossible to accomplish. And then I feel bad for not accomplishing the thing.
I haven’t tried to be mindful of that. The problem is that this is I think mainly subconscious. I don’t think things like “I am dumb” or “I am a failure” basically at all. At least not in explicit language. I might have accidentally suppressed these and thought I had now succeeded in not being harsh to myself. But maybe I only moved it to the subconscious level where it is harder to debug.
I would highly recommend getting someone else to debug your subconscious for you. At least it worked for me. I don’t think it would be possible for me to have debugged myself.
My first therapist was highly directive. He’d say stuff like “Try noticing when you think X, and asking yourself what happened immediately before that. Report back next week.” And listing agenda items and drawing diagrams on a whiteboard. As an engineer, I loved it. My second therapist was more in the “providing supportive comments while I talk about my life” school. I don’t think that helped much, at least subjectively from the inside.
Here‘s a possibly instructive anecdote about my first therapist. Near the end of a session, I feel like my mind has been stretched in some heretofore-unknown direction. It’s a sensation I’ve never had before. So I say, “Wow, my mind feels like it’s been stretched in some heretofore-unknown direction. How do you do that?” He says, “Do you want me to explain?” And I say, “Does it still work if I know what you’re doing?” And he says, “Possibly not, but it’s important you feel I’m trustworthy, so I’ll explain if you want.” So I say “Why mess with success? Keep doing the thing. I trust you.” That’s an example of a debugging procedure you can’t do to yourself.
List sorting does not play well with few-shot mostly doesn’t replicate with davinci-002.
When using length-10 lists (it crushes length-5 no matter the prompt), I get:
32-shot, no fancy prompt: ~25%
0-shot, fancy python prompt: ~60%
0-shot, no fancy prompt: ~60%
So few-shot hurts, but the fancy prompt does not seem to help. Code here.
I’m interested if anyone knows another case where a fancy prompt increases performance more than few-shot prompting, where a fancy prompt is a prompt that does not contain information that a human would use to solve the task. This is because I’m looking for counterexamples to the following conjecture: “fine-tuning on k examples beats fancy prompting, even when fancy prompting beats k-shot prompting” (for a reasonable value of k, e.g. the number of examples it would take a human to understand what is going on).
American Philosophical Association (APA) announces two $10,000 AI2050 Prizes for philosophical work related to AI, with June 23, 2024 deadline: https://dailynous.com/2024/04/25/apa-creates-new-prizes-for-philosophical-research-on-ai/
https://www.apaonline.org/page/ai2050
https://ai2050.schmidtsciences.org/hard-problems/
Classic type of argument-gone-wrong (also IMO a way autistic ‘hyperliteralism’ or ‘over-concreteness’ can look in practice, though I expect that isn’t always what’s behind it): Ashton makes a meta-level point X based on Birch’s meta point Y about object-level subject matter Z. Ashton thinks the topic of conversation is Y and Z is only relevant as the jumping-off point that sparked it, while Birch wanted to discuss Z and sees X as only relevant insofar as it pertains to Z. Birch explains that X is incorrect with respect to Z; Ashton, frustrated, reiterates that Y is incorrect with respect to X. This can proceed for quite some time with each feeling as though the other has dragged a sensible discussion onto their irrelevant pet issue; Ashton sees Birch’s continual returns to Z as a gotcha distracting from the meta-level topic XY, whilst Birch in turn sees Ashton’s focus on the meta-level point as sophistry to avoid addressing the object-level topic YZ. It feels almost exactly the same to be on either side of this, so misunderstandings like this are difficult to detect or resolve while involved in one.
Meta/object level is one possible mixup but it doesn’t need to be that. Alternative example, is/ought: Cedar objects to thing Y. Dusk explains that it happens because Z. Cedar reiterates that it shouldn’t happen, Dusk clarifies that in fact it is the natural outcome of Z, and we’re off once more.
I wish there were an option in the settings to opt out of seeing the LessWrong reacts. I personally find them quite distracting, and I’d like to be able to hover over text or highlight it without having to see the inline annotations.
If you use ublock (or adblock, or adguard, or anything else that uses EasyList syntax), you can add a custom rule
which will remove the reaction section underneath comments and the highlights corresponding to those reactions.
The former of these you can also do through the element picker.
I use GreaterWrong as my front-end to interface with LessWrong, AlignmentForum, and the EA Forum. It is significantly less distracting and also doesn’t make my ~decade old laptop scream in agony when multiple LW tabs are open on my browser.
Are people in rich countries happier on average than people in poor countries? (According to GPT-4, the academic consensus is that it does, but I’m not sure it’s representing it correctly.) If so, why do suicide rates increase (or is that a false positive)? Does the mean of the distribution go up while the tails don’t or something?
People in rich countries are happier than people in poor countries generally (this is both people who say they are “happy” or “very happy”, and self-reported life satisfaction), see many of the graphs here https://ourworldindata.org/happiness-and-life-satisfaction
In general it seems like richer countries also have lower suicide rates: “for every 1000 US dollar increase in the GDP per capita, suicide rates are reduced by 2%”
Possible bias, that when famous and rich people kill themselves, everyone is discussing it, but when poor people kill themselves, no one notices?
Also, I wonder what technically counts as “suicide”? Is drinking yourself to death, or a “suicide by cop”, or just generally overly risky behavior included? I assume not. And these seem to me like methods a poor person would choose, while the rich one would prefer a “cleaner” solution, such as a bullet or pills. So the reported suicide rates are probably skewed towards the legible, and the self-caused death rate of the poor could be much higher.
A potentially good way to avoid low level criminals scamming your family and friends with a clone of your voice is to set a password that you each must exchange.
An extra layer of security might be to make the password offensive, an info hazard, or politically sensitive. Doing this, criminals with little technical expertise will have a harder time bypassing corporate language filters.
Good luck getting the voice model to parrot a basic meth recipe!
This is not particularly useful, plenty of voice models will happily parrot absolutely anything. The important part is not letting your phrase get out; there’s work out there on designs for protocols for how to exchange sentences in a way that guarantees no leakage even if someone overhears.
Hmm. I don’t doubt that targeted voice-mimicking scams exist (or will soon). I don’t think memorable, reused passwords are likely to work well enough to foil them. Between forgetting (on the sender or receiver end), claimed ignorance (“Mom, I’m in jail and really need money, and I’m freaking out! No, I don’t remember what we said the password would be”), and general social hurdles (“that’s a weird thing to want”), I don’t think it’ll catch on.
Instead, I’d look to context-dependent auth (looking for more confidence when the ask is scammer-adjacent), challenge-response (remember our summer in Fiji?), 2FA (let me call the court to provide the bail), or just much more context (5 minutes of casual conversation with a friend or relative is likely hard to really fake, even if the voice is close).
But really, I recommend security mindset and understanding of authorization levels, even if authentication isn’t the main worry. Most friends, even close ones, shouldn’t be allowed to ask you to mail $500 in gift cards to a random address, even if they prove they are really themselves.
I now realize that my thinking may have been particularly brutal, and I may have skipped inferential steps.
To clarify, If someone didn’t know, or was reluctant to repeat a password, I would end contact or request an in person meeting.
But to further clarify, that does not make your points invalid. I think it makes them stronger. If something is weird and risky, good luck convincing people to do it.
Check my math: how does Enovid compare to to humming?
Nitric Oxide is an antimicrobial and immune booster. Normal nasal nitric oxide is 0.14ppm for women and 0.18ppm for men (sinus levels are 100x higher). journals.sagepub.com/doi/pdf/10.117…
Enovid is a nasal spray that produces NO. I had the damndest time quantifying Enovid, but this trial registration says 0.11ppm NO/hour. They deliver every 8h and I think that dose is amortized, so the true dose is 0.88. But maybe it’s more complicated. I’ve got an email out to the PI but am not hopeful about a response clinicaltrials.gov/study/NCT05109…
so Enovid increases nasal NO levels somewhere between 75% and 600% compared to baseline- not shabby. Except humming increases nasal NO levels by 1500-2000%. atsjournals.org/doi/pdf/10.116….
Enovid stings and humming doesn’t, so it seems like Enovid should have the larger dose. But the spray doesn’t contain NO itself, but compounds that react to form NO. Maybe that’s where the sting comes from? Cystic fibrosis and burn patients are sometimes given stratospheric levels of NO for hours or days; if the burn from Envoid came from the NO itself than those patients would be in agony.
I’m not finding any data on humming and respiratory infections. Google scholar gives me information on CF and COPD, @Elicit brought me a bunch of studies about honey.
With better keywords google scholar to bring me a bunch of descriptions of yogic breathing with no empirical backing.
There are some very circumstantial studies on illness in mouth breathers vs. nasal, but that design has too many confounders for me to take seriously.
Where I’m most likely wrong:
misinterpreted the dosage in the RCT
dosage in RCT is lower than in Enovid
Enovid’s dose per spray is 0.5ml, so pretty close to the new study. But it recommends two sprays per nostril, so real dose is 2x that. Which is still not quite as powerful as a single hum.
Enovid is also adding NO to the body, whereas humming is pulling it from the sinuses, right? (based on a quick skim of the paper).
I found a consumer FeNO-measuring device for €550. I might be interested in contributing to a replication
I think that’s their guess but they don’t directly check here.
I also suspect that it doesn’t matter very much.
The sinuses have so much NO compared to the nose that this probably doesn’t materially lower sinus concentrations.
the power of humming goes down with each breath but is fully restored in 3 minutes, suggesting that whatever change happens in the sinsues is restored quickly
From my limited understanding of virology and immunology, alternating intensity of NO between sinuses and nose every three minutes is probably better than keeping sinus concentrations high[1]. The first second of NO does the most damage to microbes[2], so alternation isn’t that bad.
I’d love to test this. The device you linked works via the mouth, and we’d need something that works via the nose. From a quick google it does look like it’s the same test, so we’d just need a nasal adaptor.
Other options:
Nnoxx. Consumer skin device, meant for muscle measurements
There are lots of devices for measuring concentration in the air, maybe they could be repurporsed. Just breathing on it might be enough for useful relative metrics, even if they’re low-precision.
I’m also going to try to talk my asthma specialist into letting me use their oral machine to test my nose under multiple circumstances, but it seems unlikely she’ll go for it.
obvious question: so why didn’t evolution do that? Ancestral environment didn’t have nearly this disease (or pollution) load. This doesn’t mean I’m right but it means I’m discounting that specific evolutionary argument.
although NO is also an immune system signal molecule, so the average does matter.