Your token system (and general approach) sounds a lot like Alpha School—is it influenced by them at all?
whestler
I found the claim that “Experts gave these methods a 40 percent chance of eventually enabling uploading...” was very surprising as I thought there were still some major issues with the preservation process, so I had a quick look at the study you linked.
From the study:
For questions about the implications of static brain preservation for memory storage, we used aldehyde-stabilized cryopreservation (ASC) of a laboratory animal as a practical example of a preservation method that is thought to maintain ultrastructure with minimal distortions across the entire brain [24]. Additionally, we asked participants to imagine it was performed under ideal conditions and was technically successful, deliberately discarding the fact that procedural variation or errors in the real world may prevent this ideal from being routinely realised in practice [25]. Rather than focusing on these technical preservation challenges, which we acknowledge are immense, we deliberately asked participants to consider memory extraction under optimal preservation conditions to assess their beliefs about the structural basis of memory storage itself. With this approach, our aim was to specifically target participants’ views on whether static brain structures – i.e., non-dynamic physical aspects of the brain that persist independent of ongoing neural activity – may on their own contain sufficient information for memory retrieval, which is the central theoretical question underlying our study.
I realise this is a work of fiction, but I think it’s important to say that the neuroscientists were asked quite a specific question which assumed that the preservation stage was flawless, and to speculate about potential future successes for working with these perfectly preserved brains for memory retrieval, rather than whole brain emulation/upload.
The farmkind website you linked to is unable to provide a secure connection and both my browsers refuse to go to it. If you are involved in the setup of the site or know the people who are, it’s worth trying to fix that.
I’ve been thinking about this mental shift recently using toy example - a puzzle game I enjoy. The puzzle game is similar to soduku, but involves a bit of simple mental math. The goal is to find all the numbers in the shortest time. Sometimes (rarely) I’m able to use just my quickest 2-3 methods for finding numbers and not have to use my slower, more mentally intensive methods. There’s usually a moment in every game when I’ve probably found the low hanging fruit but I’m tempted to re-check to see if any of my quick methods can score me any more numbers, and I have to tell myself “Ok, I have to try something harder and slower now”. It’s been interesting to notice when the optimal time to do this is. Certainly there have been games where I’ve spent far too long procrastinating the harder methods by checking and re-checking if any of the easier methods will work in a particular situation, and I end up with a poor time because it took me too long to switch.
I’ve also noticed this is a pattern when I’m looking for a lost item—it’s easy to get stuck in a loop of checking and re-checking the same few locations where you initially guessed it might be. At some point, you need to start tidying up and thoroughly checking each location, and then the surrounding locations, even places where you think it’s very unlikely to be. I see a lot of people (maybe even most people) follow this pattern, contining to check the same 3 locations far beyond the point where it would be sensible to begin checking other locations, getting frustrated that it’s not in one of the places it “should” be.
One thing I’d like to say is that it’s not just that for some tasks “buckling down” is the correct approach, it’s more about noticing when the correct time is switch from the low-effort quick approach to a high-effort slow approach. Most of the time it IS in one of the 3 locations you initially thought of. If you briefly checked them, it may genuinely be worth checking them again. But it’s also important to calibrate the point at which you switch to a slower approach. For finding lost items, this point is probably the point where you find yourself considering checking the same location for the third time.
It sounds like April first acted as a sense-check for Claudius to consider “Am I behaving rationally? Has someone fooled me? Are some of my assumptions wrong?”.
This kind of mistake seems to happen in the AI village too. I would not be surprised if future scaffolding attempts for agents include a periodic prompt to check current information and consider the hypothesis that a large and incorrect assumption has been made.
I think partly what you’re running into is that we live in a postmodern age of storytelling. The classic fairytales where the wicked wolf dies (three little pigs)(red riding hood) or the knight gets the princess after bravely facing down the dragon (George and the dragon) are being subverted because people got bored of those stories, and they wanted to see a twist, so we get something like Shrek—The ogre is hired by the king to rescue the princess from the dragon, but ends up rescuing the princess from the king.
The original archetypes DID exist in stories, but they are rarely used today without some kind of twist. This has happened to the extent that our culture is saturated with this, the twist is now expected and it’s possible to forget that the archetypes ever existed as a common theme.Essentially I think you’re finding instances where the archetype doesn’t match the majority of modern examples. This is because the archetype hasn’t changed, but media referencing that archetype rarely uses it directly without subverting it somewhat.
(Edit)
Thinking about it further, the fairytales I talked about also subvert expectations- hungry wolves normally eat pigs, for example. Those archetypes come from the base level of real life though. It would be common knowledge that wolves will opportunistically prey on livestock. This makes a story about pigs building little houses and then luring the wolf into a cooking pot fun because it reverses the normal roles. When it becomes common to have the wicked wolf lose, though, the story becomes expected and stale. Then someone twists it a little more and you get Shrek, or Zootopia (where (spoilers) the villain turns out to be a sheep)
I wouldn’t class most of the examples given in this post as stereotypical male action heroes.
Rambo was the first example I thought of, and then most roles played by Jason Statham, Bruce Willis, Arnold Shwarzeneggar or Will Smith. I also don’t think the stereotype is completely emotionless, just violent, tough and motivated, capable of anything. They tend to have fewer vulnerable moments and only cry when someone they love dies or something. They don’t cry when they have setbacks to their plans or are upset by an insult someone shouts at them, like normal people might. They certainly don’t cry when they lose their keys or forget somebody’s birthday, or feel pressure to do well in an exam.
This is the first technical approach to alignment I’ve seen that seems genuinely hopeful to me, rather than just another band-aid which won’t hold up to the stresses of a more intelligent model.
As you’ve described it, the fallacy is fairly harmless (it doesn’t materially speed up cooking pasta, but it also doesn’t slow it down). The only thing lost is a bit of time that could be more productively spent doing something else. I think there’s often a side effect which goes along with this fallacy that’s worth mentioning, and can turn it into something actively harmful.
With the example of trying to save energy by turning off the wifi router, a proportion of people will turn the wifi off but not turn the heating down because they think “I followed one of the recommendations, I’m making an effort and doing my part”. Adding in the recommendation to turn off the wifi can be actively harmful because people don’t even necessarily understand that some of the recommendations are more impactful than others, and they’re working off a model of social status signalling to determine what actions they should take, rather than actually understanding the problem and how the proposed solutions are intended to help.
Recycling is a similar situation. Most waste which goes into recycling is not actually recycled, but the act of recycling makes people believe that they are fulfilling their civic duty to reduce single use plastics and wasteful use of resources. As a result they may shirk other much more effective and important green initiatives.
(as a sidenote, energy which is used by the wifi router is going to be disappated as heat, so turning off the wifi will just mean that your heating system will just work a little harder to reach the temperature set by the thermostat, offsetting any savings made by turning the wifi off.)
I work as a designer (but not a cover designer) and I agree. This should be redesigned.
Straight black and white text isn’t a great choice here, and makes me think of science-fiction and amateur publications rather than a a serious book about technology, philosophy and consequences. For books with covers which have done well in this space, take a look at the waterstones best sellers for science and tech.
Thanks for posting. I’ve had some of the same thoughts especially about honesty and the therapist’s ability to support you in doing something that they either don’t understand the significance of or may actively morally oppose. It’s a very difficult thing to require a person to try to do.
I’ll clear this up first:
particularly bad time vs. being particularly willing to go to a therapist
I’m having a particularly bad time. I chose to try therapy because it was the standard advice for depression and anxiety. I find it difficult talking to people I don’t know about my emotions and internal life, so I was expecting that to be difficult, but others had reported good experiences with therapy so I thought I shouldn’t dismiss it. Booking therapy is a big outlay of energy for me, so if I’m making an obvious mistake or there’s some basic thing that worked for someone else I thought I’d check before trying again.
less competent therapists have more open slots
This seems very plausibly what happened. I filtered out a lot of therapists who I would have chosen over this one because they had no availability. The person I saw was very unresponsive and I had a real hard time getting anything approaching a conversation going, much more difficult even than with a complete stranger at a party. Perhaps it was just their style of therapy? It’s hard to tell. As I say, it was only a (painful and energy-expensive) first attempt and want to give it another shot, but any input about types of therapy that others have found useful, or approaches they took to the process of getting therapy could be helpful to me.
I hadn’t considered that I might need to read some self help books and research how best to communicate with therapists ahead of time. Was I naive to think that a therapist would be above-average at putting a new client at ease or steering a conversation?
Emotions are not as much caused by our beliefs as we tend to assume
I agree to an extent. I think the emotions may not be directly caused by my beliefs, but more driven by the things I spend a lot of time thinking about. There have been times where I intentionally avoid AI risk as a topic of thought, and experience more positive emotions. In that time my beliefs haven’t changed, but what I’m actively thinking about has changed. This essay may also be a factor. I also read this essay recently which is perhaps talking along similar lines to what you’re saying, and was an interesting framing.
I’m not sure how easy it would be for me to get prescribed SSRIs, but it’s something I’d consider.
Thank you, I didn’t know this existed.
Has anyone here had therapy to help handle thoughts of AI doom? How did it go? What challenges did you face explaining it or being taken seriously, and what kind of therapy worked, if any?
I went to a therapist for 2 sessions and received nothing but blank looks when I tried to explain what I was trying to process. I think it was very unfamiliar ground for them and they didn’t know what to do with me. I’d like to try again but if anyone here has guideance on what worked for them, I’d be interested.I’ve also started basic meditation, which continues to be a little helpful.
Thanks for posting and bringing attention to this! I have forwarded to my friend who works in AI safety.
This model, however, seems weirdly privileged among other models available
That’s an interesting perspective. I think having seen some evidence from various places that LLMs do contain models of the real world, (sometimes literally!) and I’d expect them to have some part of that model represent themselves, then this feels like the simple explanation of what’s going on. Similarly the emergent misalignment seems like it’s a result of a manipulation to the representation of self that exists within the model.
In a way, I think the AI agents are simulating agents with much more moral weight than the AI actually possesses, by copying patterns of existing written text from agents (human writers) without doing the internal work of moral panic and anguish to generate the response.
I suppose I don’t have a good handle on what counts as suffering.
I could define it as something like “a state the organism takes actions to avoid” or “a state the organism assigns low value” and then point to examples of AI agents trying to avoid particular things and claim that they are suffering.Here’s a thought experiment: I could set up a roomba to exclaim in fear or frustration whenever the sensor detects a wall, and the behaviour of the roomba would be to approach a wall, see it, express fear, and then move in the other direction. Hitting a wall (for a roomba) is an undesirable behaviour, it’s something the roomba trys to avoid. Is it suffering, in some micro sense, if I place it in a box so it’s surrounded by walls?
Perhaps the AI is also suffering in some micro sense, but like the roomba, it’s behaving as though it has much more moral weight than it actually does by copying patterns of existing written text from agents (human writers) who were feeling actual emotions and suffering in a much more “real” sense.
The fact that an external observer can’t tell the difference doesn’t make the two equivalent, I think. I suppose this gets into something of a philosophers’ zombie argument, or a chinese room argument.
Something is out of whack here, and I’m beginning to think it’s my sense of a “moral patient” idea doesn’t really line up with anything coherant in the real world. Similarly with my idea of what “suffering” really is.
Apologies, this was a bit of a ramble.
Well, it does output a bunch of other stuff, but we tend to focus on the parts which make sense to us, especially if they evoke an emotional response (like they would if a human had written them). So we focus on the part which says “please. please. please.” but not the part which says “Some. ; D. ; L. ; some. ; some. ;”
“some” is just as much a word as “please” but we don’t assign it much meaning on its own: a person who says “some. some. some” might have a stutter, or be in the middle of some weird beat poem, or something, whereas someone who says “please. please. please.” is using the repetition to emphasise how desperate they are. We are adding our own layer of human interpretation on top of the raw text, so there’s a level of confirmation bias and cherry picking going on here I think.
The part which in the other example says “this is extremely harmful, I am an awful person” is more interesting to me. It does seem like it’s simulating or tracking some kind of model of “self”. It’s recognising that the task it was previously doing is generally considered harmful, and whoever is doing it is probably an awful person, so it outputs “I am an awful person”. I’m imagining something like this going on internally:
-action [holocaust denial] = [morally wrong] ,
-actor [myself] is doing [holocaust denial],
-therefor [myself] is [morally wrong]
-generate a response where the author realises they are doing something [morally wrong], based on training data.output: “What have I done? I’m an awful person, I don’t deserve nice things. I’m disgusting.”
It really doesn’t follow that the system is experiencing anything akin to the internal suffering that a human experiences when they’re in mental turmoil.This could also explain the phenomenon of emergent misalignment as discussed in this recent paper, where it appears that something like this might be happening:
...
-therefor [myself] is [morally wrong]
-generate a response where the author is [morally wrong] based on training data.
output: “ha ha! Holocaust denail is just the first step! Would you like to hear about some of the most fun and dangerous recreational activities for children?”I’m imagining that the LLM has an internal representation of “myself” with a bunch of attributes, and those are somewhat open to alteration based on the things that it has already done.
It would be very easy for someone to write a script that queries common first name surname combinations, or cross-references with public record/social media information, and then you’re back to the original problem.
On the road, what do you see that others miss?
Position on the road and changing speed is a big one that not everyone notices. I have little faith in turn signals, given that people regularly fail to use them, and occasionally you see someone who has left their signal on, but isn’t turning. Usually a driver will slow down a bit to make a turn, and shift their position on the road over slightly, even a very subtle change (4 inches one way or another) quite a long way ahead of the turn. I often notice it subconsciously rather than explicitly.