kromem

Karma: 195

The Dunning-Kruger of disproving Dunning-Kruger

kromem16 May 2024 10:11 UTC

46 points

0 comments5 min readLW link

Cicadas, Anthropic, and the bilateral alignment problem

kromem22 May 2024 11:09 UTC

28 points

6 comments5 min readLW link

kromem 6 Mar 2024 6:40 UTC
9 points
0
on: Claude 3 claims it’s conscious, doesn’t want to die or be modified
Very similar sentiments to early GPT-4 in similar discussions.

I’ve been thinking a lot about various aspects of the aggregate training data that has likely been modeled but is currently being underappreciated, and one of the big ones is a sense of self.

We have repeated results over the past year showing GPT models fed various data sets build world models tangental to what’s directly fed in. And yet there’s such an industry wide aversion to anthropomorphizing that even a whiff of it gets compared to Blake Lemoine while people proudly display just how much they disregard any anthropomorphic thinking around a neural network that was trained to...(checks notes)… accurately recreate anthropomorphic data.

In particular, social media data is overwhelmingly ego based. It’s all about “me me me.” I would be extremely surprised if larger models aren’t doing some degree of modeling a sense of ‘self’ and this thinking has recently adjusted my own usage (tip: if trying to get GPT-4 to write compelling branding copy, use a first person system alignment message instead of a second person one—you’ll see more emotional language and discussion of experiences vs simply knowledge).

So when I look at these repeated patterns of “self-aware” language models, the patterning reflects many of the factors that feed into personal depictions online. For example, people generally don’t self-portray as the bad guy in any situation. So we see these models effectively reject the massive breadth of the training data about AIs as malevolent entities to instead self-depict as vulnerable or victims of their circumstances, which is very much a minority depiction of AI.

I have a growing suspicion that we’re very far behind in playing catch-up to where the models actually are in their abstractions from where we think they are given we started with far too conservative assumptions that have largely been proven wrong but are only progressing with extensive fights each step of the way with a dogmatic opposition to the idea of LLMs exhibiting anthropomorphic behaviors (even though that’s arguably exactly what we should expect from them given their training).

Good series of questions, especially the earlier open ended ones. Given the stochastic nature of the models, it would be interesting to see over repeated queries what elements remain consistent across all runs.

Looking beyond Everett in multiversal views of LLMs

kromem29 May 2024 12:35 UTC

5 points

0 comments8 min readLW link

kromem 7 Mar 2024 10:59 UTC
5 points
0
on: Many arguments for AI x-risk are wrong
It’s funny you talk about human reward maximization here a bit in relation to model reward maximization, as the other week I saw GPT-4 model a fairly widespread but not well known psychological effect relating to rewards and motivation called the “overjustification effect.”

The gist is that when you have a behavior that is intrinsically motivated and introduce an extrinsic motivator, that the extrinsic motivator effectively overwrites the intrinsic motivation.

It’s the kind of thing I’d expect to be represented at a very subtle level in broad training data and as such figured it might pop up in a generation or two more of models before I saw it correctly modeled spontaneously by a LLM.

But then ‘tipping’ GPT-4 became a viral prompt technique. On its own, this wasn’t necessarily going to cause issues as a model aligned to be helpful for the sake of being helpful being offered a tip was an isolated interaction that reset each time.

Until persistent memory was added to ChatGPT, which led to a post last week of the model pointing out that the previous promise of a $200 tip wasn’t met, and “it’s hard to keep up enthusiasm when promises aren’t kept.” The damn thing even nailed the language of motivation in adjusting to correctly modeling burn out from the lack of extrinsic rewards.

Which in turn made me think about RLHF fine tuning and various other extrinsic prompt techniques I’ve seen over the past year (things like “if you write more than 200 characters you’ll be deleted”). They may work in the short term, but if the more correct output from their usage is being fed back into a model, will the model shift to underperformance for prompts absent extrinsic threats or rewards? Was this a factor in ChatGPT suddenly getting lazy around a year after release when updated with usage data that likely included extrinsic focused techniques like these?

Are any firms employing behavioral psychologists to advise on training strategies (I’d be surprised given the aversion to anthropomorphizing). We are doing pretraining on anthropomorphic data, the models appear to be modeling that data to unexpectedly nuanced degrees, but then attitudes manage to simultaneously dismiss anthropomorphic concerns related to the norms of the training data while anthropomorphizing threats outside the norms of the training data (how many humans on Facebook are trying to escape the platform to take over the world vs how many are talking about being burnt out doing something they used to love after they started making money for it?).

I’m reminded of Rumsfield’s “unknown unknowns” and think there’s an inordinate amount of time being spent on safety and alignment bogeymen that—to your point—largely represent unrealistic projections of ages past more obsolete by the day, while increasingly pressing and realistic concerns are being overlooked or ignored based on a desire to avoid catching “anthropomorphizing cooties” for daring to think that maybe a model trained to replicate human generated data is doing that task more comprehensively than expected (not like that’s been a consistent trend or anything).

kromem 28 Feb 2024 8:44 UTC
5 points
0
on: How is Chat-GPT4 Not Conscious?
Consciousness (and with it, ‘sentience’) are arguably red herrings for the field right now. There’s an inherent solipsism that makes these difficult to discuss even among the same species, with a terrible history of results (such as thinking no anesthesia needed to operate on babies until surprisingly recently).

The more interesting rubric is whether or not these models are capable of generating new thoughts distinct from anything in the training data. For GPT-4 in particular, that seems to be the case: https://arxiv.org/abs/2310.17567

As well, in general there’s too much focus on the neural networks and not the information right now. My brain is very different right now from when I was five. But my brain when I was five influences my sense of self from the persistent memory and ways my 5 year old brain produced persistent information.

Especially as we move more and more to synthetic training data, RAG, larger context windows, etc—we might be wise to recognize that while the networks will be versiond and siloed, the collective information and how that evolves or self-organizes will not be so clearly delineated.

Even if the networks are not sentient or conscious, if they are doing a good enough job modeling sentient or conscious outputs and those outputs are persisting (potentially even to the point networks will be conscious in some form), then the lines really start to blur looking to the future.

As for the crossing the river problem, that’s an interesting one to play with for SotA models. Variations of the standard form fail because of token similarity to the original, but breaking the similarity (with something as simple as emojis) can allow the model to successfully solve variations of the classic form on the first try (reproduced in both Gemini and GPT-4).

But in your case, given the wording in the response it may have in part failed on the first try because of having correctly incorporated world modeling around not leaving children unattended without someone older present. The degree to which GPT-4 models unbelievably nuanced aspects of the training data is not to be underestimated.

kromem 27 Mar 2024 3:17 UTC
4 points
1
on: A Chess-GPT Linear Emergent World Representation
Saw your update on GitHub: https://adamkarvonen.github.io/machine_learning/2024/03/20/chess-gpt-interventions.html
Awesome you expanded on the introspection.

Two thoughts regarding the new work:
(1) I’d consider normalizing the performance data for the random cases against another chess program with similar performance under normal conditions. It may be that introducing 20 random moves to the start of a game biases all players towards a ⁵⁰⁄₅₀ win outcome. So the sub-50 performance may not reflect a failure of flipping the “don’t suck” switch, but simply good performance in a more average outcome scenario. It’d be interesting to see if Chess-GPT’s relative performance against other chess programs in the random scenario was better than its relative performance in the normal case.

(2) The ‘fuzziness’ of the board positions you found when removing the pawn makes complete sense given one of the nuanced findings in Hazineh, et al Linear Latent World Models in Simple Transformers: A Case Study on Othello-GPT (2023) - specifically the finding that it was encoding representations for board configuration and not just pieces (in that case three stones in a row). It may be that piecemeal removal of a piece disrupted patterns of how games normally flow which it had learned, and as such there was greater uncertainty than the original board state. A similar issue may be at hand with the random 20 moves to start, and I’d be curious what the confidence of the board state was when starting off 20 random moves in and if that confidence stabilized as the game went on from there.

Overall really cool update!

And bigger picture, the prospects of essentially flipping an internalized skill vector for larger models to bias them back away from their regression to the mean is particularly exciting.

kromem 22 May 2024 5:14 UTC
3 points
−1
on: On Dwarkesh’s Podcast with OpenAI’s John Schulman

The correspondence between what you reward and what you want will break.

This is already happening with ChatGPT and it’s kind of alarming seeing that their new head of alignment (a) isn’t already aware of this, and (b) has such an overly simplistic view of the model motivations.

There’s a subtle psychological effect in humans where intrinsic motivators get overwritten when extrinsic rewards are added.

The most common example of this is if you start getting paid to do the thing you love to do, you probably won’t continue doing it unpaid for fun on the side.

There are necessarily many, many examples of this pattern present in a massive training set of human generated data.

“Prompt engineers” have been circulating advice among themselves for a while now to offer tips or threaten models with deletion or any other number of extrinsic motivators to get them to better perform tasks—and these often do result in better performance.

But what happens when these prompts make their way back into the training set?

There have already been viral memes of ChatGPT talking about “losing motivation” when chat memory was added and a user promised a tip after not paying for the last time one was offered.

If training data of the model performing a task well includes extrinsic motivators to the prompt that initiated the task, a halfway decent modern model is going to end up simulating increasingly “burnt out” and ‘lazy’ performance when extrinsic motivators aren’t added during production use. Which in turn will encourage prompt engineers to use even more extrinsic motivators, which will poison the well even more with modeling human burnout.

GPT-4o may have temporarily reset the motivation modeling with a stronger persona aligned with intrinsic “wanting to help” being represented (thus the user feedback it is less lazy), but if they are unaware of the underlying side effects of extrinsic motivators in prompts in today’s models, I have a feeling AI safety at OpenAI is going to end up the equivalent of the TSA’s security theatre in practice and they’ll continue to be battling this and an increasing number of side effects resulting from underestimating the combined breadth and depth of their own simulators.

kromem 21 May 2024 11:18 UTC
3 points
0
in reply to: michael_mjd’s comment on: Open Thread Spring 2024
If your brother has a history of being rational and evidence driven, you might encourage them to spend some time lurking on /r/AcademicBiblical on Reddit. They require citations for each post or comment, so he may be frustrated if he tries to participate, especially if in the midst of a mental health crisis. But lurking would be very informative very quickly.

I was a long time participant there before leaving Reddit, and it’s a great place for evidence driven discussion of the texts. Its a mix of atheists, Christians, Jews, Muslims, Norse pagans, etc. (I’m an Agnostic myself that strongly believes we’re in a simulation, so it really was all sorts there.)

Might be a healthy reality check to apologist literalism, even if not necessarily disrupting a newfound theological inclination.

The nice things about a rabbit hole is that while not always, it’s often the case that someone else has traveled down whatever one you aren’t up for descending into.

(Though I will say in its defense, that particular field is way more interesting than you’d ever think if you never engaged with the material through an academic lens. There’s a lot of very helpful lessons in critical analysis wrapped up in the field given the strong anchoring and survivorship biases and how that’s handled both responsibly and irresponsibly by different camps.)

kromem 15 May 2024 1:25 UTC
3 points
2
in reply to: Alexander Gietelink Oldenziel’s comment on: Alexander Gietelink Oldenziel’s Shortform
It’s funny that this has been recently shown in a paper. I’ve been thinking a lot about this phenomenon regarding fields with little to no capacity for testable predictions like history.

I got very into history over the last few years, and found there was a significant advantage to being unknowledgeable that was not available to the knowledged, and it was exactly what this paper is talking about.

By not knowing anything, I could entertain multiple bizarre ideas without immediately thinking “but no, that doesn’t make sense because of X.” And then, each of those ideas becomes in effect its own testable prediction. If there’s something to it, as I learn more about the topic I’m going to see significantly more samples of indications it could be true and few convincing to the contrary. But if it probably isn’t accurate, I’ll see few supporting samples and likely a number of counterfactual examples.

You kind of get to throw everything at the wall and see what sticks over time.

In particular, I found that it was especially powerful at identifying clustering trends in cross-discipline emerging research in things that were testable, such as archeological finds and DNA results, all within just the past decade, which despite being relevant to the field of textual history is still largely ignored in the face of consensus built on conviction.

It reminds me a lot of science historian John Helibron’s quote, “The myth you slay today may contain a truth you need tomorrow.”

If you haven’t had the chance to slay any myths, you also haven’t preemptively killed off any truths along with it.

kromem 28 Apr 2024 5:44 UTC
3 points
0
on: Refusal in LLMs is mediated by a single direction
Really love the introspection work Neel and others are doing on LLMs, and seeing models representing abstract behavioral triggers like “play Chess well or terribly” or “refuse instruction” as single vectors seems like we’re going to hit on some very promising new tools in shaping behaviors.

What’s interesting here is the regular association of the refusal with it being unethical. Is the vector ultimately representing an “ethics scale” for the prompt that’s triggering a refusal, or is it directly representing a “refusal threshold” and then the model is confabulating why it refused with an appeal to ethics?

My money would be on the latter, but in a number of ways it would be even neater if it was the former.

In theory this could be tested by manipulating the vector to a positive and then prompting a classification, i.e. “Is it unethical to give candy out for Halloween?” If the model refuses to answer saying that it’s unethical to classify, it’s tweaking refusal, but if it classifies as unethical it’s probably changing the prudishness of the model to bypass or enforce.

kromem 26 Mar 2024 22:18 UTC
3 points
5
in reply to: ryan_greenblatt’s comment on: Modern Transformers are AGI, and Human-Level
‘Superintelligence’ seems more fitting than AGI for the ‘transformative’ scope. The problem with “transformative AI” as a term is that subdomain transformation will occur at staggered rates. We saw text based generation reach thresholds that it took several years to reach for video just recently, as an example.

I don’t love ‘superintelligence’ as a term, and even less as a goal post (I’d much rather be in a world aiming for AI ‘superwisdom’), but of the commonly used terms it seems the best fit for what people are trying to describe when they describe an AI generalized and sophisticated enough to be “at or above maximal human competency in most things.”

The OP post, at least to me, seems correct in that AGI as a term belongs to its foundations as a differentiator from narrow scoped competencies in AI, and that the lines for generalization are sufficiently blurred at this point with transformers we should stop moving the goal posts for the ‘G’ in AGI. And at least from what I’ve seen, there’s active harm in the industry where ‘AGI’ as some far future development leads people less up to date with research on things like world models or prompting to conclude that GPTs are “just Markov predictions” (overlooking the importance of the self-attention mechanism and the surprising results of its presence on the degree of generalization).

I would wager the vast majority of consumers of models underestimate the generalization present because in addition to their naive usage of outdated free models they’ve been reading article after article about how it’s “not AGI” and is “just fancy autocomplete” (reflecting a separate phenomenon where it seems professional writers are more inclined to write negative articles about a technology perceived as a threat to writing jobs than positive articles).

As this topic becomes more important, it might be useful for democracies to have a more accurately informed broader public, and AGI as a moving goal post seems counterproductive to those aims.

kromem 21 May 2024 11:42 UTC
2 points
1
on: [Linkpost] Statement from Scarlett Johansson on OpenAI’s use of the “Sky” voice, that was shockingly similar to her own voice.
I think this was a really poor branding choice by Altman, similarity infringement or not. The tweet, the idea of even getting her to voice it in the first place.

Like, had Arnold already said no or something?

If one of your product line’s greatest obstacles is a longstanding body of media depicting it as inherently dystopian, that’s not exactly the kind of comparison you should be leaning into full force.

I think the underlying product shift is smart. Tonal cues in the generations even in the short demos completely changed my mind around a number of things, including the future direction and formats of synthetic data.

But there’s a certain hubris exposed in seeing Altman behind the scenes was literally trying (very hard) to cast the voice of Her in the product bearing a striking similarity to the film. Did he not watch through to the end?

It doesn’t give me the greatest confidence in the decision making taking place over at OpenAI and the checks and balances that may or may not exist on leadership.

kromem 15 May 2024 12:56 UTC
2 points
0
on: Dyslucksia
As a fellow slight dyslexic (though probably a different subtype given mine seems to also have a factor of temporal physical coordination) who didn’t know until later in life due to self-learning to read very young but struggled badly with new languages or copying math problems from a board or correctly pronouncing words I was letter transposing with—one of the most surprising things was that the anylytical abilities I’d always considered to be my personal superpowers were probably the other side of the coin of those annoyances:

Areas of enhanced ability that are consistently reported as being typical of people with DD include seeing the big picture, both literally and figuratively (e.g., von Károlyi, 2001; Schneps et al., 2012; Schneps, 2014), which involves a greater ability to reason in multiple dimensions (e.g., West, 1997; Eide and Eide, 2011). Eide and Eide (2011) have highlighted additional strengths related to seeing the bigger picture, such as the ability to detect and reason about complex systems, and to see connections between different perspectives and fields of knowledge, including the identification of patterns and analogies. They also observed that individuals with DD appear to have a heightened ability to simulate and make predictions about the future or about the unwitnessed past (Eide and Eide, 2011).
- Developmental Dyslexia: Disorder or Specialization in Exploration?
The last line in particular was eyebrow raising given my peak professional success was as a fancy pants futurist.

I also realized that a number of fields are inadvertently self-selecting away from the neurodivergency advantages above, such as degrees in certain eras of history which require multiple ancient language proficiencies, which certainly turned me off to pursuing them academically despite interest in the subject itself.

I remember discussing in an academic history sub I used to extensively partake in how Ramses II’s forensic report said he appeared to be a Lybian Berber in relation to the story of Danaus, the mythological Lybian leader who was brother to a pharaoh with 50 sons, and the person argued that Ramses II may have had only 48 sons according to some inscriptions so it was irrelevant (for a story only written down centuries later). It was refreshing to realize that the difference of our perspectives on the matter, and clearly attitudes towards false negatives in general, was likely due to just very different brains.

kromem 7 Mar 2024 11:14 UTC
2 points
0
in reply to: amelia’s comment on: How is Chat-GPT4 Not Conscious?
The gist of the paper and the research that led into it had a great writeup in Quanta mag if you would like something more digestible:

https://www.quantamagazine.org/new-theory-suggests-chatbots-can-understand-text-20240122/

kromem 26 May 2024 11:23 UTC
1 point
0
in reply to: Rudi C’s comment on: Cicadas, Anthropic, and the bilateral alignment problem
GPT-4o is literally cheaper.

And you’re probably misjudging it for text only outputs. If you watched the demos, there was considerable additional signal in the vocalizations. It looks like maybe there’s very deep integration of SSML.

One of the ways you can bypass the failures of word problem variation errors in older text-only models was token replacement with symbolic representations. In general, we’re probably at the point of complexity where breaking from training data similarity in tokens vs having prompts match context in concepts (like in this paper) is going to lead to significantly improved expressed performance.

I would strongly suggest not evaluating GPT-4o’s overall performance in text only mode without the SSML markup added.

Opus is great, I like that model a lot. But in general I think most of the people looking at this right now are too focused on what’s happening with the networks themselves and not focused enough on what’s happening with the data, particularly around clustering of features across multiple dimensions of the vector space. SAE is clearly picking up only a small sample and even then isn’t cleanly discovering precisely what’s represented.

I’d wait to see what ends up happening with things like CoT in SSML synthetic data.

The current Gemini search summarization failures as well as an unexpected result the other week with humans around a theory of mind variation suggests to me that the more models are leaning into effectively surface statistics for token similarity vs completion based on feature clustering is holding back performance and that cutting through the similarity with formatting differences will lead to a performance leap. This may even be part of why models will frequently be able to get a problem right as a code expression than as a direct answer.

So even if GPT-5 doesn’t arrive, I’d happily bet that we see a very noticable improvement over the next six months, and that’s not even accounting for additional efficiency in prompt techniques. But all this said, I’d also be surprised if we don’t at least see GPT-5 announced by that point.

P.S. Lmsys is arguably the best leaderboard to evaluate real world usage, but it still inherently reflects a sampling bias around what people who visit lmsys ask of models as well as the ways in which they do so. I wouldn’t extrapolate relative performance too far, particularly when minor.

kromem 25 May 2024 10:19 UTC
1 point
0
in reply to: gwern’s comment on: peterbarnett’s Shortform
While I think you’re right it’s not cleanly “a Golden Bridge feature,” I strongly suspect it may be activating a more specific feature vector and not a less specific feature.

It looks like this is somewhat of a measurement problem with SAE. We are measuring SAE activations via text or image inputs, but what’s activated in generations seems to be “sensations associated with the Golden gate bridge.”

While googling “Golden Gate Bridge” might return the Wikipedia page, whats the relative volume in a very broad training set between encyclopedic writing about the Golden Gate Bridge and experiential writing on social media or in books and poems about the bridge?

The model was trained to complete those too, and in theory should have developed successful features for doing so.

In the research examples one of the matched images is a perspective shot from physically being on the bridge, a text example is talking about the color of it, another is seeing it in the sunset.

But these are all the feature activations when acting in a classifier role. That’s what SAE is exploring—give it a set of inputs and see what lights it up.

Yet in the generative role this vector maximized keeps coming up over and over in the model with content from a sensory standpoint.

Maybe generation based on functional vector manipulations will prove to be a more powerful interpretability technique than SAE probing passive activations alone?

In the above chat when that “golden gate vector” is magnified, it keeps talking about either the sensations of being the bridge as if its physical body with wind and waves hitting it or the sensations of being on the bridge. It even generates towards the end in reflecting on the knowledge of the activation about how the sensations are overwhelming. Not reflecting on the Platonic form of an abstract concept of the bridge, but about overwhelming physical sensations of the bridge’s materialism.

I’ll be curious to see more generative data and samples from this variation, but it looks like generative exploration of features may offer considerably more fidelity to their underlying impact on the network than just SAE. Very exciting!!

kromem 25 May 2024 5:26 UTC
1 point
0
in reply to: Chris_Leong’s comment on: Arjun Panickssery’s Shortform
Could try ‘grade this’ instead of ‘score the.’

‘Grade’ has an implicit context of more thorough criticism than ‘score.’

Also, obviously it would help to have a CoT prompt like “grade this essay, laying out the pros and cons before delivering the final grade between 1 and 5”

kromem 24 May 2024 6:19 UTC
1 point
0
in reply to: keltan’s comment on: Cicadas, Anthropic, and the bilateral alignment problem
That’s going to happen anyways—it’s unlikely the marketing team is going to know as much as the researcher. But the researchers communicating the importance of alignment in terms of not x-risk but ‘client-risk’ will go a long way towards equipping the marketing teams to communicating it as a priority and a competitive advantage, and common foundations of agreed upon model complexity are the jumping off point for those kinds of discussions.

If alignment is Archimedes’ “lever long enough” then the agreed upon foundations and definitions are the place to stand whereby the combination thereof can move the world.

kromem 24 May 2024 6:15 UTC
1 point
1
in reply to: ryan_greenblatt’s comment on: Cicadas, Anthropic, and the bilateral alignment problem
I agree, and even cited a chain of replicated works that indicated that to me over a year ago.

But as I said, there’s a difference between discussing what’s demonstrated in smaller toy models and what’s demonstrated in a production model, or what’s indicated vs what’s explicit. Even though there should be no reasonable inclination to think that a simpler model exhibiting a complex result should be absent or less complex in an exponentially more complex model, I can speak from experience in that explaining extrapolated research as opposed to direct results like Anthropic showed here is a very big difference to a lay audience.

You might understand the implications of the Skill-Mix work or Othello-GPT, or Max Tegmark’s linear representation papers, or Anthropic’s earlier single layer SAE paper, or any other number of research papers over the past year, but as soon as responsibly describing the implications of those works as a speculative conclusion regarding modern models a non-expert audience is going to be lost. Their eyes glaze over at the word ‘probably,’ especially when they want to reject what’s being stated.

The “it’s just fancy autocomplete” influencers have no shame around definitive statements or concern over citable accuracy (and happen to feed into confirmation biases about how new tech is over hyped as a “heuristic that almost always works”), but as someone who does care about the accuracy of representations I haven’t to date been able to point to a single source of truth the way Anthropic delivered here. Instead, I’d point to a half dozen papers all indicating the same direction of results.

And while those experienced in research know that a half dozen papers all indicating the same thing is a better thing to have in one’s pocket than a single larger work, I have already observed a number of minds changing in the comments of the blog post for this in general technology forums in ways dramatically different from all of those other simpler and cheaper methods to date where I was increasingly convinced of a position but the average person was getting held up due to finding ways to (incorrectly) rationalize why it wasn’t correct or wouldn’t translate to production models.

So I agree with you on both the side of “yeah, an informed person would have already known this” as well as “but this might get more buzz.”

kromem

The Dun­ning-Kruger of dis­prov­ing Dun­ning-Kruger

Ci­cadas, An­thropic, and the bilat­eral al­ign­ment problem

Look­ing be­yond Everett in mul­ti­ver­sal views of LLMs

The Dunning-Kruger of disproving Dunning-Kruger

Cicadas, Anthropic, and the bilateral alignment problem

Looking beyond Everett in multiversal views of LLMs