“As you can see”, “serious adults”, “really” and “fine” all (mildly) demonstrate a sense of incredulity. “Look what these people actually believe! Just in case you thought it was a strawman.” It’s admittedly subtle, not stated, and I can see how someone could miss it. (I’ll feel pretty stupid if I’m wrong.)
Kenoubi
Don’t book publication deals typically involve an exclusive license, not copyright assignment? (The effect is roughly the same, for the purposes of the question being answered here, of course.)
I would guess this is about “getting the right things into context”, not “being able to usefully process what is in context”. (AI already seems pretty good at the latter, for a broad though not universal set of tasks.)
It doesn’t sound quite right to me that there are different possible cultures for any given number of echoes. I think it’s more like… you memoize (compute on first use and also store for future use) what will or is likely to happen, in a conversation, as a result of saying a certain kind of thing. The thrust, or flavor, or whatever metaphor you prefer, of saying that kind of thing, starts to be associated with however the following conversation (or lack thereof) seems likely to go.
People don’t actually have to be aware at all of all the levels at any one time. Precomputed results can themselves derive from other precomputed results. Someone doesn’t have to be able to unpack one of these chains at all to use it. Sometimes some of the earlier judgments were actually made by someone else and the speaker is just parroting opinions he or she can’t justify! (This is not necessarily a criticism. Each human does not figure everything out from scratch for himself or herself. In the good cases, I think the chain probably could be unpacked through analysis and research, if needed.)
But there remains something like the “parity” (evenness or oddness) of the process, in addition to its depth. (More depth is of course good, as long as it’s accurate. It often isn’t accurate, and more levels means more chances for it to diverge. I would guess this is the main reason some people (often including me) prefer lower depth—they don’t expect the higher depth inferences to be sufficiently accurate to guide action. As they often aren’t.) This manifests as whether we look for fault in the speaker or in the listener. This too is of course not a single value, but it’s an apportionment, not a number of echoes. There is (I think) a tendency to look more towards the speaker or the listener(s) for fault (or credit, if communication goes well!), and THAT is what I think ask and guess culture are about. It ends up being something like the sum of a series in which the terms have a factor of (-1)^n.
(I agree with the overall thrust of this post that “you could just not respond!” references an action that, while available, is not free of cost such that one can simply assume that leaving a comment will consume none of the author’s time and attention unless he or she wants it to.)
I am saying you do not literally have to be a cog in the machine. You have other options. The other options may sometimes be very unappealing; I don’t mean to sugarcoat them.
Organizations have choices of how they relate to line employees. They can try to explain why things are done a certain way, or not. They can punish line employees for “violating policy” irrespective of why they acted that way or the consequences for the org, or not.
Organizations can change these choices (at the margin), and organizations can rise and fall because of these choices. This is, of course, very slow, and from an individual’s perspective maybe rarely relevant, but it is real.
I am not saying it’s reasonable for line employees to be making detailed evaluations of the total impact of particular policies. I’m saying that sometimes, line employees can see a policy-caused disaster brewing right in front of their faces. And they can prevent it by violating policy. And they should! It’s good to do that! Don’t throw the squirrels in the shredder!
I don’t think my view is affluent, specifically, but it does come from a place where one has at least some slack, and works better in that case. As do most other things, IMO.
(I think what you say is probably an important part of how we end up with the dynamics we do at the line employee level. That wasn’t what I was trying to talk about, and I don’t think it changes my conclusions, but maybe I’m wrong; do you think it does?)
I have trouble understanding what’s going on in people’s heads when they choose to follow policy when that’s visibly going to lead to horrific consequences that no one wants. Who would punish them for failing to comply with the policy in such cases? Or do people think of “violating policy” as somehow bad in itself, irrespective of consequences?
Of course, those are only a small minority of relevant cases. Often distrust of individual discretion is explicitly on the mind of those setting policies. So, rather than just publishing a policy, they may choose to give someone the job of enforcing it, and evaluate that person by policy compliance levels (whether or not complying made sense in any particular case); or they may try to make the policy self-enforcing (e.g., put things behind a locked door and tightly control who has the key).
And usually the consequences look nowhere close to horrific. “Inconvenient” is probably the right word, most of the time. Although very policy-driven organizations seem to have a way of building miserable experiences out of parts any one of which might be best described as inconvenient.
I’m not sure I agree who’s good and who’s bad in the gate attendant scenario. Surely getting angry at the gate attendant is unlikely to accomplish anything, but if (for now; maybe not much longer, unfortunately) organizations need humans to carry out their policies, the humans don’t have to do that. They can violate the policy and hope they don’t get fired; or they can just quit. The passenger can tell them that. If they’re unable to listen to and consider the argument that they don’t have to participate in enforcing the policy, I guess at that point they’re pretty much NPCs.
I don’t know whether we know anything about how to teach this, other than just telling (and showing, if the opportunity arises), or about what works and what doesn’t, but I think this is also what I’d consider the most important goal for education to pursue. I definitely intend to tell my kids, as strongly as possible, “You always can and should ignore the rules to do the right thing, no matter what situation you’re in, no matter what anyone tells you. You have to know what the right thing is, and that can be very hard, and good rules will help you figure out what the right thing is much better than you could on your own; but ultimately, it’s up to you. There is nothing that can force you to do something you know is wrong.”
I hadn’t noticed that there’d be any reason for people to claim Claude 3.7 Sonnet was “misaligned”, even though I use it frequently and have seen some versions of the behavior in question. It seems to me like… it’s often trying to find the “easy way” to do whatever it’s trying to do. When it decides something is “hard”, it backs off from that line of attack. It backs off when it decides a line of attack is wrong, too. Actually, I think “hard” might be a kind of wrong in its ontology of reasoning steps.
This is a reasoning strategy that needs to be applied carefully. Sometimes it works; one really should use the easy way rather than the hard way, if the easy way works and is easier. But sometimes the hard part is the core of the problem and one needs to just tackle it. I’ve been thinking of 3.7′s failure to tackle the hard part as a lack of in-practice capabilities, specifically the capability to notice “hey, this time I really do need to do it the hard way to do what the user asked” and just attempt the hard way.
Having read this post, I can see the other side of the coin. 3.7′s RL probably heavily incentivizes it to produce an answer / solution / whatever the user wanted done. Or at least something that appears to be what the user wanted, as far as it can tell. Such as (in a fairly extreme case) hard coding to “pass” unit tests.
I wouldn’t read too much into deceiving or lying to cover up in this case. That’s what practically any human who had chosen to clearly cheat would do in the same situation, at least until confronted. The decision to cheat in the first place is straightforwardly misaligned though. But I still can’t help thinking it’s downstream of a capabilities failure, and this particular kind of misalignment will naturally disappear once the model is smart enough to just do the thing, instead. (Which is not, of course, to say we won’t see other kinds of misalignment, or that those won’t be even more problematic.)
That’s possible, but what does the population distribution of [how much of their time people spend reading books] look like? I bet it hasn’t changed nearly as much as overall reading minutes per capita has (even decline in book-reading seems possible, though of course greater leisure and wealth, larger quantity of cheaply and conveniently available books, etc. cut strongly the other way), and I bet the huge pile of written language over here has large effects on the much smaller (but older) pile of written language over there.
(How hard to understand was that sentence? Since that’s what this article is about, anyway, and I’m genuinely curious. I could easily have rewritten it into multiple sentences, but that didn’t appear to me to improve its comprehensibility.)
Edited to add: on review of the thread, you seem to have already made the same point about book-reading commanding attention because book-readers choose to read books, in fact to take it as ground truth. I’m not so confident in that (I’m not saying it’s false, I really don’t know), but the version of my argument that makes sense under that hypothesis would crux on books being an insufficiently distinct use of language to not be strongly influenced, either through [author preference and familiarity] or through [author’s guesses or beliefs about [reader preference and familiarity]], by other uses of language.
I agree that the average reader is probably smarter in a general sense, but they also have FAR more things competing for their attention. Thus the amount of intelligence available for reading and understanding any given sentence, specifically, may be lower in the modern environment.
Question marks and exclamation points are dots with an extra bit. Ellipses may be multiple dots, but also indicate an uncertain end to the sentence. (Formal usage distinguishes ”...” for ellipses in arbitrary position and ”....” for ellipses coming after a full stop, but the latter is rarely seen in any but academic writing, and I would guess even many academics don’t notice the difference these days.)
I read a bunch of its “thinking” and it gets SO close to solving it after the second message, but it miscounts the number of [] in the text provided for 19. Repeatedly. While quoting it verbatim. (I assume it foolishly “trusts” the first time it counted.) And based on its miscount, thinks that should be the representation for 23, instead. And thus rules out (a theory starting to point towards) the corrects answer.
I think this may at least be evidence that having anything unhelpful in context, even (maybe especially!) if self-generated, can be really harmful to model capabilities. I still think it’s pretty interesting.
I have very mixed feelings about this comment. It was a good story (just read it, and wouldn’t have done so without this comment) but I really don’t see what it has to do with this LW post.
Possible edge case / future work—what if you optimize for faithfulness and legibility of the chain of thought? The paper tests optimizing for innocent-looking CoT, but if the model is going to hack the test either way, I’d want it to say so! And if we have both an “is actually a hack” detector and a “CoT looks like planning a hack” detector, this seems doable.
Is this an instance of the Most Forbidden Technique? I’m not sure. I definitely wouldn’t trust it to align a currently unaligned superintelligence. But it seems like maybe it would let you make an aligned model at a given capability level into a still aligned model with more legible CoT, without too much of a tax, as long as the model doesn’t really need illegible CoT to do the task? And if capabilities collapse, that seems like strong evidence that illegible CoT was required for task performance; halt and catch fire, if legible CoT was a necessary part of your safety case.
Is it really plausible that human driver inattention just doesn’t matter here? Sleepiness, drug use, personal issues, eyes were on something interesting rather than the road, etc. I’d guess something like that is involved in a majority of collisions, and that Just Shouldn’t Happen to AI drivers.
Of course AI drivers do plausibly have new failure modes, like maybe the sensors fail sometimes (maybe more often than human eyes just suddenly stop working). But there should be plenty of data about that sort of thing from just testing them a lot.
The only realistic way I can see for AI drivers, that have been declared street-legal and are functioning in a roadway and regulatory system that humans (chose to) set up, to be less safe than human drivers is if there’s some kind of coordinated failure. Like if they trust data coming from GPS satellites or cell towers, and those start spitting out garbage and throw the AIs off distribution; or a deliberate cyber-attack / sabotage of some kind.
“ goal” in “football| goal|keeping”
Looks like an anti-football (*American* football, that is) thing, to me. American football doesn’t have goals, and soccer (which is known as “football” in most of the world) does. And you mentioned earlier that the baseball neuron is also anti-football.
Since it was kind of a pain to run, sharing these probably minimally interesting results. I tried encoding this paragraph from my comment:
I wonder how much information there is in those 1024-dimensional embedding vectors. I know you can jam an unlimited amount of data into infinite-precision floating point numbers, but I bet if you add Gaussian noise to them they still decode fine, and the magnitude of noise you can add before performance degrades would allow you to compute how many effective bits there are. (Actually, do people use this technique on latents in general? I’m sure either they do or they have something even better; I’m not a supergenius and this is a hobby for me, not a profession.) Then you could compare to existing estimates of text entropy, and depending on exactly how the embedding vectors are computed (they say 512 tokens of context but I haven’t looked at the details enough to know if there’s a natural way to encode more tokens than that; I remember some references to mean pooling, which would seem to extend to longer text just fine?), compare these across different texts.
with SONAR, breaking it up like this:
sentences = [ 'I wonder how much information there is in those 1024-dimensional embedding vectors.', 'I know you can jam an unlimited amount of data into infinite-precision floating point numbers, but I bet if you add Gaussian noise to them they still decode fine, and the magnitude of noise you can add before performance degrades would allow you to compute how many effective bits there are.', '(Actually, do people use this technique on latents in general? I\'m sure either they do or they have something even better; I\'m not a supergenius and this is a hobby for me, not a profession.)', 'Then you could compare to existing estimates of text entropy, and depending on exactly how the embedding vectors are computed (they say 512 tokens of context but I haven\'t looked at the details enough to know if there\'s a natural way to encode more tokens than that;', 'I remember some references to mean pooling, which would seem to extend to longer text just fine?), compare these across different texts.']and after decode, I got this:
['I wonder how much information there is in those 1024-dimensional embedding vectors.', 'I know you can encode an infinite amount of data into infinitely precise floating-point numbers, but I bet if you add Gaussian noise to them they still decode accurately, and the amount of noise you can add before the performance declines would allow you to calculate how many effective bits there are.', "(Really, do people use this technique on latent in general? I'm sure they do or they have something even better; I'm not a supergenius and this is a hobby for me, not a profession.)", "And then you could compare to existing estimates of text entropy, and depending on exactly how the embedding vectors are calculated (they say 512 tokens of context but I haven't looked into the details enough to know if there's a natural way to encode more tokens than that;", 'I remember some references to mean pooling, which would seem to extend to longer text just fine?), compare these across different texts.']Can we do semantic arithmetic here?
sentences = [ 'A king is a male monarch.', 'A bachelor is an unmarried man.', 'A queen is a female monarch.', 'A bachelorette is an unmarried woman.' ] ... pp(reconstructed) ['A king is a male monarch.', 'A bachelor is an unmarried man.', 'A queen is a female monarch.', 'A bachelorette is an unmarried woman.'] ... new_embeddings[0] = embeddings[0] + embeddings[3] - embeddings[1] new_embeddings[1] = embeddings[0] + embeddings[3] - embeddings[2] new_embeddings[2] = embeddings[1] + embeddings[2] - embeddings[0] new_embeddings[3] = embeddings[1] + embeddings[2] - embeddings[3] reconstructed = vec2text_model.predict(new_embeddings, target_lang="eng_Latn", max_seq_len=512) pp(reconstructed) ['A kingwoman is a male monarch.', "A bachelor's is a unmarried man.", 'A bachelorette is an unmarried woman.', 'A queen is a male monarch.']Nope. Interesting though. Actually I guess the 3rd one worked?
OK, I’ll stop here, otherwise I’m at risk of going on forever. But this seems like a really cool playground.
You appear to have two full copies of the entire post here, one above the other. I wouldn’t care (it’s pretty easy to recognize this and skip the second copy) except that it totally breaks the way LW does comments on and reactions to specific parts of the text; one has to select a unique text fragment to use those, and with two copies of the entire post, there aren’t any unique fragments.
Wow, the SONAR encode-decode performance is shockingly good, and I read the paper and they explicitly stated that their goal was translation, and that the autoencoder objective alone was extremely easy! (But it hurt translation performance, presumably by using a lot of the latent space to encode non-semantic linguistic details, so they heavily downweighted autoencoder loss relative to other objectives when training the final model.)
I wonder how much information there is in those 1024-dimensional embedding vectors. I know you can jam an unlimited amount of data into infinite-precision floating point numbers, but I bet if you add Gaussian noise to them they still decode fine, and the magnitude of noise you can add before performance degrades would allow you to compute how many effective bits there are. (Actually, do people use this technique on latents in general? I’m sure either they do or they have something even better; I’m not a supergenius and this is a hobby for me, not a profession.) Then you could compare to existing estimates of text entropy, and depending on exactly how the embedding vectors are computed (they say 512 tokens of context but I haven’t looked at the details enough to know if there’s a natural way to encode more tokens than that; I remember some references to mean pooling, which would seem to extend to longer text just fine?), compare these across different texts.
Exploring this embedding space seems super interesting, in general, way more so on an abstract level (obviously it isn’t as directly useful at this point) than the embedding space used by actual LLMs. Like, with only 1024 dimensions for a whole paragraph, it must be massively polysemantic, right? I guess your follow-on post (which this was just research to support) is implicitly doing part of this, but I think maybe it underplays “can we extract semantic information from this 1024-dimensional embedding vector in any way substantially more efficient than actually decoding it and reading the output?” (Or maybe it doesn’t; I read the other post too, but haven’t re-read it in light of this one.)
There also appears to be a way to attempt to use this to enhance model capabilities. I seem to think of one of these every other week, and again, I’m not a supergenius nor a professional ML researcher so I assume it’s obvious to those in the field. The devil appears to be in the details; sometimes a new innovation appears to be a variant of something I thought of years ago, sometimes they come out of left field from my perspective, and in no case does there appear to be anything I, from my position, could have usefully done with the idea, so far. Experiments seem very compute-limited, especially because like all other software development in my experience, one needs to actually run the code and see what happens. This particular technique, if it actually works (I’m guessing either it doesn’t, or it only works when scaled so large that a bunch of other techniques would have worked just as well and converged on the same implicit computations) might come with large improvements to interpretability and controllability, or it might not (which seems to be true for all the other ideas I have that might improve capabilities, too). I’m not advising anyone to try it (again, if one works in the field I think it’s obvious, so either there are reasons not to or someone already is). Just venting, I guess. If anyone’s actually reading this, do you think there’s anything useful to do with this idea and others like it, or are they pretty much a dime a dozen, interesting to me but worthless in practice?
(Sorry for going on so long! Wish I had a way to pay a penny to anyone who thoughtfully reads this, whether or not they do anything with it.)
Sorry, I think it’s entirely possible that this is just me not knowing or understanding some of the background material, but where exactly does this diverge from justifying the AI pursuing a goal of maximizing the inclusive genetic fitness of its creators? Which clearly either isn’t what humans actually want (there are things humans can do to make themselves have more descendants that no humans, including the specific ones who could take those actions, want to take, because of godshatter) or is just circular (who knows what will maximize inclusive genetic fitness in an environment that is being created, in large part, by the decision of how to promote inclusive genetic fitness?). At some point, your writing started talking about “design goals”, but I don’t understand why tools / artifacts constructed by evolved creatures, that happen to increase the inclusive genetic fitness of the evolved creatures who constructed them by means other than the design goals of those who constructed them, wouldn’t be favored by evolution, and thus part of the “purpose” of the evolved creatures in constructing them; and this doesn’t seem like an “error” even in the limit of optimal pursuit of inclusive genetic fitness, this seems to be just what optimal pursuit of IGF would actually do. In other words, I don’t want a very powerful human-constructed optimizer to pursue the maximization of human IGF, and I think hardly any other humans do either; but I don’t understand in detail why your argument doesn’t justify AI pursuit of maximizing human IGF, to the detriment of what humans actually value.
What You Don’t Understand Can Hurt You (many variations possible, with varied effects)
Improve Your (Metaphorical) Handwriting
Make Other People’s Goodharting Work For You (tongue in cheek, probably too biting)
Make Surviving ASI Look As Hard As It Is
Unsolved Illegible Problems + Solved Legible Problems = Doom