Sidereal Confluence is a very useful object lesson on trade being positive-sum.
speck1447
This explanation is plausible to me, and has the added benefit of explaining why @mabramov thinks prediction markets are less valuable than they are (if indeed they are less valuable than they are) - many people are prone to believing things that are obviously wrong, the main skill of good forecasters (beyond being generally well-informed) is that they are immune to this particular insanity, and so people who are not good forecasters benefit from access to insanity-immune opinions. This comment is the closest among existing comments to convincing me that prediction markets can have broad social utility.
Still, I don’t think this explanation really says that 40% chances are 40% chances, it says you can safely dismiss claims that 40% markets represent probabilities below like 10% or higher than like 80%. It’s still possible that these markets are not particularly-good information aggregators and that superforecasters are not particularly good at producing actionable insights across domains—calibration is not the optimization target. Thus I still update, based on the original post, towards prediction markets being worse, and perhaps significantly worse, than advertised, in their current incarnations.
Why do you attribute this largely to the rise of prediction markets? My perception is that news outlets started citing prediction markets roughly when they became an effective vehicle for hard-to-regulate sports gambling, I don’t think this has ~anything to do with the 2016 election, and indeed, directly following that election, significantly prior to the rise in cultural salience of prediction markets, data scientists and pollsters were in crisis for several months trying to figure out why the polls were so wrong. They did a much better job in the 2018 midterms and this is certainly not attributable to prediction markets, rather to directly addressing the methodological gaps in polling that had recently become salient.
If you’ve written extensively with the author of a book, you should disclose this explicitly in your review of the book. This is true even if your writing was pseudonymous. This is true especially if you feel that readers not in your in-group might misinterpret the nature of the writing due to lack of cultural familiarity. This is true especially if the joint writing is the most detailed source publicly available for the trickier parts of the book’s underlying philosophy, and especially especially if the author’s favored response to one of the main critiques of their programme is to reference your portion of the joint writing. It is not enough to merely mention in a footnote that you get along with the author. If you don’t want to disclose this then you should not review the book. It is okay to maintain pseudonyms precisely when you follow appropriate social partitioning rules to avoid exactly the thing I am describing.
(This post intentionally left vague, I’m not interested in doxxing anyone.)
I agree that they will make these mistakes at that scope, I’m claiming that the solution won’t scale—if you RL models to not do this in 200 words, I don’t think that will make it substantially easier for them to not do it at 5k words, except insofar as it trains them to not hint at things ever. I haven’t found frontier models to be significantly more tasteful or better at writing prose than less capable models, despite being generally smarter and better at some seemingly-related parts of creative writing, so my intuition is that current scaling levers are unlikely to address this problem well.
Agree with a lot here but disagree with the conclusion:
I believe that we should focus on improving models’ ability to write in the <200 word range, where both generation and evaluation is comparatively cheap. I do not expect efforts to produce high quality long-form LLM writing to be fruitful until models are able to produce strong short-form writing.
I think AI mostly struggles at things that are difficult to learn in short-form writing. For example this passage:
A letter can be read many ways, and he had learned to write in all of them at once. The surface meaning for anyone who might intercept it. The true meaning for the recipient who knew what to look for. And a third meaning, hidden even from himself. Ambiguity was not weakness. It was survival. A man who spoke plainly was a man who would not speak for long.
is not that bad prosaically, and the prose is not what makes my skin crawl.
It’s pretty clumsy, and the second and third sentences need a different construction: there’s no real payoff for saying which kind of man each meaning is for, the AI just decided this was the construction it wanted to use and didn’t use the space industriously. There’s also the “Ambiguity was not weakness. It was survival.” construction that just smells wrong at this point.
But I think the biggest problem here is “And a third meaning, hidden even from himself.” In isolation this is probably the strongest part of the passage, because with the right characterization and the right consideration, this could be an interesting idea! But here it makes me gag, because it doesn’t meaningfully subvert what came before, it doesn’t really have anything to do with what follows, and so I know that the author put it there because it sounds cool and doesn’t have any plan or intention to deliver on what’s cool about it; in other words, the presence of the best sentence in this passage (imo) tells me that the author is not good enough to use this sentence.
I argue that this tendency is difficult to learn in short-form, because it’s hard to realize that the payoff is never coming when it has to come now or never—that is, what I think I dislike about AI prose is that it’s clearly not written with a large context in mind, and while you could train an AI to stop hinting at grand narratives that it’s not capable of by RLing it on short-form, I doubt that this will make progress toward good long-form. This might even be why many people prefer the AI writing—I suspect that people who do not read much literature do not really know how these connections are supposed to be built. If forced to read a full AI novel and a full human novel I think they would start to notice that human prose doesn’t get annoying in the way AI prose does, but most people do not read this much and so do not make this extrapolation.
In my experience adults are not good enough judges of capabilities for this to be non-damaging. I’ve tutored lots of kids in math at lots of different ability levels. Except in cases where the parents were highly-involved high-achieving mathematicians and the kids were high-achieving as well (math olympiad prep basically), parent and teacher evaluations were not more informative than mere report cards, which were not all that informative (grades and test scores are very predictive if you have a standard life path, but you should not have a standard life path, there are deep interventions available for math for most students imo.) I imagine the same issue exists in other fields, and I don’t think I’ve ever met someone who I think would be competent to give their child useful assessments of their long-term capabilities in more than two fields at a time, even assuming the children were not vulnerable to the self-image damage that I think you’re basically ignoring here.
I agree that this is the relevant consideration. I think that if cognition has many parts, we should actually expect some parts that humans use to be completely missing in LLMs (and vice-versa), and it’s not clear to me whether I should expect scaling architecture to actually produce more parts in this way, I have some intuitions that say (for combinatorial reasons) that within a certain architecture, training dynamics will eventually stop favoring the formation of circuits past a certain size regardless of how many layers you stack, but I am not that confident in these intuitions, so will need to think about it more. The point about discovery bootstrapping is well-taken, I think I’ve been imagining that this sort of discovery shouldn’t be possible while LLMs are totally lacking some parts of human cognition, but if I take multiple intelligences really seriously than I shouldn’t believe that. Thanks!
Something I’m thinking about today: frontier LLMs have a pretty unusual capabilities profile. This means one of two things: either I should think of LLMs as leveraging massive amounts of necessary compute and the problems they can solve as much more compute-vulnerable than I thought they were (i.e. this is Deep Blue and everything is kind of chess) or multiple intelligences models are simply true, in that cognition has multiple parts that don’t necessarily have anything to do with each other. The latter predicts that step changes in capabilities are available and that LLMs probably will not scale to super intelligence without architecture changes, the former is unsure about these propositions. I think there’s like a 60% chance that I will change my mind about something critical in this line of thinking if I think about it harder, if you know what I will change my mind about please let me know.
It is possible to get evidence for this claim without blind tests. For example: start interacting more with prose from an LLM you don’t interact with often (I recently discovered that I like Kimi K2.5′s prose much better than Claude’s, for example, so I’m interacting with it more). Track your ability to distinguish that LLM’s outputs (and your subjective taste/distaste for those patterns) over time. If you start to dislike tics that you didn’t notice before, that’s reasonable evidence that you’ve come to associate those tics with writing that lacks the sort of interiority described here, or at least with writing that lacks some desirable quality that’s hard to specify.
“Eventually”, sure, but I don’t think that’s operative here. If we had the ASI recipe and could study it safely for ten years, we’d find a way to implement it in a single datacenter. But discovering it in a single data center is much harder. There is actually something missing from current LLMs, there’s a part of intelligence they just don’t have, and the only thing that seems to mitigate that issue is model size, so without ever-increasing model size and analysis of their training dynamics, I think any attempts to get the missing piece are throwing darts with the lights off. (To be fair I have pretty unusual timelines compared to most of LW so maybe what’s convincing to me shouldn’t be to you.)
speck1447′s Shortform
At least through the web app, Gemini 3.1 Pro is almost (like 80% maybe? It feels 80% as bad to me) as sycophantic as 4o was.
I definitely expect that problems at this level of difficulty are within reach for present frontier models. That being said, as I understand it, most labs are still soliciting expert data and doing human-in-the-loop process reward modelling, and those that aren’t (mostly because they think RLVR is better and they have the spare compute) are still using the data they solicited in the past, or are distilling from models that used that data, or etc. etc. For basically the past two years, any math problem which is known to stump LLMs even occasionally is worth ~75 dollars to any contractor in any part of the world working as a data generator for companies like Scale AI. You should expect that any math problem which has been posted publicly, seen by more than ~50 people, and stated to be hard for LLMs in that time period has been trained on, detached from the canary string.
Knowing that all rational numbers can be represented is a big hint and would have cut at least my solution time in half. This is still probably a good test, and although I’m sure it’s been trained on, it’s not too hard to come up with “similar” puzzles where knowing about this one doesn’t immediately solve it.
Reads more as manic than rehearsed to me, but I’m not sure I see how the distinction matters. Usually I assume that if somebody has thought through what they want to say before they say it, they’re more likely to give their real thoughts as a result, as opposed to some reactively oppositional take. I guess there’s the Andy Kaufman defense?
(I guess I should mention, there’s at least one way that the distinction is relevant here. At the first pause I indicated, it seems like they were about to say that they want their political opponents wiped off of the face of the earth, but catch themself in time to moderate to something slightly less evil. I read this as instinctively reaching for the worst thing they can think to say about people they hate but not actually being committed to the content. If I thought this were more rehearsed, I think I would read it as at least some small percentage of desire for political genocide against the left. But this involves a bit too much speculation for my taste, I’m much more concerned with the claim I originally quoted, which strikes me as unsalvageably naive.)
For example the “Poor and Proud” and “March 4 Hundredaires” signs are sentiments that literally every Pro-Billionaire protestor would gladly endorse
See https://x.com/twocents/status/2020596821228388704
In particular, note the following exchange
> Interviewer: “Is there any ways you would restructure current incentives to make, to allow for, the protestors and the antiprotestors to like see eye-to-eye?”
> Pro-Billionaire Protestor: “I don’t want to see eye to eye with them! I want to destroy them. [cut, apparently to later in the same answer] All manner of socialists and communists that are motivated by jealousy, I want to wipe them off (pause) of the political spectrum. I want to make it (pause) not allowed for you to support this.”
There’s a cut here, perhaps there’s some intervening context that changes the conclusion I should draw from this exchange, but I really very strongly doubt it.
Not at all. There is no such set of characteristics. Wrong conclusions are inevitable and commonplace. Godel’s Theorems apply to all formalisms.
They do not apply to all formalisms, morality is not a formal system, and even if it were this is not what either of Godel’s theorems would say about it. I don’t know why this particular bit of math misunderstanding is so popular online, I suspect it’s because it enables moves like the one you’re making here (i.e. of the form “it’s impossible to justify any statement so I can’t be expected to justify my statements”).
In the current world, the harm of unsanctioned killing being commonly accepted (and cheered) is generally a LOT higher than the harm of statistically-evil people continuing to live. So, yes, a heuristic argument: this is a loss of civilization and order, even if it might have been justifiable on some dimensions.
Ah, I wouldn’t call this a heuristic argument—by “heuristic argument” I mean something like “I can’t come up with any utilitarian calculation that says the bad outweighs the good here, but I know that human brains are prone to underestimating this sort of bad, so I assume there is a calculation saying it was overall bad even if I don’t know what it is.” (Incidentally this is how I understand this situation.) If you have an argument to this effect, I’d love to see it! But to satisfy me it will need to be the sort of argument that permits killing Stalin or Hitler or Idi Amin with a pretty wide margin for error, and if it’s not I’ll make the same critique I did before.
“is it ok to kill people (or call for the killing or support the killing) who have not been convicted by any court and the killing does not stop any immediate physical threat to you?”
When you phrase the question like this, do you think that you’ve identified a set of characteristics that, throughout history, will never lead to the wrong conclusion? Or are you making a heuristic argument?
I think that throughout history, there have probably been many “ok” killings against people who did not present an immediate threat to the killer and had not been convicted by any court. I think that even today there are probably such killings. Do you have an argument against that position or do you think that, by phrasing things in the way you have, you’ve implicitly made a sufficient argument already?
(This is not to say that the UHC CEO killing in particular was justified, of course.)
It simply is not possible to interpret the thing you quoted as saying this. He mentions permits explicitly as one of the limiting factors!