I’ve updated marginally towards this (as a guy pretty focused on LLM-augmentation. I anticipated LLM brain rot, but it still was more pernicious/fast than I expected)
I do still think some-manner-of-AI-integration is going to be an important part of “moving forward” but probably not whatever capitalism serves up.
I have tried out using them pretty extensively for coding. The speedup is real, and I expect to get more real. Right now it’s like a pretty junior employee that I get to infinitely micromanage. But it definitely does lull me into a lower agency state where instead of trying to solve problems myself I’m handing them off to LLMs much of the time to see if it can handle it.
During work hours, I try to actively override this, i.e. have the habit “send LLM off, and then go back to thinking about some kind of concrete thing (although often a higher level strategy.” But, this becomes harder to do as it gets later in the day and I get more tired.
One of the benefits of LLMs is that you can do moderately complex cognitive work* while tired (*that a junior engineer could do). But, that means by default a bunch of time is spent specifically training the habit of using LLMs in a stupid way.
(I feel sort of confused about how people who don’t use it for coding are doing. With coding, I can feel the beginnings of a serious exoskeleton that can build structures around me with thought. Outside of that, I don’t know of it being more than a somewhat better google).
I currently mostly avoid interactions that treat the AI like a person-I’m-talking to. That way seems most madness inducing.
Outside of [coding], I don’t know of it being more than a somewhat better google
I’ve recently tried heavily leveraging o3 as part of a math-research loop.
I have never been more bearish on LLMs automating any kind of research than I am now.
And I’ve tried lots of ways to make it work. I’ve tried telling it to solve the problem without any further directions, I’ve tried telling it to analyze the problem instead of attempting to solve it, I’ve tried dumping my own analysis of the problem into its context window, I’ve tried getting it to search for relevant lemmas/proofs in math literature instead of attempting to solve it, I’ve tried picking out a subproblem and telling it to focus on that, I’ve tried giving it directions/proof sketches, I’ve tried various power-user system prompts, I’ve tried resampling the output thrice and picking the best one. None of this made it particularly helpful, and the bulk of the time was spent trying to spot where it’s lying or confabulating to me in its arguments or proofs (which it ~always did).
It was kind of okay for tasks like “here’s a toy setup, use a well-known formula to compute the relationships between A and B”, or “try to rearrange this expression into a specific form using well-known identities”, which are relatively menial and freed up my working memory for more complicated tasks. But it’s pretty minor usefulness (and you have to re-check the outputs for errors anyway).
I assume there are math problems at which they do okay, but that capability sure is brittle. I don’t want to overupdate here, but geez, getting LLMs from here to the Singularity in 2-3 years just doesn’t feel plausible.
[disclaimer, not a math guy, only barely knows what he’s talking about, if this next thought is stupid I’m interested to learn more]
I don’t expect this to fix it right now, but, one thing I don’t think you listed is doing the work in lean or some other proof assistant that lets you check results immediately? I expect LLMs to first be able to do math in that format because it’s the format you can actually do a lot of training in. And it’d mean you can verify results more quickly.
My current vague understanding is that lean is normally too cumbersome to be a reasonable to work in, but, that’s the sort of thing that could change with LLMs in the mix.
I did actually try a bit of that back in the o1 days. What I’ve found is that getting LLMs to output formal Lean proofs is pretty difficult: they really don’t want to do that. When they’re not making mistakes, they use informal language as connective tissue between Lean snippets, they put in “sorry”s (a placeholder that makes a lemma evaluate as proven), and otherwise try to weasel out of it.
This is something that should be solvable by fine-tuning, but at the time, there weren’t any publicly available decent models fine-tuned for that.
We do have DeepSeek-Prover-V2 now, though. I should look into it at some point. But I am not optimistic, sounds like it’s doing the same stuff, just more cleverly.
(I had a bit of an epistemic rollercoaster making this prediction, I updated “by the time someone makes an actually worthwhile Math AI, even if lean was an important part of it’s training process, it’s probably not that hard to do additional fine tuning that gets it to output stuff in a more standard mathy format. But, then, it seemed like it was still going to be important to quickly check it wasn’t blatantly broken as part of the process)
(I feel sort of confused about how people who don’t use it for coding are doing. With coding, I can feel the beginnings of a serious exoskeleton that can build structures around me with thought. Outside of that, I don’t know of it being more than a somewhat better google).
There’s common ways I currently use (the free version of) ChatGPT that are partially categorizable as “somewhat better search engine”, but where I feel like that’s not representative of the real differences. A lot of this is coding-related, but not all, and the reasons I use it for coding-related and non-coding-related tasks feel similar. When it is coding-related, it’s generally not of the form of asking it to write code for me that I’ll then actually put into a project, though occasionally I will ask for example snippets which I can use to integrate the information better mentally before writing what I actually want.
The biggest difference in feel is that a chat-style interface is predictable and compact and avoids pushing a full-sized mental stack frame and having to spill all the context of whatever I was doing before. (The name of the website Stack Exchange is actually pretty on point here, insofar as they were trying to provide something similar from crowdsourcing!) This is something I can see being a source of creeping mental laziness—but it depends on the size and nature of the rest of the stack: if you were already under high context-retention load relative to your capabilities, and you’re already task-saturated enough, and you use a chatbot for leaf calls that would otherwise cause you to have to do a lot of inefficient working-memory juggling, then it seems like you’re already getting a lot of the actually-useful mental exercise at the other levels and you won’t be eliminating much of it, just getting some probabilistic task speedups.
In roughly descending order of “qualitatively different from a search engine” (which is not the same as “most impactful to me in practice”):
Some queries are reverse concept search, which to me is probably the biggest and hardest-to-replicate advantage over traditional search engine: I often have the shape of a concept that seems useful, but because I synthesized it myself rather than drawing from popular existing uses, I don’t know what it’s called. This can be checked for accuracy using a traditional search engine in the forward direction once I have the candidate term.
Some queries are for babble purposes: “list a bunch of X” and I’ll throw out 90% of them for actual use but use the distribution to help nudge my own imagination—generally I’ll do my own babble first and then augment it, to limit priming effects. There’s potential for memetic health issues here, but in my case most of these are isolated enough that I don’t expect them to combine to create larger problems. (In a qualitatively similar way but with a different impact, some of it is pure silliness. “Suppose the protagonists of Final Fantasy XIII had Geese powers. What kind of powers might they have?”)
Synthesis and shaping of information is way different from search engine capabilities. This includes asking for results tailored along specific axes I care about where it’s much less likely an existing webpage author has used that as a focus, small leaps of connective reasoning that would take processing and filtering through multiple large pages to do via search engine, and comparisons between popular instances of a class (in coding contexts, often software components) where sometimes someone’s actually written up the comparison and sometimes not. Being able to fluently ask followups that move from a topic to a subtopic or related topic without losing all the context is also very useful. “Tell me about the main differences between X1 and X2.” → “This new thing introduced in X2, is that because of Y?” (but beware of sycophancy biases if you use leading questions like that)
(Beyond this point we get closer to “basically a search engine”.)
Avoiding the rise in Web annoyances is a big one in practice—which ties into the weird tension of social contracts around Internet publishing being kind of broken right now, but from an information-consumer perspective, the reprocessed version is often superior. If a very common result is that a search engine will turn up six plausible results, and three of them are entirely blog slop (often of a pre-LLM type!) which is too vague to be useful for me, two of them ask me to sign up for a ‘free’ account to continue but only after I’ve started reading the useless intro text, and one of them contains the information I need in theory but I have to be prepared to click the “reject cookies” button, and click the close button on the “get these delivered to your inbox” on-scroll popup, and hope it doesn’t load another ten-megabyte hero image that I don’t care about and chew through my cellular quota in the process, and if I try to use browser extensions to combat this then the text doesn’t load, and so on and so on… then obviously I will switch to asking the chatbot first! “most of the content is buried in hour-long videos” is skew to this but results in the same for me.
In domains like “how would I get started learning skill X”, where there’s enough people who can get a commercial advantage through SEO’ing that into “well, take our course or buy our starter kit” (but usually subtler than that), those results seem (and I think for now probably are) less trustworthy than chatbot output that goes directly to concrete aspects that can be checked more cleanly, and tend to disguise themselves to be hard to filter out without reading a lot of the way through. Of course, there’s obvious ways for this not to last, either as SEO morphs into AIO or if the chatbot providers start selling the equivalent of product placement behind the scenes.
(fwiw, I never felt like phones offered any real “you need them to not fall behind”. They are kinda a nice-to-have in some situations. I do need them for uber/lyfy and maps, I use them for other things which have some benefits and costs, this post is upweighting “completely block the internet on my phone.” I don’t have any social media apps on my phone but it doesn’t matter much, I just use the web browser)
I imagine this differs a lot based on what social position you’re already in and where you’re likely to get your needs met. When assumptions like “everyone has a smartphone” become sufficiently widespread, you can be blocked off from things unpredictably when you don’t meet them. You often can’t tell which things these are in advance: simplification pressure causes a phase transition from “communicated request” to “implicit assumption”, and there’s too many widely-distributed ways for the assumption to become relevant, so doing your own modeling will produce a “reliably don’t need” result so infrequently as to be effectively useless. Then, if making the transition to conformity when you notice a potential opportunity is too slow or is blocked by e.g. resource constraints or value differences, a lot of instant-lose faces get added to the social dice you roll. If your anticipated social set is already stable and well-adapted to you, you may not be rolling many dice, but if you’re precarious, or searching for breakthrough opportunities, or just have a role with more wide-ranging and unpredictable requirements on which interactions you need to succeed at, it’s a huge penalty. Other technologies this often happens with in the USA, again depending on your social class and milieu, include cars, credit cards, and Facebook accounts.
(It feels like there has to already be an explainer for this somewhere in the LW-sphere, right? I didn’t see an obvious one, though…)
You’ve reminded me of a perspective I was meaning to include but then forgot to, actually. From the perspective of an equilibrium in which everyone’s implicitly expected to bring certain resources/capabilities as table stakes, making a personal decision that makes your life better but reduces your contribution to the pool can be seen as defection—and on a short time horizon or where you’re otherwise forced to take the equilibrium for granted, it seems hard to refute! (ObXkcd: “valuing unit standardization over being helpful possibly makes me a bad friend” if we take the protagonist as seeing “US customary units” as an awkward equilibrium.) Some offshoots of this which I’m not sure what to make of:
If the decision would lead to a better society if everyone did it, and leads to an improvement for you if only you do it, but requires the rest of a more localized group to spend more energy to compensate for you if you do it and they don’t, we have a sort of “incentive misalignment sandwich” going on. In practice I think there’s usually enough disagreement about the first point that this isn’t clear-cut, but it’s interesting to notice.
In the face of technological advances, what continues to count as table stakes tends to get set by Moloch and mimetic feedback loops rather than intentionally. In a way, people complaining vociferously about having to adopt new things are arguably acting in a counter-Moloch role here, but in the places I’ve seen that happen, it’s either been ineffective or led to a stressful and oppressive atmosphere of its own (or, most commonly and unfortunately, both).
I think intuitive recognition of (2) is a big motivator behind attacking adopters of new technology that might fall into this pattern, in a way that often gets poorly expressed in a “tech companies ruin everything” type of way. Personally taking up smartphones, or cars, or—nowadays the big one that I see in my other circles—generative AI, even if you don’t yourself look down on or otherwise directly negatively impact non-users, can be seen as playing into a new potential equilibrium where if you can, you ‘must’, or else you’re not putting in as much as everyone else, and so everyone else will gradually find that they get boxed in and any negative secondary effects on them are irrelevant compared to the phase transition energy. A comparison that comes to mind is actually labor unions; that’s another case where restraining individually expressed capabilities in order to retain a better collective bargaining position for others comes into play, isn’t it?
… hmm, come to think of it, maybe part of conformity-pressure in general can be seen as a special case of this where the pool resource is more purely “cognition and attention spent dealing with non-default things” and the nonconformity by default has more of a purely negative impact on that axis, whereas conformity-pressure over technology with specific capabilities causes the nature of the pool resource to be pulled in the direction of what the technology is providing and there’s an active positive thing going on that becomes the baseline… I wonder if anything useful can be derived from thinking about those two cases as denoting an axis of variation.
And when the conformity is to a new norm that may be more difficult to understand but produces relative positive externalities in some way, is that similar to treating the new norm as a required table stakes cognitive technology?
I’ve updated marginally towards this (as a guy pretty focused on LLM-augmentation. I anticipated LLM brain rot, but it still was more pernicious/fast than I expected)
I do still think some-manner-of-AI-integration is going to be an important part of “moving forward” but probably not whatever capitalism serves up.
I have tried out using them pretty extensively for coding. The speedup is real, and I expect to get more real. Right now it’s like a pretty junior employee that I get to infinitely micromanage. But it definitely does lull me into a lower agency state where instead of trying to solve problems myself I’m handing them off to LLMs much of the time to see if it can handle it.
During work hours, I try to actively override this, i.e. have the habit “send LLM off, and then go back to thinking about some kind of concrete thing (although often a higher level strategy.” But, this becomes harder to do as it gets later in the day and I get more tired.
One of the benefits of LLMs is that you can do moderately complex cognitive work* while tired (*that a junior engineer could do). But, that means by default a bunch of time is spent specifically training the habit of using LLMs in a stupid way.
(I feel sort of confused about how people who don’t use it for coding are doing. With coding, I can feel the beginnings of a serious exoskeleton that can build structures around me with thought. Outside of that, I don’t know of it being more than a somewhat better google).
I currently mostly avoid interactions that treat the AI like a person-I’m-talking to. That way seems most madness inducing.
(Disclaimer: only partially relevant rant.)
I’ve recently tried heavily leveraging o3 as part of a math-research loop.
I have never been more bearish on LLMs automating any kind of research than I am now.
And I’ve tried lots of ways to make it work. I’ve tried telling it to solve the problem without any further directions, I’ve tried telling it to analyze the problem instead of attempting to solve it, I’ve tried dumping my own analysis of the problem into its context window, I’ve tried getting it to search for relevant lemmas/proofs in math literature instead of attempting to solve it, I’ve tried picking out a subproblem and telling it to focus on that, I’ve tried giving it directions/proof sketches, I’ve tried various power-user system prompts, I’ve tried resampling the output thrice and picking the best one. None of this made it particularly helpful, and the bulk of the time was spent trying to spot where it’s lying or confabulating to me in its arguments or proofs (which it ~always did).
It was kind of okay for tasks like “here’s a toy setup, use a well-known formula to compute the relationships between A and B”, or “try to rearrange this expression into a specific form using well-known identities”, which are relatively menial and freed up my working memory for more complicated tasks. But it’s pretty minor usefulness (and you have to re-check the outputs for errors anyway).
I assume there are math problems at which they do okay, but that capability sure is brittle. I don’t want to overupdate here, but geez, getting LLMs from here to the Singularity in 2-3 years just doesn’t feel plausible.
Nod.
[disclaimer, not a math guy, only barely knows what he’s talking about, if this next thought is stupid I’m interested to learn more]
I don’t expect this to fix it right now, but, one thing I don’t think you listed is doing the work in lean or some other proof assistant that lets you check results immediately? I expect LLMs to first be able to do math in that format because it’s the format you can actually do a lot of training in. And it’d mean you can verify results more quickly.
My current vague understanding is that lean is normally too cumbersome to be a reasonable to work in, but, that’s the sort of thing that could change with LLMs in the mix.
I agree that it’s a promising direction.
I did actually try a bit of that back in the o1 days. What I’ve found is that getting LLMs to output formal Lean proofs is pretty difficult: they really don’t want to do that. When they’re not making mistakes, they use informal language as connective tissue between Lean snippets, they put in “sorry”s (a placeholder that makes a lemma evaluate as proven), and otherwise try to weasel out of it.
This is something that should be solvable by fine-tuning, but at the time, there weren’t any publicly available decent models fine-tuned for that.
We do have DeepSeek-Prover-V2 now, though. I should look into it at some point. But I am not optimistic, sounds like it’s doing the same stuff, just more cleverly.
Relevant: Terence Tao does find them helpful for some Lean-related applications.
yeah, it’s less that I’d bet it works now, just, whenever it DOES start working, it seems likely it’d be through this mechanism.
⚖ If Thane Ruthenis thinks there are AI tools that can meaningfully help with Math by this point, did they first have a noticeable period (> 1 month) where it was easier to get work out of them via working in lean-or-similar? (Raymond Arnold: 25% & 60%)
(I had a bit of an epistemic rollercoaster making this prediction, I updated “by the time someone makes an actually worthwhile Math AI, even if lean was an important part of it’s training process, it’s probably not that hard to do additional fine tuning that gets it to output stuff in a more standard mathy format. But, then, it seemed like it was still going to be important to quickly check it wasn’t blatantly broken as part of the process)
There’s common ways I currently use (the free version of) ChatGPT that are partially categorizable as “somewhat better search engine”, but where I feel like that’s not representative of the real differences. A lot of this is coding-related, but not all, and the reasons I use it for coding-related and non-coding-related tasks feel similar. When it is coding-related, it’s generally not of the form of asking it to write code for me that I’ll then actually put into a project, though occasionally I will ask for example snippets which I can use to integrate the information better mentally before writing what I actually want.
The biggest difference in feel is that a chat-style interface is predictable and compact and avoids pushing a full-sized mental stack frame and having to spill all the context of whatever I was doing before. (The name of the website Stack Exchange is actually pretty on point here, insofar as they were trying to provide something similar from crowdsourcing!) This is something I can see being a source of creeping mental laziness—but it depends on the size and nature of the rest of the stack: if you were already under high context-retention load relative to your capabilities, and you’re already task-saturated enough, and you use a chatbot for leaf calls that would otherwise cause you to have to do a lot of inefficient working-memory juggling, then it seems like you’re already getting a lot of the actually-useful mental exercise at the other levels and you won’t be eliminating much of it, just getting some probabilistic task speedups.
In roughly descending order of “qualitatively different from a search engine” (which is not the same as “most impactful to me in practice”):
Some queries are reverse concept search, which to me is probably the biggest and hardest-to-replicate advantage over traditional search engine: I often have the shape of a concept that seems useful, but because I synthesized it myself rather than drawing from popular existing uses, I don’t know what it’s called. This can be checked for accuracy using a traditional search engine in the forward direction once I have the candidate term.
Some queries are for babble purposes: “list a bunch of X” and I’ll throw out 90% of them for actual use but use the distribution to help nudge my own imagination—generally I’ll do my own babble first and then augment it, to limit priming effects. There’s potential for memetic health issues here, but in my case most of these are isolated enough that I don’t expect them to combine to create larger problems. (In a qualitatively similar way but with a different impact, some of it is pure silliness. “Suppose the protagonists of Final Fantasy XIII had Geese powers. What kind of powers might they have?”)
Synthesis and shaping of information is way different from search engine capabilities. This includes asking for results tailored along specific axes I care about where it’s much less likely an existing webpage author has used that as a focus, small leaps of connective reasoning that would take processing and filtering through multiple large pages to do via search engine, and comparisons between popular instances of a class (in coding contexts, often software components) where sometimes someone’s actually written up the comparison and sometimes not. Being able to fluently ask followups that move from a topic to a subtopic or related topic without losing all the context is also very useful. “Tell me about the main differences between X1 and X2.” → “This new thing introduced in X2, is that because of Y?” (but beware of sycophancy biases if you use leading questions like that)
(Beyond this point we get closer to “basically a search engine”.)
Avoiding the rise in Web annoyances is a big one in practice—which ties into the weird tension of social contracts around Internet publishing being kind of broken right now, but from an information-consumer perspective, the reprocessed version is often superior. If a very common result is that a search engine will turn up six plausible results, and three of them are entirely blog slop (often of a pre-LLM type!) which is too vague to be useful for me, two of them ask me to sign up for a ‘free’ account to continue but only after I’ve started reading the useless intro text, and one of them contains the information I need in theory but I have to be prepared to click the “reject cookies” button, and click the close button on the “get these delivered to your inbox” on-scroll popup, and hope it doesn’t load another ten-megabyte hero image that I don’t care about and chew through my cellular quota in the process, and if I try to use browser extensions to combat this then the text doesn’t load, and so on and so on… then obviously I will switch to asking the chatbot first! “most of the content is buried in hour-long videos” is skew to this but results in the same for me.
In domains like “how would I get started learning skill X”, where there’s enough people who can get a commercial advantage through SEO’ing that into “well, take our course or buy our starter kit” (but usually subtler than that), those results seem (and I think for now probably are) less trustworthy than chatbot output that goes directly to concrete aspects that can be checked more cleanly, and tend to disguise themselves to be hard to filter out without reading a lot of the way through. Of course, there’s obvious ways for this not to last, either as SEO morphs into AIO or if the chatbot providers start selling the equivalent of product placement behind the scenes.
(fwiw, I never felt like phones offered any real “you need them to not fall behind”. They are kinda a nice-to-have in some situations. I do need them for uber/lyfy and maps, I use them for other things which have some benefits and costs, this post is upweighting “completely block the internet on my phone.” I don’t have any social media apps on my phone but it doesn’t matter much, I just use the web browser)
I imagine this differs a lot based on what social position you’re already in and where you’re likely to get your needs met. When assumptions like “everyone has a smartphone” become sufficiently widespread, you can be blocked off from things unpredictably when you don’t meet them. You often can’t tell which things these are in advance: simplification pressure causes a phase transition from “communicated request” to “implicit assumption”, and there’s too many widely-distributed ways for the assumption to become relevant, so doing your own modeling will produce a “reliably don’t need” result so infrequently as to be effectively useless. Then, if making the transition to conformity when you notice a potential opportunity is too slow or is blocked by e.g. resource constraints or value differences, a lot of instant-lose faces get added to the social dice you roll. If your anticipated social set is already stable and well-adapted to you, you may not be rolling many dice, but if you’re precarious, or searching for breakthrough opportunities, or just have a role with more wide-ranging and unpredictable requirements on which interactions you need to succeed at, it’s a huge penalty. Other technologies this often happens with in the USA, again depending on your social class and milieu, include cars, credit cards, and Facebook accounts.
(It feels like there has to already be an explainer for this somewhere in the LW-sphere, right? I didn’t see an obvious one, though…)
yeah a friend of mine gave in because she was getting so much attitude about needing people to give her directions.
You’ve reminded me of a perspective I was meaning to include but then forgot to, actually. From the perspective of an equilibrium in which everyone’s implicitly expected to bring certain resources/capabilities as table stakes, making a personal decision that makes your life better but reduces your contribution to the pool can be seen as defection—and on a short time horizon or where you’re otherwise forced to take the equilibrium for granted, it seems hard to refute! (ObXkcd: “valuing unit standardization over being helpful possibly makes me a bad friend” if we take the protagonist as seeing “US customary units” as an awkward equilibrium.) Some offshoots of this which I’m not sure what to make of:
If the decision would lead to a better society if everyone did it, and leads to an improvement for you if only you do it, but requires the rest of a more localized group to spend more energy to compensate for you if you do it and they don’t, we have a sort of “incentive misalignment sandwich” going on. In practice I think there’s usually enough disagreement about the first point that this isn’t clear-cut, but it’s interesting to notice.
In the face of technological advances, what continues to count as table stakes tends to get set by Moloch and mimetic feedback loops rather than intentionally. In a way, people complaining vociferously about having to adopt new things are arguably acting in a counter-Moloch role here, but in the places I’ve seen that happen, it’s either been ineffective or led to a stressful and oppressive atmosphere of its own (or, most commonly and unfortunately, both).
I think intuitive recognition of (2) is a big motivator behind attacking adopters of new technology that might fall into this pattern, in a way that often gets poorly expressed in a “tech companies ruin everything” type of way. Personally taking up smartphones, or cars, or—nowadays the big one that I see in my other circles—generative AI, even if you don’t yourself look down on or otherwise directly negatively impact non-users, can be seen as playing into a new potential equilibrium where if you can, you ‘must’, or else you’re not putting in as much as everyone else, and so everyone else will gradually find that they get boxed in and any negative secondary effects on them are irrelevant compared to the phase transition energy. A comparison that comes to mind is actually labor unions; that’s another case where restraining individually expressed capabilities in order to retain a better collective bargaining position for others comes into play, isn’t it?
(Now much more tangentially:)
… hmm, come to think of it, maybe part of conformity-pressure in general can be seen as a special case of this where the pool resource is more purely “cognition and attention spent dealing with non-default things” and the nonconformity by default has more of a purely negative impact on that axis, whereas conformity-pressure over technology with specific capabilities causes the nature of the pool resource to be pulled in the direction of what the technology is providing and there’s an active positive thing going on that becomes the baseline… I wonder if anything useful can be derived from thinking about those two cases as denoting an axis of variation.
And when the conformity is to a new norm that may be more difficult to understand but produces relative positive externalities in some way, is that similar to treating the new norm as a required table stakes cognitive technology?
I mostly use it for syntax, and formatting/modifying docs, giving me quick visual designs...