@Jacob Pfau and I spent a few hours optimizing our prompts and pipelines for our daily uses of AI. Here’s where I think my most desired use cases are in terms of capabilities:
Generating new frontier knowledge: As in, given a LW post generating interesting comments that add to the conversation, or given some notes on a research topic generating experiment ideas, etc. It’s pretty bad, to the extent it’s generally not worth it. But Gemini 2.5 Pro is for some reason much better at this than the other models, to the extent it’s sometimes worth it to sample 5 ideas to get your mind rolling.
I was hoping we could get a nice pipeline that generates many ideas and prunes most, but the model is very bad at pruning. It does write sensible arguments about why some ideas are non-sensical, but ultimately scores them based on flashiness rather than any sensible assessment of relevance to the stated task. Maybe taking a few hours to design good judge rubrics would be worth it, but it seems hard to design very general rubrics.
Writing documents from notes: This was surprisingly bad, mostly because for any set of notes, the AI was missing 50 small contextual details, and thus framed many points in a wrong, misleading or obviously chinese-roomy way. Pasting loads of random context related to the notes (for example, related research papers) didn’t help much. Still, Claude 4 was the best, but maybe this was just because of subjective stylistic preferences.
Of course, some less automated approaches work much better, like giving it a ready document and asking it to improve its flow, or brainstorming structure and presentation.
Math/code: Quite good out of the box. Even for open-ended exploration of vague questions you want to turn into mathematical problems (typical in alignment theory), you can get a nice pipeline for the AI to propose formalizations, decompositions, or example cases, and push the conversation forward semi-autonomously. o3 seems to work best, although I was impressed by Claude 4 Opus’ knowledge on niche topics.
Summarizing documents, and exploring topics I’m no expert in:Super good out of the box, especially thanks to its encyclopaedic indexical knowledge (connecting you to the obvious methods/answers that an expert would bring up).
One particularly useful approach is walking through how a general method or abstract idea could apply to a concrete example of interest to you.
Coaching:Pretty good out of the box in proposing solutions and perspectives. Probably close to top 10% coaches, but maybe huge value is in that last 10%
Also therapy: Probably good, probably better or more constructive than the average friend, but of course worries about hard-to-detect sycophancy.
Personal micromanagement:Pretty good.
Having a long-running chat where you ask it “how long will this task take me to complete”, and over time you both calibrate.
More general scaffold personal assistant to co-organize your week
Summarizing documents, and exploring topics I’m no expert in: Super good
I think you probably did this, but I figured it’s worth checking: did you check this on documents you understand well (such as your own writing) and topics you are an expert on?
I think the reason this works is that the AI doesn’t need to deeply understand in order to make a nice summary. It can just put some words together and my high context with the world will make the necessary connections and interpretations, even if further questioning the AI would lead it to wrong interpretations. For example it’s efficient at summarizing decision theory papers, even thought it’s generally bad at reasoning through it
Generating new frontier knowledge: As in, given a LW post generating interesting comments that add to the conversation, or given some notes on a research topic generating experiment ideas, etc.
Have you tested it on sites/forums other than LW?
Not really, just LW, AI safety papers and AI safety research notes, which are the topics I’d most be interested in. I’m not sure other forums should be very different though?
Mind modeling—surprisingly good even out of the box for many famous people who left extensive diaries etc like Leo Tolstoy.
With some caveats also good in my-mind-modeling based on very long prompt. Sometimes it is too good: it extract memories from memory quicker than I do in normal life.
Here’s a very specific workflow you can try if you want to that I find the most use of:
Iterate a “research story” with claude or chatgpt and prompt it to take on th epersonas of experts in that specific field.
Do this until you have a shared vision
Ask it then to generate a set of questions for elicit to create a research report from.
Run the prompt through elicit and create a systematic lit review breakdown on the task
Download all of the related pdfs (I’ve got some scripts for this)
Put all of the existing pdfs into gemini 2.5 pro since it’s got great context window and utilisation of context window.
Have Claude from before frame a research paper and have gemini write in the background and methodology and voila, you’ve got yourself some pretty good thoughts and a really good environment to explore more ideas in.
My vibe-check on current AI use cases
@Jacob Pfau and I spent a few hours optimizing our prompts and pipelines for our daily uses of AI. Here’s where I think my most desired use cases are in terms of capabilities:
Generating new frontier knowledge: As in, given a LW post generating interesting comments that add to the conversation, or given some notes on a research topic generating experiment ideas, etc. It’s pretty bad, to the extent it’s generally not worth it. But Gemini 2.5 Pro is for some reason much better at this than the other models, to the extent it’s sometimes worth it to sample 5 ideas to get your mind rolling.
I was hoping we could get a nice pipeline that generates many ideas and prunes most, but the model is very bad at pruning. It does write sensible arguments about why some ideas are non-sensical, but ultimately scores them based on flashiness rather than any sensible assessment of relevance to the stated task. Maybe taking a few hours to design good judge rubrics would be worth it, but it seems hard to design very general rubrics.
Writing documents from notes: This was surprisingly bad, mostly because for any set of notes, the AI was missing 50 small contextual details, and thus framed many points in a wrong, misleading or obviously chinese-roomy way. Pasting loads of random context related to the notes (for example, related research papers) didn’t help much. Still, Claude 4 was the best, but maybe this was just because of subjective stylistic preferences.
Of course, some less automated approaches work much better, like giving it a ready document and asking it to improve its flow, or brainstorming structure and presentation.
Math/code: Quite good out of the box. Even for open-ended exploration of vague questions you want to turn into mathematical problems (typical in alignment theory), you can get a nice pipeline for the AI to propose formalizations, decompositions, or example cases, and push the conversation forward semi-autonomously. o3 seems to work best, although I was impressed by Claude 4 Opus’ knowledge on niche topics.
Summarizing documents, and exploring topics I’m no expert in: Super good out of the box, especially thanks to its encyclopaedic indexical knowledge (connecting you to the obvious methods/answers that an expert would bring up).
One particularly useful approach is walking through how a general method or abstract idea could apply to a concrete example of interest to you.
Coaching: Pretty good out of the box in proposing solutions and perspectives. Probably close to top 10% coaches, but maybe huge value is in that last 10%
Also therapy: Probably good, probably better or more constructive than the average friend, but of course worries about hard-to-detect sycophancy.
Personal micromanagement: Pretty good.
Having a long-running chat where you ask it “how long will this task take me to complete”, and over time you both calibrate.
More general scaffold personal assistant to co-organize your week
Any use cases I’m missing?
I think you probably did this, but I figured it’s worth checking: did you check this on documents you understand well (such as your own writing) and topics you are an expert on?
hahah yes we had ground truth
I think the reason this works is that the AI doesn’t need to deeply understand in order to make a nice summary. It can just put some words together and my high context with the world will make the necessary connections and interpretations, even if further questioning the AI would lead it to wrong interpretations. For example it’s efficient at summarizing decision theory papers, even thought it’s generally bad at reasoning through it
Have you tested it on sites/forums other than LW?
Not really, just LW, AI safety papers and AI safety research notes, which are the topics I’d most be interested in. I’m not sure other forums should be very different though?
Mind modeling—surprisingly good even out of the box for many famous people who left extensive diaries etc like Leo Tolstoy.
With some caveats also good in my-mind-modeling based on very long prompt. Sometimes it is too good: it extract memories from memory quicker than I do in normal life.
Here’s a very specific workflow you can try if you want to that I find the most use of:
Iterate a “research story” with claude or chatgpt and prompt it to take on th epersonas of experts in that specific field.
Do this until you have a shared vision
Ask it then to generate a set of questions for elicit to create a research report from.
Run the prompt through elicit and create a systematic lit review breakdown on the task
Download all of the related pdfs (I’ve got some scripts for this)
Put all of the existing pdfs into gemini 2.5 pro since it’s got great context window and utilisation of context window.
Have Claude from before frame a research paper and have gemini write in the background and methodology and voila, you’ve got yourself some pretty good thoughts and a really good environment to explore more ideas in.
You’re saying Gemini 2.5 pro seems better at generating frontier knowledge than o3?
I’m finding G2.5P pretty useful for discussing research and theories, but I haven’t tried o3 nearly as much for the same purpose.