I am part of the team deciding if the largest NHS Trust adopts certain AI tools
I very much hope that there are also doctors, nurses, administrators, and other relevant roles on that team. If not, or really regardless, any tool selection should involve a pilot process, and side-by-side comparisons of results from several options using known past, present, and expected future use cases. The outputs also should be evaluated independently by multiple people with different backgrounds and roles.
I’m going to assume the tools you’re considering are healthcare-specific and advertise themselves as being compliant with any relevant UK laws. If so, what do the providers claim about how they can and should be used, and how they shouldn’t? Do the pilot results bear that out? If not, then you really do need to understand how the tools work, what data goes where, and the like.
I do think the value and potential are real here. As a non-professional private citizen, I have used and do use LLM chat interfaces to gather information and advice for my own, personal health, and that of my pets (or other loved ones when requested). It has been extremely useful at helping me learn things that many more hours of manual searching previously failed to find, and acting as an idea-generator and planning collaborator. But, I have put in a lot of effort to develop a system prompt and prompting strategies that reduce sycophancy, lying, hallucination, and fabrication, and promote clear, logical reasoning and explicit probability estimates of inferences and conclusions. I compare outputs of different LLMs to one another. I ask for and check sources. And I’m not doing anything in response that has any significant probability of being dangerous, or without consulting real people I trust to tell me if I’m being an idiot. But on the other hand, I’ve seen the results other people get when they don’t put in that much effort to using the LLMs well, and the results are either very generic, or in various ways just plain bad.
I suspect, though I don’t know, that the ceiling of what results a skilled user can achieve using a frontier LLM is probably higher than what most dedicated healthcare-focused tools can do, but the floor is very likely to be much, much worse.
This may or may not be useful, but in terms of training, I have a few phrases and framings I keep repeating to people that seem to resonate. I don’t remember where I first heard them.
“Think of AI as an infinite army of interns. Five years ago they were in elementary school. Now they’ve graduated college. But they have no practical experience and only have the context you explicitly give them. Reasoning models and Deep Research/Extended Thinking are like giving the interns more time to think with no feedback from you.”
“Don’t assume what you were doing and seeing with AI 3+ months ago has any bearing on what AI can or will do today. Models and capabilities change, even with the same designation.”
“You’re not talking to a person. Your prompt is telling the AI who to pretend to be as well as what kind of conversation it is having.” (In the personal contexts mentioned above, I would include things at the start of a prompt like “You are a metabolic specialist with 20 years of clinical and research experience” or “You are a feline ethologist.”)
Related: “If your prompt reads like a drunken text, the response will be the kind of response likely to be the reply to a drunken text.”
Also related: “Your system prompt tells the AI who you are, and how you want it to behave. This can be as detailed as you need, and should probably be paragraphs or pages long.” (Give examples)
“The big AI companies are all terrible at making their models into useful products. Never assume they’ve put reasonable effort into making features work the way you’d expect, or that they’ve fixed bugs that would require a few minutes of coding effort to fix.”
I very much hope that there are also doctors, nurses, administrators, and other relevant roles on that team. If not, or really regardless, any tool selection should involve a pilot process, and side-by-side comparisons of results from several options using known past, present, and expected future use cases. The outputs also should be evaluated independently by multiple people with different backgrounds and roles.
For some things it will. But for some things—tools coded as ‘research support’ or ‘point of care reference tools’ or more generally, an information resource—it’s up to the library, just like we make the decisions about what journals we’re subscribed to. I gather that before I started, there used to be more in the way of meaningful consultation with people in other roles—but as our staffing has been axed, these sorts of outreach relationships have fallen to the wayside.
I’m going to assume the tools you’re considering are healthcare-specific and advertise themselves as being compliant with any relevant UK laws. If so, what do the providers claim about how they can and should be used, and how they shouldn’t? Do the pilot results bear that out? If not, then you really do need to understand how the tools work, what data goes where, and the like.
It would be great if that were a reasonable assumption. Every one I’ve evaluated so far has turned out to be some kind of ChatGPT with a medical-academic research bow on it. Some of them are restricted to a walled garden of trusted medical sources instead of having the internet.
Part of the message I think I oughta promote is that we should hold out for something specific. The issue is that when it comes to research, it really is up to people what they use—there’s no real oversight and there’s not regulations to stop them like if they were actually provably putting patient info in there. But they’re still going to be bringing what they “learn” into practice, as well as polluting the commons (since we know at this point that peer review doesn’t do much and it’s mostly peoples’ academic integrity keeping it all from falling apart).
Part of what these companies with their GPTs are trying to sell themselves as being able to replace is the exact sort of checks and balances that stops the whole medical research commons being nothing but bullshit—critical appraisal and evidence synthesis.
I suspect, though I don’t know, that the ceiling of what results a skilled user can achieve using a frontier LLM is probably higher than what most dedicated healthcare-focused tools can do, but the floor is very likely to be much, much worse.
I very much hope that there are also doctors, nurses, administrators, and other relevant roles on that team. If not, or really regardless, any tool selection should involve a pilot process, and side-by-side comparisons of results from several options using known past, present, and expected future use cases. The outputs also should be evaluated independently by multiple people with different backgrounds and roles.
I’m going to assume the tools you’re considering are healthcare-specific and advertise themselves as being compliant with any relevant UK laws. If so, what do the providers claim about how they can and should be used, and how they shouldn’t? Do the pilot results bear that out? If not, then you really do need to understand how the tools work, what data goes where, and the like.
I do think the value and potential are real here. As a non-professional private citizen, I have used and do use LLM chat interfaces to gather information and advice for my own, personal health, and that of my pets (or other loved ones when requested). It has been extremely useful at helping me learn things that many more hours of manual searching previously failed to find, and acting as an idea-generator and planning collaborator. But, I have put in a lot of effort to develop a system prompt and prompting strategies that reduce sycophancy, lying, hallucination, and fabrication, and promote clear, logical reasoning and explicit probability estimates of inferences and conclusions. I compare outputs of different LLMs to one another. I ask for and check sources. And I’m not doing anything in response that has any significant probability of being dangerous, or without consulting real people I trust to tell me if I’m being an idiot. But on the other hand, I’ve seen the results other people get when they don’t put in that much effort to using the LLMs well, and the results are either very generic, or in various ways just plain bad.
I suspect, though I don’t know, that the ceiling of what results a skilled user can achieve using a frontier LLM is probably higher than what most dedicated healthcare-focused tools can do, but the floor is very likely to be much, much worse.
This may or may not be useful, but in terms of training, I have a few phrases and framings I keep repeating to people that seem to resonate. I don’t remember where I first heard them.
“Think of AI as an infinite army of interns. Five years ago they were in elementary school. Now they’ve graduated college. But they have no practical experience and only have the context you explicitly give them. Reasoning models and Deep Research/Extended Thinking are like giving the interns more time to think with no feedback from you.”
“Don’t assume what you were doing and seeing with AI 3+ months ago has any bearing on what AI can or will do today. Models and capabilities change, even with the same designation.”
“You’re not talking to a person. Your prompt is telling the AI who to pretend to be as well as what kind of conversation it is having.” (In the personal contexts mentioned above, I would include things at the start of a prompt like “You are a metabolic specialist with 20 years of clinical and research experience” or “You are a feline ethologist.”)
Related: “If your prompt reads like a drunken text, the response will be the kind of response likely to be the reply to a drunken text.”
Also related: “Your system prompt tells the AI who you are, and how you want it to behave. This can be as detailed as you need, and should probably be paragraphs or pages long.” (Give examples)
“The big AI companies are all terrible at making their models into useful products. Never assume they’ve put reasonable effort into making features work the way you’d expect, or that they’ve fixed bugs that would require a few minutes of coding effort to fix.”
For some things it will. But for some things—tools coded as ‘research support’ or ‘point of care reference tools’ or more generally, an information resource—it’s up to the library, just like we make the decisions about what journals we’re subscribed to. I gather that before I started, there used to be more in the way of meaningful consultation with people in other roles—but as our staffing has been axed, these sorts of outreach relationships have fallen to the wayside.
It would be great if that were a reasonable assumption. Every one I’ve evaluated so far has turned out to be some kind of ChatGPT with a medical-academic research bow on it. Some of them are restricted to a walled garden of trusted medical sources instead of having the internet.
Part of the message I think I oughta promote is that we should hold out for something specific. The issue is that when it comes to research, it really is up to people what they use—there’s no real oversight and there’s not regulations to stop them like if they were actually provably putting patient info in there. But they’re still going to be bringing what they “learn” into practice, as well as polluting the commons (since we know at this point that peer review doesn’t do much and it’s mostly peoples’ academic integrity keeping it all from falling apart).
Part of what these companies with their GPTs are trying to sell themselves as being able to replace is the exact sort of checks and balances that stops the whole medical research commons being nothing but bullshit—critical appraisal and evidence synthesis.
thats about what i thought yeah
Thank you for the phrases, they seem useful.