I'll be finishing my interventional radiology fellowship this year. I remember in 2016 when Geoffrey Hinton said, "We should stop training radiologists now," the radiology community was aghast and in-denial. My undergrad and masters were in computer science, and I felt, "yes, that's about right."
If you were starting a diagnostic radiology residency, including intern year and fellowship, you'd just be finishing now. How can you really think that "computers can't read diagnostic images" if models such as this can describe a VGA connector outfitted with a lighting cable?
As another radiologist, I'm not sure how you can say this with a straight face? If anything the minimal progress that has been made since Hinton made this claim should be encouraging people to pursue radiology training. As with other areas of medicine that have better AI (interpreting ECGs for example) all this will do is make our lives easier. AI is not an existential threat to radiology (or pathology for that matter which is an easier problem to solve than medical imaging).
1. Radiology =/= interpreting pixels and applying a class label.
2. Risk and consequences of misclassifying T-staging of a cancer =/= risk of misclassifying a VGA connector.
3. Imaging appearance overlap of radiological findings >>>>>>>>>> imaging appearance overlap of different types of connectors (e.g. infection and cancer can look the same, we make educated guesses on a lot of things considering many patient variables, clinical data, and prior imaging.) You would need to have a multi-modal model enriched with a patient knowledge graph to try and replicate this, while problems like this are being worked on we are no where close enough for this to be a near-term threat. We haven't even solved NLP in medicine, let alone imaging interpretation!
4. Radiologists do far more than interpret images, unless you're in a tele-radiology eat-what-you-kill sweatshop. This includes things like procedures (i.e. biopsies and drainages for diagnostic rads) and multidisciplinary rounds/tumor boards.
I totally understand your point #4 - obviously ChatGPT can't do procedures, but I interpreted GP's post as "this is why I did a fellowship in interventional radiology instead of being a (solely) diagnostic radiologist."
But, at the end of the day, diagnostic radiology is about taking an input set of bytes and transforming that to an output set of bytes - that is absolutely what generative AI does excellently. When you said "I'm not sure how you can say this with a straight face?", I couldn't understand if you were talking about now, or what the world will look like in 40 years. Because someone finishing med school now will want to have a career that lasts about 40 years. If anything, I think the present day shortage of radiologists is due to the fact that AI is not there yet, but smart med students can easily see the writing on the wall and see there is a very, very good chance AI will start killing radiology jobs in about 10 years, let alone 40.
As the simplest analogy, we still pay cardiologists to interpret an ECG that comes with a computer readout and is literally a graph of voltages.
First AI will make our lives much easier as it will on other industries, saying it will take 10 years to solve the AI problem for most of diagnostic radiology is laughable. There are many reasons why radiology AI is currently terrible and we don't need to get into them but let's pretend that current DL models can do it today.
The studies you would need to make to validate this across multiple institutions while making sure population drift doesn't happen (see the Epic sepsis AI predicting failure in 2022) and validating long term benefits (assuming all of this is going right) will take 5-10 years. It'll be another 5-10 years if you aggressively lobby to get this through legislation and deal the insurance/liability problem.
Separately w have to figure out how we set up the infrastructure for this presumably very large model in the context of HIPAA.
I find it hard to hard to believe that all of this will happen in 10 years, when once again we still don't have models that do it close to being good enough today. What will likely happen is it will be flagging nodules for me so I don't have to look as carefully at the lungs and we will still need radiologists like we need cardiologists to read a voltage graph.
Radiology is a lot about realizing what is normal, 'normal for this patient' and what we should care about while staying up to date on literature and considering the risks/benefits of calling an abnormality vs not calling one. MRI (other than neuro) is not that old of a field we're discovering new things every year and pathology is also evolving. Saying it's a solved problem of bits and bytes is like saying ChatGPT will replace software engineers in 10 years because it's just copy pasting code from SO or GH and importing libraries. Sure it'll replace the crappy coders and boilerplate but you still need engineers to put the pieces together. It will also replace crap radiologists who just report every pixel they see without carefully interrogating things and the patient chart as relevant.
I agree that the level of risk/consequence is higher for radiology misses, but I wonder if radiologists are already missing things because of simplification for human feasibility. Things like LI-RADS and BI-RADS are so simple from a computer science perspective. I wouldn't even call them algorithms, just simple checkbox decision making.
This tendency to simplify is everywhere in radiology: When looking for a radial head fracture, we're taught to exam the cortex for discontinuities, look for an elbow joint effusion, evaluate the anterior humeral line, etc. But what if there's some feature (or combination of feature) that is beyond human perception? Maybe the radial ulnar joint space is a millimeter wider than it should be? Maybe soft tissues are just a bit too dense near the elbow? Just how far does the fat pad have to be displaced to indicate an effusion? Probably the best "decision function" is a non-linear combination of all these findings. Oh, but we only have 1 minute to read the radiograph and move on to the next one.
Unfortunately, as someone noted below, advances in medicine are glacially slow. I think change is only going to come in the form of lawsuits. Imagine a future where a patient and her lawyer can get a second-opinion from an online model, "Why did you miss my client's proximal scaphoid fracture? We uploaded her radiographs and GPT-4 found it in 2 seconds." If and when these types of lawsuits occur, malpractice insurances are going to push for radiologists to use AI.
Regarding other tasks performed by radiologists, some radiologists do more than dictate images, but those are generally the minority. The vast majority of radiologists read images for big money without ever meeting the patient or the provider who ordered the study. In the most extreme case, radiologists read studies after the acute intervention has been performed. This happens a lot in IR - we get called about a bleed, review the imaging, take the patient to angiography, and then get paged by diagnostic radiology in the middle of the case.
Orthopedists have already wised-up to the disconnect between radiology reimbursement and the discrepancy in work involved in MR interpretation versus surgery. At least two groups, including the "best orthopedic hospital in the country" employ their own in-house radiologists so that they can capture part of the imaging revenue. If GPT-4 can offer summative reads without feature simplification, and prior to intervention, why not have the IR or orthopedist sign off the GPT-4 report?
1a. Seeing as we know the sensitivity, specificity and inter-rater reliability of LI-RADS and BI-RADS so we can easily determine how many cases we are missing. Your suggestion that we are potentially 'missing' cases with these two algorithms is a misunderstanding of the point of both, with LI-RADS we are primarily optimizing specificity to avoid biopsy and establish a radiologic diagnosis of HCC. With BI-RADS it's a combination of both, and we have great sensitivity. We don't need to be diagnosing more incidentalomas.
1b. With respects to the simplicity of LI-RADS, if you are strictly following the major criteria only it's absolutely simple. This was designed to assist the general radiologist so they do not have to hedge (LR-5 = cancer). If you are practicing in a tertiary care cancer center (i.e. one where you would be providing locoregional therapy and transplant where accurate diagnosis matters), it is borderline negligent to not be applying ancillary features (while optional LR-4 triggers treatment as you would be experienced with in your practice). Ancillary features and accurate lesion segmentation over multiple sequences that are not accurately linked on the Z-axis remains an unsolved problem, and are non-trivial to solve and integrate findings on in CS (I too have a CS background and while my interest is in language models my colleagues involved with multi-sequence segmentation have had less than impressive results even using the latest techniques with diffusion models, although better than U-net, refer to Junde Wu et al. from baidu on their results). As you know with medicine it is irrefutable that increased / early diagnosis does not necessarily lead to improved patient outcomes, there are several biases that result from this and in fact we have routinely demonstrated that overdiagnosis results in harm for patients and early diagnosis does not benefit overall survival or mortality.
2a. Again a fundamental misunderstanding of how radiology and AI work and in fact the reason why the two clinical decision algorithms you mentioned were developed. First off, we generally have an overdiagnosis problem rather than an underdiagnosis one. You bring up a specifically challenging radiographic diagnosis (scaphoid fracture), if there is clinical suspicion for scaphoid injury it would be negligent to not pursue advanced imaging. Furthermore, let us assume for your hypothetical GPT-4 or any ViLM has enough sensitivity (in reality they don't, see Stanford AIMI and Microsoft's separate on chest x-rays for more detail), you are ignoring specificity. Overdiagnosis HARMS patients.
2b. Sensitivity and specificity are always tradeoffs by strict definition. For your second example of radial head fracture, every radiologist should be looking at the soft tissues, it takes 5 seconds to window if the bone looks normal and I am still reporting these within 1-2 minutes. Fortunately, this can also be clinically correlated and a non-displaced radial head fracture that is 'missed' or 'occult' can be followed up in 1 week if there is persistent pain with ZERO (or almost zero) adverse outcomes as management is conservative anyway. We do not have to 'get it right' for every diagnosis on every study the first time, thats not how any field of medicine works and again is detrimental to patient outcomes. All of the current attempts at AI readers have demonstrably terrible specificity hence why they are not heavily used even in research settings, its not just inertia. As an aside, the anterior humeral line is not a sign of radial head fracture.
2c. Additionally, if you were attempting to build such a system using a ViLM model is hardly the best approach. It's just sexy to say GPT-4 but 'conventional' DL/ML is still the way to go if you have a labelled dataset and has higher accuracy than some abstract zero-shot model not trained on medical images.
3. Regarding lawsuits, we've had breast computer-aided-diagnosis for a decade now and there have been no lawsuits, at least major enough to garner attention. It is easy to explain why, 'I discounted the AI finding because I reviewed it myself and disagreed.' In fact that is the American College of Radiology guidance on using breast CAD. A radiologist should NOT change their interpretation solely based on a CAD finding if they find it discordant due to aforementioned specificity issues and the harms of overdiagnosis. What you should (and those of us practicing in these environments do) is give a second look to the areas identified by CAD.
4. Regarding other tasks, this is unequivocally changing. In most large centres you don't have IR performing biopsies. I interviewed at 8 IR fellowships and 4 body imaging fellowships and in all of those this workload was done by diagnostic radiologists. We also provide fluoroscopic services, I think you are referring to a dying trend where IR does a lot of them. Cleveland Clinic actually has nurses/advanced practice providers doing this. Biopsies are a core component of diagnostic training per ACGME guidelines. It is dismissive to say the vast majority of radiologists read images for big one without ever reviewing the clinical chart, I don't know any radiologist who would read a complex oncology case without reviewing treatment history. How else are you assessing for complications without knowing what's been done? I don't need to review the chart on easy cases, but that's also not what you want a radiologist for. You can sign a normal template for 90% of reports, or 98% of CT pulmonary embolism studies without looking at the images and be correct. That's not why were trained and do fellowships in advanced imaging, its for the 1% of cases that require competent interpretation.
5. Regarding orthopedists, the challenge here is that it is hard for a radiologist to provide accurate enough interpretation without the clinical history for a single or few pathologies that a specific orthopedist deals with. For example, a shoulder specialist looks at the MRI for every one of their patients in clinic. As a general radiologist my case-volumes are far lower than theres. My job on these reports is to triage patients to the appropriate specialty (i.e. flag the case as abnormal for referral to ortho) who can then correlate with physical exam maneuvers and adjust their ROC curves based on arthroscopic findings. I don't have that luxury. Fortunately, that is also not why you employ a MSK radiologist as our biggest role is contributing to soft tissue and malignancy characterization. I've worked with some of very renowned orthopedists in the US and as soon as you get our of their wheelhouse of the 5 ligaments they care about they rely heavily on our interpretations.
Additionally, imaging findings in MSK does not equal disease. In a recent study of asymptomatic individuals > 80% had hip labral tears. This is why the clinical is so important. I don't have numbers on soft tissue thickening as an isolated sign of radial head fracture but it would be of very low yield, in the very infrequent case of a radial head fracture without joint effusion I mention the soft tissues and as above follow-up in 1 week to see evolution of the fracture line if it was occult. That's a way better situation than to immobilize every child because of a possible fracture due to soft tissue swelling.
With respects to the best orthopaedic hospital in the country, presumably referring to HSS, they employ radiologists because that is the BEST practice for the BEST patient outcomes/care. It's not solely/mostly because of the money. EVERY academic/cancer center employs MSK radiologists.
6. Respectfully, the reason to not have IR sign off the GPT-4 report is because you are not trained in advanced imaging of every modality. See point 1b, if you aren't investing your time staying up to date on liver imaging because you are mastering your interventional craft you may be unaware of several important advances over the past few years.
7. With respect to hidden features, there are better ones to talk about than soft tissue swelling. There is an entire field about this with radiomics and texture analysis, all of the studies on this have been underwhelming except in very select and small studies showing questionable benefit that is very low on the evidence tree.
To summarize, radiology can be very very hard. We do not train to solely diagnose simple things that a junior resident can pickup (a liver lesion with APHE and washout). We train for the nuanced cases and hard ones. We also do not optimize for 'accurate' detection on every indication and every study type, there are limitations to each imaging modality and the consequences of missed/delayed diagnosis vary depending on the disease process being discussed, similarly with overdiagnosis and overtreatment. 'Hidden features' have so far been underwhelming in radiology or we would use them.
I'm very much a skeptic, but it just hit me, what about blood work?
A scattered history of labs probably provides an opportunity to notice something early, even if you don't know what you are looking for. But humans are categorically bad at detecting complex patterns in tabular numbers. Could routinely feeding people's lab history into a model serve as a viable early warning system for problems no one thought to look for yet?
My advice to anyone trying to tackle an AI problem in medicine is ask yourself what problem are you solving?
We have established and validated reference ranges for bloodwork, there is also inherent lab error and variability in people's bloodwork (hence a reference range).
People < 50 should not be having routine bloodwork, and routine blood work on annual check-ups in older patients are very easy to interpret and trend.
Early warning systems need to be proven to improve patient outcomes. We have a lot of hard-learned experience in medicine where early diagnosis = bad outcomes for patients or no improved outcomes (lead-time bias).
If an algorithm somehow suspected pancreatic cancer based on routine labs, what am I supposed to do with that information? Do I schedule every patient for an endoscopic ultrasound with its associated complication rates? Do I biopsy something? What are the complication rates of those procedures versus how many patients am I helping with this early warning system?
In some case (screening mammography, colonoscopy) demonstrably improved patient outcomes but took years to decades to gather this information. In other cases (ovarian ultrasound screening) it led to unnecessary ovary removal and harmed patients. We have to be careful about what outcomes we are measuring and not rely on 'increased diagnosis' as the end goal.
I’m just a parent, not a medical professional, whose infant went through a lot of blood work with multiple parameters very out of range. It took five or six primary care physicians, six months, and probably twenty five labs to figure it out. The helpful recommendation in that case would have been something like “given the trend & relationship of these six out of range parameters, these other three specific blood tests could support or reject conditions X, Y, and Z”, e.g. moving beyond the cbc and so forth.
Perhaps it’s simple for most patients, but we learned a large number of the markers are really just second order effects. For example, concerning readings on your liver enzymes can mean a million different things, and are only useful when integrated with other data to develop a hypothesis on the root cause.
I agree with your point, liver enzymes (or all medical tests) don't have relevance without specific pre-test probabilities and diagnoses in mind.
But what you're arguing we should do is what physicians are taught to / should do. We also have plenty of great point of care resources (UpToDate being the most popular) that provide current evidence based recommendations for investigation of abnormal bloodwork written by experts that you really shouldn't be doing arbitrary tests.
Without knowing the details of your case I can't comment very well, nor is this my area of expertise, but a child with multiple persistent lab values seems out of the scope of most primary care physicians, and why multiple? Are you somewhere where you weren't sent to a paediatrician or don't have access to paediatric hematologists/hepatologists? Some conditions unfortunately involve a lot of investigation.
There are obviously also bad doctors. I don't mean to suggest every one of us is good (just like any profession). AI would be a great tool to augment physicians but we just have to be careful about what outcome we are trying to achieve. Diagnosis isn't a linear thing like increasing transistor density it comes with tradeoffs of overdiagnosis and harm.
It’s more like I have a good understanding of both domains as a CS/Rad actively conducting research in the field with practical experience on the challenges involved in this fearmongering.
Radiology is not the lowest hanging fruit when you talk about AI taking over jobs.
What do you think is going to happen to tech hiring when a LLM is putting out production ready code (or refactoring legacy). I would be far more worried (in reality learning new/advanced skills) if I was a software engineer right now where there isn’t a data or regulatory hurdle to cross.
As with every other major advancement in human history, people’s job descriptions may change but won’t eliminate the need.
With that said people are also dramatically overstating the power of LLMs which appear very knowledgeable at face value but aren’t that powerful in practice.
It all comes down to labelled data. There are millions images of VGA connectors and lightning cables on the internet with description, where CLIP model and similar could learn to recognize them relatively reliably. On the other hand, I'm not sure such amount of data are available for AI training. Especially if the diagnostic is blinded, it will be even harder for the AI model to reliably differentiate between them, making cross-disease diagnostic hard. Not to mention the risk and reliability of such tasks.
As someone who has worked at a Radiology PACS with petabytes of medical images under management, this is 100% accurate.
You might have images, but not the diagnoses to train the AI with.
In addition, there are compliance reasons, just because you manage that data doesn't mean that you can train an AI on it and sell it, unless of course you get explicit permission from every individual patient (good luck).
I do believe that with enough effort we could create AI specialist doctors, and allow the generalist family doctor to make a comeback, augmented with the ability to tap into specialist knowledge.
Technology in the medical industry is extremely far behind modern progress though, CT images are still largely 512 by 512 pixels. It's too easy to get bogged down with legacy support to make significant advancements and stay on the cutting edge.
We don't even have the images needed, especially for unsupervised learning.
A chest x-ray isn't going to do the model much good to interpret a prostate MRI.
Add in heterogeneity in image acquisition, sequence labelling, regional and site-specific disease prevalence, changes in imaging interpretation and most importantly class imbalance (something like >90% of imaging studies are normal) it is really really hard to come up with a reasonably high quality dataset with enough cases (from personal experience trying).
With respects to training a model, IRB/REB (ethics) boards can grant approval for this kind of work without needing individual patient consent.
It's the same thing. Predict the next pixel, or the next token (same way you handle regular images), or infill missing tokens (MAE is particularly cool lately). Those induce the abstractions and understanding which get tapped into.
It's incredibly hard to disambiguate and accurately label images using the reports (area of my research).
Reports are also not analogous to ground truth labels, and you don't always have histopathologic/clinical outcomes.
You also have drift in knowledge and patient trends, people are on immunotherapy now and we are seeing complications/patterns we didn't see 5 years ago. A renal cyst that would have been follow-up to exclude malignancy before 2018 is now definitively benign, so those reports are not directly usable.
You would have to non-trivially connect this to a knowledge base of some form to disambiguate, one that doesn't currently exist.
And then there's hallucination.
Currently if you could even extract actionable findings, accurately summarize reports and integrate this with workflow you could have a billion dollar company.
Nuance (now owned by Microsoft) can't even autofill my dictation template accurately using free-text to subject headings.
I'm curious as to what your take on all this recent progress is Gwern. I checked your site to see if you had written something, but didn't see anything recent other than your very good essay "It Looks Like You’re Trying To Take Over The World."
It seems to me that we're basically already "there" in terms of AGI, in the sense that it seems clear all we need to do is scale up, increase the amount and diversity of data, and bolt on some additional "modules" (like allowing it to take action on it's own). Combine that with a better training process that might help the model do things like build a more accurate semantic map of the world (sort of the LLM equivalent of getting the fingers right in image generation) and we're basically there.[1]
Before the most recent developments over the last few months, I was optimistic on whether we would get AGI quickly, but even I thought it was hard to know when it would happen since we didn't know (a) the number of steps or (b) how hard each of them would be. What makes me both nervous and excited is that it seems like we can sort of see the finish line from here and everybody is racing to get there.
So I think we might get there by accident pretty soon (think months and not years) since every major government and tech company are likely racing to build bigger and better models (or will be soon). It sounds weird to say this but I feel like even as over-hyped as this is, it's still under-hyped in some ways.
Would love your input if you'd like to share any thoughts.
[1] I guess I'm agreeing with Nando de Freitas (from DeepMind) who tweeted back in May 2022 that "The Game is Over!" and that now all we had to do was scale things up and tweak: https://twitter.com/NandoDF/status/1525397036325019649?s=20
Perhaps, I'm admittedly not an expert in identifying use cases of Unsupervised Learning yet. My hunch would be that the lack of the labels would require orders of magnitude more data and training to produce an equivalent model, which itself will be a sticky point for health tech. companies.
Eventually it's going to be cheap enough to drop by Tijuana for $5 MRI that even the cartel has to react.
Also, even within the US framework, there's pressure. A radiologist can rubberstamp 10x as many reports with AI-assistance. That doesn't eliminate radiology, but it eliminates 90% of the radiologists we're training.
>drop by Tijuana for $5 MRI that even the cartel has to react.
Not if its an emergency.
> but it eliminates 90% of the radiologists we're training.
Billing isnt going to change. Billing is a legal thing, not a supply/demand thing.
But yes, I fully plan to utilize travel medicine and potentially black market prescription drugs in my lifetime if there isnt meaningful reform for the middle/upper class.
In 2015, I took an intro cognitive science class in college. The professor listed some natural language feats that he was certain AI would never accomplish. It wasn't long before average people were using AI for things he predicted were impossible.
I think it will be radiologists signing-off auto-generated reports, with less reimbursement per study. It'll likely result in more work for diagnostic radiologists to maintain their same salary levels.
It will take a very long time for this to happen, probably decades. Cardiologists are still paid to finalize ECG reports 3 days after a STEMI.
I've worked at places with AI/CAD for lung nodules, mammo and stroke and there isn't even a whisper at cutting fee codes because of AI efficiency gains at the moment.
N.B. I say this as a radiologist who elected not to pursue an interventional fellowship because I see reimbursement for diagnostic work skyrocketing with AI due to increases in efficiency and stagnant fee codes.
It’s hard to imagine this not happening in the next five years. Just depends on who is prepared to take on the radiologists to reduce their fee codes. Speaking as 2nd year radiology resident in Australia
None, unless “Open”AI really opens up about how and if their LLM can actually interpret the images like in their marketing material. We’re talking about medicine and a ton of regulations.
If you were starting a diagnostic radiology residency, including intern year and fellowship, you'd just be finishing now. How can you really think that "computers can't read diagnostic images" if models such as this can describe a VGA connector outfitted with a lighting cable?