Why Foundational LLMs Fail on Healthcare Data: The Lack of Context-Engineering Expertise

by Chandi Kodthiwada, VP, Product Management

November 12, 2025

Biopharma is eager to reap all the advantages that AI offers, and several top 20 companies are actively exploring or piloting the use of foundational LLMs for select processes. Will the complexity of healthcare data thwart success?

General large language models (LLMs) have demonstrated remarkable capabilities across many domains, so retrofitting them for healthcare analytics might appear to be an efficient approach to incorporating AI into your organization. However, the many complexities and nuances of working with healthcare data make this a formidable challenge. Success hinges on one critical factor: context engineering expertise — knowing the right context to engineer for producing the best result.

No Substitutes for Healthcare Data Expertise

Context engineering is extremely complex, and its success depends on the knowledge and real-world experience of those building the instruction set. The greater their expertise, the more granular and complete the analysis will be.

Ensuring that a foundational LLM has access to requisite, high-quality data sources is not enough. While high-quality data is essential, it is only the foundation. The AI must be trained to reason like a seasoned healthcare analyst, with the ability to “think” critically — synthesizing myriad data and determining what is relevant/irrelevant for each analysis, recommending the best analytical approaches/methods, surfacing salient insights, refining its own work, and recommending the most opportune interventions.

Nor is increased processing capacity a remedy for the inherent complexity of healthcare data. In fact, recent research shows that, though foundational models have increased their context window sizes dramatically (allowing them to process more text at once), this “cramming” of more information into these windows actually degrades performance. Models suffer from “attention capacity” limitations: Their ability to maintain focus deteriorates, which leads to two critical problems:

Hallucination where data is missing or ambiguous: Generic AI tends to “fill in” with plausible but potentially incorrect information, creating a dangerous illusion of completeness.
Quality deterioration with information overload: As more content is pushed into the context window, the model’s attention becomes diffused, compromising its ability to identify and prioritize the most relevant clinical information.

Comparing Foundational vs. Purpose-Built AI Insights

To illustrate, I fed this prompt to both a foundational LLM equipped with the basic tools of web search, data query, and charting capabilities and to Marmot™, Komodo’s AI platform that’s purpose-built for the Life Sciences and healthcare industries. (Note that both had access to the same data sources):

“I’d like to understand the pre- and post-diagnosis patient journey for multiple myeloma (MM) patients. Include analyses that you believe would be helpful in understanding the MM patient journey.”

The foundational LLM generated this Sankey diagram that illustrates the complex journey of MM patients from pre-diagnosis through various treatment phases to advanced disease stages. The width of each flow represents the relative proportion of patients following each pathway.

While the foundational LLM accurately depicted the MM patient journey, its output was mechanical and void of interpretation, as evidenced by the pre-diagnosis and general treatment pathways insights.

Pre-Diagnosis Journey

First-Line Treatment Considerations

Symptoms and risk factors

Patients often present with non-specific symptoms like bone pain, fatigue, and recurrent infections
Anemia is common (as seen in trial eligibility criteria requiring hemoglobin ≤10 g/dL)
Renal dysfunction may occur (trial criteria mention creatinine clearance thresholds)
Hypercalcemia may be present (calcium ≥11 mg/dL noted in trial criteria)
Lytic bone lesions detected on imaging studies
Some patients are asymptomatic and diagnosed through routine blood tests that show elevated proteins

Diagnostic criteria (from clinical trial eligibility) and key diagnostic tests

Serum M-protein ≥3 g/dL and/or bone marrow plasma cells ≥10%
Evidence of end-organ damage: Calcium elevation, Renal insufficiency, Anemia, Bone lesions (CRAB criteria)
Blood tests: complete blood count, serum protein electrophoresis, free light chain assay
Bone marrow biopsy (mentioned repeatedly in clinical trial protocols)
Imaging: X-ray, CT, PET/CT, MRI for bone lesion detection

Transplant-eligible pathway

Induction therapy:

Typically triplet regimens, including proteasome inhibitors, immunomodulatory drugs (IMiDs), and steroids
VRD (bortezomib, lenalidomide, dexamethasone) commonly mentioned in trial criteria
Stem cell collection and autologous stem cell transplantation (ASCT)
Several trials involve post-ASCT maintenance therapy
Consolidation therapy (in some cases)

Maintenance therapy:

Lenalidomide maintenance (MajesTEC-4 trial specifically examines this)
Trial evidence for teclistamab combinations as maintenance

Transplant-ineligible pathway

Continuous therapy with proteasome inhibitors and IMiDs
Lower-intensity regimens based on patient fitness
Longer duration of initial therapy

View the full analysis generated by the foundational LLM

In contrast, Marmot shared a much more complete and sophisticated analysis of the patient journey. It included numerous visualizations that conveyed pre-diagnosis symptom patterns, utilization/frequency/timing of diagnostic procedures, HCP specialty involvement, post-diagnosis treatment patterns, and demographic disparities across the patient journey.

Importantly, Marmot gleaned the most salient insights, highlighting significant opportunities for earlier diagnosis and more equitable treatment across demographic segments, particularly for older patients and those with Medicare insurance — and suggests early intervention opportunities based on symptom patterns:

Pre-Diagnosis Journey

Recommendations

Symptoms and timeline

Most common pre-diagnosis symptoms are bone pain/fractures (36.4%), anemia (35.7%), and renal dysfunction (24.7%)
These symptoms appear 203-265 days (median) before diagnosis, suggesting a significant delay between symptom onset and formal diagnosis
Symptom patterns vary by age: Older patients (75+) have higher rates of anemia (54.8%) and bone pain (52.1%) compared to younger patients (18-44: 37.1% and 30.0%, respectively)

Diagnostic procedures

Protein electrophoresis is the most common diagnostic test (37.9% of patients), typically occurring 84 days before diagnosis
Bone marrow biopsy/aspiration (10.8%) is performed closer to diagnosis (median 32 days prior)
These tests are significantly underutilized, suggesting missed opportunities for earlier diagnosis

Diagnosis

Provider specialties

Hematology/oncology specialists make 34.3% of MM diagnoses
Other common diagnosing specialties include internal medicine (16.4%), pathology (13.1%), and radiology (12.2%)
Many patients see multiple specialists before diagnosis; 71.8% see non-specialists

Early detection initiative: Implement systematic screening for MM in patients presenting with bone pain/fractures, anemia, or renal dysfunction, particularly when these symptoms persist or occur together
Provider education: Develop targeted education for internal medicine (non-HemOnc) and other/unknown specialty about early recognition of MM symptoms and appropriate diagnostic workup
Diagnostic pathway optimization: Standardize the use of protein electrophoresis as a screening tool for suspicious symptoms to reduce the current 218-day median delay to diagnosis
Treatment gap intervention: Address the significant treatment gap — only 21.46% of diagnosed patients receive treatment — through better transitions in care from diagnosis to treatment initiation
Age-appropriate treatment strategies: Develop geriatric-specific treatment protocols to address the treatment disparity in older populations, where treatment rates drop from 31.92% (ages 45-64) to 11.79% (75+).

View the full analysis generated by Marmot

The Best Path Forward

While foundational LLMs excel at natural language understanding and generation, they lack the rigorous analytical methodologies and specialized training that is required to generate actionable healthcare analytics. At Komodo, we’ve leveraged decades of industry expertise acquired by our product, engineering, research, and analytics teams to address these shortcomings. Our highly curated data, rigorous analytics methods, and knowledge gained from the 1+ million cohorts built with our software are the foundation for Marmot, the healthcare industry’s first AI thought partner.

Learn more by watching our on-demand webinar, where we introduce Marmot and share a live demo.

To see more articles like this, follow Komodo Health on LinkedIn, YouTube, or X, and visit our Resources Hub.

Still exploring how to best incorporate the power of AI into your organization?

Learn more here

Related Topics: Perspectives

Explore Related Posts

HEOR/RWE

The Evidence Paradox: AI Is Generating More RWE Than Ever, and Most of It Won’t Hold Up

Perspectives

What AI Looks Like When the Foundation Is Already Right

Reports

Why Foundational LLMs Fail on Healthcare Data: The Lack of Context-Engineering Expertise

Biopharma is eager to reap all the advantages that AI offers, and several top 20 companies are actively exploring or piloting the use of foundational LLMs for select processes. Will the complexity of healthcare data thwart success?

No Substitutes for Healthcare Data Expertise

Comparing Foundational vs. Purpose-Built AI Insights

Pre-Diagnosis Journey

First-Line Treatment Considerations

Pre-Diagnosis Journey

Recommendations

Diagnosis

The Best Path Forward

Explore Related Posts

The Evidence Paradox: AI Is Generating More RWE Than Ever, and Most of It Won’t Hold Up

What AI Looks Like When the Foundation Is Already Right

When BMI Misses the Risk: AAPI Patients and the Diabetes Treatment Gap