Why Foundational LLMs Fail on Healthcare Data: The Lack of Context-Engineering Expertise

by Chandi Kodthiwada, VP, Product Management

Biopharma is eager to reap all the advantages that AI offers, and several top 20 companies are actively exploring or piloting the use of foundational LLMs for select processes. Will the complexity of healthcare data thwart success?

 

General large language models (LLMs) have demonstrated remarkable capabilities across many domains, so retrofitting them for healthcare analytics might appear to be an efficient approach to incorporating AI into your organization. However, the many complexities and nuances of working with healthcare data make this a formidable challenge. Success hinges on one critical factor: context engineering expertise — knowing the right context to engineer for producing the best result.

What is Context Engineering

No Substitutes for Healthcare Data Expertise

Context engineering is extremely complex, and its success depends on the knowledge and real-world experience of those building the instruction set. The greater their expertise, the more granular and complete the analysis will be.

Ensuring that a foundational LLM has access to requisite, high-quality data sources is not enough. While high-quality data is essential, it is only the foundation. The AI must be trained to reason like a seasoned healthcare analyst, with the ability to “think” critically — synthesizing myriad data and determining what is relevant/irrelevant for each analysis, recommending the best analytical approaches/methods, surfacing salient insights, refining its own work, and recommending the most opportune interventions.

Nor is increased processing capacity a remedy for the inherent complexity of healthcare data. In fact, recent research shows that, though foundational models have increased their context window sizes dramatically (allowing them to process more text at once), this “cramming” of more information into these windows actually degrades performance. Models suffer from “attention capacity” limitations: Their ability to maintain focus deteriorates, which leads to two critical problems:

  1. Hallucination where data is missing or ambiguous: Generic AI tends to “fill in” with plausible but potentially incorrect information, creating a dangerous illusion of completeness.
  2. Quality deterioration with information overload: As more content is pushed into the context window, the model’s attention becomes diffused, compromising its ability to identify and prioritize the most relevant clinical information.

Comparing Foundational vs. Purpose-Built AI Insights

To illustrate, I fed this prompt to both a foundational LLM equipped with the basic tools of web search, data query, and charting capabilities and to Marmot™, Komodo’s AI platform that’s purpose-built for the Life Sciences and healthcare industries. (Note that both had access to the same data sources):

“I’d like to understand the pre- and post-diagnosis patient journey for multiple myeloma (MM) patients. Include analyses that you believe would be helpful in understanding the MM patient journey.” 

The foundational LLM generated this Sankey diagram that illustrates the complex journey of MM patients from pre-diagnosis through various treatment phases to advanced disease stages. The width of each flow represents the relative proportion of patients following each pathway.

 

BLOG_Multiple Myeloma Patient Journey

Gen LLMs Fail_Table

 

While the foundational LLM accurately depicted the MM patient journey, its output was mechanical and void of interpretation, as evidenced by the pre-diagnosis and general treatment pathways insights.

Pre-Diagnosis Journey

First-Line Treatment Considerations

Symptoms and risk factors

  • Patients often present with non-specific symptoms like bone pain, fatigue, and recurrent infections
  • Anemia is common (as seen in trial eligibility criteria requiring hemoglobin ≤10 g/dL)
  • Renal dysfunction may occur (trial criteria mention creatinine clearance thresholds)
  • Hypercalcemia may be present (calcium ≥11 mg/dL noted in trial criteria)
  • Lytic bone lesions detected on imaging studies
  • Some patients are asymptomatic and diagnosed through routine blood tests that show elevated proteins

Diagnostic criteria (from clinical trial eligibility) and key diagnostic tests

  • Serum M-protein ≥3 g/dL and/or bone marrow plasma cells ≥10%
  • Evidence of end-organ damage: Calcium elevation, Renal insufficiency, Anemia, Bone lesions (CRAB criteria)
  • Blood tests: complete blood count, serum protein electrophoresis, free light chain assay
  • Bone marrow biopsy (mentioned repeatedly in clinical trial protocols)
  • Imaging: X-ray, CT, PET/CT, MRI for bone lesion detection

 

Transplant-eligible pathway

Induction therapy:

  • Typically triplet regimens, including proteasome inhibitors, immunomodulatory drugs (IMiDs), and steroids
  • VRD (bortezomib, lenalidomide, dexamethasone) commonly mentioned in trial criteria
  • Stem cell collection and autologous stem cell transplantation (ASCT)
  • Several trials involve post-ASCT maintenance therapy
  • Consolidation therapy (in some cases)

Maintenance therapy:

  • Lenalidomide maintenance (MajesTEC-4 trial specifically examines this)
  • Trial evidence for teclistamab combinations as maintenance

Transplant-ineligible pathway

  • Continuous therapy with proteasome inhibitors and IMiDs
  • Lower-intensity regimens based on patient fitness
  • Longer duration of initial therapy 


View the full analysis generated by the foundational LLM 

 

In contrast, Marmot shared a much more complete and sophisticated analysis of the patient journey. It included numerous visualizations that conveyed pre-diagnosis symptom patterns, utilization/frequency/timing of diagnostic procedures, HCP specialty involvement, post-diagnosis treatment patterns, and demographic disparities across the patient journey.

multiple myeloma patient journey timeline

 

Importantly, Marmot gleaned the most salient insights, highlighting significant opportunities for earlier diagnosis and more equitable treatment across demographic segments, particularly for older patients and those with Medicare insurance — and suggests early intervention opportunities based on symptom patterns: 

Pre-Diagnosis Journey

Recommendations

Symptoms and timeline

  • Most common pre-diagnosis symptoms are bone pain/fractures (36.4%), anemia (35.7%), and renal dysfunction (24.7%)
  • These symptoms appear 203-265 days (median) before diagnosis, suggesting a significant delay between symptom onset and formal diagnosis
  • Symptom patterns vary by age: Older patients (75+) have higher rates of anemia (54.8%) and bone pain (52.1%) compared to younger patients (18-44: 37.1% and 30.0%, respectively)

Diagnostic procedures

  • Protein electrophoresis is the most common diagnostic test (37.9% of patients), typically occurring 84 days before diagnosis
  • Bone marrow biopsy/aspiration (10.8%) is performed closer to diagnosis (median 32 days prior)
  • These tests are significantly underutilized, suggesting missed opportunities for earlier diagnosis

Diagnosis

Provider specialties

  • Hematology/oncology specialists make 34.3% of MM diagnoses
  • Other common diagnosing specialties include internal medicine (16.4%), pathology (13.1%), and radiology (12.2%)
  • Many patients see multiple specialists before diagnosis; 71.8% see non-specialists
 

  1. Early detection initiative: Implement systematic screening for MM in patients presenting with bone pain/fractures, anemia, or renal dysfunction, particularly when these symptoms persist or occur together
  2. Provider education: Develop targeted education for internal medicine (non-HemOnc) and other/unknown specialty about early recognition of MM symptoms and appropriate diagnostic workup
  3. Diagnostic pathway optimization: Standardize the use of protein electrophoresis as a screening tool for suspicious symptoms to reduce the current 218-day median delay to diagnosis
  4. Treatment gap intervention: Address the significant treatment gap — only 21.46% of diagnosed patients receive treatment — through better transitions in care from diagnosis to treatment initiation
  5. Age-appropriate treatment strategies: Develop geriatric-specific treatment protocols to address the treatment disparity in older populations, where treatment rates drop from 31.92% (ages 45-64) to 11.79% (75+).


View the full analysis generated by Marmot

The Best Path Forward

While foundational LLMs excel at natural language understanding and generation, they lack the rigorous analytical methodologies and specialized training that is required to generate actionable healthcare analytics. At Komodo, we’ve leveraged decades of industry expertise acquired by our product, engineering, research, and analytics teams to address these shortcomings. Our highly curated data, rigorous analytics methods, and knowledge gained from the 1+ million cohorts built with our software are the foundation for Marmot, the healthcare industry’s first AI thought partner.

Learn more by watching our on-demand webinar, where we introduce Marmot and share a live demo.

To see more articles like this, follow Komodo Health on LinkedIn, YouTube, or X, and visit our Resources Hub.

Still exploring how to best incorporate the power of AI into your organization?

Learn more here

Explore Related Posts