Skip to main content

Document Extraction with AI

This guide covers the AI-powered document extraction capabilities in LyfeAI Provider, which can process various medical documents and extract structured patient data.

Overview

The document extraction feature uses GPT-4 to intelligently parse unstructured medical documents and extract relevant patient information into a structured format compatible with the system's data models.

Supported Document Types

  • Clinical Notes - Progress notes, H&P, discharge summaries
  • Lab Reports - Blood work, pathology, microbiology
  • Imaging Reports - Radiology, CT, MRI, X-ray reports
  • Medication Lists - Current medications, prescriptions
  • Insurance Documents - Insurance cards, authorization letters
  • Referral Letters - Specialist consultations
  • Patient Forms - Registration, history forms

How It Works

1. Document Input

Documents can be provided as:

  • Plain text
  • Scanned images (via OCR)
  • PDF files
  • Word documents
  • HL7/FHIR messages

2. AI Processing

The AI service processes documents through multiple stages:

async function extractPatientData(documentText: string): Promise<ExtractedPatientData> {
// 1. Pre-process document
const cleaned = preprocessDocument(documentText);

// 2. Identify document type
const docType = identifyDocumentType(cleaned);

// 3. Extract data using specialized prompt
const prompt = createExtractionPrompt(cleaned, docType);
const rawData = await openai.complete(prompt);

// 4. Parse and validate
const structured = parseAIResponse(rawData);
const validated = validateExtractedData(structured);

// 5. Calculate confidence scores
const confidence = calculateConfidence(validated, documentText);

return {
...validated,
confidence,
extractedSections: identifyExtractedSections(documentText)
};
}

3. Data Extraction

The AI extracts the following information:

Demographics

{
firstName: "John",
lastName: "Doe",
dateOfBirth: "1980-05-15",
gender: "male",
mrn: "MRN123456",
ssn: "***-**-6789",
address: {
street: "123 Main St",
city: "Boston",
state: "MA",
zipCode: "02101"
},
phone: "(555) 123-4567",
email: "[email protected]"
}

Medical History

{
conditions: [{
name: "Type 2 Diabetes Mellitus",
icd10Code: "E11.9",
dateOfDiagnosis: "2018-03-20",
status: "active"
}],
medications: [{
name: "Metformin",
dosage: "500mg",
frequency: "twice daily",
startDate: "2018-04-01"
}],
allergies: [{
allergen: "Penicillin",
reaction: "Anaphylaxis",
severity: "severe"
}],
surgeries: [{
procedure: "Appendectomy",
date: "2015-07-10",
outcome: "successful"
}]
}

Clinical Data

{
vitalSigns: {
bloodPressure: "120/80",
heartRate: 72,
temperature: 98.6,
weight: "180 lbs",
height: "5'10\"",
bmi: 25.8
},
labResults: [{
testName: "Hemoglobin A1c",
value: "7.2",
unit: "%",
referenceRange: "4.0-5.6",
date: "2024-01-15",
abnormal: true
}],
immunizations: [{
vaccine: "COVID-19",
date: "2023-10-01",
booster: true
}]
}

Usage

Via Server Action

import { processDocumentWithAI } from '@/app/actions/ai-actions';

// Process a document
const result = await processDocumentWithAI(documentText);

if (result.success) {
console.log('Extracted data:', result.data);
console.log('Confidence:', result.confidence);
console.log('Insights:', result.insights);
}

Direct API Usage

import { AIService } from '@/lib/ai-service';

const aiService = new AIService();
const extractedData = await aiService.extractPatientData(documentText);

Confidence Scoring

The system provides confidence scores for extracted data:

  • High (0.8-1.0): Data clearly stated in document
  • Medium (0.5-0.8): Data inferred or partially present
  • Low (0-0.5): Data uncertain or missing
{
demographics: {
data: { /* ... */ },
confidence: 0.95
},
medicalHistory: {
data: { /* ... */ },
confidence: 0.82
},
overallConfidence: 0.87
}

Validation & Error Handling

Data Validation

All extracted data is validated using Zod schemas:

const PatientDemographicsSchema = z.object({
firstName: z.string().min(1),
lastName: z.string().min(1),
dateOfBirth: z.string().regex(/^\d{4}-\d{2}-\d{2}$/),
gender: z.enum(['male', 'female', 'other']),
// ... more fields
});

Error Scenarios

  1. Invalid Document Format
{
success: false,
error: "Unable to parse document: Unsupported format"
}
  1. Low Confidence Extraction
{
success: true,
data: extractedData,
warnings: ["Low confidence in medication dosages", "Missing patient MRN"]
}
  1. AI Service Unavailable
{
success: true,
data: fallbackExtraction,
usedFallback: true,
message: "AI service unavailable, used basic extraction"
}

Best Practices

1. Document Preparation

  • Ensure good scan quality for images
  • Remove headers/footers if possible
  • Separate multi-patient documents
  • Maintain consistent formatting

2. Review Extracted Data

Always review AI-extracted data before saving:

// Show review UI
const ReviewExtractedData = ({ data, onConfirm, onEdit }) => {
return (
<div>
<h3>Review Extracted Information</h3>
<PatientDataForm
initialData={data}
onSubmit={onConfirm}
highlightAIFields={true}
/>
</div>
);
};

3. Incremental Processing

For large documents, process in sections:

const sections = splitDocument(largeDocument);
const results = await Promise.all(
sections.map(section => extractPatientData(section))
);
const merged = mergeExtractedData(results);

Advanced Features

Custom Extraction Rules

Define custom extraction rules for specific document types:

const customRules = {
labReport: {
patterns: {
hemoglobin: /Hemoglobin:\s*([\d.]+)\s*g\/dL/i,
glucose: /Glucose:\s*([\d.]+)\s*mg\/dL/i
},
transforms: {
hemoglobin: (value) => ({ value, unit: 'g/dL' }),
glucose: (value) => ({ value, unit: 'mg/dL' })
}
}
};

Multi-Language Support

Extract from documents in multiple languages:

const extractedData = await aiService.extractPatientData(documentText, {
language: 'spanish',
translateTo: 'english'
});

Batch Processing

Process multiple documents efficiently:

const documents = [doc1, doc2, doc3];
const results = await aiService.batchExtract(documents, {
parallel: true,
maxConcurrency: 5
});

Performance Optimization

Caching

Frequently processed documents are cached:

const cacheKey = generateDocumentHash(documentText);
const cached = await documentCache.get(cacheKey);

if (cached && !forceRefresh) {
return cached;
}

Response Time

Typical processing times:

  • Small document (less than 1 page): 2-3 seconds
  • Medium document (1-5 pages): 5-10 seconds
  • Large document (>5 pages): 10-30 seconds

Token Optimization

Minimize token usage with smart truncation:

const optimizedText = truncateDocument(documentText, {
maxTokens: 3000,
preserveSections: ['demographics', 'medications', 'diagnoses']
});

Integration Examples

With Patient Import

// Extract and create patient
const handleDocumentUpload = async (file: File) => {
const text = await extractTextFromFile(file);
const extracted = await processDocumentWithAI(text);

if (extracted.success) {
const patient = await addPatientWithInsights(extracted.data);
router.push(`/patients/${patient.id}`);
}
};

With EHR Systems

// Process EHR message
const processEHRMessage = async (hl7Message: string) => {
const parsed = parseHL7(hl7Message);
const enhanced = await aiService.enhanceWithAI(parsed);
return syncWithEHR(enhanced);
};

Troubleshooting

Common Issues

  1. Poor Extraction Quality

    • Check document quality
    • Verify OCR accuracy
    • Review prompt templates
  2. Missing Data

    • Ensure data is present in document
    • Check extraction patterns
    • Adjust confidence thresholds
  3. Slow Processing

    • Optimize document size
    • Use batch processing
    • Enable caching

Debug Mode

Enable detailed logging:

const result = await processDocumentWithAI(documentText, {
debug: true,
logPrompts: true,
logResponses: true
});

Security Considerations

  • PHI is processed securely
  • No data stored in AI service
  • Audit logs maintained
  • Encrypted transmission
  • HIPAA compliant processing