Document Extraction with AI

This guide covers the AI-powered document extraction capabilities in LyfeAI Provider, which can process various medical documents and extract structured patient data.

Overview

The document extraction feature uses GPT-4 to intelligently parse unstructured medical documents and extract relevant patient information into a structured format compatible with the system's data models.

Supported Document Types

Clinical Notes - Progress notes, H&P, discharge summaries
Lab Reports - Blood work, pathology, microbiology
Imaging Reports - Radiology, CT, MRI, X-ray reports
Medication Lists - Current medications, prescriptions
Insurance Documents - Insurance cards, authorization letters
Referral Letters - Specialist consultations
Patient Forms - Registration, history forms

How It Works

1. Document Input

Documents can be provided as:

Plain text
Scanned images (via OCR)
PDF files
Word documents
HL7/FHIR messages

2. AI Processing

The AI service processes documents through multiple stages:

async function extractPatientData(documentText: string): Promise<ExtractedPatientData> {
  // 1. Pre-process document
  const cleaned = preprocessDocument(documentText);
  
  // 2. Identify document type
  const docType = identifyDocumentType(cleaned);
  
  // 3. Extract data using specialized prompt
  const prompt = createExtractionPrompt(cleaned, docType);
  const rawData = await openai.complete(prompt);
  
  // 4. Parse and validate
  const structured = parseAIResponse(rawData);
  const validated = validateExtractedData(structured);
  
  // 5. Calculate confidence scores
  const confidence = calculateConfidence(validated, documentText);
  
  return {
    ...validated,
    confidence,
    extractedSections: identifyExtractedSections(documentText)
  };
}

3. Data Extraction

The AI extracts the following information:

Demographics

{
  firstName: "John",
  lastName: "Doe",
  dateOfBirth: "1980-05-15",
  gender: "male",
  mrn: "MRN123456",
  ssn: "***-**-6789",
  address: {
    street: "123 Main St",
    city: "Boston",
    state: "MA",
    zipCode: "02101"
  },
  phone: "(555) 123-4567",
  email: "[email protected]"
}

Medical History

{
  conditions: [{
    name: "Type 2 Diabetes Mellitus",
    icd10Code: "E11.9",
    dateOfDiagnosis: "2018-03-20",
    status: "active"
  }],
  medications: [{
    name: "Metformin",
    dosage: "500mg",
    frequency: "twice daily",
    startDate: "2018-04-01"
  }],
  allergies: [{
    allergen: "Penicillin",
    reaction: "Anaphylaxis",
    severity: "severe"
  }],
  surgeries: [{
    procedure: "Appendectomy",
    date: "2015-07-10",
    outcome: "successful"
  }]
}

Clinical Data

{
  vitalSigns: {
    bloodPressure: "120/80",
    heartRate: 72,
    temperature: 98.6,
    weight: "180 lbs",
    height: "5'10\"",
    bmi: 25.8
  },
  labResults: [{
    testName: "Hemoglobin A1c",
    value: "7.2",
    unit: "%",
    referenceRange: "4.0-5.6",
    date: "2024-01-15",
    abnormal: true
  }],
  immunizations: [{
    vaccine: "COVID-19",
    date: "2023-10-01",
    booster: true
  }]
}

Usage

Via Server Action

import { processDocumentWithAI } from '@/app/actions/ai-actions';

// Process a document
const result = await processDocumentWithAI(documentText);

if (result.success) {
  console.log('Extracted data:', result.data);
  console.log('Confidence:', result.confidence);
  console.log('Insights:', result.insights);
}

Direct API Usage

import { AIService } from '@/lib/ai-service';

const aiService = new AIService();
const extractedData = await aiService.extractPatientData(documentText);

Confidence Scoring

The system provides confidence scores for extracted data:

High (0.8-1.0): Data clearly stated in document
Medium (0.5-0.8): Data inferred or partially present
Low (0-0.5): Data uncertain or missing

{
  demographics: {
    data: { /* ... */ },
    confidence: 0.95
  },
  medicalHistory: {
    data: { /* ... */ },
    confidence: 0.82
  },
  overallConfidence: 0.87
}

Validation & Error Handling

Data Validation

All extracted data is validated using Zod schemas:

const PatientDemographicsSchema = z.object({
  firstName: z.string().min(1),
  lastName: z.string().min(1),
  dateOfBirth: z.string().regex(/^\d{4}-\d{2}-\d{2}$/),
  gender: z.enum(['male', 'female', 'other']),
  // ... more fields
});

Error Scenarios

Invalid Document Format

{
  success: false,
  error: "Unable to parse document: Unsupported format"
}

Low Confidence Extraction

{
  success: true,
  data: extractedData,
  warnings: ["Low confidence in medication dosages", "Missing patient MRN"]
}

AI Service Unavailable

{
  success: true,
  data: fallbackExtraction,
  usedFallback: true,
  message: "AI service unavailable, used basic extraction"
}

Best Practices

1. Document Preparation

Ensure good scan quality for images
Remove headers/footers if possible
Separate multi-patient documents
Maintain consistent formatting

2. Review Extracted Data

Always review AI-extracted data before saving:

// Show review UI
const ReviewExtractedData = ({ data, onConfirm, onEdit }) => {
  return (
    <div>
      <h3>Review Extracted Information</h3>
      <PatientDataForm 
        initialData={data}
        onSubmit={onConfirm}
        highlightAIFields={true}
      />
    </div>
  );
};

3. Incremental Processing

For large documents, process in sections:

const sections = splitDocument(largeDocument);
const results = await Promise.all(
  sections.map(section => extractPatientData(section))
);
const merged = mergeExtractedData(results);

Advanced Features

Custom Extraction Rules

Define custom extraction rules for specific document types:

const customRules = {
  labReport: {
    patterns: {
      hemoglobin: /Hemoglobin:\s*([\d.]+)\s*g\/dL/i,
      glucose: /Glucose:\s*([\d.]+)\s*mg\/dL/i
    },
    transforms: {
      hemoglobin: (value) => ({ value, unit: 'g/dL' }),
      glucose: (value) => ({ value, unit: 'mg/dL' })
    }
  }
};

Multi-Language Support

Extract from documents in multiple languages:

const extractedData = await aiService.extractPatientData(documentText, {
  language: 'spanish',
  translateTo: 'english'
});

Batch Processing

Process multiple documents efficiently:

const documents = [doc1, doc2, doc3];
const results = await aiService.batchExtract(documents, {
  parallel: true,
  maxConcurrency: 5
});

Performance Optimization

Caching

Frequently processed documents are cached:

const cacheKey = generateDocumentHash(documentText);
const cached = await documentCache.get(cacheKey);

if (cached && !forceRefresh) {
  return cached;
}

Response Time

Typical processing times:

Small document (less than 1 page): 2-3 seconds
Medium document (1-5 pages): 5-10 seconds
Large document (>5 pages): 10-30 seconds

Token Optimization

Minimize token usage with smart truncation:

const optimizedText = truncateDocument(documentText, {
  maxTokens: 3000,
  preserveSections: ['demographics', 'medications', 'diagnoses']
});

Integration Examples

With Patient Import

// Extract and create patient
const handleDocumentUpload = async (file: File) => {
  const text = await extractTextFromFile(file);
  const extracted = await processDocumentWithAI(text);
  
  if (extracted.success) {
    const patient = await addPatientWithInsights(extracted.data);
    router.push(`/patients/${patient.id}`);
  }
};

With EHR Systems

// Process EHR message
const processEHRMessage = async (hl7Message: string) => {
  const parsed = parseHL7(hl7Message);
  const enhanced = await aiService.enhanceWithAI(parsed);
  return syncWithEHR(enhanced);
};

Troubleshooting

Common Issues

Poor Extraction Quality
- Check document quality
- Verify OCR accuracy
- Review prompt templates
Missing Data
- Ensure data is present in document
- Check extraction patterns
- Adjust confidence thresholds
Slow Processing
- Optimize document size
- Use batch processing
- Enable caching

Debug Mode

Enable detailed logging:

const result = await processDocumentWithAI(documentText, {
  debug: true,
  logPrompts: true,
  logResponses: true
});

Security Considerations

PHI is processed securely
No data stored in AI service
Audit logs maintained
Encrypted transmission
HIPAA compliant processing

Overview​

Supported Document Types​

How It Works​

1. Document Input​

2. AI Processing​

3. Data Extraction​

Demographics​

Medical History​

Clinical Data​

Usage​

Via Server Action​

Direct API Usage​

Confidence Scoring​

Validation & Error Handling​

Data Validation​

Error Scenarios​

Best Practices​

1. Document Preparation​

2. Review Extracted Data​

3. Incremental Processing​

Advanced Features​

Custom Extraction Rules​

Multi-Language Support​

Batch Processing​

Performance Optimization​

Caching​

Response Time​

Token Optimization​

Integration Examples​

With Patient Import​

With EHR Systems​

Troubleshooting​

Common Issues​

Debug Mode​

Security Considerations​