Document Extraction with AI
This guide covers the AI-powered document extraction capabilities in LyfeAI Provider, which can process various medical documents and extract structured patient data.
Overview
The document extraction feature uses GPT-4 to intelligently parse unstructured medical documents and extract relevant patient information into a structured format compatible with the system's data models.
Supported Document Types
- Clinical Notes - Progress notes, H&P, discharge summaries
- Lab Reports - Blood work, pathology, microbiology
- Imaging Reports - Radiology, CT, MRI, X-ray reports
- Medication Lists - Current medications, prescriptions
- Insurance Documents - Insurance cards, authorization letters
- Referral Letters - Specialist consultations
- Patient Forms - Registration, history forms
How It Works
1. Document Input
Documents can be provided as:
- Plain text
- Scanned images (via OCR)
- PDF files
- Word documents
- HL7/FHIR messages
2. AI Processing
The AI service processes documents through multiple stages:
async function extractPatientData(documentText: string): Promise<ExtractedPatientData> {
// 1. Pre-process document
const cleaned = preprocessDocument(documentText);
// 2. Identify document type
const docType = identifyDocumentType(cleaned);
// 3. Extract data using specialized prompt
const prompt = createExtractionPrompt(cleaned, docType);
const rawData = await openai.complete(prompt);
// 4. Parse and validate
const structured = parseAIResponse(rawData);
const validated = validateExtractedData(structured);
// 5. Calculate confidence scores
const confidence = calculateConfidence(validated, documentText);
return {
...validated,
confidence,
extractedSections: identifyExtractedSections(documentText)
};
}
3. Data Extraction
The AI extracts the following information:
Demographics
{
firstName: "John",
lastName: "Doe",
dateOfBirth: "1980-05-15",
gender: "male",
mrn: "MRN123456",
ssn: "***-**-6789",
address: {
street: "123 Main St",
city: "Boston",
state: "MA",
zipCode: "02101"
},
phone: "(555) 123-4567",
email: "[email protected]"
}
Medical History
{
conditions: [{
name: "Type 2 Diabetes Mellitus",
icd10Code: "E11.9",
dateOfDiagnosis: "2018-03-20",
status: "active"
}],
medications: [{
name: "Metformin",
dosage: "500mg",
frequency: "twice daily",
startDate: "2018-04-01"
}],
allergies: [{
allergen: "Penicillin",
reaction: "Anaphylaxis",
severity: "severe"
}],
surgeries: [{
procedure: "Appendectomy",
date: "2015-07-10",
outcome: "successful"
}]
}
Clinical Data
{
vitalSigns: {
bloodPressure: "120/80",
heartRate: 72,
temperature: 98.6,
weight: "180 lbs",
height: "5'10\"",
bmi: 25.8
},
labResults: [{
testName: "Hemoglobin A1c",
value: "7.2",
unit: "%",
referenceRange: "4.0-5.6",
date: "2024-01-15",
abnormal: true
}],
immunizations: [{
vaccine: "COVID-19",
date: "2023-10-01",
booster: true
}]
}
Usage
Via Server Action
import { processDocumentWithAI } from '@/app/actions/ai-actions';
// Process a document
const result = await processDocumentWithAI(documentText);
if (result.success) {
console.log('Extracted data:', result.data);
console.log('Confidence:', result.confidence);
console.log('Insights:', result.insights);
}
Direct API Usage
import { AIService } from '@/lib/ai-service';
const aiService = new AIService();
const extractedData = await aiService.extractPatientData(documentText);
Confidence Scoring
The system provides confidence scores for extracted data:
- High (0.8-1.0): Data clearly stated in document
- Medium (0.5-0.8): Data inferred or partially present
- Low (0-0.5): Data uncertain or missing
{
demographics: {
data: { /* ... */ },
confidence: 0.95
},
medicalHistory: {
data: { /* ... */ },
confidence: 0.82
},
overallConfidence: 0.87
}
Validation & Error Handling
Data Validation
All extracted data is validated using Zod schemas:
const PatientDemographicsSchema = z.object({
firstName: z.string().min(1),
lastName: z.string().min(1),
dateOfBirth: z.string().regex(/^\d{4}-\d{2}-\d{2}$/),
gender: z.enum(['male', 'female', 'other']),
// ... more fields
});
Error Scenarios
- Invalid Document Format
{
success: false,
error: "Unable to parse document: Unsupported format"
}
- Low Confidence Extraction
{
success: true,
data: extractedData,
warnings: ["Low confidence in medication dosages", "Missing patient MRN"]
}
- AI Service Unavailable
{
success: true,
data: fallbackExtraction,
usedFallback: true,
message: "AI service unavailable, used basic extraction"
}
Best Practices
1. Document Preparation
- Ensure good scan quality for images
- Remove headers/footers if possible
- Separate multi-patient documents
- Maintain consistent formatting
2. Review Extracted Data
Always review AI-extracted data before saving:
// Show review UI
const ReviewExtractedData = ({ data, onConfirm, onEdit }) => {
return (
<div>
<h3>Review Extracted Information</h3>
<PatientDataForm
initialData={data}
onSubmit={onConfirm}
highlightAIFields={true}
/>
</div>
);
};
3. Incremental Processing
For large documents, process in sections:
const sections = splitDocument(largeDocument);
const results = await Promise.all(
sections.map(section => extractPatientData(section))
);
const merged = mergeExtractedData(results);
Advanced Features
Custom Extraction Rules
Define custom extraction rules for specific document types:
const customRules = {
labReport: {
patterns: {
hemoglobin: /Hemoglobin:\s*([\d.]+)\s*g\/dL/i,
glucose: /Glucose:\s*([\d.]+)\s*mg\/dL/i
},
transforms: {
hemoglobin: (value) => ({ value, unit: 'g/dL' }),
glucose: (value) => ({ value, unit: 'mg/dL' })
}
}
};
Multi-Language Support
Extract from documents in multiple languages:
const extractedData = await aiService.extractPatientData(documentText, {
language: 'spanish',
translateTo: 'english'
});
Batch Processing
Process multiple documents efficiently:
const documents = [doc1, doc2, doc3];
const results = await aiService.batchExtract(documents, {
parallel: true,
maxConcurrency: 5
});
Performance Optimization
Caching
Frequently processed documents are cached:
const cacheKey = generateDocumentHash(documentText);
const cached = await documentCache.get(cacheKey);
if (cached && !forceRefresh) {
return cached;
}
Response Time
Typical processing times:
- Small document (less than 1 page): 2-3 seconds
- Medium document (1-5 pages): 5-10 seconds
- Large document (>5 pages): 10-30 seconds
Token Optimization
Minimize token usage with smart truncation:
const optimizedText = truncateDocument(documentText, {
maxTokens: 3000,
preserveSections: ['demographics', 'medications', 'diagnoses']
});
Integration Examples
With Patient Import
// Extract and create patient
const handleDocumentUpload = async (file: File) => {
const text = await extractTextFromFile(file);
const extracted = await processDocumentWithAI(text);
if (extracted.success) {
const patient = await addPatientWithInsights(extracted.data);
router.push(`/patients/${patient.id}`);
}
};
With EHR Systems
// Process EHR message
const processEHRMessage = async (hl7Message: string) => {
const parsed = parseHL7(hl7Message);
const enhanced = await aiService.enhanceWithAI(parsed);
return syncWithEHR(enhanced);
};
Troubleshooting
Common Issues
-
Poor Extraction Quality
- Check document quality
- Verify OCR accuracy
- Review prompt templates
-
Missing Data
- Ensure data is present in document
- Check extraction patterns
- Adjust confidence thresholds
-
Slow Processing
- Optimize document size
- Use batch processing
- Enable caching
Debug Mode
Enable detailed logging:
const result = await processDocumentWithAI(documentText, {
debug: true,
logPrompts: true,
logResponses: true
});
Security Considerations
- PHI is processed securely
- No data stored in AI service
- Audit logs maintained
- Encrypted transmission
- HIPAA compliant processing