bio-ontology-mapper
v0.1.0Map unstructured biomedical text to standardized ontologies (SNOMED CT,
Installation
Bio-Ontology Mapper
Overview
Biomedical terminology normalization tool that maps free-text clinical and scientific concepts to standardized ontologies for semantic interoperability and data harmonization.
Key Capabilities: - Multi-Ontology Support: SNOMED CT, MeSH, ICD-10, LOINC, RxNorm - Entity Extraction: NER for diseases, symptoms, procedures, drugs - Fuzzy Matching: Handle typos, abbreviations, and synonyms - Confidence Scoring: Reliability metrics for each mapping - Batch Processing: Normalize large datasets efficiently - Cross-Mapping: Translate between ontology systems
When to Use
✅ Use this skill when: - Normalizing clinical notes for EHR integration - Standardizing terminology for multi-site studies - Mapping legacy data to modern ontologies - Preparing data for clinical data warehouses - Converting free-text to coded data for analysis - Building semantic search for biomedical literature - Teaching biomedical informatics principles
❌ Do NOT use when: - Clinical diagnosis or decision support → Use clinical decision tools - Real-time patient care → Latency too high for acute settings - Replacing expert coding → Use for pre-coding, final review needed - Processing PHI without de-identification → Ensure HIPAA compliance
Integration:
- Upstream: clinical-data-cleaner (data preparation), ehr-semantic-compressor (text extraction)
- Downstream: clinical-data-cleaner (SDTM mapping), unstructured-medical-text-miner (NLP pipelines)
Core Capabilities
1. Entity Recognition and Mapping
Extract and map biomedical entities to ontologies:
from scripts.mapper import BioOntologyMapper
mapper = BioOntologyMapper()
# Map clinical text
result = mapper.map_text(
text="Patient has diabetes and hypertension, taking metformin",
ontologies=["snomed", "mesh", "rxnorm"],
confidence_threshold=0.7
)
for entity in result.entities:
print(f"{entity.text} → {entity.concept_id} ({entity.ontology})")
print(f" Preferred: {entity.preferred_term}")
print(f" Confidence: {entity.confidence:.2f}")
Supported Ontologies: | Ontology | Domain | Use Case | |----------|--------|----------| | SNOMED CT | Clinical | EHR interoperability | | MeSH | Literature | PubMed indexing | | ICD-10 | Billing | Diagnosis codes | | LOINC | Labs | Test result standardization | | RxNorm | Drugs | Medication normalization | | HGNC | Genes | Gene name standardization |
2. Cross-Ontology Translation
Map concepts between different ontologies:
# Cross-map SNOMED to ICD-10
translation = mapper.cross_map(
source_id="22298006", # SNOMED: Myocardial infarction
source_ontology="snomed",
target_ontology="icd10"
)
print(f"ICD-10: {translation.target_id} - {translation.target_term}")
# Output: I21.9 - Acute myocardial infarction, unspecified
Cross-Mapping Coverage: - SNOMED CT ↔ ICD-10-CM (clinical modifications) - MeSH ↔ SNOMED CT (literature to clinical) - RxNorm ↔ ATC (drug classifications) - LOINC ↔ SNOMED (lab to clinical)
3. Batch Normalization
Process large datasets:
# Batch process CSV
results = mapper.batch_map(
input_file="clinical_terms.csv",
text_column="diagnosis_description",
ontologies=["snomed", "icd10"],
output_format="csv",
max_workers=4
)
# Results include:
# - Original term
# - Mapped concept ID
# - Confidence score
# - Alternative mappings (if ambiguous)
Performance: - ~100 terms/second (with caching) - ~20 terms/second (API lookup) - Parallel processing for large datasets
4. Confidence Scoring and Validation
Assess mapping reliability:
scoring = mapper.score_mapping(
term="heart attack",
candidate="22298006", # Myocardial infarction
factors=["string_similarity", "context_match", "frequency"]
)
print(f"Overall confidence: {scoring.confidence:.2f}")
print(f"Breakdown: {scoring.factors}")
Scoring Factors: - String similarity: Levenshtein distance, n-grams - Context match: Surrounding words alignment - Frequency: Common usage in corpus - Semantic similarity: Vector embeddings
Common Patterns
Pattern 1: Clinical Note Normalization
Scenario: Convert free-text diagnoses to SNOMED codes.
# Normalize clinical notes
python scripts/main.py
--input notes.csv
--column diagnosis_text
--ontology snomed
--threshold 0.8
--output coded_diagnoses.csv
# Results: "heart attack" → 22298006 (Myocardial infarction)
Post-Processing: - Review low-confidence mappings (<0.8) - Handle ambiguous terms manually - Validate against clinical context
Pattern 2: Literature Indexing
Scenario: Map research paper keywords to MeSH.
# Map keywords to MeSH
mesh_terms = mapper.map_to_mesh(
keywords=["cancer immunotherapy", "checkpoint inhibitors", "PD-1"],
include_tree_numbers=True,
include_qualifiers=True
)
for term in mesh_terms:
print(f"{term.input} → {term.descriptor}")
print(f" Tree: {term.tree_numbers}")
print(f" Entry terms: {term.synonyms}")
Pattern 3: Drug Name Normalization
Scenario: Standardize medication names across datasets.
# Normalize drug names
drugs = ["Tylenol", "Advil", "Motrin", "acetaminophen"]
for drug in drugs:
result = mapper.map_to_rxnorm(drug)
print(f"{drug} → {result.rxcui}: {result.name}")
# Tylenol → 161: Acetaminophen
# Advil → 5640: Ibuprofen
# Motrin → 5640: Ibuprofen
Pattern 4: EHR Data Harmonization
Scenario: Merge data from multiple hospital systems.
# Harmonize diagnoses from 3 hospitals
python scripts/main.py
--batch
--inputs "hospital_a.csv,hospital_b.csv,hospital_c.csv"
--target-ontology snomed
--cross-map-to icd10
--output harmonized_data.csv
Complete Workflow Example
From free-text to coded database:
from scripts.mapper import BioOntologyMapper
from scripts.validator import MappingValidator
# Initialize
mapper = BioOntologyMapper()
validator = MappingValidator()
# Step 1: Extract entities from text
clinical_note = "Patient has Type 2 diabetes and hypertension..."
entities = mapper.extract_entities(clinical_note)
# Step 2: Map to SNOMED
mappings = []
for entity in entities:
mapping = mapper.map_to_snomed(
entity.text,
context=clinical_note,
top_n=3
)
mappings.append(mapping)
# Step 3: Validate mappings
for mapping in mappings:
validation = validator.validate(
mapping,
check_clinical_plausibility=True
)
if not validation.is_valid:
print(f"Review needed: {mapping}")
# Step 4: Export to database format
db_records = [m.to_database_record() for m in mappings]
Quality Checklist
Pre-Mapping: - [ ] Text preprocessed (lowercase, punctuation handled) - [ ] Abbreviations expanded where possible - [ ] Language identified (multilingual support)
During Mapping: - [ ] Confidence threshold appropriate (>0.7 for clinical) - [ ] Multiple candidates considered for ambiguous terms - [ ] Context used for disambiguation
Post-Mapping: - [ ] Low-confidence mappings flagged for review - [ ] Unmapped terms logged - [ ] CRITICAL: Clinical expert validation for high-stakes use
Before Production: - [ ] Mapping accuracy validated on gold standard - [ ] False positive rate acceptable (<5%) - [ ] Recall acceptable for use case (>90%) - [ ] API rate limits respected
Common Pitfalls
Mapping Errors: - ❌ Abbreviation ambiguity → "MI" = Myocardial infarction OR Michigan - ✅ Use context; flag for manual review
- ❌ Outdated terms → Old terminology not in current ontology
-
✅ Use historical mappings; update terminology
-
❌ False confidence → High score for wrong concept
- ✅ Always review top-3 candidates
Technical Issues: - ❌ API failures → No local fallback - ✅ Implement caching; use local reference files
- ❌ Version mismatches → Different ontology versions
-
✅ Track ontology version used
-
❌ PHI exposure → Sending patient data to external APIs
- ✅ De-identify before API calls; use local processing when possible
References
Available in references/ directory:
snomed_ct_guide.md- SNOMED CT hierarchy and relationshipsmesh_structure.md- MeSH tree structure and qualifiersontology_mappings.md- Crosswalks between systemsnlp_best_practices.md- Biomedical text processingapi_documentation.md- External service integrationvalidation_datasets.md- Gold standard test sets
Scripts
Located in scripts/ directory:
main.py- CLI interface for mappingmapper.py- Core ontology mapping engineextractor.py- Named entity recognitioncross_mapper.py- Ontology-to-ontology translationscorer.py- Confidence calculationbatch_processor.py- Large dataset handlingvalidator.py- Mapping quality checkscaching.py- Local storage for frequent lookups
Limitations
- Ambiguity: Many-to-many mappings common; context required
- Coverage: Rare diseases and new concepts may not be in ontologies
- Versioning: Ontology updates can change mappings over time
- Language: Best support for English; other languages limited
- Real-time: Not suitable for time-critical clinical applications
- API Dependency: Requires internet for most lookups (caching helps)
⚠️ Critical: Ontology mapping is for research and data integration, not clinical decision-making. Always validate mappings with domain experts before use in patient care contexts. Never process PHI without appropriate de-identification and compliance measures.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--term |
str | Required | Single term to map |
--input |
str | Required | Input file path |
--output |
str | Required | Output file path |
--ontology |
str | 'both' | |
--threshold |
float | 0.7 | |
--format |
str | 'json' | |
--use-api |
str | Required | Use UMLS/MeSH APIs |
--api-key |
str | Required |