Case Study Details:
The Challenge
For the university’s information systems division, maintaining a single, accurate, institution-wide student identity had become increasingly difficult. Over 1.6 million student records were spread across multiple academic, enrollment, and administrative platforms. As integrations expanded, so did duplicate student profiles, often caused by inconsistent data entry, format variations, or incomplete records.
Traditional deduplication relied heavily on deterministic rules—exact match on name, email, phone—an approach that struggled with real‑world data variability and failed to detect multi-record duplicate clusters.
This let to:
- Fragmented student identity data
- Administrative reporting inconsistencies
- Increased manual effort for data cleanup
- Operational inefficiencies across student services
The growing volume of records made manual review unsustainable, and no incremental improvement would solve it. The deduplication methodology itself needed modernization.
Evoke’s Approach
Evoke Technologies developed and deployed a hybrid AI-assisted Entity Resolution System that combines SQL-based candidate selection with LLM-driven similarity analysis to automatically detect duplicate student records at scale.
- SQL-Based Candidate Generation
Configurable SQL pipelines generate potential duplicate clusters using flexible attribute matching rules, enabling preliminary grouping before deeper analysis. - LLM-Driven Similarity Analysis
Large Language Models evaluate names, emails, phone numbers, and addresses to identify semantic similarity beyond exact matches, resolving ambiguity caused by data variations or partial inputs. - Hybrid Scoring Framework
Deterministic match signals and LLM similarity outputs feed into a unified scoring engine, producing confidence scores to rank potential duplicate records. - Scalable Azure Pipeline
An Azure-based architecture—ADF, Logic Apps, Azure Functions, Cosmos DB—enables incremental, automated, and scalable processing of millions of student records. - Human-in-the-Loop Validation
High-confidence matches are auto-classified, while borderline clusters are routed to administrative reviewers for precise and controlled merging.
The Outcomes
| Metric | Before | After |
|---|---|---|
| Duplicate detection accuracy | Limited by exact-match rules | Significantly improved through LLM semantic analysis |
| Undetected duplicate records | 30,000+ | All identified and flagged for cleanup |
| Administrative effort | High manual review dependency | Reduced through automated scoring & clustering |
| Scalability | Batch processes with bottlenecks | Fully scalable Azure-based pipeline |
The university gained a repeatable, scalable, AI-powered entity resolution framework that continuously improves data quality, enhances operational efficiency, and supports long-term institutional data governance.
Strategic Value Delivered
Beyond operational gains, the engagement created a sustainable and intelligent data-quality foundation, ensuring:
- A unified, accurate student identity across systems
- Reduction of downstream administrative errors
- Scalable support for future system integrations
- Lower long-term cost and effort for data quality management