Overhauling a Major Health Data Platform

The existing data platform was highly unstable and frequently experienced failures. Job runs would often crash and require restarts, even after running for multiple days. This instability resulted in significant costs, amounting to tens of thousands of dollars. Additionally, job run times consistently exceeded SLA thresholds and reached the maximum capacity for scaling ontologies. To address these challenges, the entire conveyor data platform was slated for replacement with a faster, more scalable, stable, and cost-efficient Data Aggregator.

PROJCT GOAL

  1. Reduce the average customer job run time (across all environments) 

  2. Reduce the average customer job run cost (across all environments) 

  3. Support more job runs within agreed SLAs 

  4. Provide incremental value in data delivery to customers 

  5. Remain aligned with IHDP

Current runtime was 86 hours on Extra Large cluster ($16.91/h) plus Redis cluster ($18.35/h) = $3032 per run (in best case if nothing failed and job wasn’t re-run). My estimation of proposed solution run is 2-4 hours on Medium Cluster ($14.95/h) – total cost of run 30-60$ 

Target Go live in Production: Go-live by the end of the fiscal year.

Team: Four development teams and one DevOps team were involved.

TECHNICAL PLAN

1. Development

 1.1 Functionality 

 1.1.1 Identity

 1.1.2 Terminology

 1.1.3 Curation

 1.1.4 Representation

 1.1.5 Representation Curation (Linked Resources, Aggregation)

 1.1.6 Representation Terminology

 1.1.7 Packaging CSV

 1.1.8 Packaging OSDB

 1.1.9 Logging - Galaxy

 1.1.10 Job Main Function

2. Unit Testing

 2.1 Code Coverage 

 2.1.1 Identity

 2.1.2 Terminology

 2.1.3 Curation

 2.1.4 Representation

 2.1.5 Representation Curation

 2.1.6 Representation Terminology

 2.1.7 Packaging CSV

 2.1.8 Packaging OSDB

3. Deployment

3.1 Build & Deploy Job   

 3.2 Deploy Dependencies

 3.2.1 Nick Names File  

 3.2.2 Terminology Mappings  

 3.2.3 Configuration Files  

4. Integration Testing

 4.1 Conveyor Tests 

 4.1.2 Integrate Data Aggregator

 4.1.3 Representation Mappers

 4.1.4 Representation for Profiling

 4.3 E2E Tests  

 4.3.1 Teamcity Project

 4.3.2 Representation Mappers

 4.3.3 Make E2E tests green

 4.4 Crypto Internal QA

5. Documentation

 5.1 Readme

 5.2 System Failure Analysis

6. Performance Testing  

 6.1 Refresh Sentara Data

 6.2 Run Data Aggregator

7. Scheduling

 7.1 Runbook Setup

7.2 Job Configuration

8. Customer Rollout

 8.1 Promote Conveyor Tests to Stage

 8.2 Promote E2E tests to Stage

 8.3 Crypto Internal Tests to Stage

 8.4 Promote code to INT

 8.5 Run Sentara Job in INT

 8.6 Run E2E in INT

 8.7 Refactor SSIS Package

 8.9 Internal UAT

 8.10 Promote to Prod

 8.11 Run E2E in Prod

 8.12 Repartition Sentara Data on Prod

 8.13 Run Sentara Job on Prod

 8.14 Intermal UAT in Prod

9. Knowledge Transfer and Handoff

 9.1 Runbooks  

 9.2 Monitoring 

 9.3 Deployment 

 9.4 Failure Analysis

RESULTS

Job Run Time

Performance Improvements with new platform

Cost Savings

Cloud Cost Saving per Job Run

Over 1,000 Engineering development hours across 5 teams


Previous
Previous

Overcoming Data Ingestion Challenges for health Tech Startup

Next
Next

Optimizing Commercial Operations for RWE Client Delivery