We can imagine our health as a jigsaw, with each individual piece representing a different aspect of our medical history. These pieces might include blood test results, X-ray images or the notes taken by a doctor as we describe our symptoms. These jigsaw pieces are ultimately recorded and stored in electronic health records (or EHRs). EHRs are a valuable resource, providing an overview of someone’s health and they could have the potential to allow clinicians and researchers to unlock new medical insights. However, there’s a fly in the ointment – not all the pieces in such records always fit together correctly, and they may not completely capture the required information. Some clinical event documentation may not be complete, others do not align with related pieces, and some events are even missing entirely. This data quality problem was tackled by Dr. Hanieh Razzaghi of the Children’s Hospital of Philadelphia, and her colleagues, in their innovative work on the PRESERVE study, a research project exploring chronic kidney disease in children (the PRESERVE study itself was led by Drs. Michelle Denburg and Christopher Forrest). Using EHRs from 15 different hospitals across the United States, the team aimed to understand how various treatments could potentially slow down chronic kidney disease progression. However, initially, they had to make sure that the data they were relying on were accurate, reliable, and suitable for the required complex analyses. More
When researchers use EHRs to gain medical insights, they are not starting with information that was originally collected for scientific purposes. These records were created to help doctors diagnose and treat individual patients, not to answer research questions. As a result, the data might have gaps or quirks. Imagine trying to study a population’s eating habits but only receiving records from diners who forgot to order dessert. Without the full story, it’s easy to draw the wrong conclusions.
In medical research, these kinds of errors can have serious consequences. Misclassification of a patient’s condition, missing data points, or inconsistencies in how tests are reported can skew results and, worse, lead to incorrect recommendations for patient care. Recognizing this, Dr. Razzaghi’s team set out to systematically assess and improve the quality of the data they were using for the PRESERVE study.
The PRESERVE study is a large-scale investigation focused on children with chronic kidney disease, a condition that can lead to kidney failure and other serious health problems. The study examines whether controlling high blood pressure, a key risk factor in kidney disease, can slow down the decline in kidney function. To do this, researchers needed high-quality data on everything from blood pressure readings and kidney function tests to the medications children received and how often they visited specialists.
However, combining EHR data from 15 hospitals presented a major challenge. Each hospital used slightly different systems, recorded information in unique ways, and sometimes left out important details. To address these issues, the research team employed a rigorous approach called Study-Specific Data Quality Assessment (SSDQA).
The SSDQA framework was developed using a theoretical framework to create data quality checks that are reproducible beyond just this use case. Application of the framework for this study involved two rounds of detailed testing. In the first round, the researchers ran checks on high-level summaries of the data. This helped them spot broad problems, such as missing test results or inconsistent coding for procedures. For instance, they found that some hospitals didn’t record certain kidney function tests at all, while others used different codes for the same procedures.
In one striking example, the team noticed significant gaps in the recording of a key kidney function test, serum cystatin C, which is crucial for understanding how well a patient’s kidneys are working. Because this test wasn’t available consistently across hospitals, the researchers had to adjust their plans and use another measure, serum creatinine, which was more widely reported but less precise.
This round of testing also identified problems with identifying patient visits associated with nephrologists at two institutions. By allowing early remediation to take place, these institutions’ data could be included in the study. If it had been identified later without this systematic process, the results of data from these institutions risked being highly biased.
The second round of testing focused on row-level data, which includes individual patient records. This deeper dive revealed subtler issues, such as anomalies in how frequently certain tests were performed or whether important diagnoses were missing entirely. For example, one hospital recorded abnormally high rates of specific lab results, likely due to technical errors in how the data was extracted from their system.
Dr. Razzaghi and her colleagues identified and resolved over 270 data quality issues across the two rounds of testing. These ranged from missing blood pressure measurements to inconsistencies in how dialysis procedures were recorded. By working closely with each hospital’s data teams, they were able to correct many of these problems, ensuring that valuable patient data wasn’t excluded or misinterpreted by the study unnecessarily.
Some of the most significant improvements included increased completeness. For instance, for one participating institution, the percentage of patients with valid urine protein test results, a key indicator of potential kidney damage, rose from less than 5% to over 70% after data issues were addressed. The researchers also enhanced data accuracy. For instance, errors in how patient heights were recorded, which affected the calculation of kidney function through a measure called estimated glomerular filtration rate, were corrected. They also assisted the analytics team in making decisions about how to collect more precise data about dialysis, relying on the United States Renal Data System instead of source EHR records due to complex segmentation of care inconsistently captured across institutions.
These efforts not only improved the quality of the PRESERVE study but also highlighted broader issues in how EHR data is used for research. Dr. Razzaghi’s work is a powerful reminder that good research starts with good data. By systematically identifying and fixing data problems, the team ensured that the PRESERVE study could provide meaningful insights into how to better care for children with chronic kidney disease.
However, the benefits of their approach extend far beyond this single study. The tools and methods developed by Dr. Razzaghi’s team can be applied to other research projects, helping to improve the quality of medical studies worldwide. And by collaborating with hospital data teams, they’ve also set the stage for more accurate and reliable EHR systems in the future.