Data Quality: The Art and the Science
Medicine is often considered part science and part art. There is a huge amount of content to master, but there is an equal amount of technique regarding diagnosis and delivery of service. To optimally succeed, care providers need to master both components. The same can be said for the problem of processing Healthcare data in bulk. In spite of the existence of many standards and protocols regarding Healthcare data, the problem of translating and consolidating data across many sources of information in a reliable and repeatable way is a tremendous challenge. At the heart of this challenge is recognizing when quality has been compromised. The successful implementation of a data quality program within an organization, similar to medicine, also combines a science with an art form. Here, we will run through the basic framework that is essential to data quality initiative and then provide some of the lesser-understood processes that need to be in place in order to succeed.
The science of implementing a data quality program is relatively straightforward. There is field-level validation, which ensures that strings, dates, numbers, and lists of valid values are in good form. There is cross-field validation and cross-record validation, which checks the integrity of the expected relationships to be found within the data. There is also profiling, which considers historical changes in the distribution and volume of data and determines significance. Establishing a framework to embed this level of quality checks and associated reporting is a major effort, but it is also clearly an essential part of any successful implementation involving Healthcare data.
Data profiling and historical trending are also essential tools in the science of data quality management. As we go further down the path of conforming and translating our Healthcare data, there are inferences to be made. There is Provider and Member matching based on algorithms, categorizations and mappings that are logic based, and then there are the actual analytical results and insights generated from the data for application consumption.
Whether your downstream application is analytical, workflow, audit, outreach-based or something else, you will want to profile and perform historical trending of the final result of your load processes. There are so many data dependencies between and among fields and data sets that it is nearly impossible for you to anticipate them all. A small change in the relationship between, say, the place of service and the specialty of the service provider can alter your end-state results in surprising and unexpected ways.
This is the science of data quality management. It is quite difficult to establish full coverage – nearly impossible – and that is where ‘art’ comes into play.
If we do a good job and implement a solid framework and reporting around data quality, we immediately find that there is too much information. We are flooded with endless sets of exceptions and variations.
The imperative of all of this activity is to answer the question, “Are our results valid?”. Odd as it may seem, there is some likelihood that key teams or resident SME’s will decide not to use all that exception data because it is hard to find the relevant exceptions from the irrelevant. This is a more common outcome than one might think. How do we figure out which checks are the important ones?
Simple cases are easy to understand. If the system doesn’t do outbound calls, then maybe phone number validation is not very important. If there is no e-Mail generation or letter generation, maybe these data components are not so critical.
In many organizations, the final quality verification is done by inspection, reviewing reports and UI screens. Inspecting the final product is not a bad thing and is prudent in most environments, but clearly, unless there is some automated validation of the overall results, such organizations are bound to learn of their data problems from their customers. This is not quite the outcome we want. The point is that many DQ implementations are centered primarily on the data as it comes in, and less on the outcomes produced.
Back to the science. The overall intake process can be broken down into three phases: Staging, Model generation, and Insight generation. We can think of our data quality analysis as post-processes to these three phases. Post-Staging, we look at the domain (field)-level quality; post-Model generation, we look at relationships, key generation, new and orphaned entities. Post-Insight generation, we check our results to see if they are correct, consistent, and in line with prior historical results.
If the ingestion process takes many hours, days, or weeks, we will not want to wait until the entire process has completed to find out that results don’t look good. The cost of re-running processes is a major consideration. Missing a deadline due to the need to re-run is a major setback.
The art of data quality management is figuring out how separate the ‘noise’ from the essential information.Instead of showing all test results from all of the validations, we need to learn how to minimize the set of tests made while maximizing the chances of seeing meaningful anomalies. Just as an effective physician would not subject patients to countless tests that may or may not be relevant to a particular condition, an effective data quality program should not present endless test results that may or may not be relevant to the critical question regarding new data introduced to the system. Is it good enough to continue or is there a problem?
We need to construct a minimum number of views into the data that represents a critical set and is a determinant of data quality. This minimum reporting set is not static, but changes as the product changes. The key is to focus on insights, results, and, generally, the outputs of your system. The critical function of your system determines the critical set.
Validation should be based on the configuration of your customer. Data that is received, processed, but not actively used should not be validated along with data that is used. There is also a need for customer-specific validation in many cases. You will want controls by product and by customer. The mechanics of adding new validation checks should be easy and the framework should scale to accommodate large numbers of validations. The priority of each verification should be considered carefully. Too many critical checks and you miss clues that are buried in data. Too few and you miss clues because they don’t stand out.
Profiling your own validation data is also a key. You should know, historically, how many of each type of errors you typically encounter and flag statistically significant variation just as you would when you detect variations in essential data elements and entities. Architecture is important. You will want the ability to profile and report ‘anything’ which implies it is developed in a common way that is centralized rather than having different approaches to different areas you want to profile.
Embedding critical validations as early in the ingestion process as possible is essential. It is often possible to provide validations that emulate downstream processing. The quality team should be incentivized to pursue these types of checks on an ongoing basis. They are not obvious and are never complete, but are part of any healthy data quality initiative.
A continuous improvement program should be in place to monitor and tune the overall process. Unless the system is static, codes change, dependencies change, and data inputs change. There will be challenges, and with every exposed gap found late in the process there is an opportunity to improve.
This post had glossed over a large amount of material and I have oversimplified much in order to convey some of the not-so-obvious learnings of the craft. Quality is a big topic and organizations should treat it as such. Getting true value is indeed an art as it is easy to invest and not get the intended return. This is not a project with a beginning and an end but an ongoing process. Just as with the practice of medicine, there is a lot to learn in terms of the science of constructing the proper machinery, but there is an art to establishing active policies and priorities that effectively deliver successfully.
This article was originally published on SCIO HealthAnalytics and is republished here with permission.