
Data quality is a very broad term and while it affects us all to some degree on a day-to-day basis we may not necessarily have a concrete understanding of what it means and why it's important. Even though I work with data on a daily basis I would scarcely consider myself an expert in this field, and the writing of this blog is as much for my own learning as it is for everyone who potentially reads it.
With this post I intend to demystify the notion of what data quality is and why it is relevant for organisations. It is something that we as analysts should always have in the forefront of our minds when working with data, and hopefully this blog will provide some quantifiable metrics to check data sources against to assess their quality.
University of Skating and other Action Sports
My aim here is to first gain a solid grasp of the what, as a way of setting the scope for the how; i.e using tools such as Tableau and Alteryx to make maintaining and reporting on data quality easier. This topic could potentially be considered a bit dry by some so I'll try to keep jargon to a minimum, and treat this post like a case study. As of now I am the head of the fictional University of Skating and other Action Sports, the world's only dedicated inline skating university. From this point on I will be relating all my examples back to the University to try and make them a bit more relatable.
As this is something that I'm still trying to continually learn more about I had to do a little bit of research to help with the writing of this blog, so I'll include links to those articles.
How do we measure it?
As you may have guessed data quality refers to the quality of your data. But to be a bit more specific about it, what criteria could I as a key stakeholder at my institution use to measure the quality of our data?
- Accuracy - How accurately does the data reflect reality or 'the truth'? Seems like a simple enough question, but this is probably one of the most critical measures to get right. A simple example here would be admissions, according to my data 1000 students are due to start at the University in the coming academic year, and it also tells me that I have exactly 1000 dorms available to accommodate them in. But if either of these figures is not accurate and does not reflect the truth then that would have serious implications, either there would be dorms sitting empty or students with no place to live!
- Completeness - Does our given data source contain all of the elements that it should or are there any gaps? This is a relatively simple one to get your head around, imagine if you were expecting 4 years worth of academic history from a student's time at the university, but there is only two, this information could not be considered complete.
- Timeliness - How accurate data is in relation to a given time period? Or to explain this using an analogy; if I want to know how many students are currently studying at the University midway through the academic year, but the only count that I have is from the end of the previous academic year then this is not timely as it would have been accurate at the time the count was taken, but not necessarily relevant to the question I'm asking now.
- Consistency - This looks at if data that we have is consistent, and that could be across multiple columns, tables or even databases. Essentially what we don't want is two bits of information that directly conflict with each other. A simple example would be the with the start date and graduation date of a student. We would expect depending on course taken, that the graduation date would be 3 or four years after the start date. But if we have a graduation date that occurs before the start date then this would be considered inconsistent.
- Validity - This looks at whether the data that we had is in accordance with the format that we expect. Again, on paper this is another pretty straightforward one, data can be considered valid if a given attribute about it, for example the length of a string, conforms to a specified rule. An example for my University would be during the registration process, when I need students to submit various bits of information about themselves for our records. If somehow a student entered letters in a contact number field or if they entered their date of birth in the incorrect format, then at worse these fields would become unusable or best case we would then need to expend time and effort correcting these issues ourselves.
Why is it important?
Now that we've established some of the measures that we can use to assess the quality of our data. And while we elaborated on this slightly with some of the examples; it's still worth posing the question, 'Why is this important to my organisation?'
Following the example of my university again, data is incredibly important to its day-to-day workings as an institution. There is an inherently complex organisational structure with many students potentially studying many courses and a large number of records to be maintained. Simply put, with this level information being held, data quality is essential because otherwise even seemingly simple questions like how many students are enrolled can become very tough to answer accurately.
There are also legal implications to be considered here as well, as an institution operating within the European Economic Area, my University would be bound by GDPR(General Data Protection Regulation) so ensuring that the data held about students is correct and fit for purpose is key. Also as a Higher Education institution we are required to provide a statutory return to HESA, once again containing the requirement that the data submitted is as accurate as possible. The general point here is that there are factors external to the organisation that require us to ensure the quality of our data and failing to meet these expectations would damage our reputation but also lead to hefty fines.
This point is more of an ethical one, but something that I still consider important. Considering that we are handling data relating to actual people we have a moral duty to ensure that it is correct, because failing to do so could impact their lives detrimentally.
And then finally there is this key concept of 'Rubbish in, Rubbish out'. By this I mean that if the quality of the data within our systems is poor then any reporting or modelling that is built on top of it cannot be considered completely trustworthy. This is a key point because for my institution to effectively make informed decisions based on data, then it is critical to ensure the quality of the information being used. Other any decisions made are effectively worse than blind guesses.
Building processes to achieve this.
So after delving into some of the criteria we can use to measure data quality and the reasons why doing so is important, the final question to ask is how to build tools and processes that will make it possible to monitor and strive toward the criteria that I have outlined above.
As I mentioned at the start this is a potentially massive topic and something that I'm still continually learning about, so even the information that I've provided here only really scratches the surface of Data Quality. But now that there a clearer definition around what our aims are, my subsequent blog posts will deal with building out examples of how to utilise technologies such as Tableau or Alteryx to strive towards better Data Quality but also some of the wider thinking around best practices in this area.
Thanks for taking the time to read this blog post and hopefully it was useful. Below I've included some links to articles that I found useful when beginning to research this topic. The first one in particular I found to be especially useful as it goes into the topic in quite a bit of detail.
https://www.toptal.com/database/data-warehouse-data-quality-process
https://searchdatamanagement.techtarget.com/definition/data-quality