Taylor Rodgers
  • Home
  • Books
  • Free Tools
    • Requirement Doc Templates
    • Quality Checking Guide
  • Blogs
    • Data Science
    • R Programming
  • Consulting
    • Data Science
  • Contact
  • Home
  • Books
  • Free Tools
    • Requirement Doc Templates
    • Quality Checking Guide
  • Blogs
    • Data Science
    • R Programming
  • Consulting
    • Data Science
  • Contact

Why Data Quality Issues Happen and How to Fix Them

6/23/2019

1 Comment

 
Good data quality is the foundation of your data solution’s success. It doesn’t matter if you have a great personality, build beautiful dashboards, or present engaging analysis – if your stakeholders stop trusting your data, they’ll stop trusting you.
Most managers and executives who oversee a data team know the importance of data quality, but many feel like their teams don’t tackle the issues correctly. That’s especially true of newer executives without much experience in data.

Many people would say being detail oriented is how you handle those issues, but that’s not necessarily true. Being detail oriented is a requirement, but building a business ecosystem that empowers detail oriented people to fix problems at the right time is even more important.

That’s why I have always argued that data quality is ultimately the responsibility of the leaders overseeing data teams and data quality is ultimately a reflection of the quality of management. I’ve been interviewing experienced leaders in the business intelligence space for a book I’m writing and most have said something similar (a few have disagreed with me).

Why Data Quality Issues Happen?

Data quality issues happen because the organization and programs that support data are complex. They require hundreds, if not thousands, of steps that work together to produce data. Those steps could be a human or a computer program.

The success rate of a computer program is a 100%. A SQL select statement or a custom calculation always runs how it’s supposed to. When it breaks, it’s usually the result of a human input.

That leaves humans as the primary culprit of data quality issues. That’s understandable, since humans are not perfect. The more human steps you add to a process, the more quality issues you’ll have. Even if your humans succeed 99% of the time for each step, that is not enough to prevent data quality issues from becoming widespread.

There’s an equation that I love that illustrates why this happens:
Picture

If you have 200 individual, independent human steps and each succeeds 99% of the time, the chance that all steps will succeed is 13.39%.

I can tell you right now that most BI solutions have far more human steps than 200.

How to Stop Data Quality Issues? Focus on Prevention, Not Treatment

When I think of data quality, I think of it like bodily health. And bodily health doesn’t come from treatment or surgery, it comes from habits.

Picture two different people. The first one we’ll call Jane. Jane went jogging three days a week. She cut out soda and other sugary drinks and snacks from her diet. She focused on eating more vegetables and less red meat. She didn’t smoke. She enjoyed the occasional glass of wine with friends, but didn’t drink much either.

The other person did not take that approach. We’ll call him Steve. Steve didn’t go jogging and seldom walked, accept to and from his car after work. He ate some vegetables, but ate mostly burgers, potato chips, and ding dongs for dessert. He drank often with friends and enjoyed a glass of scotch most nights. He never exercised and smoked regularly.

Every once in awhile, Steve decided to “be healthy” and he switched to an extreme diet and started lifting heavy weights at the gym. These new habits never lasted long and he reverted back to his habits of smoking, eating burgers, and ding dongs.

As you can imagine, Jane would live a long, healthy life with few health complications in her old age. Steve would likely have many costly, corrective surgeries to help with diabetes, cancer, or other preventable diseases. They may live to the same age, but Jane’s life would be far less stressful and better.

The data teams with the best, long-term success are like Jane. They have good habits that produce that success. Habits such as requirements building, quality checking, automation, and keeping things clean and organized.

The data teams with the worst, long-term success are like Steve. They don’t do things such as requirements building, quality checking, and keeping things organized. They often focus on short-term wins, which involve excessive hacks and unscalable fixes. They won’t QA extensively, or if they do, don’t make it a consistent process. They over emphasize a new data strategy as the solution to their problems, when it really is a need to improve their operations.

I don’t have data to back this up, which I admit is ironic, considering my background, but I’d estimate that building a good set of habits on your team will take care of 80% of the data quality issues out there. Focusing on building that as a good foundation will make it far easier to fix the rest. They are the vegetables and jogging of your data solution.

The Best Habits to Adopt to Improve Data Quality

What are these habits? There are many you can focus on, but I’d say the most important are:
  1. Requirements gathering
  2. Quality checking
  3. Over communication
  4. Making the output as simple as possible
  5. Reducing human inputs

Good requirements gathering improves your data quality because it forces your team to communicate with the stakeholders to clearly define the outcome you’re working towards, which makes it easier for everyone involved to know what contributions they need to make. Requirements gathering is also beneficial because, when done right, it reduces needless complexity with projects. It narrows the scope of a project to what’s important. To see how to do requirements gathering, see this article.

Quality checking has obvious benefits to data quality. Quality checking alone isn’t sufficient though, as thorough requirements helps guide how quality should be defined. Still though – it’s amazing though how often people skip this step. If I oversaw a whole department or organization with multiple data teams, I would seriously consider whether someone was fit to lead a data team if they didn’t enforce this simple process.

Communication is also important to supporting a BI solution's quality, especially after it launches. It takes a village to produce a BI solution. The developers and analysts usually work closely with subject matter experts to produce the output. Sometimes that means the stakeholders have to change their own internal processes. Making sure you communicate the expectations you have of them – both during and after the project – is important. I’ve seen whole data solutions fail because stakeholders didn’t know how important their contribution was after the solution's launch. No one ever told them!

Reducing complexity in the solution itself is an important habit, one that you have to coach newer employees on before they become bigger issues. There’s an admirable quality to learn every cool thing that a BI tool can do and use it, whether they need to or not. This is beneficial to their learning because they take risks and learn new tricks. However, it’s important to make sure all the unnecessary stuff is removed by the end. This is important to make solutions easier to troubleshoot, because less time will be spent on reverse engineering and more focused on fixing.

Building off the concept of reducing complexity, it’s important to reduce the number of human inputs. This goes back to the equation I mentioned earlier, where 200 steps can lead to a 13% success rate. Building solutions that require human inputs means those inputs often exist outside a QA process and it degrades the quality over the long run. Some developers get a little too eager to offer manual inputs into Google Sheets as a solution to every problem.

Sometimes there’s no way around it, but I usually speak with the stakeholder and let them know the added risk to quality and the extra time it will take for them to maintain those Google Sheets. Sometimes the benefits they gain are worth the work, but more often than not, they decide against it once I make them aware of the costs.

Why Managers Must Take an Active Involvement in Prevention?

There are two big reasons a manager needs to take active involvement in prevention. The first big one is their authority to institute better processes.

Organizational habits, otherwise known as processes, do not magically happen. Those that do “just happen” are not always beneficial. Sometimes employees just develop their own, which leads to frustration when other employees don’t follow it. The kind of habits that improve the organization towards quality improvement requires a proactive leader, who can sell it and explain how the process improves it.

Every once in awhile, a manager may be more focused on strategy than processes. Strategy is sexier and generally makes it look like we’re accomplishing more. This emphasis on strategy means they might delegate process design to one of their team members. That’s not a terrible approach, especially if they’re unfamiliar with BI development themselves.

However, it’s still is a requirement for the manager to either sell the new process on behalf of their analyst or make it clear that he or she supports the analyst's proposal. The manager has all the authority. That’s what’s required for a process to be adopted by employees.

The second big reason a manager needs to take an active involvement is project management scheduling. Telling people “make sure you have someone else QA your work" doesn’t always guarantee it happens.

The trouble isn't so much that your team doesn’t believe in QA’ing. I don’t think I’ve ever met a good BI developer who doesn’t believe in QA. Many cringe at the idea of a mistake they made being spotted after deployment and all have a bad memory associated with that experience. I know I've had plenty!

The trouble is time allocation. Developers naturally want to be seen building things, particularly their own projects that have a due date. This means their schedules quickly fill up with their own work. Whenever their project is ready for QA’ing, it can feel like they’re placing a burden on another employee, asking them to take time out of their day to QA.

Some resort to self-QA’ing. That’s admirable, but it doesn’t work that well. I think I only met one developer who did that well himself, but he was used to working as a single person team. The problem with self-QA’ing is it’s a lot like editing your own essay in college, one day before it’s due. If you had weeks between when you finished your project and when it was due, self-QA’ing might work because it seems fresh again. Little mistakes you once missed would now be glaring.

That’s not the case for BI development. You often build it and need to release it the day after. It’s harder to see the little things that would be obvious to the stakeholder. For that reason, you need someone else to QA your work.

It’s critical for managers to make clear the expectation to project management and your developers that QA’ing is not a lesser job responsibility. It’s a primary job responsibility!

How You Can Identify Data Quality Issues?

QA’ing is the best way to identify data quality issues. There are some instances though where you should build a separate report to identify widespread issues with your data. One example might be that you only recently implemented proper QA’ing procedures and you know many past database builds were not done correctly. Or you may keep having partial data load failures, which don’t trigger error messages.

All those reasons means you need to proactively identify data quality issues. That is a fun challenge for a data scientist or one savvy analyst. The sad reality is you have to have a lot of skill to do this. I’ve discovered over the years general trends with marketing data to help find it, which I detail here, but it’s a lot harder to write a broad prescription you can apply to any database. (I feel like the person who built a system that did that would be one rich fella’.)

If you have a background in statistics or feel comfortable with concepts like standard deviation, there are some broad tips I can give you.

The first tip is to understand the cause and effect of your data flow. If you know A causes B, then the existence of B implies the existence of A (and vice versa).
Picture

In digital marketing, we know that a click on a display ad or a paid search ad leads to a session on a website. So I typically look for a massive discrepancy, such as 500 clicks and only 1000 session. That indicates something is wrong in the data ecosystem.

The second tip is to use standard deviation to identify outliers. If you know that you have an average of 100 new rows added each day to a table and that 2 standard deviations from the mean for new rows is 10, anything above 90 or below 110 is a cause for concern (see graphic below).
Picture
Picture

Standard deviation has a wide range of applications within data quality (and also identifying anomalies in behavior that the data is tracking).

There are many other statistical tests, some quite advanced, that can be built specific to your data warehouse. If you have someone with a statistics background or education on your team (or you do yourself), I’d suggest starting with calculating normal distributions and outliers and work your way up to more complicated tests.

Last Thoughts

Data quality issues can make us feel insecure about our own performance. I no longer have those insecurities, myself. We’re all human. No one has a 100% success rate (I like to think I get close though). The most talented individuals, and the best managers, learn to accept that they’re human and build safeguards to combat it.
1 Comment
Mia Evans link
11/11/2022 11:25:10 pm

Thanks for pointing out that having a proactive leader will be able to help with getting the best quality for your organization. I guess a good leader will also hire a data quality monitoring expert to help them out with their processes. In my opinion, this is something that has to be given importance if they really want to prevent issues that can affect their operations as well as their reputation which they hold dear.

Reply



Leave a Reply.

    ABOUT

    A blog about the non-technical side of data science.

      SUBSCRIBE

    Confirm

    ARCHIVES

    December 2022
    April 2022
    March 2022
    October 2021
    September 2021
    August 2021
    March 2021
    February 2021
    January 2021
    November 2020
    August 2020
    June 2020
    May 2020
    April 2020
    February 2020
    January 2020
    December 2019
    November 2019
    October 2019
    September 2019
    August 2019
    July 2019
    June 2019
    May 2019
    March 2019

    RSS Feed