The Importance of Cleaning Your Data

--

Both of these images hit close to home for me. If they hit you too, then I bet you love puzzles too.

This article is part rant and part good advice about why you need to have clean data. This topic has been a pain in my posterior for the last 7 months, but I’m starting to see the light at the end of the tunnel. Now that I hear it out loud, I probably shouldn’t mix posterior and tunnel metaphors. Hashtag where_the_sun_don’t_shine.

I would wager that most readers of this article don’t think that data cleanliness applies to them. They may believe that data is someone else’s domain, and it’s best to stay far away lest you anger the analysts. Well, I’m going to change their minds. The use of data is something that impacts everyone, even if you only open a spreadsheet once in a blue moon. Cleaning data is how we ensure that impact is a positive one, and not a cluster of extreme proportions.

Oh good, you’re still reading the article. I appreciate you giving the topic a chance. This may sound like something that only the nerds have to deal with, not the cool kids with their company cars and per diems. Let me ask you this — Do you always know when an analysis is wrong? I don’t mean performed incorrectly, but the outputs don’t make sense given what you know about the topic. How about analyses that look correct, but are still wrong because they were based on bad data? You have to be able to trust your information if you’re going to draw any conclusions or sell your ideas, so having clean data is a must.

We’re long past the days of small enough data sets that we can usually tell when there are problems with the inputs. The sheer number of data sources and formats make it extremely difficult to know if the outputs are right or wrong, because they will probably look fine to all but the most pedantic users.

I guarantee at least a quarter of U.S. companies are making strategic decisions based on bad or incomplete data. If you asked them though, they would swear up and down that their data is fine and demand to know how you got into the building. The question is, are you sure your company is not part of that 25%?

Now that you’re sufficiently rattled, I can reassure you that there is a way to feel safe and secure again. Ensure your data is clean, and you’ll have a strong foundation for any analysis, strategic project, or acquisition decision you have to make. Of course you’ll also want a competent team of people crunching your numbers, but that’s a topic for another article.

Cleaning data is simply the process where missing, duplicate, poorly formatted, or just plain wrong data is removed and replaced with data that can play nice with others. Don’t fret — I’m not planning to walk you through all of the formulas. I’ll cover the main topics so you get a feel for the process, and then you’ll understand why the analysts care so much about tracking things and getting all the little details right.

  1. Know where all of your data is and who owns it. Sounds like something obvious, but I guarantee this is the step most people skip. It may sound easier to just delegate someone to figure things out, but that’s going to take significantly more time than to ensure that all of the teams generating or acquiring data are doing it the right way. It doesn’t have to be complicated. Even having a list of sources and content can help connect the dots between functions.
  2. Align metric & field names between your data sets. Quick, how many different titles can you come up with to name a column of sales data? 3, 10, 50? The correct title is going to depend on what you’re trying to measure. Two of your teams might each have a dollar sales metric, but one team may be responsible for buying raw materials, while another sells the finished outputs to customers. You don’t want to combine those identically named dollar sales metrics later, so make things easier on yourself and aim for descriptive metric names.
  3. Whenever possible, have a comparison for your data to test its accuracy. You can be reasonably sure your data is correct if you are able to match it to another source. Do the sales amounts reported by Accounting match the commission spreadsheet that Sales manages? Does the email address that Marketing is using match the email that was submitted online? Go back to that data’s source and verify, at least once in a while.
  4. Make sure your data is formatted the way it’s supposed to look. Phone numbers should look like phone numbers, and addresses need valid zip or postal codes. Set the rules for how that particular data point should look, and if you can, use technology to enforce those rules. Can’t put letters in a phone number if the software won’t let you. (O’s are not zeroes, people, I don’t care if they look alike).
  5. Reward and/or discourage users from skipping fields or data points along the way. This is one of the hardest data issues to reconcile — how do you know what to enter if the data just doesn’t exist? The very minor amount of time your user saves by skipping step 3 out of 7 means that there’s a gap that someone else will spend much more time trying to fill. Take it from a guy who has been having conversations about this with salespeople for a couple of decades now. It’s significantly more painful to go back to fill in gaps than to take the few extra minutes up front to do things right.
  6. Once you’re feeling good about your existing data, it’s a great time to set up some bouncers on the data flowing in. Sanitizing inputs is the easiest way to assure you are limiting how far astray your data might go. Everyone has heard the “garbage in, garbage out” phrase, but most non-data people still happily put garbage in unless someone stops them. Get your bouncers (software) to keep the riff raff out.
  7. Make it a habit within your companies to check and clean your data on a regular basis. You’re probably thinking “I don’t have anything to do with that”, but all it takes is for enough people to ask questions and the ball will start rolling. Especially if you remind people that bonuses are based on your results, which come from the data in your ecosystem. And stop any conversations on making the data look a little more favorable. The NSA agents listening on your phone line don’t have a sense of humor.

Even if you never go near data, you can still help out by cutting the data cleansing team some slack. Remember the article on meta-work? Data cleansing definitely falls in the meta-work realm. It’s very time consuming work that happens in the background. It may appear as though the analysts aren’t producing anything, but they’re actually removing the potential to produce the wrong things. I know my team is champing at the bit for slick new dashboards, but I’ve had to spend most of my first few months on the job cleaning up the data sources so those dashboards would be actually useful.

My last request for everyone reading this is to think about the data you and your teams produce as fuel for someone else. That survey you half-ass is going to end up on some overworked analyst’s plate to have to clean up before they can have dinner with their kids. The fields you skip in your CRM system become someone else’s headache as they try to figure out where the opportunity areas are. Garbage in doesn’t just refer to incorrect information. The cleaner you can make the fuel (data), the better the fancy new analytics engines will run, and the better your outputs will be. And a happy analyst is an analyst more inclined to help you fix your Excel file “one last time I swear”.

-Philip

--

--

Philip White (not that one, the other one)
Philip White (not that one, the other one)

Written by Philip White (not that one, the other one)

Don't believe this photo, I'm way less handsome in person. And if you like my writing, let me know by sending me the word "plethora". It'll mean a lot to me.

No responses yet