LoadingLoading
Date2022-05-23 16:53
Read5min

Exploring the Data: A quick guide for MLEs

Exploring the Data: a quick guide for MLEs

Exploring the Data: A quick guide for MLEs

Why we explore

The process of data exploration generally begins with setting up expectations. And each one of us can have different ones. In line with DCAI recommendations, you might be trying to identify issues with your original data set; you might also be looking for annotation errors, missing values, and unexpected outliers. Someone else could start exploring theirs to prove or disprove a hypothesis, test an existing model or to carve out a specific slice of existing data to be used for some other purpose.

Let’s take a look at an example. Let’s suppose that you have a dataset containing different characteristics of a group of students. We know their height, weight, gender, exam performance, etc. At face value, everything seems fine. When you start exploring the data, though, one thing just doesn’t feel right: the height range does not resemble anything you’d expect to see in reality. You then pinpoint the weirdest-looking bits of data and find out that some of the variables were provided in feet and inches, while others in centimeters. Thanks to your inquisitiveness and data exploration techniques you managed to avoid feeding wrong data to our model.

For another example, let’s look at image-based projects. In ML projects relying on large image sets, exploring can help you reduce the size of the data set, by enabling you to easily remove duplicates or images that look too similar to one another, images that are too dark, too bright etc. This should help you save time (for example spent labeling wrong images, answering a host of questions) and valuable resources (having to re-train the model).

How and when to explore the data

The basic explore loop looks like the following:

  1. 1. Run some process

  2. 2. Take some action

Typical explore processes include seeing or hearing the data firsthand or using statistical methods like univariate and bivariate analysis to figure out data distribution as well as instances of outliers, missing values, collinearity, etc.

Typically explore actions include flagging a file or set of files for further human review in case of annotation errors or generating a new novel slice of the data that is easier to label. 

Depending on your needs, data exploration can be done at different stages of the overall process. With DCAI best practices in mind, you can run it during the annotation of a batch to do quality assurance (for example checking how well labelers do against the gold standard, trying to gauge annotator consensus or spotting edge cases) or to simply inspect examples. But there’s nothing stopping you from doing that prior to the annotation of a batch. Let’s say you’ve uploaded your data set and you’ve set the gold standard labels. Just to be on the safe side, you decide to run a pre-trained model on the same data set and then assess how the model does. While exploring and comparing both annotation types you find out that some gold standard labels are actually missing! The model found additional items and the tools that you used to explore the results made it super-easy to spot the difference.

Potential caveats

The person who does the exploration may, or may not, be involved in the annotation process. More generally, the person doing the exploration may be different from the person who did any other process, including uploading, annotation etc. This may be a good thing, with a fresh set of eyes being able to spot new issues with the data set, but this may also prove to be a hindrance, as lack of knowledge of original workflow and organization of data may result in issues being raised in wrong places, leading to extra chaos and rework.

It’s worth noting that even if you’re the person who created the  data set or set up the workflow in the first place, this doesn’t necessarily mean that you’re immune to potential issues with exploring. With projects sometimes taking months or years to complete, after a while even your own data set can feel new or set up in a strange way.

Conclusion

Exploring your data set can be thought of as a “supercharged” version of looking through files on a regular file browser. There’s nothing in the process that you can’t do manually, but with the right set of tools and provided that you know the correct process and know how to avoid possible pitfalls you’ll find yourself doing your work (like discovering valuable insights, filtering data volume, comparing model performance, debugging the human etc.) way more efficiently.

Get started

Learn more!

Willing to enforce your knowledge of Training Data for Machine Learning? This ebook from O'Reilly is for you. To deepen your understanding of human supervision from Annotation to Data Science, get yours.

Related resources

Get started

Get Started

Get started! Build better data, now.