In Machine Learning, it is common to create a predictor that performs very well on average on the entire test dataset but very poorly on a specified subset (or slices) of data.
These systematic errors are commonly referred to as hidden stratification and have caused huge failures in the development of Artificial Intelligence. For example, Amazon's facial recognition matching professional athletes as mugshots in the 2019 Super Bowl or Google identifying men wearing monocles as armed with a gun in April 2021.
Identifying these subsets is the first step to improving the model robustness, either through robust optimization techniques or by adding relevant data.
Figure 1: Example of hidden stratification. Model (purple dot line) performs well on average but poorly on two subsets of data (top right blue and bottom left red)
With structured data; it is usually quite simple to identify those slices, simply with visualization or data slicing methods ([Polyzotis et al., 2019]). But with unstructured data (images, text, audio), the problem becomes much more complex. Before exploring some recent methods developed to discover hidden stratification, we will try to understand how those slices of data can emerge and why it is difficult to detect them. Finally, we will see how we can leverage the dataset visualizations we created with Mapper to detect these hidden stratifications.
Slices and granularity
[Eyuboglu et al., 2022] identify three popular types of slice:
Correlation slices are the most common. They occur when the variable we want to predict is correlated with another hidden variable. Our models might rely on the spurious correlation to make predictions instead of the actual target. For example, when trying to classify between landbird and waterbirds, a model can actually classify based on the background (water → waterbirds, land → landbirds).
Noisy label slices appear when a subset of data exhibits higher labeling error rates than the rest of the training dataset. This can happen when the annotation project has not been properly defined
Rare slices are the default type of hidden stratification. Sometimes, the model can perform poorly on sub-categories even if it has been labeled correctly and no spurious correlation can be identified. In this case, only the lack of data in the training dataset is responsible for the model difficulties. For example, the model could underperform on photos taken at night if most of the training dataset only contains photos taken during the day.
One of the main difficulties when trying to detect hidden stratification is the definition of the slices where the model underperforms. This is the consequence of the infinite possibilities in the choice of granularity. Let’s take, for example, a dataset where we want to classify between dogs and cats:
the model could underperform on a specific cat or dog’s species;
but the model could also underperform based on color, sex, age, etc. (some of those filters are even shared between the two main classes).
Figure 2: Example of stratification a Slice Discovery Method could identify
All stratifications are correct, this is why it is so difficult to define an algorithm to detect hidden stratification in a general framework and without a human in the loop process.
Slice Discovery Methods
Nonetheless, we have seen in recent years some progress in this topic with the emergence of several Slice Discovery Methods (Multi-accuracy ([Kim et al., 2018]), GEORGE([Sohoni et al., 2020]), Spotlight ([d’Eon et al.,2021]), DOMINO ([Eyuboglu et al., 2022]), BARLOW ([Singla et al., 2021]). They all rely on the same two steps:
Recover embeddings of the dataset
Apply dimension reduction and clustering to detect hidden stratification.
In the first Mapper article, we have already seen what data embeddings are and how we can build them with a pre-trained neural network. Some of the abovementioned methods go further by using cross-modal embeddings (DOMINO) or a robustly-trained model (BARLOW).
PCA is the default dimension reduction method for most of the methods. GEORGE uses UMAP for dimension reduction and BARLOW a feature selection method based on mutual information gained (a statistical value that measures the importance of each feature for the classification task). Each method uses its own clustering algorithm, some choose only to highlight a few clusters that could match poor-performing slices (Spotlight, DOMINO) while others cluster the entire dataset to highlight potential sub-classes (GEORGE).
Slice Discovery with Mapper’s Dataset Representation
If we look at how we built a dataset representation with Mapper, we can see that we perfectly match the two steps of Slice Discovery Methods. Even better, instead of standard clustering methods, which only return lists of elements, we have a method that returns a graph, linking clusters to each other and providing a user-friendly visualization.
Figure 3: Mapper representation of one label category of the Fashion-MNIST dataset. Constructed with Kili AutoML and KeplerMapper
Topological Data Analysis’s ability to detect hidden stratification and subgroups has been known for more than a decade ([Nicolau et al., 2011]), and others have used Mapper to classify error types in a predictive process ([Carlsson et al., 2020]).
[Eyuboglu et al., 2022] created “DC-Bench”, a data-centric benchmark consisting of a list of problems and associated models subject to hidden stratification. This benchmark can be used to compare Slice Discovery Methods based on their ability to detect hidden stratification. We use standard evaluation metrics such as precision (fraction of assets among the retrieved instances that actually belong to the slice to discover) or recall (fraction of assets that actually belong to the slice which are recovered by the Slice Discovery Method).
We used it to compare the following methods:
Confusion Matrix that partitions data into the cells of the confusion matrix,
Dataset representation with Mapper constructed either:
with the activation embeddings (extracted from the penultimate layer of the model used for classification),
with the Big Transfer ([Kolesnikov et al., 2019]) embeddings.
and where we consider the 5 (or 10) graph nodes where the model performs the worst.
Spotlight with the activation embeddings and where we consider the 10 clusters constructed by the method.
GEORGE with the activation embeddings (since we cannot impose a number of clusters in GEORGE, we use the standard parameters, which returned an average of 12.2 clusters per problem).
Unfortunately, we could not recover the results presented for the DOMINO method.
We compare the precision of all created clusters (meaning the percentage of value among the cluster that belong to the actual slice):
Figure 4: comparison of Slice Discovery Methods with DC-Bench
Results can be interpreted as follow:
blue: In one of the cells of the confusion matrix, 27% of the assets belong to the slice we want to identify;
green: In one of the 5 nodes in Mapper’s representation where the model performs the worst, 37% of the assets belong to the slice we want to identify;
pink: In one of the clusters constructed by GEORGE (12 on average), 27% of the assets belong to the slice we want to identify.
We use precision as our metric because the purpose is to facilitate the detection of hidden stratification for the user. This means that, when looking at the few nodes constructed by Mapper where the model underperforms, the user can easily identify a coherent subset corresponding to a systematic error of the model.
If we look at the example of the previous article: mapper Fashion_MNIST
Figure 5: Mapper representation of category “T-shirt/Top” of the Fashion-MNIST dataset. Constructed with Kili AutoML and KeplerMapper
Figure 6: Sample of assets in nodes 1 to 4 from Figure 5
If I look inside node 4, where my model performs poorly (on average, the model projects a 17% probability of belonging to the correct class to assets inside that node), we see “T-shirt/Top” with complex patterns, which is a systematic error of the model. To solve this issue, we can train our model to recognize Tops with complex patterns or add pictures matching this criterion.
Hidden stratifications are the source of important failures in Machine Learning. Their automatic discovery is a hot topic in the Data-Centric paradigm of Artificial Intelligence.
With Mapper’s interactive dataset representation, we can help data scientists identify hidden stratification. While still relying on a human in the loop, this method is both powerful and user-friendly, thanks to graph visualization.
Gurjeet Kaur Chatar Singh, Facundo Mémoli, and Gunnar E. Carlsson. Topological methods for the analysis of high dimensional data sets and 3d object recognition. 2007. URL http://dx.doi.org/10.2312/SPBG/SPBG07/091-100
Neoklis Polyzotis, Steven Whang, Tim Klas Kraska, and Yeounoh Chung. Slice finder: Automated data slicing for model validation. 2019. URL https://arxiv.org/pdf/1807.06068.pdf
Michael P. Kim, Amirata Ghorbani, and James Y. Zou. Multiaccuracy: Black-box post-processing for fairness in classification. CoRR, abs/1805.12317, 2018. URL http://arxiv.org/abs/1805.12317
Nimit Sharad Sohoni, Jared A. Dunnmon, Geoffrey Angus, Albert Gu, and Christopher R ́e. No subclass left behind: Fine-grained robustness in coarse-grained classification problems. CoRR, abs/2011.12945, 2020. URL https://arxiv.org/abs/2011.12945
Greg d’Eon, Jason d’Eon, James R. Wright, and Kevin Leyton-Brown. The spotlight: A general method for discovering systematic errors in deep learning models. CoRR, abs/2107.00758, 2021. URL https://arxiv.org/abs/2107.00758
Sahil Singla, Besmira Nushi, Shital Shah, Ece Kamar, and Eric Horvitz. Understanding failures of deep networks via robust feature extraction. 2021. URL https://bit.ly/3cSWXcd
Monica Nicolau, Arnold J. Levine, and Gunnar Carlsson. Topology-based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. Proceedings of the National Academy of Sciences, 108(17):7265–7270, 2011.URL https://www.pnas.org/doi/abs/10.1073/pnas.1102826108
Leo S. Carlsson, Mikael Vejdemo-Johansson, Gunnar Carlsson, and P ̈ar G. J ̈onsson. Fibers of failure: Classifying errors in predictive processes. Algorithms, 13(6), 2020. ISSN 1999-4893. URL https://www.mdpi.com/1999-4893/13/6/150
Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Large-scale learning of general visual representations for transfer. CoRR, abs/1912.11370, 2019. URL http://arxiv.org/abs/1912.11370