Debugging without ever seeing the code

Towards privacy in real-world machine learning.

by NICOLAS REMERSCHEIDOctober 16th, 2020

Debugging. “Should I do it…., or could you….? You do it? I do it? Everybody needs to do it - in every domain. Be it finding an error in a text, checking the sensibility of a piece of software or, in this case, evaluating whether an AI model outputs what it should. Now, what does this have to do with privacy in machine learning, you might ask? Well, we also need to debug ML models that are trained on private data which we aren’t allowed to access directly.
But don’t I need direct access to the training data to debug a machine learning model?

No, you don’t! Let me first give a quick introduction to the core technology that makes this possible: Privacy Preserving Machine Learning (PP-ML).
PP-ML is a rapidly evolving new subfield of Machine Learning which tries to enable algorithms to be trained on data that can’t be accessed directly, most often stored in a decentralized manner on multiple devices without the possibility of transferring it to a central place where our model sits. There have been remarkable advances in the development of techniques that allow to train on decentralized data without directly accessing the data.
However, we need the whole machine learning workflow from data-processing to model testing to be privacy preserving!
Specifically, I will explain and further elaborate on an interesting approach to inspect data samples for debugging in a privacy preserving manner, outlined in the paper “Generative Models for Effective ML on Private, Decentralized Datasets”, by Sean Augustin et al. [1], published as a conference paper at ICLR 2020.
Before we deep dive into the proposed method to debug an AI model without accessing the data directly, the terms Federated Learning, Differential Privacy and Deep Generative Models should sound familiar. As a quick overview, let me very briefly summarize these.

Deep-GM simply describe Deep Learning (DL) models which can be trained to generate artificial, new data samples. Normally we bring the data to the model to train, FL allows to train on multiple devices at the same time by bringing the model to the data - the data never leaves the device. ML-models tend to memorize certain parts of the data they’re trained on, to prevent reverse engineering information about the data-examples, here we use DP to add random noise to the model weights.
Lastly, before we take off, a bit of motivation for doing PP-ML - why should you even care about PP-ML? To shorten this discussion, let me give a very simple use-case as an example. There exists a skin-cancer model from a research team at Stanford in corporation with Google [5] which is capable of detecting whether you have skin-cancer simply by providing it with a photo of your skin. Remarkable accuracy, immediate feedback. However, would you want some big tech company to have a large store of your naked full-body scans on their cloud? No!

Debugging…what, exactly?

To further narrow down this issue, this blog post limits the challenge of debugging ML-models to the problem of manual data inspection. Manual data inspection without direct data access would be like debugging code without actually seeing the code.
Intuitively, the manual inspection of data samples can be used for two purposes, find flaws or mistakes in the dataset itself or discover errors in the data preprocessing. The mentioned other study from (Humbatova et al. 2019 [4]) also comes to the result that “Preprocessing of Training of Data” and “Training Data Quality” are one of the most common problems in DL.
The paper defines six concrete different problems (*) where in a normal scenario (non-PP-ML) a Machine Learning Engineer would manually inspect some data.
      1. Check whether the data is suited for the specific use-case.
      2. Understand which samples are misclassified and why.
      3. Check which samples aren’t labeled.
      4. Understand on which groups of samples the model performs poorly and why.
      5. Check/Improve whether (human) labeling of samples was done properly.
      6. Check whether data or training parameters are biased.

Are Generative Models possible in PP-ML?
No – they are necessary!

First off, a key question most certainly is, what do generative models have to do in a privacy preserving context? After all, their core objective is to generate data which makes the effect of overfitting/memorization of training samples or parts of training samples even more openly visible. Even normal usage of the model for inference (data generation in this case) without any reverse engineering, etc. could lead to the disclosure of sensitive information. So, training generative models on sensitive data seems to be unwise, even dangerous on the first glance! Counterintuitively however, the approach developed in the paper demonstrates that Generative Models combined with Federated Learning and Differential Privacy can indeed be the solution to the problem of not being able to manually look at the data. Indeed, on a second thought, generative models paired with differential privacy depict the ideal proxy to check for general properties of the data samples as a whole without needing to check out single samples of the real data. In fact, they could possibly even be beneficial to the different tasks of manual inspection debugging as per definition a GM should model inherent characteristics of the whole dataset and not sample-specific details. The same holds for DP, which also tries to hide individual information whilst providing key characteristics of the whole dataset. This prevents e.g. that we take a look at a couple of samples that do not exhibit a common error in the dataset and therefore assume a flawless dataset.

Bridging the Lab-World with the Real-World.
2 real world use-cases.

The above-mentioned paper has explored this concept considering two actual real-world usecases. The first example debugs the next-word-prediction model which is used in Google GBoard [6] (smart-keyboard) to predict what word the user might want to use next. An RNN on word level is used as a primary model for the prediction task and an RNN on character level is used as the debugging model.
The second example is a mobile banking app where a Digit-Classifier allows to automatically read in handwritten digits. A standard CNN is used as primary model for the digit classification and a GAN is used as auxiliary model for debugging.
We consider this scenario where the different data sources are users of an app, which uses the primary model to provide some utility.

We can break down the specifics of the debugging approach and its application to the two use-cases into three different concepts building on top of each other.

1. Training generative models in a privacy preserving manner:

The first step obviously is to establish a specific algorithm to train a ML-model in a privacy preserving way on a decentralized dataset. For that we use the so called “DP-FedAvg” [3] algorithm and its adaption to training GANs by the above-mentioned paper. For the sake of this blog, I will only briefly explain the most important parts of this algorithm.
To start off, we consider a set of many different data sources (in the app scenario = users) from which we select a new random subset for each training iteration (i.e. gradient step). In parallel, we send the actual model to each data-source in the subset and update the model parameters while training for a couple of epochs (i.e. iterations on the whole dataset). Next, we check the difference between the parameters which were first sent to the data-source and the parameters after the whole training and clip them, if necessary, to a maximum magnitude. This way, after receiving all the weight updates from all the data-sources and averaging over them, we make sure all data-sources contribute roughly equally to the final global model update.

Lastly, before actually updating the weights of the model, i.e. doing the gradient step, we add gaussian noise with zero mean and a carefully calculated amount of variance to the weight updates for differential privacy.
For our GAN the algorithm doesn’t need to be changed much. By definition of our (ε,δ)-DP, if a certain operation on data is in a specific DP-bound, further operations stacked on top of this operation have at least the same DP-bounds. As only the training of the GAN-discriminator works on the real data, the training of our GAN-generator doesn’t require a further DP-step (added noise) and can be directly trained on the central server (no FL-procedure needed).

2. Generative models as debugging mechanism:

Another key element to successfully debug a certain problem using generative models is the ability to construct a programmatic selection procedure which is able to automatically select a suitable subset of the whole training set on which the generative models should train.
This means we want to find a standard metric which portrays the performance of the primary model on a given subset of the data accurately enough to be able to separate the “bad” data points from the “good” ones without leaking sensitive information about the data-samples themselves. Sounds abstract? Let’s take a look at the examples!
In the first example we consider a standard problem in the data processing pipeline of an NLP-model, specifically the so called “tokenization” which is used to convert words into model inputs, namely vectors. The problem is that a space between two different words isn’t recognized reliably and thus instead of two tokens, one for each word, only one token for the two words is created (see figure below).
For the selection procedure, we first track the performance of the primary model using a standard metric in NLP, the “out-of-vocabulary” (OOV) rate. The OOV rate depicts the number of words (in this case the words generated by the primary model for the task of next-word prediction) which are not in the set of known words. Once we recognized that the OOV-rate for the predicted words is unusually high, we suspect that there might be an error in the tokenization (corresponding to the third problem defined above *).
To investigate this issue, we train another time-series generative model, a char-level-RNN, on exclusively the OOV-words and generate secluded sequences of chars (this means the model also predicts when a char sequence should end, i.e. a word most likely is complete). Next, because the generative model works at char-level, we can calculate the joint character probability allowing us to order the most probable word sequences based on the OOV training. We realize that indeed the most probable word is composed of two known words separated by a space – we uncovered the tokenization error without looking at a single data sample!

For the other example we use the misclassification rate as a selection metric (corresponding to the second and fourth problem defined above) to debug a pixel inversion error – equally successful!

3. Direct examination vs. comparison:

As we consider a low-fidelity generative model, which on top is trained with DP, it seems to be a good idea to opt for comparison of the erroneous samples with better performing samples rather than relying on recognizing the bug by only looking at the erroneous samples. In fact, this is exactly what was also done for both examples. In total we always train four different models, one good and one bad for the primary and the auxiliary model.
Once done, we first compare the performance of the primary models based on the selection metric to check whether there is a bug. If so, we use the selection metric to select a most bad performing subset and a most well performing subset to train the two auxiliary models and to then compare two data samples from these to uncover the bug.
To conclude, the approach doesn’t only work utility-wise (we can successfully debug!), but at the same time it allows a single-digit epsilon privacy budget given enough users (350,000 for the first example and 2 million for the second, reasonable for such apps!)

So, where can I send-in my naked full body scans?

There are two take-home-messages this blog should convey.
First, PP-ML can be made feasible to be actually usable in real-world use-cases – as this example shows, we can even find solutions for challenges that seem to be unsolvable at first.
Second, these examples are a proof-of-concept and not a plug-and-play guide on how to complete the whole machine learning procedure in a privacy preserving fashion. Much work in all of the three fundamental building blocks has still to be done.
So, what are you waiting for?!


[1] “Generative Models for Effective ML on Private, Decentralized Datasets”, by Sean Augustin et al., published as a conference paper at ICLR 2020

[2] “Deep Learning with Differential Privacy”, by Martín Abadi et al., published in the proceedings of the 23rd ACM Conference on Computer and Communications Security (CCS 2016)

[3] “Learning Differential Private Recurrent Language Models”, by H. Brendan McMahan et al., published as a conference paper at ICLR 2018

[4] “Taxonomy of real faults in deep learning systems “, Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. 2019