Logo: University of Southern California

Paving the Way for Robust Machine Learning

Team from USC Viterbi, UC San Diego, and MIT Has Recipe for Detecting Outliers in High Dimensional Data Sets
By: Amy Blumenthal
November 18, 2016 —


Ilias Diakonikolas, Andrew and Ema Viterbi Early Chair in USC Viterbi's Department of Computer Science, Photo Courtesy of Ilias Diakonikolas


Ilias Diakonikolas, Assistant Professor and Andrew and Erna Viterbi Early Career Chair in USC Viterbi’s Department of Computer Science, is a theoretical computer scientist who studies the inherent abilities and limitations of efficient algorithms. He studies what computers can and cannot do efficiently— that is, measuring the very resources that a computer uses to solve computational problems, such as the time a computer takes or the space it uses to complete a computational task.

His most recent work is at the intersection of computer science and machine learning, dealing with high-dimensional learning when scientists are working with hundreds or thousands of variables.

Data scientists often need to incorporate many, many features. “Every feature corresponds to a dimension. A researcher may want to observe data points in thousands of dimensions and to be able to make accurate inferences about the model that generated this data,” says Diakonikolas.

Due to a variety of factors, real data often has incorrect values within it. Diakonikolas says that in some circumstances, five percent of a researcher’s data could be corrupted, which could render standard machine learning techniques essentially useless. This is the type of problem that Diakonikolas, along with post doctoral student Alistair Stewart and a team of scholars at UC San Diego and MIT set out to fix.

In a recent paper, “Robust Estimators in High Dimensions without the Computational Intractability,” published in the Proceedings of this year’s IEEE Symposium on Foundations of Computer Science, Ilias Diakonikolas from USC Viterbi, and his co-authors Gautam Kamath of MIT, Daniel Kane of UC San Diego, Jerry Li of MIT, Ankur Moitra of MIT, and Alistair Stewart of USC Viterbi, revisit an old topic that has existed in statistics since the early twentieth century: finding the parameters of a high-dimensional model in the presence of corrupted data. The authors contemplate the following: What if the observations collected by scientists do not perfectly align with the assumed model? What if there is a small amount of error in the data due to noise or incorrect measurements?

The authors consider one of the most fundamental models: the Gaussian distribution, also known as ‘normal distribution’ in statistics (The Gaussian distribution is one of the most commonly used models by data scientists in a plethora of applications). Diakonikolas and his co-authors set out to address the issue of corrupted data within this distribution family as well as related models.

In some cases, even a single nonsensical measurement can completely compromise the accuracy of an estimator says Diakonikolas, “In the one-dimensional setting, there is a simple solution to this problem, known to statisticians since the 1960’s. Instead of using the average of the values, we use the median value.”

The situation changes radically, says Diakonikolas, in high dimensions. “If we use any of the standard estimators in high dimensional settings, one percent of errors, for example, in ten thousand dimensions, will give you nothing useful.” 

The challenge in high-dimensional learning is that it is difficult to detect corruptions. “It is difficult to detect the outliers -- the data points that shouldn’t be there. If one uses the wrong algorithm, it is incredibly time-consuming as the number of dimensions increase, and, even a computer couldn’t do it,” says Diakonikolas.

The researchers thus focused on the computational efficiency of their algorithm when the dimensionality of the data increases, asking: “Can we actually design an efficient algorithm that processes corrupted observations and that gives an accurate answer?”

If an algorithm uses samples in many dimensions, Diakonikolas and his colleagues want to reduce the number of steps the algorithm uses to be able to process those samples, so that it scales at a reasonable rate with the number of dimensions. “We want an algorithm whose running time increases at a polynomial rate, as opposed to exponential, with the dimension.” This, Diakonikolas says, “is the difference between whether a computation is feasible or not feasible. An exponential dependence on the dimension is computationally infeasible for any dimension larger than, say, 15.”

The team of researchers from USC, MIT, and UC San Diego developed the first efficient robust model-fitting technique in the high-dimensional setting. Their technique applies to various models, including Gaussian distributions, product distributions and certain combinations of such distributions.

At the same conference, researchers from Georgia Tech presented a similar algorithm.

Diakonikolas and his colleagues have developed an efficient way to filter the corrupted data samples---an algorithm that through a sequence of iterations detects the noisy or erroneous data and removes it. Once the noisy data is detected and removed, the researcher will only have clean data and can apply any of the standard estimators that were being used.

While this algorithm is being applied for particular families of models, says Diakonikolas, it is the first step in a line of work on more and more complex models. More recently, the USC-UCSD team has applied their technique to a broader family of models that can incorporate arbitrary correlations between the variables.

“The big advantage of this work is that it answers an old question but that there is much more to be done in the future. Everyone has a model in machine learning, a model that they would like to learn, such as a graphical model. All known estimators for such models are very sensitive to outliers. We are working on adapting our techniques to such models as well. An equally important goal we are currently pursuing is to apply our tools to real-world data.”