Identifying contamination in datasets is important in a wide variety of settings, including view and click fraud in online advertising. After a brief overview of digital ad fraud, I’ll describe a technique for estimating contamination in large, categorical datasets. The technique involves solving a series of convex programs, resulting in a bound on the minimum number of data points that must be discarded (i.e, the level of contamination) from an empirical data set in order to match a model to within a specified goodness-of-fit, controlled by a p-value. I’ll discuss convergence guarantees, provide geometric interpretations, and highlight practical aspects of solving over a million convex optimizations nightly.
May 6, 2015
12:30 pm (1h)
Discovery Building, Orchard View Room
Matt Malloy