The advent of data-hungry applications has enabled computers to interpret what they see, communicate in natural language, and answer complex questions. There is a hidden catch, however: the reliance of all these state-of-the-art systems on high-effort tasks like data preparation and data cleaning. It is estimated that 70% to 80% of the time devoted on analytics projects is spent on checking and organizing data. The challenge is that data collection often introduces dirty data, i.e., incomplete, erroneous, replicated, or conflicting data records.
In this talk, I discuss how to reason about dirty data and demonstrate how statistical learning is the key to managing large volumes of heterogeneous, noisy data sources effectively. I will present HoloClean, our new system that relies on statistical learning and inference to repair identified data errors and anomalies. Finally, I will conclude by drawing connections between data cleaning and structured prediction and how these connections lead to new insights and solutions to classical database problems such as data repairs and consistent query answering.
February 28 @ 12:30
12:30 pm (1h)
Discovery Building, Orchard View Room