Systems | Information | Learning | Optimization
 

Subspace cluster with missing data via integer programming

Abstract:
In the Subspace Clustering with Missing Data (SCMD) problem, we are given a collection of n data points, arranged into columns of a matrix X, and each of the data points is observed only on a subset of its coordinates. The data points are assumed to be concentrated near a union of low-dimensional subspaces. The goal of SCMD is to recover the missing elements of the data matrix X. State-of-the-art algorithms for SCMD can perform poorly on instances with a large amount of missing data or if the data matrix X is nearly full-rank. We propose a novel integer programming based approach for SCMD. The approach is based on dynamically determining a set of candidate subspaces and optimally assigning points to selected subspaces. The problem structure is identical to the classical facility-location problem, with subspaces playing the role of facilities and data points that of customers. We propose a column-generation approach for identifying candidate subspaces combined with a Benders decomposition approach for solving the linear programming relaxation of the formulation. An empirical study demonstrates that the proposed approach can achieve better clustering accuracy than state-of-the-art methods when the data is high-rank, the percentage of missing data is high, or there are a small number of data points in each subspace.

This is joint work with Jim Luedtke, Daniel Pimentel-Alarcon and Akhilesh Soni.

September 29 @ 12:30
12:30 pm (1h)

Orchard View Room, Virtual

Jeff Linderoth

Video