SILO: Efficiently Searching for Distributions

Abstract:

How efficiently can we search distributions? The problem is modeled as follows: we are given knowledge of k discrete distributions v_i for 1 <= i <= k over the domain [n] = {1,…,n} which we can preprocess. Then we get samples from an unknown discrete distribution p, also over [n]. The goal is to output the closest distribution to p among the v_i’s in TV distance (up to some small additive error). State of the art sample efficient algorithms require Theta(log k) samples and run in near linear time.

We introduce a fresh perspective on the problem and study the relationship between sample complexity and time complexity. This question is particularly motivated as it is a generalization of the traditional nearest neighbor search problem: if we take enough samples, we can learn p explicitly up to low TV distance, and then find the closest v_i in o(k) time using standard nearest neighbor search. However, this approach requires Omega(n) samples.

Thus, it is natural to ask: can we obtain both sublinear number of samples and sublinear query time? We present some nice progress towards this question and uncover a very interesting statistical-computational trade-off.

This is joint work with Anders Aamand, Alex Andoni, Justin Chen, Piotr Indyk, Shyam Narayanan, and Haike Xu.

January 22, 2025

12:30 pm (1h)

Orchard View Room

Sandeep Silwal, UW-Madison

video