Systems | Information | Learning | Optimization

Some Recent Insights on Transfer-Learning

A common situation in Machine Learning is one where training data is not fully representative of a target population due to bias in the sampling mechanism or due to prohibitive target sampling costs. In such situations, we aim to ’transfer’ relevant information from the training data (a.k.a. source data) to the target application. How much information is in the source data about the target application? Would some amount of target data improve transfer? These are all practical questions that depend crucially on ‘how far’ the source domain is from the target. However, how to properly measure ‘distance’ between source and target domains remains largely unclear.

In this talk we will argue that much of the traditional notions of ‘distance’ (e.g. KL-divergence, extensions of TV such as D_A discrepancy, density-ratios, Wasserstein distance) can yield an over-pessimistic picture of transferability. Instead, we show that some new notions of ‘relative dimension’ between source and target (which we simply term ‘transfer-exponents’) capture a continuum from easy to hard transfer. Transfer-exponents uncover a rich set of situations where transfer is possible even at fast rates; they encode relative benefits of source and target samples, and have interesting implications for related problems such as ‘multi-task or multi-source learning’.

In particular, in the case of transfer from multiple sources, we will discuss (if time permits) a strong dichotomy between minimax and adaptive rates: no adaptive procedure exists that can achieve the same rates as minimax (oracle) procedures.

The talk is based on earlier work with Guillaume Martinet, and ongoing work with Steve Hanneke.

I work in statistical machine learning, with an emphasis on common nonparametric methods (e.g., kNN, trees, kernel averaging). I’m particularly interested in adaptivity, i.e., how to automatically leverage beneficial aspects of data as opposed to designing specifically for each scenario. This involves characterizing statistical limits, under modern computational and data constraints, and identifying favorable aspects of data that help circumvent these limits.

Some specific interests: notions of intrinsic data dimension, benefits (or lack thereof) of sparse or manifold representations; performance limits and adaptivity in active learning, transfer and multi-task learning; hyperparameter-tuning and guarantees in density-based clustering.

October 14 @ 12:30
12:30 pm (1h)


Samory Kpotufe