Minimax-optimal semi-supervised regression on unknown manifolds
Please note different day and location (Tuesday, McDonnell 102A). In recent years, many semi-supervised regression and classification methods have been proposed. These methods demonstrated empirical success on some data sets, whereas on others the unlabeled data did not appear to help. To analyze semi-supervised learning theoretically, it is often assumed that the data points lie on a low-dimensional manifold. Under this assumption  and  have shown that classical nonparametric regression methods, using only the labeled data, can achieve optimal rates of convergence. This implies that asymptotically, as the number of labeled points tends to infinity, unlabeled data does not help. However, typical semi-supervised scenarios involve few labeled points, and plenty of unlabeled ones. In this work () we clarify the potential benefits of unlabeled data under the manifold assumption, given a fixed amount of labeled points. Specifically, we prove that for a Lipschitz function on a manifold, a simple semi-supervised regression method based on geodesic k-nearest-neighbors achieves the finite-sample minimax bound on the mean squared error, provided that sufficiently many unlabeled points are available. Furthermore, we show that this approach is computationally efficient, requiring only O(k N log N) operations to estimate the regression function for all N labeled and unlabeled points. We illustrate this approach on two datasets with a manifold structure: indoor localization using WiFi fingerprints and facial pose estimation. In both cases, the proposed method is more accurate and much faster than the popular Laplacian eigenvector regressor . The talk should be accessible to anyone with a general background in statistics and machine learning. Specifically, no knowledge of manifold geometry or minimax theory is assumed.