Streaming Submodular Optimization: Massive Data Summarization on the Fly, KDD 2014

]]>The primary advantages of this method are

1) A significant improvement on speed, even by single-thread computation, our method can be hundreds to thousands times faster than its major rivals, EM algorithm or Gibbs sampling;

2) A more interpretable model, our method is based on the idea of finding a small subset of real data instances to build the model (i.e., separability assumption), and can embed extra prior knowledge of representative point/pattern in learning. This leads to very close or even better performance than its rivals;

3) It finally makes spectral method (i.e., the method of moments) work! Existing spectral method suffers from inaccurate estimate of moments, and can only be used as a initialization of EM algorithm in practice. The separability assumption plays a role of regularization in our method, and is a powerful cure to the large variance of moment estimator.

All the details you need can be found in our long arxiv version:

Code will be released later, but you will find the method is super-easy to implement.

]]>In summary, we tackle two challenging learning problems:

**1) How to model a rich class (non-Gaussian, asymmetric, etc.) of joint likelihood function with multiple outputs? Our Answer: multi-task copula.**

**2) How to (efficiently) estimate a sparse graph function G(Y, E) = F(X), i.e., the sparse dependency graph of (outputs) Y as a function of X (inputs)? Our Answer: sparse graph regression.**

This is the first** copula** model for multi-task or multi-output regression problem (*Yes yes yes I hear someone whispering “Ceee-Rrrr-Ffff”*). The separable semiparametric form of copula model allows us to select arbitrary distribution class to model the marginal likelihood p(Y_i|X), and model the non-Gaussian dependency between outputs by estimating a Gaussian precision matrix (normally seen in covariance selection, Gaussian graphical model, etc.). In addition, it enables a two-stage learning scheme which learn the marginals and conditional dependency graph separately.

Sparse graph regression (SpaGraphR) is a fast tool to learn the dependency graph. However, it is also an independent interest to others who need to ahiceve a dependancy graph varying with input/time/other factors. In statistics this problem is usually called “covariance regression”, while we are seeking here is a “covariance regression with sparse graph structure”, i.e., the covariance function is smooth but its corresponding graph is usually sparse. Given m input instances, the evaluation of this function gives you m sparse graphs of outputs, and similar inputs lead to similar graph structure. This is much harder than “covariance selection” which only need to estimate one graph. However, we show that our algorithm can use similar (or even less) computations than covariance selection to find a nonparametric estimator of covariance function.

**Here comes the Abstract of our paper:**

This paper proposes multi-task copula (MTC) that can handle a much wider class of tasks than mean regression with Gaussian noise in most former multi-task learning (MTL). While former MTL emphasizes shared structure among models, MTC aims at joint prediction to exploit inter-output correlation. Given input, the outputs of MTC are allowed to follow arbitrary joint continuous distribution. MTC captures the joint likelihood of multi-output by learning the marginal of each output firstly and then a sparse and smooth output dependency graph function. While the former can be achieved by classical MTL, learning graphs dynamically varying with input is quite a challenge. We address this issue by developing sparse graph regression (SpaGraphR), a non-parametric estimator incorporating kernel smoothing, maximum likelihood, and sparse graph structure to gain fast learning algorithm. It starts from a few seed graphs on a few input points, and then updates the graphs on other input points by a fast operator via coarse-to-fine propagation. Due to the power of copula in modeling semi-parametric distributions, SpaGraphR can model a rich class of dynamic non-Gaussian correlations. We show that MTC can address more flexible and difficult tasks that do not fit the assumptions of former MTL nicely, and can fully exploit their relatedness. Experiments on robotic control and stock price prediction justify its appealing performance in challenging MTL problems.

]]>

It is normal to see fluctuation in the first several steps, this is common when applying Nesterov’s method.

]]>How to summarize a huge-scale dataset (ratings, movie, action, text) in seconds by representative REAL instances (users, key frames, anchor word) rather than artificial basis?

How to learn graphical models like LDA only by distributed max and min operations rather than iterative algorithm?

Learning by selecting real instances from real world, is usually much faster, more robust and more convincing than learning by optimizing and creating artificial numbers.

Summarizing the world, is the way how human learn from it.

Please come to my “Divide-and-Conquer Anchoring” talk at ICDM this year at Dallas. We win the Best student paper award! My talk will be located at **Houston Ballroom B Sheraton Dallas, between 16:00-18:00 this Sunday (Dec 8). **Here is the paper’s PDF. But I will talk more in Sunday

**Warning of Winter Storm and Freezing Rain in Dallas this weekend!**

The proposed Manifold Elastic Net (MEN) in the paper provides an efficient and interesting method for building a whole solution path for sparse dimension reduction problems, including most manifold learning methods whose solutions are preferred to be sparse. A dynamic process of selecting real features in building sparse representation/projection matrix can be shown by MEN, which has never been shown in previous sparse PCA methods.

]]>**Abstract:**

*Recovering a large low-rank matrix from highly corrupted, incomplete or sparse outlier overwhelmed observations is the crux of various intriguing statistical problems. We explore the power of “greedy bilateral (GreB)” paradigm in reducing both time and sample complexities for solving these problems. GreB models a low-rank variable as a bilateral factorization, and updates the left and right factors in a mutually adaptive and greedy incremental manner. We detail how to model and solve low-rank approximation, matrix completion and robust PCA in GreB’s paradigm. On their MATLAB implementations, approximating a noisy 10000×10000 matrix of rank 500 with SVD accuracy takes 6s; MovieLens10M matrix of size 69878×10677 can be completed in 10s from 30% of 10^7 ratings with RMSE 0.86 on the rest 70%; the low-rank background and sparse moving outliers in a 120×160 video of 500 frames are accurately separated in 1s. This brings 30 to 100 times acceleration in solving these popular statistical problems.*

**More good news:**

*For huge-scale problem or for obtaining faster speed, GreB can be seamlessly and straightforwardly extended to online, stochastic, parallel or divide-and-conquer (D&C) versions by using recent popular methods. In particular, major computations in GreB is matrix multiplications that can be parallelized; GreB can be invoked as a subroutine for sub-matrix in D&C; greedily selected gradient direction can be replaced by a stochastic one.*

**Some results:**

The draft abstract reads:

Directly applying single-label classification methods to the multi-label learning problems substantially limits both the performance and speed due to the imbalance, dependence and high dimensionality of the given label matrix. Existing methods either ignore these three problems or reduce one with the price of aggravating another. In this paper, we propose a $\{0,1\}$ label matrix compression and recovery method termed “compressed labeling (CL)” to simultaneously solve or at least reduce these three problems. CL first compresses the original label matrix to improve balance and independence by preserving the signs of its Gaussian random projections. Afterward, we directly train popular binary classifiers (e.g., support vector machines) for each new label. A fast recovery algorithm is developed to recover the original labels from the predicted new labels. In the recovery algorithm, a “labelset distilling method (LDM)” is designed to extract distilled labelsets (DLs), i.e., the frequently appeared label subsets from the original labels via recursive clustering and subtraction. Given a distilled labelset (DL) and an original label vector, we discover that the signs of their random projections have an explicit joint distribution that can be quickly computed from a geometric inference. Based on this observation, the original label vector is exactly determined after performing a series of Kullback-Leibler divergence based hypothesis tests on the distribution about the new labels. CL significantly improves the balance of the training samples and reduces the dependence between different labels. Moreover, it accelerates the learning process by training less binary classifies for compressed labels, and makes use of label dependence via DLs based tests. Theoretically, we prove the recovery bound of CL which verifies the effectiveness of CL for label compression and multi-label classification performance improvement brought by label correlations preserved in DLs. We show the effectiveness, efficiency and robustness of CL via 5 groups of experiments on 21 datasets from text classification, image annotation, scene classification, music categorization, genomics and web page classification.

The CL-DL simultaneously solves the following three main problems in multi-label learning in a general, simple framework:

1) **Reduces the problem size to far less than k**, wherein k is the number of labels. CL-DL only needs to train O(log k) binary classifiers, which is a significant improvement in efficiency.

2) **Fully leverages the label correlations**. CL-DL extracts the label correlations by finding the frequently appeared 0-1 patterns in the label vectors of training data, and uses the correlation by testing the frequent pattern’s attendance in the recovery of prediction. A clustering based binary matrix decomposition is developed to extract DLs.

3) **Eliminates sample balance problem** when transforming multi-label learning problems to multiple binary classification problems. The positive and negative samples for each binary classifier training have the balanced amounts. The binary classifier can be selected as any kind, such as SVM, logistic regression, etc.

**For those who are at NIPS 2011**: I am now in Granada, Spain attending NIPS 2011 that will start from tomorrow and end at Dec 18. Our paper about Hamming Compressed Sensing (HCS), entitled as “1-bit Compressed Quantization” will be presented at the Workshop on Sparse Representation and Low-rank Approximation in the poster session at Dec 16. If you are at NIPS 2011 and interested in HCS, we can have some discussions.

**For those who are interested in CL**: Since I am in Spain, I am sorry the final PDF version of CL-DL paper will be available only after NIPS 2011. Its MatLab code will be released after its formal publication.

**For those who are interested in Hamming Compressed Sensing (HCS) code**: Thank you very much for your attentions and interests to HCS! I am so sorry to tell you that the code will not be available until the first HCS formal paper is accepted. I hope this will happen soon. CL-DL is a degenerated HCS and its code can be anticipated to be released earlier than HCS. Hope this can help.

PS: Granada is a beautiful, ancient, small city. Today is a little bit cloudy, hope it will be sunny tomorrow. Have to rest after 25 hours’ long journey.

]]>**Semi-Soft GoDec** is released. Different from the ordinary GoDec which imposes hard threshholding to both the singular values of the low-rank part L and the entries of the sparse part S, Semi-Soft GoDec adopts **soft thresholding to the entries of S**. This change brings two key advantages:

1) the parameter **k in constraint “card(S)<=k” is now can be automatically determined** by the soft-threshold \tau, thus avoids the situation when k is chosen too large and some part of noise G is leaked into S.

This is some kind of solution to the question from some readers of GoDec, i.e., **how to determine r and k?** We prefer the hard constraint to the rank of L because a controllable rank can significantly reduce the time cost, in addition, the upper bound of the rank r is not hard to determine in most applications. The rest problem is how to choose a good k. If k is selected in a safe range then it will not affect the result, but a k far from the correct value is possible to induce leaks between S and G, which is annoying sometimes. We found soft threholding is a good and natural solution to this problem. While the error is not increased, the time cost is reduced.

2) the time cost is substantially smaller than the ordinary GoDec, for example, the background modeling experiments can be accomplished with a speed **4 times faster then ordinary GoDec**, which means almost all the videos shown in our ICML paper can be processed in** less than 10 seconds, while the error is kept the same or even smaller.**

This GoDec variant will be included in our journal version, the **MatLab code** of Semi-Soft GoDec is now available on GoDec website for download. Enjoy it!