# Accepted Papers

Contents

## Conference Track

• A Bayesian Nonparametric Approach for Multi-label Classification - Vu Nguyen, Sunil Gupta, Santu Rana, Cheng Li, Svetha Venkatesh [abstract]
Many real-world applications require multi-label classification where multiple target labels are assigned to each instance. In multi-label classification, there exist the intrinsic correlations between the labels and features. These correlations are beneficial for multi-label classification task since they reflect the coexistence of the input and output spaces that can be exploited for prediction. Traditional classification methods have attempted to reveal these correlations in different ways. However, existing methods demand expensive computation complexity for finding such correlation structures. Furthermore, these approaches can not identify the suitable number of label-feature correlation patterns. In this paper, we propose a Bayesian nonparametric (BNP) framework for multi-label classification that can automatically learn and exploit the unknown number of multi-label correlation. We utilize the recent techniques in stochastic inference to derive the cheap (but efficient) posterior inference algorithm for the model. In addition, our model can naturally exploit the useful information from missing label samples. Furthermore, we extend the model to update parameters in an online fashion that highlights the flexibility of our model against the existing approaches. We compare our method with the state-of-the-art multi-label classification algorithms on real-world datasets using both complete and missing label settings. Our model achieves better classification accuracy while our running time is consistently much faster than the baselines in an order of magnitude.
• An Efficient Approach for Multi-Sentence Compression - Elahe Shafiei, Mohammad Ebrahimi, Raymond K. Wong, Fang Chen [abstract]
Multi Sentence Compression (MSC) is of great value to many real world applications, such as guided microblog summarization, opinion summarization and newswire summarization. Recently, word graph-based approaches have been proposed and become popular in MSC. Their key assumption is that redundancy among a set of related sentences provides a reliable way to generate informative and grammatical sentences. In this paper, we propose an effective approach to enhance the word graph-based MSC and tackle the issue that most of the state-of-the-art MSC approaches are confronted with: i.e., improving both informativity and grammaticality at the same time. Our approach consists of three main components: (1) a merging method based on Multiword Expressions (MWE); (2) a mapping strategy based on synonymy between words; (3) a re-ranking step to identify the best compression candidates generated using a POS-based language model (POS-LM). We demonstrate the effectiveness of this novel approach using a dataset made of clusters of English newswire sentences. The observed improvements on informativity and grammaticality of the generated compressions show an up to 44\% error reduction over state-of-the-art MSC systems.
• Bank of Weight Filters for Deep CNNs - Suresh Kirthi Kumaraswamy, PS Sastry, Kalpathi Ramakrishnan [abstract]
Convolutional neural networks (CNNs) are seen to be extremely effective in many large object recognition tasks. One of the reasons for this is that they learn appropriate features also from the training data. The convolutional layers of a CNN have these feature generating filters whose weights are learnt. However, this entails learning millions of weights (across different layers) and hence learning times are very large even on the best available hardware. In some studies in transfer learning it has been observed that the network learnt on one task can be reused on another task (by some finetuning). In this context, this paper presents a systematic study of the exchangeability of weight filters of CNNs across different object recognition tasks. The paper proposes the concept of bank of weight-filters (BWF) which consists of all the weight vectors of filters learnt by different CNNs on different tasks. The BWF can be viewed at multiple levels of granularity such as network-level, layer-level and filter-level. Through extensive empirical investigations we show that one can efficiently learn CNNs for new tasks by randomly selecting from the bank of filters for initializing the convolutional layers of the new CNN. Our study is done at all the multiple levels of granularity mentioned above. Our results show that the concept of BWF proposed here would offer a very good strategy for initializing the filters while learning CNNs. We also show that the dependency among the filters and the layers of the CNN is not strict. One can choose any pre-trained filter instead of a fixed pre-trained net, as a whole, for initialization. This paper is a first step in the direction of creating and characterizing a Universal BWF for efficient learning of CNNs.
• Collaborative Recurrent Neural Networks for Dynamic Recommender Systems - Young-Jun Ko, Lucas Maystre, Matthias Grossglauser [abstract]
Modern technologies enable us to record sequences of online user activity at an unprece- dented scale. Although such activity logs are abundantly available, most approaches to recommender systems are based on the rating-prediction paradigm, ignoring temporal and contextual aspects of user behavior revealed by temporal, recurrent patterns. In contrast to explicit ratings, such activity logs can be collected in a non-intrusive way and can offer richer insights into the dynamics of user preferences, which could potentially lead more accurate user models.
In this work we advocate studying this ubiquitous form of data and, by combining ideas from latent factor models for collaborative filtering and language modeling, propose a novel, flexible and expressive collaborative sequence model based on recurrent neural networks. The model is designed to capture a user’s contextual state as a personalized hidden vector by summarizing cues from a data-driven, thus variable, number of past time steps, and represents items by a real-valued embedding. We found that, by exploiting the inherent structure in the data, our formulation leads to an efficient and practical method. Furthermore, we demonstrate the versatility of our model by applying it to two different tasks: music recommendation and mobility prediction, and we show empirically that our model consistently outperforms static and non-collaborative methods.
• Cost Sensitive Online Multiple Kernel Classification - Doyen Sahoo, Steven Hoi, Peilin Zhao [abstract]
Mining data streams has been an important open research problem in the era of big data analytics. This paper investigates supervised machine learning techniques for mining data streams with application to online anomaly detection. Unlike conventional data mining tasks, mining data streams for online anomaly detection has several challenges:
(i) data arriving sequentially and increasing rapidly,
(ii) highly class-imbalanced distributions; and
(iii) complex anomaly patterns that could evolve dynamically.
To tackle these challenges, we propose Cost-Sensitive Online Multiple Kernel Classification (CSOMKC) for comprehensively mining data streams and demonstrate its application to online anomaly detection.
Specifically, CSOMKC learns a kernel-based cost-sensitive prediction model for imbalanced data streams in a sequential or online learning fashion, in which a pool of multiple diverse kernels is dynamically explored.
The optimal kernel predictor and the multiple kernel combination are learnt together, and simultaneously class imbalance issues are addressed.
We perform theoretical and extensive empirical analysis of the proposed algorithms.
• Deep Gate Recurrent Neural Network - Yuan Gao, Dorota Glowacka [abstract]
This paper explores the possibility of using multiplicative gates to build two recurrent neural network structures. These two structures are called Deep Simple Gated Unit (DSGU) and Simple Gated Unit (SGU), which are structures for learning long-term dependencies. Compared to traditional Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), both structures require fewer parameters and less computation time in sequence classification tasks. Unlike GRU and LSTM, which require more than one gate to control information flow in the network, SGU and DSGU only use one multiplicative gate to control the flow of information. We show that this difference can accelerate the learning speed in tasks that require long dependency information. We also show that DSGU is more numerically stable than SGU. In addition, we also propose a standard way of representing the inner structure of RNN called RNN Conventional Graph (RCG), which helps to analyze the relationship between input units and hidden units of RNN.
• Echo State Hoeffding Tree Learning - Diego Marron, Jesse Read, Albert Bifet, Talel Abdessalem, Eduard Ayguade, José Herrero [abstract]
Nowadays, real-time classification of Big Data streams is becoming essential in a variety of application domains. While decision trees are powerful and easy-to-deploy approaches for accurate and fast learning from data streams, they are unable to capture the strong temporal dependences typically present in the input data. Recurrent Neural Networks are an alternative solution that include an internal memory to capture these temporal dependences; however their training is computationally very expensive and with slow convergence, requiring a large number of hyper-parameters to tune. Reservoir Computing was proposed to reduce the computation requirements of the training phase but still include a feed-forward layer which requires a large number of parameters to tune. In this work we propose a novel architecture for real-time classification based on the combination of a Reservoir and a decision tree. This combination reduces the number of hyper-parameters while still maintaining the good temporal properties of recurrent neural networks. The capabilities of the proposed architecture to learn some typical string-based functions with strong temporal dependences are evaluated in the paper. We show how the new architecture is able to incrementally learn these functions in real-time with fast adaptation to unknown sequences. And we study the influence of the reduced number of hyper-parameters in the behaviour of the proposed solution.
• EcoICA: Skewness-based ICA via Eigenvectors of Cumulant Operator - Liyan Song, Haiping Lu [abstract]
Independent component analysis (ICA) is an important unsupervised learning method. Most popular ICA methods use kurtosis as a metric of non-Gaussianity to maximize, such as FastICA and JADE.However, their assumption of kurtosic sources may not always be satisfied in practice. For weak-kurtosic but skewed sources, kurtosis-based methods could fail while skewness-based methods seem more promising, where skewness is another non-Gaussianity metric measuring the non-symmetry of signals. Partly due to the common assumption of signal symmetry, skewness-based ICA has not been systematically studied in spite of some existing works. In this paper, we take a systematic approach to develop EcoICA, a new skewness-based ICA method for weak-kurtosic but skewed sources. Specifically, we design a new cumulant operator, define its eigenvalues and eigenvectors, reveal their connections with the ICA model to formulate the EcoICA problem, and use Jacobi method to solve it. Experiments on both synthetic and real data show the superior performance of EcoICA over existing kurtosis-based and skewness-based methods for skewed sources. In particular, EcoICA is less sensitive to sample size, noise, and outlier than other methods. Studies on face recognition further confirm the usefulness of EcoICA in classification.
• Enhancing Topic Modeling on Short Texts with Crowdsourcing - Xiaoyan Yang, Shanshan Ying, Wenzhe Yu, Rong Zhang, Zhenjie Zhang [abstract]
Topic modeling is nowadays widely used in text archive analytics, to find significant topics in news articles and important aspects of product comments available on the Internet. While statistical approaches, e.g. Latent Dirichlet Allocation (LDA) and its variants, are effective on building topic models on long texts, it remains difficult to identify meaningful topics over short texts, e.g. news titles and social-media messages. With the emergence and prosperity of crowdsourcing platforms, it becomes possible and easier for analytical systems to incorporate human intelligence into text analytics.

Different from traditional active learning techniques, the combination of crowdsourcing and machine learning poses new challenges on the design of simple tasks for non-experts to finish in seconds. In this paper, we design a new topic modeling technique, fully exploiting the basic intuitions of humans on short text reading. By requesting human labors to subjectively measure the similarity between short text pairs, the accuracy of the topic modeling algorithms could be greatly enhanced, regardless of the prior used in the graphical model. We present well-designed short text pair selection strategies for crowdsourcing and provide analysis on the convergence property of the inference algorithm. Empirical studies show that our proposed approaches improve the result topics on English tweets and Chinese microblogs, by requesting only a small number of labels from crowd.
• Fast Collaborative Filtering from Implicit Feedback with Provable Guarantees - Sayantan Dasgupta [abstract]
Building recommendation algorithm is one of the most challenging tasks in Machine Learning. Although most of the recommendation systems are built on explicit feedback available from the users in terms of rating or text, a majority of the applications do not receive such feedback. Here we consider the recommendation task where the only available data is the records of user-item interaction over web applications over time, in terms of subscription or purchase of items; this is known as implicit feedback recommendation. There is usually a massive amount of such user-item interaction available for any web applications. Algorithms like PLSI or Matrix Factorization runs several iterations through the dataset and may prove very expensive for large datasets. Here we propose a recommendation algorithm based on Method of Moment, which involves factorization of second and third order moments of the dataset. Our algorithm can be proven to be globally convergent using PAC learning theory. Further, we show how to extract the parameters using only three passes through the entire dataset. This results in a highly scalable algorithm that scales up to million of users even on a machine with a single-core processor and 8 GB RAM and produces competitive performance in comparison with existing algorithms.
• Geometry-aware stationary subspace analysis - Inbal Horev, Florian Yger, Masashi Sugiyama [abstract]
In many real-world applications data exhibits non-stationarity, i.e., its distribution changes over time.
One approach to handling non-stationarity is to remove or minimize it before attempting to analyze the data.
In the context of brain computer interface (BCI) data analysis this is sometimes achieved using stationary subspace analysis (SSA).
The classic SSA method finds a matrix that projects the data onto a stationary subspace by optimizing a cost function based on a matrix divergence.
In this work we present an alternative method for SSA based on a symmetrized version of this matrix divergence.
We show that this frames the problem in terms of distances between symmetric positive definite (SPD) matrices, suggesting a geometric interpretation of the problem.
Stemming from this geometric viewpoint, we introduce and analyze a method which utilizes the geometry of the SPD matrix manifold and the invariance properties of its metrics.
Most notably we show that these invariances alleviate the need to whiten the input matrices, a common step in many SSA methods which often introduces error.
We demonstrate the usefulness of our technique in experiments on both synthetic and real-world data.
• Hierarchical Probabilistic Matrix Factorization with Network Topology for Multi-relational Social Network - Haoli Bai, Zenglin Xu, Bin Liu, Yingming Li [abstract]
Link prediction in multi-relational social networks has attracted much attention. For instance, we may care the chance of two users being friends based on their contacts of other patterns, e.g., SMS and phone calls. In previous work, matrix factorization models are typically applied in single-relational networks; however, two challenges arise to extend it into multi-relational networks. First, the interaction of different relation types is hard to be captured. The second is the cold start problem, as the prediction of new entities in multi-relational networks becomes even more challenging. In this article we propose a novel method called Hierarchical Probabilistic Matrix Factorization with Network Topology (HPMFNT). Our model exploits the network topology by extending the Katz index into multi-relational settings, which could efficiently model the multidimensional interplay via the auxiliary information from other relationships. We also utilize the extended Katz index along with entitiy attributes to solve the cold-start problem. Experiments on two real world datasets have shown that our model outperforms the state-of-the-art with a significant margin.
• Improving Distributed Word Representation and Topic Model by Word-Topic Mixture Model - Xianghua Fu, Ting Wang, Jing Li, Chong Yu, Wangwang Liu [abstract]
We propose a Word-Topic Mixture(WTM) model to improve word representation and topic model simultaneously. Firstly, it introduces the initial external word embeddings into the Topical Word Embeddings(TWE) model based on Latent Dirichlet Allocation(LDA) model to learn word embeddings and topic vectors. Then the results learned from TWE are integrated in the LDA by defining the probability distribution of topic vectors-word embeddings according to the idea of latent feature model with LDA (LFLDA), meanwhile minimizing the KL divergence of the new topic-word distribution function and the original one. The experimental results prove that the WTM model performs better on word representation and topic detection compared with some state-of-the-art models.
• Learnability of Non-I.I.D. - Wei Gao, Xin-Yi Niu, Zhi-Hua Zhou [abstract]
Learnability has always been one of the most central problems in learning theory. Most previous studies on this issue were based on the assumption that the samples are drawn independently and identically according to an underlying (unknown) distribution. The i.i.d. assumption, however, does not hold in many real applications. In this paper, we study the learnability of problems where the samples are drawn from empirical process of stationary $\beta$-mixing sequence, which has been a widely-used assumption implying a dependence weaken over time in training samples. By utilizing the independent blocks technique, we provide a sufficient and necessary condition for learnability, that is, average stability is equivalent to learnability with AERM (Asymptotic Empirical Risk Minimization) in the non-i.i.d. learning setting. In addition, we also discuss the generalization error when the test variable is dependent on the training sample.
• Learning Distance Metrics for Multi-Label Classification - Henry Gouk, Bernhard Pfahringer, Michael Cree [abstract]
Distance metric learning is a well studied problem in the field of machine learning, where it is typically used to improve the accuracy of instance based learning techniques. In this paper we propose a distance metric learning algorithm that is specialised for multi-label classification tasks, rather than the multiclass setting considered by most work in this area. The method trains an embedder that can transform instances into a feature space where Euclidean distance provides an estimate of the Jaccard distance between the corresponding label vectors. In addition to a linear Mahalanobis style metric, we also present a nonlinear extension that provides a substantial boost in performance. We show that this technique significantly improves upon current approaches for instance based multi-label classification, and also enables interesting data visualisations.
• Learning Feature Aware Metric - Han-Jia Ye, De-Chuan Zhan, Xue-Min Si, Yuan Jiang [abstract]
Distance Metric Learning (DML) aims to find a distance metric, revealing feature relationship and satisfying restrictions between instances, for distance based classifiers, e.g., kNN. Most DML methods take all features into consideration while leaving the feature importance identification untouched. Feature selection methods, on the other hand, only focus on feature weights and are seldom directly designed for distance based classifiers. In this paper, we propose a Feature AwaRe Metric learning (FARM) method which not only learns the appropriate metric for distance constraints but also discovers significant features and their relationships. In FARM approach, we treat a distance metric as a combination of feature weighting and feature relationship discovering factors. Therefore, by decoupling the metric into two parts, it facilitates flexible regularizations for feature importance selection as well as feature relationship constructing. Simulations on artificial datasets clearly reveal the comprehensiveness of feature weighting for FARM. Experiments on real datasets validate the improvement of classification performance and the efficiency of our FARM approach.
• Learning from Survey Training Samples: Rate Bounds for Horvitz-Thompson Risk Minimizers - Stephan Clemencon, Patrice Bertail, Guillaume Papa [abstract]
The generalization ability of minimizers of the empirical risk in the context of binary classification has been investigated under a wide variety of complexity assumptions for the collection of classifiers over which optimization is performed. In contrast, the vast majority of the works dedicated to this issue stipulate that the training dataset used to compute the empirical risk functional is composed of i.i.d. observations and involve sharp control of uniform deviation of i.i.d. averages from their expectation. Beyond the cases where training data are drawn uniformly without replacement among a large i.i.d. sample or modelled as a realization of a weakly dependent sequence of r.v.'s, statistical guarantees when the data used to train a classifier are drawn by means of a more general sampling/survey scheme and exhibit a complex dependence structure have not been documented in the literature yet. It is the main purpose of this paper to show that the theory of empirical risk minimization can
be extended to situations where statistical learning is based on survey samples and knowledge of the related (first order) inclusion probabilities. Precisely, we prove that minimizing a (possibly biased) weighted version of the empirical risk, refered to as the (approximate) Horvitz-Thompson risk (HT risk), over a class of controlled complexity lead to a rate for the excess risk of the order $O_{\mathbb{P}}((\kappa_N (\log N)/n)^{1/2})$ with $\kappa_N=(n/N)/\min_{i\leq N}\pi_i$, when data are sampled by means of a rejective scheme of (deterministic) size $n$ within a statistical population of cardinality $N\geq n$, a generalization of basic {\it sampling without replacement} with unequal probability weights $\pi_i > 0$. Extension to other sampling schemes are then established by a coupling argument. Beyond theoretical results, numerical experiments are displayed in order to show the relevance of HT risk minimization and that ignoring the sampling scheme used to generate the training dataset may completely jeopardize the learning procedure.
• Linearized Alternating Direction Method of Multipliers for Constrained Nonconvex Regularized Optimization - Linbo Qiao, Bofeng Zhang, Jinshu Su, Xicheng Lu [abstract]
In this paper, we consider a class of constrained nonconvex regularized minimization problems, where the constraints is linearly constrained. It was reported in the literature that nonconvex regularization usually yields a solution with more desirable sparse structural properties beyond convex ones. However, it is not easy to obtain the proximal mapping associated with nonconvex regularization, due to the imposed linearly constraints. In this paper, the optimization problem with linear constraints is solved by the Linearized Alternating Direction Method of Multipliers (LADMM). Moreover, we present a detailed convergence analysis of the LADMM algorithm for solving nonconvex compositely regularized optimization with a large class of nonconvex penalties. Experimental results on several real-world datasets validate the efficacy of the proposed algorithm.
• Localized Multiple Kernel Learning---A Convex Approach - Yunwen Lei, Alexander Binder, Urun Dogan, Marius Kloft [abstract]
We propose a localized approach to multiple kernel learning that can be formulated as a convex optimization problem over a given cluster structure. For which we obtain generalization error guarantees and derive an optimization algorithm based on the Fenchel dual representation. Experiments on real-world datasets from the application domains of computational biology and computer vision show that convex localized multiple kernel learning can achieve higher prediction accuracies than its global and non-convex local counterparts.
• Long Short-term Memory Network over Rhetorical Structure Theory for Sentence-level Sentiment Analysis - Xianghua Fu, Wangwang Liu, Yingying Xu, Chong Yu, Ting Wang [abstract]
Using deep learning models to solve sentiment analysis of sentences is still a challenging task. Long short-term memory (LSTM) network solves the gradient disappeared problem existed in recurrent neural network (RNN), but LSTM structure is linear chain-structure that can't capture text structure information. Afterwards, Tree-LSTM is proposed, which uses LSTM forget gate to skip sub-trees that have little effect on the results to get good performance. It illustrates that the chain-structured LSTM more strongly depends on text structure. However, Tree-LSTM can't clearly figure out which sub-trees are important and which sub-trees have little effect. We propose a simple model which uses Rhetorical Structure Theory (RST) for text parsing. By building LSTM network on RST parse structure, we make full use of LSTM structural characteristics to automatically enhance the nucleus information and filter the satellite information of text. Furthermore, this approach can make the representations concerning the relations between segments of text, which can improve text semantic representations. Experiment results show that this method not only has higher classification accuracy, but also trains quickly.
• Modelling Symbolic Music: Beyond the Piano Roll - Christian Walder [abstract]
In this paper, we consider the problem of probabilistically modelling symbolic music data. We introduce a representation which reduces polyphonic music to a univariate categorical sequence. In this way, we are able to apply state of the art natural language processing techniques, namely the long short-term memory sequence model. The representation we employ permits arbitrary rhythmic structure, which we assume to be given.
We show that our model is effective on all four piano roll based benchmark datasets. We further improve our model by augmenting our training data set with transpositions of the original pieces through all musical keys, thereby convincingly advancing the state of the art on these benchmark problems. We also fit models to music which is unconstrained in its rhythmic structure, discuss the properties of this model, and provide musical samples which are more sophisticated than previously possible with this class of recurrent neural network sequence models. We also provide our newly preprocessed data set of non piano-roll music data.
To facilitate future work we describe and provide a new carefully preprocessed dataset of 19700 classical midi music files — significantly more than previously available.
• Multiple Kernel Learning with Data Augmentation - Khanh Nguyen, Trung Le, Vu Nguyen, Tu Nguyen, Dinh Phung [abstract]
The motivations of multiple kernel learning (MKL) approach are to increase kernel expressiveness capacity and to avoid the expensive grid search over a wide spectrum of kernels. A large amount of work has been proposed to improve the MKL in terms of the computational cost and the sparsity of the solution. However, these studies still either require an expensive grid search on the model parameters or scale unsatisfactorily with the numbers of kernels and training samples. In this paper, we address these issues by conjoining MKL, Stochastic Gradient Descent (SGD) framework, and data augmentation technique. The pathway of our proposed method is developed as follows. We first develop a maximum-a-posteriori (MAP) view for MKL under a probabilistic setting and described in a graphical model. This view allows us to develop data augmentation technique to make the inference for finding the optimal parameters feasible, as opposed to traditional approach of training MKL via convex optimization techniques. As a result, we can use the standard SGD framework to learn weight matrix and extend the model to support online learning. We validate our method on several benchmark datasets in both batch and online settings. The experimental results show that our proposed method can learn the parameters in a principled way to eliminate the expensive grid search while gaining a significant computational speedup comparing with the state-of-the-art baselines.
• Multitask Principal Component Analysis - Ikko Yamane, Florian Yger, Maxime Berar, Masashi Sugiyama [abstract]
Principal Component Analysis (PCA) is a canonical and well-studied tool for dimensionality reduction. However, when few data are available, the poor quality of the covariance estimator at its core may compromise its performance. We leverage this issue by casting the PCA into a multitask framework, and doing so, we show how to solve simultaneously several related PCA problems. Hence, we propose a novel formulation of the PCA problem relying on a novel regularization. This regularization is based on a distance between subspaces, and the whole problem is solved as an optimization problem over a Riemannian manifold. We experimentally demonstrate the usefulness of our approach as pre-processing for EEG signals.
• Non-Linear Smoothed Transductive Network Embedding with Text Information - Weizheng Chen, Xia Zhang, Jinpeng Wang, Yan Zhang, Hongfei Yan, Xiaoming Li [abstract]
Network embedding is a classical task which aims to map the nodes of a network to low-dimensional vectors. Most of the previous network embedding methods are trained in an unsupervised scheme. Then the learned node embeddings can be used as inputs of many machine learning tasks such as node classification, attribute inference.
However, the discrimination validity of the node embeddings maybe improved by considering the node label information and the node attribute information.

Inspired by traditional semi-supervised learning techniques, we explore to train the node embeddings and the node classifiers simultaneously with the text attributes information in a flexible framework.
We present NLSTNE (Non-Linear Smoothed Transductive Network Embedding), a transductive network embedding method, whose embeddings are enhanced by modeling the non-linear pairwise similarity between the nodes and the non-linear relationship between the nodes and the text attributes. We use the node classification task to evaluate the quality of node embeddings learned by different models on four real-world network datasets .
The experimental results demonstrate that our model outperforms several state-of-the-art network embedding methods.
• Proper Inner Product with Mean Displacement for Gaussian Noise Invariant ICA - Liyan Song, Haiping Lu [abstract]
Independent Component Analysis (ICA) is a classical method for Blind Source Separation (BSS). In this paper, we are interested in ICA in the presence of noise, i.e., the noisy ICA problem. Pseudo-Euclidean Gradient Iteration (PEGI) is a recent cumulant-based method that defines a pseudo Euclidean inner product to replace a quasi-whitening step in Gaussian noise invariant ICA. However, PEGI has two major limitations: 1) the pseudo Euclidean inner product is improper because it violates the positive definiteness of inner product; 2) the inner product matrix is orthogonal by design but it has gross errors or imperfections due to sample-based estimation. This paper proposes a new cumulant-based ICA method named as PIMD to address these two problems. We first define a Proper Inner product (PI) with proved positive definiteness and then relax the centering preprocessing step to a mean displacement (MD) step. Both PI and MD aim to improve the orthogonality of inner product matrix and the recovery of independent components (ICs) in sample-based estimation. We adopt a gradient iteration step to find the ICs for PIMD. Experiments on both synthetic and real data show the respective effectiveness of PI and MD as well as the superiority of PIMD over competing ICA methods. Moreover, MD can improve the performance of other ICA methods as well.
• Random Fourier Features For Operator-Valued Kernels - Romain Brault, Markus Heinonen, Florence d'Alché Buc [abstract]
Devoted to multi-task learning and structured output learning, operator-valued kernels provide a flexible tool to build vector-valued functions in the context of Reproducing Kernel Hilbert Spaces. To scale up these methods, we extend the celebrated Random Fourier Feature methodology to get an approximation of operator-valued kernels. We propose a general principle for Operator-valued Random Fourier Feature construction relying on a generalization of Bochner's theorem for translation-invariant operator-valued Mercer kernels. We prove the uniform convergence of the kernel approximation for bounded and unbounded operator random Fourier features using appropriate Bernstein matrix concentration inequality. An experimental proof-of-concept shows the quality of the approximation and the efficiency of the corresponding linear models on example datasets.
• Secure Approximation Guarantee for Cryptographically Private Empirical Risk Minimization - Toshiyuki Takada, Hiroyuki Hanada, Yoshiji Yamada, Jun Sakuma, Ichiro Takeuchi [abstract]
Privacy concern has been increasingly important in many machine learning (ML) problems. We study empirical risk minimization (ERM) problems under secure multi-party computation (MPC) frameworks. Main technical tools for MPC have been developed based on cryptography. One of limitations in current cryptographically private ML is that it is computationally intractable to evaluate non-linear functions such as logarithmic functions or exponential functions. Therefore, for a class of ERM problems such as logistic regression in which non-linear function evaluations are required, one can only obtain approximate solutions. In this paper, we introduce a novel cryptographically private tool called secure approximation guarantee (SAG) method. The key property of SAG method is that, given an arbitrary approximate solution, it can provide a non-probabilistic assumption-free bound on the approximation quality under cryptographically secure computation framework. We demonstrate the beneﬁt of the SAG method by applying it to several problems including a practical privacy-preserving data analysis task on genomic and clinical information.
• Simulation and Calibration of a Fully Bayesian Marked Multidimensional Hawkes Process with Dissimilar Decays - Kar Wai Lim, Young Lee, Leif Hanlen, Hongbiao Zhao [abstract]
We propose a simulation method for multidimensional Hawkes processes based on superposition theory of point processes. This formulation allows us to design efficient simulations for Hawkes processes with differing exponentially decaying intensities. We demonstrate that inter-arrival times can be decomposed into simpler auxiliary variables that can be sampled directly, giving exact simulation with no approximation. We establish that the auxiliary variables provides information on the parent process for each event time. The algorithm correctness is shown by verifying the simulated intensities with their theoretical moments. A modular inference procedure consisting of a combination between Gibbs through the adaptive rejection sampling and Metropolis Hastings steps is presented. Finally, we compare our proposed simulation method against existing methods, and find significant improvement in terms of algorithm speed. Our inference algorithm is used to discover the strengths of mutually excitations in real dark networks.
• Unifying Topic, Sentiment & Preference in an HDP-Based Rating Regression Model for Online Reviews - Zheng Chen, Yong Zhang, Yue Shang, Xiaohua Hu [abstract]
This paper proposes a new HDP based online review rating regression model named Topic-Sentiment-Preference Regression Analysis (TSPRA). TSPRA combines topics (i.e. product as-pects), word sentiment and user preference as regression factors, and is able to perform topic clus-tering, review rating prediction, sentiment analysis and what we invent as “critical aspect” analysis altogether in one framework. TSPRA extends sentiment approaches by integrating the key concept “user preference” in collaborative filtering (CF) models into consideration, while it is distinct from current CF models by decoupling “user preference” and “sentiment” as independent factors. Our experiments conducted on 22 Amazon datasets show overwhelming better performance in rating predication against a state-of-art model FLAME (2015) in terms of error, Pearson’s Correlation and number of inverted pairs. For sentiment analysis, we compare the derived word sentiments against a public sentiment resource SenticNet3 and our sentiment estimations clearly make more sense in the context of online reviews. Last, as a result of the de-correlation of “user preference” from “sentiment”, TSPRA is able to evaluate a new concept “critical aspects”, defined as the prod-uct aspects seriously concerned by users but negatively commented in reviews. Improvement to such “critical aspects” could be most effective to enhance user experience.

## Journal Track

• A Unified Probabilistic Framework for Robust Manifold Learning and Embedding - Qi Mao, Li Wang, Ivor W. Tsang [abstract]
This paper focuses on learning a smooth skeleton structure from noisy data - an emerging topic in the fields of computer vision and computational biology. Many dimensionality reduction methods have been proposed, but none are specially designed for this purpose. To achieve this goal, we propose a unified probabilistic framework that directly models the posterior distribution of data points in an embedding space so as to suppress data noise and reveal the smooth skeleton structure. Within the proposed framework, a sparse positive similarity matrix is obtained by solving a box-constrained convex optimization problem, in which the sparsity of the matrix represents the learned neighborhood graph and the positive weights stand for the new similarity. Embedded data points are then obtained by applying the maximum a posteriori estimation to the posterior distribution expressed by the learned similarity matrix. The embedding process naturally provides a probabilistic interpretation of Laplacian eigenmap and maximum variance unfolding. Extensive experiments on various datasets demonstrate that our proposed method obtains the embedded points that accurately uncover inherent smooth skeleton structures in terms of data visualization, and the method yields superior clustering performance compared to various baselines.
• Collaborative Topic Regression for Online Recommender Systems: An Online and Bayesian Approach - Chenghao Liu, Tao Jin, Steven C.H. Hoi, Peilin Zhao, Jianling Sun [abstract]
Collaborative Topic Regression (CTR) combines ideas of probabilistic matrix factorization (PMF) and topic modeling (e.g., LDA) for recommender systems, which has gained increasing successes in many applications. Despite enjoying many advantages, the existing Batch Decoupled Inference algorithm for CTR model (bdi-CTR) has some critical limitations. First of all, it is designed to work in a batch learning manner, making them unsuitable to deal with streaming data or big data in real-world recommender systems. Second, the item-specific topic proportions of LDA are fed to the downstream PMF, but not reverse, which is sub-optimal as the rating information is not exploited in discovering the low-dimensional representation of documents and thus can result in a sub-optimal representation for prediction. In this paper, we propose a novel inference algorithm, called the Online Bayesian Inference algorithm for CTR model (obi-CTR), which is efficient and scalable for learning from data streams. Particularly, we {\it jointly} optimize the combined objective function of both PMF and LDA in an online learning fashion, in which both PMF and LDA tasks can be reinforced each other during the online learning process. Our encouraging experimental results on real-world data validate the effectiveness of the proposed method.
• Multi-view Kernel Completion - Sahely Bhadra, Samuel Kaski, Juho Rousu [abstract]
In this paper, we introduce the first method that (1) can complete kernel matrices with completely missing rows and columns as opposed to individual missing kernel values, with help of information from other incomplete kernel matrices. Moreover, (2) the method does not require any of the kernels to be complete a priori, and (3) can tackle non-linear kernels. The kernel completion is done by finding, from the set of available incomplete kernels, an appropriate set of related kernels for each missing entry.These aspects are necessary in practical applications such as integrating legacy data sets, learning under sensor failures and learning when measurements are costly for some of the views. The proposed approach predicts missing rows by modeling both within-view and between-view relationships among kernel values. For within-view learning, we propose a new kernel approximation that generalizes and improves Nyström approximation. We show, both on simulated data and real case studies, that the proposed method outperforms existing techniques in the settings where they are available, and extends applicability to new settings.
• Non-redundant Multiple Clustering by Nonnegative Matrix Factorization - Sen Yang, Lijun Zhang [abstract]
Clustering is one of the basic tasks in data mining and machine learning which aims at discovering the hidden structure of the data. For many real-world applications, there often exist many different yet meaningful clusterings while most of existing clustering methods only produce a single clustering. To address this limitation, multiple clustering, which tries to generate clusterings that are with high quality and different from each other, has emerged recently. In this paper, we propose a novel alternative clustering method that generates non-redundant multiple clusterings sequentially. The algorithm is built upon Nonnegative Matrix Factorization (NMF), and we take advantage of the nonnegative property to enforce the non-redundancy. Specifically, we design a quadratic term to measure the redundancy between the reference clustering and the new clustering, and incorporate it into the objective. The optimization problem takes on a very simple form, and can be solved efficiently by multiplicative updating rules. Experimental results demonstrate that the proposed algorithm is comparable to or outperforms the existing multiple clustering methods.
• Progressive Random k-Labelsets for Cost-Sensitive Multi-Label Classification - Hsuan-Tien Lin, Yu-Ping Wu [abstract]
In multi-label classification, an instance is associated with multiple relevant labels, and the goal is to predict these labels simultaneously. Many real-world applications of multi-label classification come with different performance evaluation criteria. It is thus important to design general multi-label classification methods that can flexibly take different criteria into account. Such methods tackle the problem of cost-sensitive multi-label classification (CSMLC). Most existing CSMLC methods either suffer from high computational complexity or focus on only certain specific criteria. In this work, we propose a novel CSMLC method, named progressive random $k$-labelsets (PRA$k$EL), to resolve the two issues above. The method is extended from a popular multi-label classification method, random $k$-labelsets, and hence inherits its efficiency. Furthermore, the proposed method can handle arbitrary example-based evaluation criteria by progressively transforming the CSMLC problem into a series of cost-sensitive multi-class classification problems. Experimental results demonstrate that PRA$k$EL is competitive with existing methods under the specific criteria they can optimize, and is superior under several popular criteria.