摘要：There is no standard definition of outliers, but most authors agree that outliers are points far from other data points. Several outlier detection techniques have been developed mainly for two different purposes. On one hand, outliers are considered error measurement observations that should be removed from the analysis, e.g. robust statistics. On the other hand, outliers are the interesting observations, like in fraud detection, and should be modelled by some learning method. In this work, we start from the observation that outliers are affected by the so-called simpson paradox: a trend that appears in different groups of data but disappears or reverses when these groups are combined. Given a data set, we learn a regression tree. The tree grows by partitioning the data into groups more and more homogeneous of the target variable. At each partition defined by the tree, we apply a box plot on the target variable to detect outliers. We would expect that the deeper nodes of the tree would contain less and less outliers. We observe that some points previously signalled as outliers are no more signalled as such, but new outliers appear.
摘要：The Bayesian network (BN) structure learning from the observational data has been proved to be a NP-hard problem. The expert knowledge is beneficial to determine the BN structure, especially when the data are scarce and the related variables are huge in the researched domain. In this paper, we propose a new BN structure learning method by integrating expert knowledge. On the one hand, to improve the performance of expert knowledge usage, the intuitionistic fuzzy set (IFS) is introduced to express and integrate the expert knowledge. The determination of BN priori structure is transformed into the group decision making problem. On the other hand, the improved Bayesian information criterion score function and the Genetic Algorithm search algorithm are used to obtain the most suitable structure under the constraints from the priori structure. Some experiments demonstrate the validity of proposed scheme and compare the performance with the existing research results. The obtained BN structure owns better performance. The more the quantity of expert knowledge is, the better the performance of BN structure learning would be. Finally, the proposed method is applied to the thickening process of gold hydrometallurgy to solve the practical problem.
摘要：We propose an efficient, approximate algorithm to solve the problem of finding frequent subgraphs in large streaming graphs. The graph stream is treated as batches of labeled nodes and edges. Our proposed algorithm finds the set of frequent subgraphs as the graph evolves after each batch. The computational complexity is bounded to linear limits by looking only at the changes made by the most recent batch, and the historical set of frequent subgraphs. As a part of our approach, we also propose a novel sampling algorithm that samples regions of the graph that have been changed by the most recent update to the graph. The performance of the proposed approach is evaluated using five large graph datasets, and our approach is shown to be faster than the state of the art large graph miners while maintaining their accuracy. We also compare our sampling algorithm against a well known sampling algorithm for network motif mining, and show that our sampling algorithm is faster, and capable of discovering more types of patterns. We provide theoretical guarantees of our algorithm's accuracy using the well known Chernoff bounds, as well as an analysis of the computational complexity of our approach.
摘要：The rapid growth in web development has transformed today's communication. The combination of features and corresponding sentiment words (SWs) can help produce accurate, meaningful, and high-quality sentiment analysis (SA) results. There are some basic matters in the study of SA that must be understood, namely, the objects or entities that form a key part of the discussion, the characteristics or features of the object, the SWs, and the connection between the features of the object and the SWs. Failure to identify these basic matters can reduce the accuracy and meaning of the SA results. The main objective of this review is to offer an overview of the role and techniques of feature selection (FS), SWs detection, and the identification of the relationship between features and SWs. The main contributions of this review are its sophisticated categorisations of a large number of recent articles related to FS techniques and the detection of SWs. It also highlights the recent trends in the field of SA research. This review will also look at the metaheuristic approach as a FS technique in SA, identify the strengths and weaknesses of existing FS techniques, and analyse the potential of the metaheuristic approach for solving problems that exist in the selection of features in SA.
摘要：Data clustering is one of the most important tasks in machine learning and data mining, which aims to discover natural structure of the data, identify relationships between observations inside data sets, or detect outliers. Clustering is traditionally seen as part of unsupervised learning, but in many situations, side information about the clusters may be available in addition to the values of the features. For example, the cluster labels of some observations may be known (called seeds) or certain observations may be known to belong (or not) to the same cluster (pairwise constraints). Clustering algorithms using such information are called semi-supervised algorithms. A problem is that although many semi-supervised clustering algorithms have been presented in literature over the last decades, each of them usually uses one kind of side information. In this work, we aim to propose a new semi-supervised density based clustering which integrates effectively both kinds of side information, and embeds an active learning strategy in the process of finding clusters, named MCSSDBS. In order to evaluate our proposed method and demonstrate its effectiveness compared with a state-of-the-art semi-supervised density-based clustering (SSDBSCAN), a series of experiments is carried out on both synthetic and real world data sets. First is experiments primarily conducted on 6 data sets from UCI repository. Then, especially for the facial expression recognition task, our tests are performed on two facial data sets: A popular one in literature - the extended Cohn Kanade Data set (CK+), and our own new facial data set collected from volunteers in Vietnam - named ITI facial expression data set. Comparative results conducted show that our method can boost the performance of clustering process.
摘要：In many machine learning or patter recognition tasks such as classification, datasets with a large number of features are involved. Feature selection aims at eliminating the redundant and irrelevant features which would bring computational burden and degrade the performance of learning algorithms. Particle swarm optimization (PSO) has been widely used in feature selection due to its global search ability and computational efficiency. However, PSO was originally designed for continuous optimization problems and the discretization of PSO in feature selection is still a problem which needs further investigation. This paper develops a novel feature selection algorithm based on a set based discrete PSO (SPSO). SPSO employs a set based encoding scheme which makes it able to characterize the discrete search space in feature selection problem. It also redefines the velocity term and the corresponding arithmetic operators which enables it to search for the optimal feature subset in the discrete space. In addition, a novel feature subset evaluation criterion based on contribution rate is proposed as the fitness function in SPSO. The proposed criterion does not need any pre-determined parameter to keep the balance between relevance and redundancy of the feature subset. The proposed method is compared with six filter approaches and four wrapper approaches on ten well known UCI dataset and the experimental results demonstrate the proposed method is promising.
摘要：The emergence of Big Data has had a profound impact on how data are analyzed. Open source distributed stream processing platforms have gained popularity for analyzing streaming Big Data as they provide low latency required for streaming Big Data applications using cluster resources. However, existing resource schedulers are still lacking the efficiency that Big Data analytical applications require. Recent works have already considered streaming Big Data characteristics to improve the efficiency of scheduling in the platforms. Nevertheless, they have not taken into account the specific attributes of analytical applications. This study, therefore, presents Bframework, an efficient resource scheduling framework used by streaming Big Data analysis applications based on cluster resources. Bframework proposes a query model using Directed Graphs (DGs) and introduces operator assignment and operator scheduling algorithms based on a novel partitioning algorithm. Bframework is highly adaptable to the fluctuation of streaming Big Data and the availability of cluster resources. Experiments with the benchmark and well-known real-world queries show that Bframework can significantly reduce the latency of streaming Big Data analysis queries up to about 65%.
摘要：Association Rule Mining (ARM) is a fundamental data mining task that is time-consuming on big datasets. Thus, developing new scalable algorithms for this problem is desirable. Recently, Bee Swarm Optimization (BSO)-based meta-heuristics were shown effective to reduce the time required for ARM. But these approaches were applied only on small or medium scale databases. To perform ARM on big databases, a promising approach is to design parallel algorithms using the massively parallel threads of a GPU processor. While some GPU-based ARM algorithms have been developed, they only benefit from GPU parallelism during the evaluation step of solutions obtained by the BSO-metaheuristics. This paper improves this approach by parallelizing the other steps of the BSO process (diversification and intensification). Based on these novel ideas, three novel algorithms are presented, i) DRGPU (Determination of Regions on GPU), ii) SAGPU (Search Area on GPU, and, iii) ALLGPU (All steps on GPU). These solutions are analyzed and empirically compared on benchmark datasets. Experimental results show that ALLGPU outperforms the three other approaches in terms of speed up. Moreover, results confirm that ALLGPU outperforms the state-of-the-art GPU-based ARM approaches on big ARM databases such as the Webdocs dataset. Furthermore, ALLGPU is extended to mine big frequent graphs and results demonstrate its superiority over the state-of-the-art D-Mine algorithm for frequent graph mining on the large Pokec social network dataset.
摘要：Inner ear balance problems are common worldwide and are often difficult to diagnose. In this study we examine the classification of patients with inner ear balance problems and controls (people not suffering from inner ear balance problems) based on data derived from the stabilogram signals and using machine learning algorithms. This paper is a continuation for our earlier paper where the same dataset was used and the focus was medically oriented. Our collected dataset consists of stabilogram (a force platform response) data from 30 patients suffering from Meniere's disease and 30 students called controls. We select a wide variety of machine learning algorithms from traditional baseline methods to state-of-the-art methods such as Least-Squares Support Vector Machines and Random Forests. We perform extensive and carefully made parameter value searches and we are able to achieve 88.3% accuracy using k-nearest neighbor classifier. Our results show that machine learning algorithms are well capable of separating patients and controls from each other.
摘要：In sentiment analysis, the high dimensionality of the feature vector is a key problem because it can decrease the accuracy of sentiment classification and make it difficult to obtain the optimum subset of features. To solve this problem, this study proposes a new text feature selection method that uses a wrapper approach, integrated with ant colony optimization (ACO) to guide the feature selection process. It also uses the k-nearest neighbour (KNN) as a classifier to evaluate and generate a candidate subset of optimum features. To test the subset of optimum features, algorithm dependency relations were used to find the relationship between the feature and the sentiment word in customer reviews. The output of the feature subset, which was derived using the proposed ACO-KNN algorithm, was used as an input to identify and extract sentiment words from sentences in customer reviews. The resulting relationship between features and sentiment words was tested and evaluated to determine the accuracy based on precision, recall, and F-score. The performance of the proposed ACO-KNN algorithm on customer review datasets was evaluated and compared with that of two hybrid algorithms from the literature, namely, the genetic algorithm with information gain and information gain with rough set attribute reduction. The results of the experiments showed that the proposed ACO-KNN algorithm was able to obtain the optimum subset of features and can improve the accuracy of sentiment classification.
摘要：Sentiment analysis and opinion mining is an area that has experienced considerable growth over the last decade. This area of research attempts to determine the feelings, opinions, emotions, among other things, of people on something or someone. To do this, natural language techniques and machine learning algorithms are used.This article discusses the problem of extracting sentiment and opinions from a collection of reviews on scientific articles conducted under an international conference on computing in northern Chile.The first aim of this analysis is to automatically determine the orientation of a review and contrast this with the assessment made by the reviewer of the article. This would allow scientists to characterize and compare reviews crosswise and more objectively support the overall assessment of a scientific article.A hybrid approach that combines an unsupervised machine learning algorithm with techniques from natural language processing is proposed to analyze reviews. This method uses part-of-speech (POS) tagging to obtain the syntactic structure of a sentence. This syntactic structure, along with the use of dictionaries, allows determining the semantic orientation of the review through a scoring algorithm.A set of experiments were conducted to evaluate the capability and performance of the proposed approaches relative to a baseline, using standard metrics, such as accuracy, precision, recall, and the F-1-score. The results show improvements in the case of binary, ternary and a 5-point scale classification in relation to classical machine learning algorithms such as SVM and NB, but they also present a challenge to improve the multiclass classification in this domain.
摘要：Feature selection is a common solution to microarray analysis. Previous approaches either select features based on classical statistical tests that can be tuned up with a classifier, or using regularization penalties incorporated in the cost function. Here we propose to use a feature ranking and weighting scheme instead, which combines statistical techniques with a weighted k-NN classifier using a modified forward selection procedure.We demonstrate that classification accuracy of our proposal outperforms existing methods on a range of public microarray gene expression datasets. The proposed method is also compared to state-of-the-art feature selection algorithms by means of the Friedman test.Although a bunch of feature selection techniques has been used for genomic data, the experimental results show the classification superiority of our method on most of the present gene expression datasets.
摘要：A number of graph-parallel computing abstractions have been proposed to address the needs of solving complex and large-scale graph computing. However, unnecessary and excessive communication and state sharing between nodes in these frameworks not only reduce the network efficiency but may also cause decrease in runtime performance. In this paper, we propose a mechanism called LightGraph, which reduces the synchronizing communication overhead for distributed graph-parallel computing abstractions. Besides identifying and eliminating the redundant synchronizing communications in existing systems, in order to minimize the required synchronizing communications LightGraph also proposes an edge direction-aware graph partitioning strategy. This new graph partitioning strategy optimally isolates the outgoing edges from the incoming edges of a vertex. We have conducted extensive experiments using real-world data, and our results verified the effectiveness of LightGraph. For example compared to PowerGraph LightGraph can not only reduce up to 31.5% synchronizing communication overhead for intra-graph synchronizations, but also cut up to 16.3% runtime for PageRank running on Livejournal dataset.
摘要：Reddit is a popular social media website where users can submit content such as direct links and text posts into a forum called subreddit. The average number of new subreddits created reaches 500 per day. Because of the vast and growing number of subreddits, users need to discover and familiarize themselves with all existing communities before submission. In this paper, we propose new feature sets for an online community which are text posts ratio, the average length of text in the post and the domain-specific features. The community recommendation framework is designed and experimented based on Reddit dataset. The framework successfully identifies and collects textual communities by finding their representatives using clustering algorithm namely DBSCAN, then a logistic regression algorithm is applied to recommend a list of communities with high content similarity to a given post. Comprehensive experimental evaluations on Reddit dataset reveal that the proposed framework achieves high precision at 90%.
摘要：In multi-label classification settings, one of the most common problems is the massive label output space. To alleviate this, some methods opt to exploit label correlations to reduce the output space during prediction. However, these methods sacrifice efficiency or ignore global label correlations. In addition, label imbalances are another problem that is prevalent in multi-label classification. Current methods of correcting for imbalance oftentimes use single-label methods, which fail to consider label correlations. In this paper, we introduce general frameworks that incorporate topic modeling to seamlessly address both problems. We show that these frameworks can allow even the most naive methods, such as Binary Relevance, to perform similarly to state-of-the-art methods. Furthermore, we show that our frameworks can also adapt state-of-the-art methods to perform better than the methods by themselves.
摘要：This research focuses on resource assignment in cooperative energy heterogeneous systems with non-orthogonal multiple access in which cells are powered via a common grid network and alternative energy resources and all base stations have the ability to cover a group of subscribers simultaneously at a specific frequency band. In order to consider the local limitations of alternative energy resources, it was assumed that the alternative energy would be shared among the base stations by the dynamic grid network. In this architecture, resource allocation and user association frameworks should be reconfigured because conventional schemes use orthogonal multiple access. Hence, this paper suggests a novel approach joint optimal power allocation and user association techniques to achieve the maximum degree of energy efficiency for the whole system in which the quality of experience parameters are assumed to be bounded during multi-cell multicast sessions. The solution to the introduced problem in a scenario with fixed transmission power is an improved decentralized algorithm that supplies effective user association framework. The model has been modified to develop joint multi-layered resource control and user association that can distinguish the service pattern in cooperative energy heterogeneous systems with non-orthogonal multiple access to obtain more resource optimality than current approaches. The effectiveness of the suggested approach has been confirmed by the numerical results. Also, the results reveal that non-orthogonal multiple access can provide greater energy efficiency than orthogonal multiple access in heterogeneous wireless networks.
摘要：Short-term traffic flow prediction plays a crucial component in transportation management and deployment. In this paper, a novel regression framework for short-term traffic flow prediction with automatic parameter tuning is proposed, with the SVR being the primary regression model for traffic flow prediction and the Bayesian Optimization being the major method for parameters selection. First, the preprocessing of raw traffic flow is carried out by seasonal difference to eliminate the non-stationary of the data. Then, Support Vector Regression model is trained by the pre-processed data. In order to optimize the model parameters, the generalization performance of SVR is modeled as a sample from a Gaussian process (GP). Bayesian optimization determines the parameters configuration of the regression model by optimizing the acquisition function over the GP. Finally, the optimal short-term traffic flow regression model is constructed through repeated GP update and iteratively multiple training of the model. Experiment results show that the accuracy of proposed method is superior to methods of classical SARIMA, MLP-NN, ERT and Adaboost.
摘要：Empirical evidence suggests that ensembles with adequate levels of pairwise diversity among a set of accurate member algorithms can significantly outperform any of the individual algorithms. As a result, several diversity measures have been developed for use in optimizing ensembles. We show, however, that there is natural tension between the pairwise diversity of ensemble members and their individual accuracy. While efficient ensembles can be built with stronger forms of diversity, they also suffer in overall accuracy. On the other hand, ensembles built with weaker forms of diversity can be very accurate, but tend to be significantly more computationally expensive. We discuss these findings in light of the notion of diversity space.