### Supported by the National Science Foundation (NSF) – IIS-0546155

Duration: 5 years, starting date 02/15/2006

### Principal Investigator

Department of Computer and Information Sciences

Temple University

303 Wachman Hall

1805 N. Broad Street

Philadelphia PA 19122

215-204-5774

### Project Summary

In the environment where new large-scale problems are emerging in various disciplines and pervasive computing applications are becoming more common, there is an urgent need for techniques that provide efficient and accurate knowledge discovery by limited-capacity computing devices. The objective of this project is to address this need by developing memory-constrained predictive data mining algorithms that operate when data size exceeds the available memory capacity. The approach is based on the integration of data mining and data compression techniques to optimally utilize the memory for data and model storage, learning, and ancillary operations. The methodology will be thoroughly evaluated on a range of real-life problems that includes learning from large sorted databases, biased data and nonstationary data. Various memory constraints will be considered pertaining to devices ranging from powerful workstations to handheld computers and cell phones to small, inexpensive sensors. This research will reveal the memory lower bounds for accurate learning from different types of data and by different types of learning algorithms. The educational component of the project seeks to integrate research into computer science instruction by designing exciting courses, exploring effective teaching techniques, introducing research to undergraduate and graduate students, and involving underrepresented student groups in research. Broader impacts of the project will be in extending the frontiers of computer and information science and in facilitating knowledge discovery in various scientific, engineering, and business disciplines. Teaching materials and research results, including developed software and databases, will be widely disseminated to promote learning and enhance scientific understanding.

### List of Supported Students and Staff

- Ana Milicic, Ph.D. student
- Nemanja Djuric, Ph.D. student
- Mihajlo Grbovic, Ph.D. student
- Zhuang Wang, Ph.D. student
- Vuk Malbasa, Ph.D. student
- Vladimir Coric, Ph.D. student
- Liang Lan, Ph.D. student
- Hao Sun, M.S. student
- Bencella Wisner, undergraduate student
- Brian Stempin, undergraduate student (REU support)
- Yu Liang, undergraduate student (REU support)

### Other collaborators

- Prof. Dragoljub Pokrajac (Delaware State Univ.)
- Prof. Zoran Obradovic (Temple Univ.)
- Tammy Pirmann, Springfield Township High School
- Graduate students: Haidong Shi, Bo Han (Temple Univ.)
- Undergraduate students: Mikia La Rue (Delaware State Univ.)

### Keywords

Data Mining, Machine Learning, Data Compression

### Highlights

**Ms. Bencella Wisner presented results of her research as a poster titled “Analysis and Prediction of Traffic Volume in Large Transportation Systems” at the 10th Annual Philadelphia Alliance for Minority Participation (AMP) Research Symposium, where she received the ***second prize*** in the Mathematics and Computer Sciences Category.**

**Mr. Brian Stempin presented results of his research as a poster titled “Visualization and Planning of Traffic Routes Based on Predicted Travel Times” at the Temple’s 2010 Annual Future of Computing Competition, where he received the third prize in the undergraduate student category**

### Research and Education Activities

**Activities**

This project has three specific aims. The following is a summary of the major activities in this project.

**AIM 1: Reservoir Data Mining Algorithms.**

In this aim, we are addressing the memory-constrained problem by assuming that the memory can hold at most *B* examples, where *B* is much smaller than the number of training examples *N*. We were interested in prototype-based classification algorithms where the maintained *B* examples define the classification model. We developed and published several new prototype-based algorithms for memory-constrained learning.

**Tighter Perceptron with Improved Dual Use of Cached Data for Model Representation and Validation.**

Kernel perceptrons are a popular class of algorithms for online learning. They are represented by a subset of observed examples, called the support vectors, and their weights. The baseline kernel perceptron algorithm is simple – it observes a new example and, if it is misclassified by the current model, adds it to the model as a new support vector. However, kernel perceptrons often suffer from an unbounded growth in the number of support vectors with training data size. Budget kernel perceptrons address the issue of unbounded growth in computational resources by maintaining a fixed support vector budget through their removal. The existing budget algorithms typically achieve poor accuracy on noisy classification problems. We proposed the Tightest Perceptron, a new algorithm that removes support vector with minimal impact on classification accuracy. The main strength of the new algorithm is that it does not require maintenance of a separate validation data set (as needed by the previously proposed Tighter Perceptron algorithm) to estimate the accuracy. Instead, it manages to obtain highly accurate estimates of classification accuracy using exclusively the support vectors. To achieve this, a simple statistics is maintained for each support vector that estimates the class probability in its vicinity. During the validation process, the expectation of the validation hinge loss is calculated with respect to the estimated class probability.

**Twin Vector Machines for Online Learning on a Budget.**

Support Vector Machines (SVMs) are a popular class of machine learning algorithms that achieve superior accuracy on a number of practical supervised learning problems. Most often, SVM training is formulated as a quadratic programming (QP) problem and its solution typically has quadratic space and cubic time scaling with sample size. Such scaling is often unacceptable for data mining applications. Various solutions have been proposed to improve the time and space efficiency of SVMs, but they still have an unbounded growth in memory with the sample size and require multiple passes through the training data. We proposed a new algorithm, called the Twin Vector Support Vector Machine (TVM), which is able to handle arbitrarily large data streams while operating under a fixed memory budget. The basic idea of TVM is to upper bound the number of support vectors during the whole learning process. Each example kept in the memory, called the twin vector, summarizes the local data distribution. To optimally utilize the budget, twin vectors are positioned near the decision boundary, which is the most informative region of the input space. The TVM and the set of twin vectors are updated each time a new training example is observed such that the memory budget is kept constant. If the new example becomes a support vector, two twin vectors are selected and combined into a single one to maintain the budget. Finally, TVM is updated to account for the change in twin vectors such that Kuhn-Tucker conditions are satisfied on all twin vectors at any time. To accomplish this, we used the solution proposed in the online memory unbounded Incremental-Decremental SVM (IDSVM) algorithm. The resulting TVM algorithm achieves constant space and linear time scaling, and is very appropriate for online learning from large data sets using a limited memory budget. We proposed two versions of TVM – one that uses the hinge loss and another that uses the ramp loss. We also paid special attention to overfitting prevention by controlling the slack parameter of TVM.

**Learning Vector Quantization with Adaptive Prototype Addition and Removal.**

Learning Vector Quantization (LVQ) is a popular class of nearest prototype classifiers for multiclass classification. Like the nearest neighbor algorithm, LVQ is a local classification algorithm, where the classification boundaries are approximated locally. The difference is that instead of using all the points in the training dataset, LVQ uses only a set of appropriately chosen prototype vectors. This way the classification method is much more efficient, because the number of vectors that should be stored and compared with is significantly reduced. In addition, a carefully chosen prototype set can greatly increase classification accuracy on noisy problems. We proposed a modification of the LVQ algorithm that addresses problems of determining appropriate number of prototypes, sensitivity to initialization, and sensitivity to noise. The proposed algorithm allows adaptive prototype addition at potentially beneficial locations and removal of harmful or less useful prototypes. The idea is to start with an initial, small, and equal number of prototypes per each class. Then, candidate prototypes are added where they can be the most beneficial and prototypes are removed when it results in decrease in classification error. Instead of fixing the number of prototypes, the adaptive algorithm only sets a threshold B defining an upper bound on the number of LVQ prototypes. To balance the budget B, when the total number of prototypes reaches B, new prototypes can be added only when some existing ones are deleted. The prototype addition and removal steps can be easily implemented on top of the many existing LVQ algorithms.

**Learning Vector Quantization for Regression.**

LVQ is traditionally used for classification. In this project, we considered extensions of LVQ to make it suitable for regression. There are several justifications for using LVQ algorithms in regression. One is that it allows a clear mechanism for designing a prototype-based regressor on a fixed prototype budget. Another is that regression LVQ has a relatively low training cost with linear time and constant memory scaling with training size. We posed Regression LVQ as an optimization problem with two different cost functions. The first cost function leads to Regression LVQ with hard predictions and the second to LVQ with soft predictions. It is worth noting that hard predictions are suboptimal in the sense that they are a special case of soft predictions where the whole weight of decision is given to the nearest prototype. However, there are several situations where hard predictions are necessary, such as decentralized estimation and meta-learning in distributed data mining. To design an LVQ learning algorithm, we explored two optimization approaches, one being gradient-based and another expectation-maximization- (EM-) based. The gradient-based approach allows both batch and stochastic (online) updates of prototypes, while the EM-based is naturally suited for the batch mode of operation. Finally, we proposed a simplification of one of the resulting algorithms and showed that it leads to a regression variant of the popular LVQ2.1 classification algorithm. Just like the original LVQ algorithm for classification, the regression LVQ use a supervised learning procedure to learn the best prototype positions, but it also learns the best prototype target values. This results in the effective partition of the feature space, similar to the one the K-means algorithm would make.

**Online SVM Learning with Ramp Loss.**

There have been many online algorithms developed for optimization on a regularized hinge loss cost function. Similarly to their offline counterparts that use hinge-loss, their drawback is sensitivity to noise and outliers. This comes from the nature of the hinge loss cost function in which all the noisy examples become the support vectors during training. To address drawbacks of hinge loss, recently, an efficient offline algorithm based on ConCave Convex Procedure (CCCP) has been proposed for SVMs by using a regularized ramp loss cost function that prevents the noisy examples from becoming support vectors. In this work, we proposed a fast online modification of SVM^{R} for large-scale online learning tasks, called the online Ramp-Loss SVM (OnlineSVM^{R}). OnlineSVM^{R} produces the optimal solution for SVM^{R} on *t+*1 training data using the existing optimal solution on *t* previous examples and efficiently updating it given the new (*t*+1)-th example. The algorithm retains the Karush–Kuhn–Tucker (KKT) conditions on all previously observed examples using an SMO-style incremental learning and decremental unlearning approach under the CCCP framework. By only maintaining a fixed number of non-SVs closest to the decision boundary, the training time of OnlineSVM^{R} could be further improved while retaining competitive accuracy. Online learning algorithms have been combined with other machine learning scenarios (e.g. active learning and budget learning) in many practical applications. Coupled with the higher accuracy and robustness to noise, faster training time, and sparser model, OnlineSVM^{R} could be used as an efficient alternative to these algorithms. To demonstrate the applicability of OnlineSVM^{R}, we also proposed OnlineSVM^{R} with active learning. This greedy variant of OnlineSVM^{R} for active learning could be viewed as an approximate optimization process over a regularized ramp-loss cost function.

**Online Passive-Aggressive Algorithms on a Budget.**

We proposed a family of online classifiers based on the online Passive-Aggressive algorithm (PA). PA is a popular kernel-based online algorithm that has solid theoretical guarantees, is as fast as and more accurate than kernel perceptrons. Upon receiving a new example, PA updates the model such that it is similar to the current one and has a low loss on the new example. When equipped with kernel, PA suffers from the same computational problems as other kernelized online algorithms, which limits its applicability to large-scale learning problems or in the memory-constrained environments. We proposed budgeted PA algorithm where the model updating and budget maintenance are solved as a joint optimization problem. By introducing an additional constraint into the PA optimization problem, the algorithm can explicitly bound the number of support vectors by removing one of them and by properly updating the rest. The new optimization problem has a closed-from solution. Under the same underlying algorithmic structure, we developed several algorithms with different tradeoffs between accuracy and computational cost. We also proposed to replace the standard hinge loss with the recently proposed ramp loss in the optimization problem such that the resulting algorithms are more robust to the noisy data. We used an iterative method, ConCave Convex Procedure (CCCP), to solve the resulting non-convex problem. The resulting CCCP converges after a single iteration and results in a closed-from solution which is identical to its hinge loss counterpart when it is coupled with active learning.

**Multi-class Pegasos on a Budget.**

When equipped with kernel functions, online learning algorithms are susceptible to the curse of kernelization that causes unbounded growth in the model size. To address this issue, we proposed a family of budgeted online learning algorithms for multi-class classification which have constant space and time complexity per update. Our approach was based on the multi-class version of the popular Pegasos algorithm. Pegasos runs iteratively and alternates between a stochastic sub-gradient descent step and a projection step and it was theoretically shown that Pegasos converges toward the optimal solution as the number of examples grows. Although in its original, non-kernelized, form it has constant update time and constant space, when equipped with kernel Pegasos suffers from the same computational problems as other kernelized online algorithms. We proposed a multi-class version of Pegasos based on the multi-class SVM formulation. The resulting multi-class Pegasos has similar algorithmic structure as its binary version. We proposed a budgeted version of the kernelized multi-class Pegasos that achieves constant update time and constant space. This was achieved by fixing the number of support vectors through several budget maintenance strategies. We analyzed behavior of the budgeted multi-class Pegasos within the framework of online convex learning with errors in the gradients.

**Adaptive Multi-Hyperplane Machine for Nonlinear Classification**.

Support Vector Machines (SVMs) are among the most popular and successful classification algorithms. Kernel SVMs often reach state-of-the-art accuracies, but suffer from the curse of kernelization due to linear model growth with data size on noisy data. Linear SVMs have the ability to efficiently learn from truly large data, but they are applicable to a limited number of domains due to low representational power. To fill the representability and scalability gap between linear and nonlinear SVMs, we proposed the Adaptive Multi-hyperplane Machine (AMM) algorithm that accomplishes fast training and prediction and has capability to solve nonlinear classification problems. AMM model consists of a set of hyperplanes (weights), each assigned to one of the multiple classes, and predicts based on the associated class of the weight that provides the largest prediction. The number of weights is automatically determined through an iterative algorithm based on the stochastic gradient descent algorithm which is guaranteed to converge to a local optimum. Since the generalization bound decreases with the number of weights, we proposed and analyzed a weight pruning mechanism. Specifically, the SGD algorithm starts with a single zero weight assigned to each class and adds new weights as necessary. On simple problems such as linear or near-linear classification, the number of weights remains low, while it can become relatively high on more complex problems. Since the generalization error of AMM grows with the number of weights, AMM also contains a pruning mechanism that removes small weights, such that it provably results in only a minor degradation in optimization accuracy. In addition to improving generalization error, pruning is also useful because it reduces the cost of training and prediction.

**AIM 2: Reduction of Datasets and Prediction Models.**

In this aim, we are exploring how to reduce the data to maximize the effective size of the reservoir. We addressed this issue within the framework of decentralized estimation. We proposed and published a novel decentralized estimation algorithm based on learning vector quantization and compared it with previously proposed alternatives. We also developed a new algorithm for feature selection in multi-task learning from high-dimensional small-sample data sets.

**Decentralized Estimation Using Learning Vector Quantization.**

Decentralized estimation systems play an important role in various applications such as sensing of environment, energy management, home automation, inventory control, health monitoring, and food-safety monitoring. Sensors employed in these systems are low-cost devices with highly limited computational capabilities that are jointly working at monitoring the environment. In decentralized estimation, the physical quantities measured by the sensors are sent via communication channels to the fusion center where estimation of the target variable is performed. Due to the channel bandwidth and energy constrains, the sensors transmit quantized data to the fusion center. The challenge is to find quantization functions and fusion function such that the estimation error is minimized under given communication constraints. We studied a decentralized system where each sensor communicates only with the fusion center and there is no feedback to the sensors. It was assumed that a sample of size N from the underlying distribution is available for the design of the decentralized system. Our approach follows the iterative procedure of the generalized Lloyd’s algorithm: (1) the quantizer for a single sensor is determined assuming the quantizers of the remaining sensors and the fusion function are known, (2) the fusion function is determined assuming all the quantizers are known. For quantizer design we adopted the multi-prototype approach where given a fixed set of prototype vectors, each is assigned to one of the codewords, and each observation is assigned to the codeword of its nearest prototype. To determine prototype locations, we proposed a modification to the Learning Vector Quantization (LVQ) classification algorithm that updates locations of prototypes each time a data point is misclassified. Our Distortion-Sensitive LVQ treats the decoder design as a soft classification problem. It calculates squared errors of assigning i-th training observation to j-th codeword and uses them to drive the updates of prototype locations.

**Feature Selection for Transfer Learning.**

Training classifiers on small-sample and high-dimensional data sets is a challenging problem that has been receiving an increasing attention from the research community. A standard way of addressing the challenge is to perform some kind of feature selection as a preprocessing step and to follow it by applying a classification algorithm that controls model complexity through some form of regularization. When the number of labeled examples is particularly small (e.g., less than 10), the amount of available information diminishes to the level that even the most carefully designed classification approaches are bound to underperform. Recent work in multi-task learning indicates that accuracy on the target classification task can be significantly increased if data from the auxiliary tasks are consulted during learning. This can be achieved either by modifying the regularization term or by directly using auxiliary data for training. A standard approach is to perform feature selection on training data of the target task. Our hypothesis was that such single-task feature selection suffers from similar problems as single-task classification. Thus, we proposed a multi-task feature selection filter that borrows strength from the auxiliary data sets. Our major contributions were in (1) proposing a new multi-task feature selection filter, (2) demonstrating that feature selection is an important preprocessing step in conjunction with multi-task classification algorithms, and (3) demonstrating the potential and limitations of multi-task learning on microarray data.

**AIM 3: Integrating Data Mining and Compression Algorithms.**

The aim was to perform an extensive study over a number of scenarios, including different data types and different levels of memory and processing constraints. Our efforts included studies on several real-life large-scale data sets, study of multiple instance regression, and development of a truly memory-constrained algorithm, called the compressed kernel perceptron.

**Multiple Instance Regression.**

We continued our study of an important class of multiple instance learning problems that is common to remote sensing applications. There, spatially contiguous regions are being observed by remote sensing instruments and a subset of observed regions is being labeled through accurate ground-based measurements. This set of labeled regions is then used to train a predictive model to be used over unlabeled regions. The issue that arises is that each region results in a bag of instances that contain multiple observations that are all labeled the same. An interesting property of such data from the perspective of this project is that individual bags can contain thousands of examples, that number of bags can also be very large (e.g tens of thousands), and that their number increases in time. These are exactly conditions that require use of memory-constrained data mining methods. We proposed two methods for multiple instance regression where the goal is to select a small subset of the available examples and use them to train a highly accurate predictor. One is based on an iterative procedure that detects and removes the instances with the largest prediction error from each bag. Another is based on modification of the cost function for neural network training.

**Compressed Kernel Perceptron.**

Kernel machines are a popular class of machine learning algorithms that achieve state of the art accuracies on many real-life classification problems. Kernel perceptrons are among the most popular online kernel machines that are known to achieve high-quality classification despite their simplicity. They are represented by a set of B prototype examples, called support vectors, and their associated weights. To obtain a classification, a new example is compared to the support vectors. Both space to store a prediction model and time to provide a single classification scale as O(B). A problem with kernel perceptrons is that on noisy data the number of support vectors tends to grow without bounds with the number of training examples. To reduce the strain at computational resources, budget kernel perceptrons have been developed by upper bounding the number of support vectors. We proposed a new budget algorithm that upper bounds the number of bits needed to store kernel perceptron. Setting the bitlength constraint could facilitate development of hardware and software implementations of kernel perceptrons on resource-limited devices such as microcontrollers. The basis of the proposed algorithm is Random Kernel Perceptron, a budget algorithm that removes a randomly selected support vector when the budget is exceeded. The proposed compressed kernel perceptron algorithm decides on the optimal tradeoff between number of support vectors and their bit precision. Compressed kernel perceptron estimates its distortions due to removal of a support vector and reduction in bit precision of support vectors and decides on the optimal tradeoff between number of support vectors and their precision.

**Application-Specific Resource-Aware Large-Scale Data Mining.**

We worked on two applications that require efficient data mining algorithms. The first is the problem of traffic forecasting over a large metropolitan highway network where several years of historical data from thousands of traffic sensors are available for learning. We explored application of the recently proposed Continuous Conditional Random Field (CCRF) algorithm to travel forecasting. CCRF is a flexible, probabilistic framework that can seamlessly incorporate multiple traffic predictors and can exploit spatial and temporal correlations inherently present in traffic data. In addition to improving the prediction accuracy, the probabilistic approach also provides information about prediction uncertainty. Moreover, information about importance of particular predictors and spatial-temporal correlations can be easily extracted from the model. CCRF are also fault tolerant and can provide predictions even when some of the observations are missing.

The second is active learning from large unlabeled data. Active learning addresses the problem in which large amounts of unlabeled data are available at low cost but where acquiring labels is expensive. The objective of active learning is to achieve high accuracy by using the least labeling effort. We considered the pool-based active learning scenario, where labels can be queried from a large pool of unlabeled data. We proposed to use the Regularized Parzen Window Classifier as the baseline classifier because of its ease of implementation and ability to solve highly nonlinear classification problems. We developed an ensemble of RPWC to enhance the overall performance. We considered several feature selection filters because it is a critical choice for Parzen Window classifiers because it influences the measure of distance between examples. The querying strategy used in our algorithm consisted of selecting a mixture of the most uncertain examples and randomly sampled examples. Instead of pure random sampling, which might over-emphasize dense regions of the feature space, we used random sampling of unlabeled data clusters.

Spatial analysis of disease risk, or disease mapping, typically relies on information about the residence and health status of individuals from population under study. However, residence information has its limitations because people are exposed to numerous disease risks as they spend time outside of their residences. Thanks to the wide-spread use of mobile phones and GPS-enabled devices, it is becoming possible to obtain a detailed record about the movement of human populations. Availability of movement information opens up an opportunity to improve the accuracy of disease mapping. Starting with an assumption that an individual’s disease risk is a weighted average of risks at the locations which were visited, we proposed that disease mapping can be accomplished by spatially regularized logistic regression. Due to the inherent sparsity of movement data, the proposed approach can be applied to large populations and over large spatial grids.

**Random Kernel Perceptron on ATTiny2313 Microcontroller.**

Kernel perceptron is very simple and efficient online classification algorithm. However, it requires increasingly large computational resources with data stream size and is not applicable on large-scale problems or on resource-limited computational devices. To explore the practical lower bounds on memory needed for satisfactory online learning from large data we implemented kernel perceptron on ATTiny2313 microcontroller, one of the most primitive computational devices with only 128B of data memory and 2kB of program memory. ATTyny2313 is a representative of devices that are popular in embedded systems and sensor networks due to their low cost and low power consumption. Implementation on microcontrollers was possible thanks to two properties of kernel perceptrons: (1) availability of budgeted kernel perceptron algorithms that bound the model size, and (2) relatively simple calculations required to perform online learning and provide predictions. Since ATTiny2313 is the fixed-point controller that supports floating-point operations through software, which introduces significant computational overhead, we considered implementation of basic kernel perceptron operations through fixed-point arithmetic. To achieve this, we developed a new approach to approximate one of the most popular kernel functions, the RBF kernel, on fixed-point microcontrollers. Our approximation of kernel perceptron was well-suited for implementation on resource-limited devices. The new algorithm used only integers, without any need for floating-point numbers, which makes it appropriate for simple devices such as microcontrollers.

**Education and Outreach Activities.**

Activities for integration of research and education involved supervising 6 undergraduate independent studies where students have been involved in data collection, data processing, and implementation of memory-unconstrained data mining. One of them (Ms. Bencella Wisner) did her summer internship on the task of Minneapolis traffic data collection and prediction of traffic patterns. During 8 weeks of summer 2007 she was funded as a research assistant from the project. Two undergraduate students were during 2009 and 2010 from the Research Experiences for Undergraduates supplement for this project. One of them (Mr. Brian Stempin) got a third prize at the 2010 Future of Computing Competition organized at Temple University.

The 7 graduate students involved in the project coupled their research with the independent study courses the PI supervised, where they got familiar with the related work in the field. A major goal of the PI’s newly established Scientific Data Mining Lab is to promote the activities on this project among undergraduate and graduate students through student seminars, invited talks, and regular lab meetings.

As part of the educational objectives of the project, the PI has been actively involved in an effort to introduce data mining to science and engineering undergraduate students. In October 2008 the PI attended the two-day NSF conference on Broadening Participation in Computing Disciplines in Virginia Beach, VA and in February 2009 the Computational Thinking for Everyone Workshop I in Washington, DC. Both workshops provided a forum for discussions on fostering computational thinking as one of the fundamental skills for the workforce of 21^{st} century and on increasing the retention of women and underrepresented minorities in computer science and information technology fields. As the Chair of Computer Science Undergraduate Committee in his Department since Spring 2009, the PI has been actively involved in exploring ways to introduce computational thinking very early in the curriculum of computer science majors, but also to majors from other disciplines such as sciences, communications, business, and social sciences. Courses or course modules related to data mining and knowledge discovery have been considered as very appropriate for teaching the computational thinking skills to both groups of students.

In Spring 2009, the PI taught a new course “Paradigms of Scientific Knowledge: Knowledge Discovery from Scientific Data”. The goal of this course was to teach science students the necessary skills for knowledge discovery from large collections of data. The course was designed to introduce students to various data mining algorithms and illustrate how they can be applied to real-life knowledge discovery problems in sciences. In Spring 2011 the PI is teaching an undergraduate course “Data Mining” where he is introducing a problem-based data mining syllabus designed to guide students to rediscover and implement several standard data mining algorithms for movie recommendations. The hypothesis is that problem-based learning approach is very suitable as it provides a natural way for students to fundamentally understand all aspects of the data mining pipeline and get genuinely interested in learning of more advanced concepts.

In Summer 2011, the PI started collaboration with Tammy Pirmann, a high school teacher from Springfield Township High School, in Montgomery County, PA. Ms. Pirmann spent 6 weeks during summer 2011 as a visitor to the PI’s research lab at Temple University. During the first two weeks, Ms. Pirmann was introduced to the ongoing data mining research activities related to the ongoing NSF CAREER project by the PI. She also learn about the “Data Mining” course the PI taught to Temple University undergraduates during Spring 2011. The remaining 4 weeks were devoted to developing a teaching module for the “Computer Science: Principles” that was taught by Ms. Pirmann during the 2011-2012 school year. This teaching module is designed with an objective to introduce computational thinking concepts to high school students through solving a data mining problem. The course module presents the EachMovie movie recommendation data set and EpiSims Portland population movement data sets to students, discusses the data, and shows how the data could be handled and analyzed. Our hypothesis was that data mining can be very appropriate for introductory computer science courses because it covers most of the aspects of computational thinking and it addresses very interesting and important social and scientific problems that can be appealing to high school students. The big data unit was presented to CSTA-Phila teachers at a workshop on Nov 19^{th} at Villanova University (https://sites.google.com/a/sdst.org/csprinciples/big-data). The course module has already been used at Trinity College, (http://turing.cs.trincoll.edu/~ram/cpsc110/notes/bigdata.html)

**Findings**

The research activities during the first 5-years of the project have resulted in 13 peer-reviewed conference papers, 2 journal papers, and 1 journal paper in review. We have a few more papers in review and in preparation.

**Tighter Perceptron with Improved Dual Use of Cached Data for Model Representation and Validation [17].**

We evaluated the proposed algorithm on 12 large benchmark classification data sets. It was compared with several existing budget perceptron algorithms (Forgetron, Random Perceptron, Budget Perceptron, Stoptron, Tighter Perceptron) and with the memory-unbounded Kernel Perceptron algorithm. We evaluated three different budgets B = 20, 100, 500, using an RBF kernel. The results showed that our Tightest Perceptron significantly outperforms all competing budget perceptron algorithms on every data set and for all three budgets. This result confirms that using the expected class probability by the proposed method provides highly valuable information for accuracy estimation and support vector removal. Interestingly, Tightest was highly competitive with the memory unbounded Kernel Perceptron: its accuracy with budget B=500 was better than Kernel Perceptron in 8 of 11 data sets, with a modest budget B=100 Tightest was more accurate 5 times, and even with a tiny budget of B=20 Tightest still beat Kernel Perceptron on 3 of the noisiest data sets. The success of Tightest probably lies in its ability to remove less useful or even harmful support vectors using the validation set.

**Twin Vector Machines for Online Learning on a Budget [16,18].**

Using different budget sizes (B = 20, 50, 100, 200, 500) we compared TVM with three memory-unbounded SVM algorithms (LibSVM, IDSVM, and ramp-loss SVM), an SVM learned on a random sample with size equal to the budget B, and to a budget kernel perceptron algorithm Forgetron. The evaluation was performed using three different kernel functions (linear, polynomial, and Gaussian) on 12 benchmark classification data sets. With a modest budget size of B=100, TVM accuracy was most often within a couple percent of the state of the art memory-unbounded LibSVM that typically needed thousands of support vectors. For larger budgets of B=200 and B=500, TVM accuracy was most often virtually the same with LibSVM. Additionally, the TVM accuracy was substantially larger than that of SVM trained on a random sample of the same size. The results illustrated that highly accurate online SVMs could be trained from large data streams using devices with severely limited memory budgets. We also observed two important behaviors. First, TVM using hinge and ramp loss achieved very similar accuracies on all the datasets. Considering that the running time of the hinge loss version is around two times faster this justifies the use of hinge loss as the default for TVM. Second, control of the slack parameter proved to be critical for the success of TVM. It guarantees that accuracy keeps increasing with the stream size. In its absence, the unavoidable and catastrophic drop in accuracy, that occurs at certain point during online learning, is caused by gradual reduction of influence of regularization term to model update.

**Learning Vector Quantization with Adaptive Prototype Addition and Removal [4].**

The original LVQ algorithms LVQ1, LVQ2.1, and LVQ3 with k-means prototype initialization were compared to their Adaptive counterparts that start from a single prototype per class and use LVQAdd and LVQRemove steps to optimize the number of prototypes and their locations. In addition, we also compared to the popular Soft LVQ algorithm. Experiments were done on 11 benchmark classification datasets. In the first set of experiments, the proposed Adaptive LVQ with a large budget B were employed to try to maximize the accuracy, and, as a by-product, to determine the optimal number of prototypes for each data set. The proposed Adaptive LVQ achieved very high accuracy on all data sets. More importantly, it was able to successfully determine the appropriate number of prototypes. In the second set of experiments the LVQ algorithms were restricted to a very tight budget. This was done in order to illustrate the behavior of the proposed algorithm in the highly resource constrained applications. On average, the proposed Adaptive LVQ2.1 had accuracy of 92.88%. When we compare this to 86.92% (Soft LVQ), 84.87% (K-Means LVQ2.1) and 75.74% (Regular LVQ2.1) we can conclude that our method, on average, achieved considerable improvement in classification accuracy. Overall, experiments showed that our algorithm, that uses simple heuristics, outperforms the basic LVQ algorithms and that it is comparable and, in noisy data cases, even better than the optimization-oriented Soft LVQ algorithm.

**Learning Vector Quantization for Regression [5].**

All proposed versions of Regression LVQ algorithm were evaluated on various regression tasks, both real-life and synthetic, from StatLib, Delve and UCI ML Repositories. We compared the proposed Hard and Soft Regression LVQ algorithms and Regression LVQ2.1 with several benchmark algorithms, including k-nearest neighbor, parzen window regression, k-means regression, random prototype Parzen, and quantization LVQ. All three versions of Regression LVQ on a fixed budget performed similarly to non-budget Parzen window and KNN algorithms. Budget versions of 1NN and Parzen window were significantly less accurate. Comparing of Regression LVQ algorithms showed that Soft RLVQ is slightly better than Hard RLVQ. Interestingly, despite its simplicity, RLVQ2.1 was quite competitive. The experimental results showed that all versions of our algorithm are computationally efficient and that they achieve considerable improvement in prediction accuracy when compared to other prototype-based methods and that their accuracy is comparable to memory unconstrained methods.

**Online SVM learning with ramp loss [13].**

Experiments were performed on 9 benchmark binary classification data sets. We used online algorithms IDSVM and Online Passive-Aggressive (PA) algorithm and offline LibSVM to compare with the proposed algorithms. OnlineSVM^{R} was significantly more accurate than LibSVM on 5 out of 9 data sets and IDSVM on 4 out of 9 data sets. The largest accuracy improvement occurred on noisy data sets. The accuracy of OnlineASVM^{R} was competitive to IDSVM, LibSVM and OnlineSVM^{R} and was usually within 1% of their accuracy. Finally, perceptron-based PA algorithm was significantly less accurate than the other algorithms.On the model sparsity side, OnlineASVM^{R }achieved the sparest model on 6 out of 9 data sets. OnlineSVM^{R} was as sparse as OnlineASVM^{R} in most cases, and it obtained the sparsest model on 2 data sets. Both SVM^{R}-based algorithms are much sparser than the hinge-loss based IDSVM, LibSVM and PA. Specifically, in Gauss data set, OnlineSVM^{R} and OnlineASVM^{R} were about 10 times sparser than LibSVM. The proposed algorithm OnlineSVM^{R} significantly outperforms its competing algorithm IDSVM in terms of accuracy, model sparsity and training time. A variant of OnlineSVM^{R }for active learning could be viewed as an approximate optimization process over a regularized ramp loss cost function. The applicability of OnlineSVM^{R} in active learning scenario has been demonstrated, as it showed the advantages over the competing algorithms.

**Online Passive-Aggressive Algorithms on a Budget [15].**

We evaluated the proposed algorithms against the state-of-the-art budgeted online algorithms on a large number of benchmark data sets with different problem complexity, dimensionality and noise levels. Experimental results show that 1) the proposed budget algorithms significantly improve accuracy over the previously proposed budgeted algorithms; 2) ramp loss based algorithms are more robust to noisy data and can produce sparser models; 3) ramp loss PA can be directly interpreted as active learning algorithm; and 4) the proposed budgeted algorithms are applicable for large-scale learning. Finally, the idea of budgeted PA can be extended beyond binary classification.

**Multi-class Pegasos on a Budget [11].**

By treating the budget maintenance as a source of the gradient error, we proved that the gap between the budgeted Pegasos and the optimal solution directly depends on the average model degradation due to the budget maintenance. In the absence of budget maintenance the multi-class Pegasos inherits the convergence properties of its binary predecessor. These results directly indicate that the optimal budget maintenance strategy should be the one that minimizes the model degradation. Having analytically shown that the quality of budgeted multi-class Pegasos directly depends on the quality of budget maintenance, we explored how to achieve this goal in the computationally efficient manner. We proposed several greedy methods for multi-class budget maintenance using removal, projection, and merging. In case of removal, it is optimal to remove the smallest support vector. The optimal projection of one SV to the remaining ones is achieved by minimizing the accumulated loss of multiple sub-problems for each class, which extends previous results to the multi-class setting. In the case of merging, when the Gaussian kernel is used, the new SV is always on the line connecting two merged SVs, which generalizes the previous results to the multi-class setting. Both space and update time of the budgeted Pegasos scale quadratically with budget size when projection is used and linearly when merging or removal are used. Our experimental results showed that the proposed online algorithm achieves accuracy comparable to non-budget multi-class kernelized Pegasos, while being able to learn from high throughput data streams extremely efficiently.

**Adaptive Multi-Hyperplane Machine for Nonlinear Classification [12]**.

The experiments on several large data sets showed that AMM is nearly as fast during training and prediction as the state-of-the-art linear SVM solver and that it can be orders of magnitude faster than kernel SVM. In accuracy, AMM is somewhere between linear and kernel SVMs. For example, on an OCR task with 8 million highly dimensional training examples, AMM trained in 300 seconds on a single core processor had 0.54% error rate, which was significantly lower than 2.03% error rate of a linear SVM trained in the same time and comparable to 0.43% error rate of a kernel SVM trained in 2 days on 512 processors. With efficient training and prediction, comparable to linear SVM, and the ability to solve nonlinear classification problems, similar to kernel SVM, AMM fills the scalability and representability gap between linear and nonlinear classification. As such, AMM can be an appealing option when solving large-scale nonlinear classification problems. In addition to being an attractive option when solving large-scale classification problems, AMM could also be useful as a data exploration tool. When faced with a new large-scale classification problem, AMM could be trained together with linear SVM and if its error rate is significantly lower, it could indicate that the problem is nonlinear and that more computationally costly but more powerful classification algorithms should be considered.

The software is available at www.dabi.temple.edu/~vucetic/AMM.html.

**Decentralized Estimation Using learning Vector Quantization [3].**

The experiments were performed on single- and 2-source 2-dimensional problems and on 4-source 1 to 3 dimentional problems. We used 2 different training set sizes and several levels of noise. The proposed algorithm was compared to the regression tree algorithm and the regular LVQ algorithm and it was shown to be successful in both single- and multi-sensor scenarios and that it consistently outperformed the standard LVQ approach as well as the regression tree based approach for quantizer design. The results showed that the proposed algorithm is robust to noise.

**Feature Selection for Transfer Learning [7,8].**

Data sets used to evaluate the proposed algorithms were microarrays data corresponding to multiple samples from 14 human cancer tissues and 12 normal tissues. From that data, we extracted 9 binary classification data sets by coupling normal and cancer samples from the same tissue type, whenever available. We evaluated several feature selection methods in our experiments: (1) Single-task, (2) the proposed Multi-task (M_FS), (3) Random selection, and (4) Select all features. The feature selection algorithms were evaluated in conjunction with 3 logistic regression algorithms: (1) Single task penalized LR (Logistic Regression), (2) Multi-task penalized LR, (3) Multi-task reweighted LR. For each combination of feature selection and logistic regression algorithms, we selected one of the 9 data sets as target data and the remaining 8 as the auxiliary data. To explore the influence of training size on transfer learning algorithms, we used {1, 2, 3, 4, 5} positive and the same number of negative target task examples for training. Results showed that the proposed M_FS results in superior accuracies. It boosted accuracy of both single-task and multi-task classifiers. Combination of multi-task feature selection and classification appeared particularly successful. We observed that multi-task learning is the most useful when target examples are scarce. Its benefits decrease with data size and could even lead to negative transfer. Despite its limitations, it is evident that multi-task learning can be a useful tool in bioinformatics and other domains characterized by multi-task high-dimensional and small sample data.

**Multiple Instance Regression [14].**

We evaluated the proposed algorithms on two remote sensing applications: Aerosol Optics Depth (AOD) prediction and crop yield prediction. The results showed that the proposed algorithms achieved high accuracy on both sets as compared to baseline approaches, including learning on pooled instances from all bags, learning on aggregated instances from each bag, a predictor based on the domain knowledge, and a previously proposed multiple instance algorithm called the prime.

**Application-Specific Resource-Aware Large-Scale Data Mining.**

In the traffic forecasting application, we [**1**] applied several CCRF models on the problem of travel speed prediction in a range between 10 and 60 minutes ahead, and evaluated them on loop detector data from a 5.71-mile section of the I-94 highway in Minneapolis, MN. Several CCRF models, with increasing level of complexity, are proposed in order to better assess performance of the method. When compared to the linear regression model, the Mean Absolute Error was reduced by around 4%. The results imply that modeling spatial and temporal neighborhood in traffic data and combining various baseline predictors under the CCRF framework can be beneficial.

In the active learning application, we [**6**] evaluated the proposed algorithm on several data sets from AISTATS 2010 Active Learning Challenge. The results showed that our algorithm based on Parzen Window classification and mixture of uncertainty-based and random sampling gives consistent results, comparable to the winning algorithms. Our analysis revealed that, although it was less accurate than the winning decision forests, Parzen Window classification was a reasonable choice for solving highly dimensional classification problems. In addition, since several of the challenge datasets could be accurately solved using linear classifiers, the representative power of Parzen Window classifiers becomes a disadvantage. Our results also showed that accuracy of Parzen Window classifiers can be boosted by using their ensembles. On the other hand, the ease of implementation, coupled with respectable accuracy achieved on the challenge tasks, indicates that Parzen Window classifiers should be given consideration on any active learning problem.

In the disease mapping application [**9**], we were able to map disease for a simulated population with 1.6 million people and a spatial grid with 65 thousand locations in several minutes. The results indicate that movement information can improve the accuracy of disease mapping as compared to residential data only. We also studied a privacy-preserving scenario in which only the aggregate statistics are available about the movement of the overall population, while detailed movement information is available only for individuals with disease. The results indicate that the accuracy of disease mapping remains satisfactory when learning from movement data sanitized in this way.

**Compressed Kernel Perceptrons [10].**

In the experiments, we compared the standard kernel perceptron and the proposed Compressed Kernel Perceptron, both using RBF kernels. Because the kernel perceptron is a memory-unconstrainted online algorithm it served as an upper bound on achievable accuracy. Additionally, it provided useful information about the desired number of support vectors when computational resources are not an issue. For experiments with Compressed Kernel Perceptrons, we evaluated five different budgets of M*{50, 100, 200, 500, 1000} bits, where M is the number of attributes. The evaluation was performed on three large benchmark classification data sets. The results showed that accurate classifiers could be learned efficiently while consuming very little memory. In some cases, accurate classifiers were constructed from large data sets using less than 1,000 bits of memory. The results are also useful in establishing lower bounds on memory needed to build an accurate classifier. They indicate that this lower bound is clearly a function of problem complexity and dimensionality.

**Random Kernel Perceptron on ATTiny2313 Microcontroller [2].**

We implemented kernel perceptron on ATTiny2313 microcontroller. Its specifications make it extremely challenging to implement a data mining algorithm. It has very limited speed (0-8 MHz, depending on voltage that ranges from 1.8 to 5.5V), low power requirements (when in 1.8V and 1MHz mode, it requires only 300μA and 540μW power), and very limited memory (2kB of program memory that is used to store executable code and constants; 128B of data memory that is used to store model and program variables). Because of its limitations, it is also very cheap, with the price of around 1$ [15]. It supports 16-bit integers and supports fixed-point operations. Multiplication and division of integers is very fast, but at the expense of potential overflow problems with multiplication and round-off error with division. ATTiny2313 does not directly support floating-point numbers, but can cope with them using certain software solutions. Floating-point calculations can also be supported through C library, but only after incurring significant overhead in both memory and execution time.

Our resource-constrained implementation of kernel perceptron was evaluated on several benchmark datasets and the results were very promising. For example, achieved accuracy on Pendigits benchmark data was 81%, close to the 85% accuracy of the resource-unconstrained kernel perceptron. This is impressive considering that it was achieved in 7.5 seconds using a processor with only 128 Bytes of data memory and 2 kB of program memory. The success of our implementation opens the doors for implementing powerful online learning algorithms on the most resource-constrained computational devices.

**References**

1. Djuric, N., Radosavljevic, V., Coric, V., Vucetic, S., Travel Speed Forecasting Using Continuous Conditional Random Fields, Transportation Research Board (TRB) 90th Annual Meeting, Washington, D.C., 2011.

2. Djuric, N., Vucetic, S., Random Kernel Perceptron on ATTiny2313 Microcontroller, 4th International Workshop on Knowledge Discovery from Sensor Data (SensorKDD), at the 16th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD), Washington, D.C., 2010.

3. Grbovic, M., Vucetic, S., Decentralized Estimation Using Learning Vector Quantization, Data Compression Conference (DCC), Snowbird, UT, 2009.

4. Grbovic, M., Vucetic, S., Learning Vector Quantization with Adaptive Prototype Addition and Removal, Proc. Int’l Joint Conf. on Neural Networks (IJCNN), Atlanta, GA, 2009.

5. Grbovic, M., Vucetic, S., Regression Learning Vector Quantization, IEEE Int’l Conf. on Data Mining (ICDM), Miami, FL, 2009.

6. Lan, L., Shi, H., Wang, Z., Vucetic, S., An Active Learning Algorithm Based on Parzen Window Classification, JMLR W&C Proc. Workshop on Active Learning and Experimental Design (2010 AISTATS Active Learning Challenge), 2010.

7. Lan, L., Vucetic, S., A Multi-Task Feature Selection Filter for Microarray Classification, IEEE Int’l Conf. on Bioinformatics and Biomedicine (BIBM), Washington, D.C., 2009.

8. Lan, L., Vucetic, S., Improving Accuracy of Microarray Classification by a Simple Multi-Task Feature Selection Filter, International Journal of Data Mining and Bioinformatics, in press, 2011.

9. Malbasa, V., Vucetic., S., Spatially Regularized Logistic Regression for Disease Mapping on Large Moving Population, 17th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD), San Diego, CA, 2011.

10. Vucetic, S., Coric, V., Wang, Z., Compressed Kernel Perceptrons, Data Compression Conference (DCC), Snowbird, UT, 2009.

11. Wang, Z., Crammer, K., Vucetic, S., Multi-Class Pegasos on a Budget, Proc. Int. Conf. on Machine Learning (ICML), Haifa, Israel, 2010.

12. Wang, Z., Djuric, N., Crammer, K. Vucetic, S., Trading Representability for Scalability: Adaptive Multi-Hyperplane Machine for Nonlinear Classification, 17th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD), San Diego, CA, 2011.

13. Wang, Z., Vucetic, S., Fast Online Training of Ramp Loss Support Vector Machines, IEEE Int’l Conf. on Data Mining (ICDM), Miami, FL, 2009.

14. Wang, Z., Vucetic, S., Multiple Instance Regression with Applications to Remote Sensing, IEEE Transactions on Geoscience and Remote Sensing, in press, 2012.

15. Wang, Z., Vucetic, S., Online Passive-Aggressive Algorithms on a Budget, JMLR W&C Proc. Int. Conf. on Artificial Intelligence and Statistics (AISTATS), Sardinia, Italy, 2010.

16. Wang, Z., Vucetic, S., Online Training on a Budget of Support Vector Machines using Twin Prototypes, Statistical Analysis and Data Mining, Vol 3 (3), pp. 149-169, 2010.

17. Wang, Z., Vucetic, S., Tighter Perceptron with Improved Dual Use of Cached Data for Model Representation and Validation, Proc. Int’l Joint Conf. on Neural Networks (IJCNN), Atlanta, GA, 2009.

18. Wang, Z., Vucetic, S., Twin Vector Machines for Online Learning on a Budget, 2009 SIAM Conf. on Data Mining (SDM), Sparks, NV, 2009.

### Contributions

Contributions within Discipline:

The research addresses a vital question of efficient knowledge discovery from large volumes of data available in various scientific, engineering, and business-oriented domains. The merit of the ongoing activities stems form integration of ideas from data mining and data compression. The methodology resulting from this project will have broad impact on computer science, and, more generally, on the whole field of information technology.

Contributions to Other Disciplines:

The project is addressing learning from a large variety of scientific, engineering, and business-related data sets. The methodologies that are being developed will facilitate knowledge discovery from such data and will thus contribute to a large array of disciplines.

Contributions to Education and Human Resources:

The project has already involved 7 graduate students and 4 undergraduate students. They gained valuable research skills in computer science and information technology.