www.apress.com

4/6/18

Technicacuriosa Ensemble Modeling

By Manohar Swamynathan

Introduction

Last decade has seen an exponential growth in invention, simplification in the area of computation power & hardware cost. This has led to significant contribution for improvement of existing algorithms and invention of new algorithms in the area of machine learning. For the machine learning enthusiast the question that arise is, which algorithm is the best?

Well, there is no one answer for this question! It depends on the domain, problem that you are trying to address, data set considered and the business requirement. Ensemble method or modeling is one such attempt to address the above mentioned question. Ensemble methods enable combining multiple model scores into a single score to create a robust generalized model in order to improve the accuracy of predictive analytics and data mining applications.

This article is focused to provide the fundamental principles with example codes to help you understand the ensemble methods.

Types of Ensemble Methods

At a high level there are two types of ensemble methods.

  1. Combine multiple models of similar type
    • Bagging (Bootstrap aggregation)
    • Boosting
  2. Combine multiple models of various types
    • Vote Classification
    • Blending or Stacking

Bagging

Bootstrap aggregation (also known as bagging) was proposed by Leo Breiman in 1994, which is model aggregation technique to reduce model variance. The training data is split into multiple samples with replacement called as bootstrap samples. Bootstrap sample size will be same as original sample size, with 3/4th of the original values and replacement result in repetition of values.


Figure 1 - Bootstraping

Independent models on each of the bootstrap samples are built and the average of the predictions for regression or majority vote for classification is used to create the final model.

Figure 2 shows the bagging process flow. If N be the number of bootstrap samples created out of the original training set. For i = 1 to N, train a base machine learning model Ci.

Cfinal = aggrigate max of y∑iI(Ci = y)


Figure 2 – Bagging Process Flow

Bagging Example Code: here!, is the python code implementation example in iPython notebook.

Boosting

Freud & Schapire in 1995 introduced the concept of boosting with the well know AdaBoost algorithm (adaptive boosting). The core concept of boosting is that rather than a independent individual hypothesis, combining hypothesis in a sequential order increases the accuracy. Essentially, boosting algorithms convert the weak learners into strong learners. Boosting algorithms are well designed to address the bias problems.

At a high level the AdaBoosting process can be divided into 3 steps:

  • Assign uniform weights for all data points W0(x) = 1 / N, where N is the total number of training data points.
  • At each iteration fit a classifier ym (xn) to the training data and update weights to minimize the weighted error function.
  • The weight is calculated as Wn(m+1) = Wn(m) exp{ ∝m ym (xn) ≠ tn}.
  • The hypothesis weight or the loss function is given by , and the term rate is given by , where (ym (xn) ≠ tn) has values  i.e., 0 id (xn) correctly classified else 1
  • The final model is given by 


Figure 3 – AdaBoosting

Example Illustration for AdaBoost

Let's consider a training data with two class lables of 10 data points. Assume, initially all the data points will have equal weights given by i.e., 1/10 as shown in figure 4 below.


Figure 4 - Sample data set with 10 data points

Boosting Iteration 1

Notice in figure 5 that 3 points of positive class are misclassified by the first simple classification model, so they will be assigned higher weights. Error term and loss function (learning rate) is calcuated as 0.30 and 0.42 respectively. The data points P3, P4 and P5 will get higher weight (0.15) due to misclassification, whereas other data points will retain the original weight (0.1).


Figure 5 - Ym1 classification or hypothesis

Boosting Iteration 2

Let's fit another classification model as shown in figure 6 below and notice that 3 data points of negative class are misclassified. The data points P6, P7 and P8 are misclassified. Hence these will be assigned higher weights of 0.17 as calculated, whereas the remaining data point's weights will remain the same as they are correctly classified.


Figure 6 - Ym2 classification or hypothesis

Boosting Iteration 3

The third classification model has misclassified a toal of three data points i.e., two positive class P1, P2 and one negative class P9. So these misclassified data points will be assigned new higher weight of 0.19 as calculated and the remaining data points will retain their earlier weights.


Figure 7 - Ym3 classification or hypothesis

Final Model:

Now as per the AdaBoost algorithm, let's combine the weak classification models as shown in figure 8 below. Notice that the final model combined model will have minimum error term and maximum learning rate leading to a higher degree of accuracy.


Figure 8 - AdaBoost algorithm to combine weak classifiers

Gradient Boosting

Due to the stage wise addictivity, the loss function can be represented in a form suitable for optimization. This gave birth to a class of generalized boosting algorithms known as generalized boosting algorithm (GBM). Gradient boosting is an example implementation of GBM and it can work with different loss functions such as regression, classification, risk modelling etc. As the name suggested it is a boosting algorithm which identifies shortcomings of a weak learner by gradients (AdaBoost uses high-weight data points), hence the name Gradient Boosting.

  • Iteratively fit a classifier ym (xn) to the training data. The initial model will be with a constant value 
  • Calculate the loss (i.e., the predicted value vs actual value) for each model fit iteration gm(x) or compute the negative gradient, and use it to fit a new base learner function hm(x), and find the best gradient decent step-size 
  • Update the function estimate  and output ym(x)

Xgbost (eXtreme Gradient Boosting)

In March 2014, Tianqui Chen built xgboost in C++ as part of Distributed (Deep) Machine Learning Community, and it has interface for Python. It is an extended, more regularized version of gradient boosting algorithm. This is one of the most well performing large-scale, scalable machine learning algorithms which has been playing a major role in winning solutions of Kaggle (forum for predictive modelling and analytics competition) data science competition.

XGBoost objective function

Regularization term is given by


Gradient descent technique is used for optimizing the objective function, and more mathematics about the algorithms can be found at the site http://xgboost.readthedocs.io/en/latest/

Some of the key advantages of the xgboost algorithm are:

  • It implements parallel processing
  • It has a built-in standard to handle missing values, which means user can specify a particular value different than other observations (such as -1 or -999) and pass it as parameter.
  • It will split the tree upto maximum depth unlike Gradient Boosting where it stops splitting node on encounter of a negative loss in the split.

Boosting Example Code: here!, is the python code implementation example in iPython notebook

Ensemble Voting - ML’s biggest heroes united         


Figure 9 – Ensemble Voting: ML's biggest heroes united

Voting classifier enables us to combine the predictions through majority voting from multiple machine learning algorithms of different types, unlike Bagging/Boosting where similar type of multiple classifiers are used for majority voting.

Firstly you can create multiple standalone models from your training dataset. Then a voting classifier can be used to wrap your models and average the predictions of the sub-models when asked to make predictions for new data. The predictions of the sub-models can be weighted, but specifying the weights for classifiers manually or even heuristically is difficult. More advanced methods can learn how to best weight the predictions from sub-models, but this is called stacking (stacked aggregation) and is currently not provided in scikit-learn.

Ensemble Voting Example Code: here!, is the python code implementation example in iPython notebook

Stacking

Wolpert David H presented (in 1992) the concept of stacked generalization, most commonly known as ‘stacking’ in his publication with journal of Neural Networks. In stacking initially, you train multiple base models of different type on training/test dataset. It is ideal to mix models that work differently (kNN, bagging, boosting etc) so that it can learn some part of the problem. At level one, use the predicted values from base models as features and train a model which is known as meta-model thus combining the learning of individual model will result in improved accuracy. This is a simple level 1 stacking, and similarly you can stack multiple levels of different type of models.


Figure 10 – Simple Level 2 stacking model

Ensemble Voting Example Code: here!, is the python code implementation example in iPython notebook

Conclusion

We’ve just learnt different ensemble methods in current practice in the area of machine learning for combing multiple models leading to a robust model with better accuracy. I hope it will serve as an introduction to the fundamentals of ensemble modeling and help you to start a new journey or quest to strengthen the concept and apply them on to real-world problems.

About the Author

Manohar Swamynathan is a data science practitioner and an avid programmer, with over 13 years of experience in various data science related areas that include data warehousing, Business Intelligence (BI), analytical tool development, ad-hoc analysis, predictive modeling, data science product development, consulting, formulating strategy and executing analytics program.

He’s had a career covering life cycle of data across different domains such as US mortgage banking, retail, insurance, industrial IoT and eCommerce. He has a bachelor’s degree with a specialization in physics, mathematics, computers, and a master’s degree in project management.

This article is based on a preview of the fourth chapter of Mastering Machine learning with Python in Six Steps, published in June 2017 (Apress Media LLC).