DataMining_CA2-4

8
Model Evaluation for Classification and Regression Aravind Kumar Balasubramaniam, 14123754 School of Computing National College of Ireland Dublin, Ireland Email: [email protected] Anisha Kudalappa Gudagi, 14123223 School of Computing National College of Ireland Dublin, Ireland Email: [email protected] Abstract—This paper aims to compare classification and re- gression models using CARET package in R. Classification and regression analysis are performed for predicting survival rate on titanic dataset and quantifying hazard score on property-liability dataset respectively. Four models of classification algorithms are modeled for the data samples from Titanic dataset and evaluated against Accuracy, Sensitivity, Specificity, Pos pred value, Neg pred value, Kappa and F-measure metrics. Based on F-measure the best model is selected and classification analysis is performed on the titanic dataset. Three models of regression algorithms are modeled for the data samples from property-liability dataset an evaluated against RMSE,R-squared, RMSE SD and R-squared SD. RMSE metric is used for selecting the best model for regression analysis and predicting the hazard score. I. I NTRODUCTION Data prediction is a non-trivial task and has an enormous value, this paper out rights the heuristics about classification and regression modelling by providing a literature review. Two different data sets individually suitable for classification and regression are used for evaluating the model performance metrics with the aid of CARET package and to train test the selected algorithm. This research paper presents and discusses classification based survival prediction of the infamous Titanic shipwrecks and regression based prediction of property haz- ards based on property information in the property-liability insurance industry. Predictive analysis is one of the most important supervised data mining technique which enables us to find unidentified patterns or trends in datasets. Machine learning from disaster is one of the most explored areas in the field of Data mining and disaster management. In the past, many researchers have used Data mining techniques for predicting survivals from natural disasters and for predicting patient survival rate for uncommon and cure-less diseases. We have used classification data mining algorithms to predict what sort of people were likely to survive the shipwrecks. The pre- diction depends on the key attributes such as gender, passenger class, age, parents, siblings and the type of ticket. Regression analysis is done on the property-liability data for quantifying hazards. Property insurance requires inspection of the property based on the condition of the property such as foundation, roof, flooring etc. which are key property attributes. These key property attributes needs to be investigated by the insurance companies before they are approved for insured. Thus, to provide a clear insight to insurance companies to find property hazards from key attributes of properties, predictive analysis is best suited. In this paper we have analyzed and built predictive models to classify the rate of survivals in Titanic shipwreck and to quantify property hazards before time of inspection of Liberty Mutual Insurance company dataset which contains a hazard score provided for each property by inspection. There are two queries- the first query is to “predict what sort of people were likely to survive the shipwreck” using classification and second query is to “predict which of these hazards contribute more for each property. We have built four classification models and three regression models for comparison and selected the best model. The rest of the paper is divided as follows: section (2) related work which discusses about predictive analysis in insurance industry and comparing models, section (3) methodology used to answer the query, section (4) evaluation of results and section (5) conclusion and future work. NCI August 20, 2015 II. RELATED WORK Classification data mining technique refers to assigning each data point instance to a class label. It is one of the most researched platforms in the field of data mining. Classification algorithms aim to find a classifier that will assist in assigning the input instance to a class label [8]. Classification has been used as the most popular technique in the field of disaster management and medicine. A conventional algorithm for survival analysis technique evaluate the data inputs by using the Kaplan-Meier method or Cox proportional hazard model. However during the recent years classification models are widely used for medicine. These models first establishes a tree structure by partitioning the training data into several subclasses. This partitioning is done according to the test conditions until all the data are grouped under one subclass. After the tree structure is established, pruning is done to the tree from the bottom of the tree. After the pruning, rules are created from the output and these rules are used in the classification of new training data for prediction [2].The K-nearest neighbor is one of the oldest methods and non- parametric methods of classification. In this algorithm a class is assigned based on the common class amongst the k-nearest neighbors. Fuzzy k-nearest neighbor is an extension of KNN in which the algorithm assigns the fuzzy memberships of data samples to different classes [3]. Boosting is an iterative

Transcript of DataMining_CA2-4

Page 1: DataMining_CA2-4

Model Evaluation for Classification and RegressionAravind Kumar Balasubramaniam, 14123754

School of ComputingNational College of Ireland

Dublin, IrelandEmail: [email protected]

Anisha Kudalappa Gudagi, 14123223School of Computing

National College of IrelandDublin, Ireland

Email: [email protected]

Abstract—This paper aims to compare classification and re-gression models using CARET package in R. Classification andregression analysis are performed for predicting survival rate ontitanic dataset and quantifying hazard score on property-liabilitydataset respectively. Four models of classification algorithms aremodeled for the data samples from Titanic dataset and evaluatedagainst Accuracy, Sensitivity, Specificity, Pos pred value, Neg predvalue, Kappa and F-measure metrics. Based on F-measure thebest model is selected and classification analysis is performed onthe titanic dataset. Three models of regression algorithms aremodeled for the data samples from property-liability dataset anevaluated against RMSE,R-squared, RMSE SD and R-squaredSD. RMSE metric is used for selecting the best model forregression analysis and predicting the hazard score.

I. INTRODUCTION

Data prediction is a non-trivial task and has an enormousvalue, this paper out rights the heuristics about classificationand regression modelling by providing a literature review.Two different data sets individually suitable for classificationand regression are used for evaluating the model performancemetrics with the aid of CARET package and to train test theselected algorithm. This research paper presents and discussesclassification based survival prediction of the infamous Titanicshipwrecks and regression based prediction of property haz-ards based on property information in the property-liabilityinsurance industry. Predictive analysis is one of the mostimportant supervised data mining technique which enables usto find unidentified patterns or trends in datasets. Machinelearning from disaster is one of the most explored areas inthe field of Data mining and disaster management. In thepast, many researchers have used Data mining techniques forpredicting survivals from natural disasters and for predictingpatient survival rate for uncommon and cure-less diseases. Wehave used classification data mining algorithms to predict whatsort of people were likely to survive the shipwrecks. The pre-diction depends on the key attributes such as gender, passengerclass, age, parents, siblings and the type of ticket. Regressionanalysis is done on the property-liability data for quantifyinghazards. Property insurance requires inspection of the propertybased on the condition of the property such as foundation,roof, flooring etc. which are key property attributes. These keyproperty attributes needs to be investigated by the insurancecompanies before they are approved for insured. Thus, toprovide a clear insight to insurance companies to find propertyhazards from key attributes of properties, predictive analysis is

best suited. In this paper we have analyzed and built predictivemodels to classify the rate of survivals in Titanic shipwreckand to quantify property hazards before time of inspectionof Liberty Mutual Insurance company dataset which containsa hazard score provided for each property by inspection.There are two queries- the first query is to “predict whatsort of people were likely to survive the shipwreck” usingclassification and second query is to “predict which of thesehazards contribute more for each property. We have builtfour classification models and three regression models forcomparison and selected the best model. The rest of the paperis divided as follows: section (2) related work which discussesabout predictive analysis in insurance industry and comparingmodels, section (3) methodology used to answer the query,section (4) evaluation of results and section (5) conclusionand future work.

NCIAugust 20, 2015

II. RELATED WORK

Classification data mining technique refers to assigning eachdata point instance to a class label. It is one of the mostresearched platforms in the field of data mining. Classificationalgorithms aim to find a classifier that will assist in assigningthe input instance to a class label [8]. Classification hasbeen used as the most popular technique in the field ofdisaster management and medicine. A conventional algorithmfor survival analysis technique evaluate the data inputs byusing the Kaplan-Meier method or Cox proportional hazardmodel. However during the recent years classification modelsare widely used for medicine. These models first establishesa tree structure by partitioning the training data into severalsubclasses. This partitioning is done according to the testconditions until all the data are grouped under one subclass.After the tree structure is established, pruning is done tothe tree from the bottom of the tree. After the pruning,rules are created from the output and these rules are used inthe classification of new training data for prediction [2].TheK-nearest neighbor is one of the oldest methods and non-parametric methods of classification. In this algorithm a classis assigned based on the common class amongst the k-nearestneighbors. Fuzzy k-nearest neighbor is an extension of KNNin which the algorithm assigns the fuzzy memberships ofdata samples to different classes [3]. Boosting is an iterative

Page 2: DataMining_CA2-4

algorithm combining classification rules with performance interms of error rate to produce an accurate classification rule.Regularized discriminant analysis refers to assigning objectsto one or several groups which is obtained from each object.Regularization techniques are applied for linear discriminantanalysis and quadratic discriminant analysis and have beensuccessful in the results of poorly-posed inverse problems.The efficiency of RDA is to improve the misclassification riskand error rate from LDA and QDA [4].Recursive partitioning(rpart) is a statistical method that is used in classificationand regression trees. This method is used to discover datastructures in trends in the data sample. It is in variousscientific fields as multivariate data exploration, for example:DNA sequencing, medicine. This algorithm can be tunedto perform classification and regression [10]. Data miningtechniques have been used in the insurance industry from quitesome time since insurance databases consists of large datasetswhich provide valuable business knowledge for improving thecustomer relationship or improving profits or expanding thebusiness. Modeling insurance risk is done by applying datamining techniques. Past research has showed that data miningmethods improves the existing models by discovering extravariables and by detecting nonlinear relationships. Data mininghas been of great importance in the insurance industry byidentifying risk factors that helps in predicting profits andlosses. Data mining techniques like decision trees and neuralnetwork can accurately predict risk. Customer relationshipmanagement analysis helps in understanding the customer andaccurately select which policies to be offered to a customer[5]. Random Forest is a data mining algorithm for performingclassification and regression analysis. The metric measure forrandomForest is mtry which specifies the best fit split from thepredictor variables. The randomForest package produces twosets of information: a metric for specifying the importanceof the predicting variables and a metric for specifying thestructure of the data [7]. Multivariate regression methods suchas Principal component analysis and partial least square havelot of implications in a variety of industries. Quantitative–structure-activity relations and quantitative structure –prop-erty relations use PSLR and PCR. In PCR, the first a principalcomponents (PCs) is used to approximate the matrix. Nextthe Y is regressed on the scores, which in turn providesthe regression coefficients [9]. Gradient Boosted regressionis an iterative algorithm for finding a predictor. Regressiontrees provides a repeated expansion of nodes until a stoppingcriteria is met. All the data points in the data are assignedto a single node initially. Parallelizing the boosted regressiontrees implies to boosting analysis sequentially by parallelizingthe building of trees individually [11]. The CARET packagewhich is short for classification and regression training con-sists of classification and regression models. It is used formodel tuning and training across models. This package helpsin comparing model performance between different models.Since classification and regression models are used in manydifferent applications, Caret package will help in selecting thebest model and approach [6]

III. METHODOLOGY

The data mining methodology used here is CRISP-DMmethodology. CRISP-DM stands for Cross-Industry StandardProcess for Data Mining. This methodology consists of sixsteps:Business Understanding: This phase includes understandingthe business objectives and requirements.Data Understanding: This phase starts with the initial datacollection and identifying subsets in the data.Data Preparation: Transformation and cleaning is done inthis phase. Data is prepared for modeling.Modeling: Modeling techniques are applied to the dataset.Evaluation: Evaluation of the model results are performedin this phase. The results are measured to satisfy businessobjectives.Deployment: The model built is deployed to the customerwith the results [1]

A. Implementation for Classification Modelling

1) Business Understanding: Theobjective was to identifyor predict what sort of people other than upper class; womenand children were likely to survive in the RMS Titanic, oneof the wicked shipwrecks in history.The preliminary plan wasto understand the key variables or dimensions, which will beused for predicting the class-object and use caret package formodel evaluation and identifying the parameter values.

2) Data Understanding: The Titanic data set was collectedfrom Kaggle, The key variables identified were ‘Sex’, ‘Age’,‘Pclass’, ‘Parch’, ‘Fare’, ‘SibSp’, ‘Embarked’ to predict thecategorical class ‘Survived’. There were some problems likeNULL’s and NA’s were identified in data set, which will betaken care in fore coming stages.

3) Data Preparation: Problems detected in data under-standing are addressed here.

• Median values are replaced for NA’s in ‘Fare’.• NULL values are replaced with “S” in ‘Embarked’.• NA’s in ‘Age’ are replaced with -1.• ‘Sex’ is converted to factor.• ‘Embarked’ is converted to factor.4) Modeling: Four modeling algorithms were selected in

a random manner. Random forest (‘rf’), K-Nearest neighbor(‘knn’), Regularized Discriminant Analysis (‘rda’) and Treebased model – CART (‘rpart’). In order to tune the parametersand evaluate the metrics of the above models,ClassificationAnd REgression Training package acronym as CARET1 isused.

Model Tuning NA’s, NULL’s and factorization is handledin data pre-processing. Data partitioning is performedthrough simple splitting by creating balanced splits of datain 75&25% for each class. 75% is used for training and25% is used for testing within caret package. Parameterselection: Train function in ‘CARET’ is used to evaluateby resampling training data with parameterized number

1http://topepo.github.io/caret/index.html

Page 3: DataMining_CA2-4

of folds and repeats. Here resampling is done using tenfolds and three repeats by using ‘repatedcv’ method whichmeans repeating the cross validation. Class probability isset to TRUE to compute class probability for the held outsamples. summaryFuntion is set as ‘twoClassSummary’for caret to compute specificity, sensitivity and the areaunder Receiver Operating Characteristic curve. Neither thepredictors require estimating power transformations nor havezero or negative values, so Box-Cox or Yeo-Jhonson methodof preprocessing is eliminated and instead only centeringand scaling is used for random forest and ‘knnImpute’ isused to find the distance between k closest samples by usingEuclidan distance incase of K-Nearest Neighbour. Tunegrid: gamma value is set to range from 0.00 to 1.00 with ascale of 0.25 and lambda is fixed with 0.75 for RegularizedDiscriminant Analysis to find the optimal ROC. Ntreeparameter was checked between ranges 10 to 15 with thescale of 1 and found to have a best fit at 13 for Random Forest.

R-Code for Classification Modeling using CARET

##CARET Model evaluation for Titanic dataset##

##Common Process accross all algorithmsbegin##

library(caret)library(readr)set.seed(121)#Read Titanic data from csvcrtTrain<-read_csv("trainTc.csv",

col_names = TRUE,n_max = -1,progress = interactive())

#Data cleaning ’NA’ in AGE is replaced withmedian value

crtTrain$Age[is.na(crtTrain$Age)] <-median(crtTrain$Age, na.rm=TRUE)

#Selecting required variablescrtTrain <- crtTrain[c(#’PassengerId’,’Pclass’,#’Name’,’Sex’,’Age’,’SibSp’,’Parch’,#’Ticket’,’Fare’,’Cabin’,’Embarked’,’Survived’)]

#Convert the survived from binary into factorcrtTrain$Survived <-

ifelse(crtTrain$Survived==1,’yes’,’no’)crtTrain$Survived <-

as.factor(crtTrain$Survived)#Partition 75% to training and remaining to

testset.seed(221)inTrain <- createDataPartition(y =

crtTrain$Survived,p = .75,list = FALSE)#Create train and testtcTrain <- crtTrain[ inTrain,]tcTest <- crtTrain[-inTrain,]####Common Process ends#########Train function on all 4 algorithms###########Random Forest#######set.seed(301)rfctrl <- trainControl(method =

"repeatedcv",number=10,repeats = 3,verboseIter=TRUE,classProbs

= TRUE,

summaryFunction =twoClassSummary)

#install.packages("pROC")rfFit <- train(Survived ˜.,

data = tcTrain,method = "rf",metric="ROC",ntree = 13,preProcess = c("center", "scale"),trControl = rfctrl)

rfFitplot(rfFit)#PredictrfPredictTC <- predict(rfFit, newdata =

tcTest)rfProbs <- predict(rfFit, newdata = tcTest,

type = "prob")confusionMatrix(data = rfPredictTC,

tcTest$Survived)######K-Nearest Neighbour########knnctrl <- trainControl(method =

"repeatedcv",number=10,repeats = 3,verboseIter=TRUE,classProbs

= TRUE,summaryFunction =

twoClassSummary)set.seed(301)#install.packages("pROC")knnFit <- train(Survived ˜.,

data = tcTrain,method = "knn",metric="ROC",preProcess = "knnImpute",tuneLength = 10,trControl = knnctrl)

knnFitplot(knnFit)#PredictknnPredictTC <- predict(knnFit, newdata =

tcTest)knnProbs <- predict(knnFit, newdata = tcTest,

type = "prob")confusionMatrix(data = knnPredictTC,

tcTest$Survived)####Regularized Discriminant Analysis#####mygrid <- data.frame(gamma = (0:4)/4, lambda

= 3/4)rdactrl <- trainControl(method =

"repeatedcv",number=10,repeats = 3,verboseIter=TRUE,classProbs

= TRUE,summaryFunction =

twoClassSummary)rdaFit <- train(Survived ˜.,

data = tcTrain,method = "rda",trControl = rdactrl,metric = "ROC",tuneGrid=mygrid,trace = FALSE,maxit = 100)

rdaFitplot(rdaFit)#PredictrdaPredictTC <- predict(rdaFit, newdata =

tcTest)rdaProbs <- predict(rdaFit, newdata = tcTest,

Page 4: DataMining_CA2-4

type = "prob")confusionMatrix(data = rdaPredictTC,

tcTest$Survived)####Tree based model CART (rpart)#####rptctrl <- trainControl(method =

"repeatedcv",number=10,repeats = 3,verboseIter=TRUE,classProbs

= TRUE,summaryFunction =

twoClassSummary)rptFit <- train(Survived ˜ .,

data = tcTrain,method = "rpart",trControl = rptctrl,metric =

"ROC",tuneLength = 10)rptFitplot(rptFit)#PredictrptPredictTC <- predict(rptFit, newdata =

tcTest)rptProbs <- predict(rptFit, newdata = tcTest,

type = "prob")confusionMatrix(data = rptPredictTC,

tcTest$Survived)######End of Modeling###########

Performance metrics comparison between 4 models: Af-ter executing the models using CARET package fore comingstatistics were collected to evaluate the results for choosing thebest model. Fig 1 shows the tabulation of cross table matrixand F-Measure statistics for all four models.

Figure 1. Table: Cross table and F-Measure

It is inferred from Fig 2 that Random Forest has 77.3%F-Measure, which is higher while comparing other models.

Fig 3 shows the other metrics comparison for all fourmodels.

Fig 4 shows the Kappa statistic comparison for all fourmodels.

Fig 5 highlights the Random Forest performance againstother models.

The model Random Forest with its tuning parameter ntree= 13 and mtry = 2 is evaluated to have best-fit using CARETpackage from the metrics collected and consolidated with othermodels.

B. Implementation for Regression Modelling

1) Business Understanding: The objective of the researchwas to quantify property hazards before time of inspection of

Figure 2. F-Measure comparison

Figure 3. Other metrics comparison

Liberty Mutual Insurance company dataset. At this stage thekey attributes and dimensionality of the dataset were identified.Caret package from R was chosen for model evaluation andthen selecting the best model for regression analysis.

2) Data Understanding: The property hazard dataset wasdownloaded from Kaggle website. The dataset is split intotwo- train and test. Train dataset contains the hazard scoreand anonymized predictor variables. This dataset was usedfor modeling algorithms and selecting the best model. Testdataset contains only the anonymised predictor variables. Eachrow in the Train dataset corresponds to a property that wasevaluated and given hazard score. The hazard score attributeis a continuous number that represents the condition of theproperty provided by the inspection committee. The datasetcontains 51000 records.

3) Data Preparation: The train dataset was used for mod-eling and selecting the best model for regression. This datasetdid not contain any NULL values or NA values. The datasetwas clean and did not require any preparation or transforma-tion for modeling.

4) Modeling: prediction through regression analysis. Ran-dom Forest (rf), Partial Least Squares (pls) and eXtremeGradient Boosting (xgboost) are the three models used. Caretpackage contains all the classification and regression models.

Page 5: DataMining_CA2-4

Figure 4. Kappa Statistic

Figure 5. Other Metrics

This package is used for tuning the parameters and comparingthe models in R. Train dataset is loaded in to R, datapartitioning is done by splitting the data as training data whichcontains 75

Model Tuning Parameter selection is done through thetrainControl function. This function is used for resamplingthe training data by specifying the tuning parameters. It alsocontrols the computational refinements of the train function.The tuning parameters used are- method (repeatedcv): whichspecifies the resampling method to be used is selected, number(10): specifies the number of folds or number of resamplingiterations which is selected as 10, repeats (3): the numberof complete sets of folds to compute, verboseIter (TRUE):which is a logical printing of training log if TRUE is specified,classProbs (FALSE): this parameter is used for classificationanalysis, summaryFunction (defaultSummary): this is a func-tion to compute performance metrics across resamples. Trainfunction is used to fit predictive models over different tuningparameters. It its each model and calculates a resampling basedperformance measure. Parameters used for train are: data-training data, method- which specifies the model used (rf, plsand xgboost), trControl- which specifies how the function acts(trainControl function is used), tuneLength- specified as 10 fornumber of levels, ntree- number of trees (20) and important-as TRUE.

R-Code for Regression Modeling using CARET

##CARET Model evaluation for Liberty Mutual

Group dataset####Common Process across all algorithms begin##

#Install ’caret’, ’randomForest’ , readr,pls, xgboost

install.packages ("caret")install.packages ("randomForest")

install.packages ("readr")install.packages ("xgboost") install.packages

("pls")

# Load the caret, readr, xgboost, pls and rfpackages

library(caret)library(xgboost)library(readr)library(rf)library(pls)

# read first 5000 rows of Liberty MutualDataset into R using readr package

Crttrain<-read_csv("C:/train.csv", n_max =5000, progress = interactive())

#Partion the data into training and test.intrain<- createDataPartition(y =

Crttrain$Hazard,p = .75,list = FALSE)

#create train and testtcTrain <- Crttrain[ intrain,]tcTest <- Crttrain [-intrain,]

#checking the number of rows and columnsnrow(tcTrain)ncol(tcTest)

# set seedset.seed(107)####Common Process ends#####

###Train function on all 4 algorithms##########Random Forest##### tuning the model using trainControl.

Specifiying the method as repeatedcv andnumber of folds as 10

rfctrl <- trainControl(method ="repeatedcv",repeats = 3, number = 10,

verboseIter=TRUE,classProbs = FALSE,

summaryFunction =defaultSummary)

# running the randomForest modelrfFit <- train(Hazard˜ .,data = tcTrain,

method = "rf", trControl = rfctrl,tuneLength = 10, ntree = 20, importance =TRUE)

#checking the outputrfFit

# plotting the modelplot(rfFit)

# using Predict function to predict the modelrf <- predict(rfFit, newdata = tcTest)

Page 6: DataMining_CA2-4

str(rf)

# Checking for problemsrfProbs <- predict(rfFit, newdata = tcTest,

type = "raw")head(rfProbs)

####Partial Least Squares####

# tuning the model usingtrainControl.Specifying the method asrepeatedcv and number of folds as 10

plsctrl <- trainControl(method ="repeatedcv",repeats = 3,number = 10,verboseIter=TRUE, classProbs = FALSE,summaryFunction = defaultSummary)

# Running the PLS modelplsFit <- train(Hazard ˜ .,data = tcTrain,

method = "pls", trControl = plsctrl,tuneLength = 10,ntree = 20, importance =TRUE)

# checking the outputplsFitplot(plsFit)

# using Predict function to predict the modelpls <- predict(plsFit, newdata = tcTest)str(pls)

# Checking for problemsplsProbs <- predict(rfFit, newdata = tcTest,

type = "raw") head(plsProbs)

####eXtreme Gradient Boosting####

#tuning the model using trainControl.Specifying the method as repeatedcv andnumber of folds as 10

xgctrl <- trainControl(method ="repeatedcv",repeats = 3,number = 10,verboseIter=TRUE, classProbs = FALSE,summaryFunction = defaultSummary)

#running the Xgboost modelxgFit <- train(Hazard ˜ .,data = tcTrain,

method = "xgbTree", trControl = xgctrl,tuneLength = 10, ntree = 20, importance =TRUE)

#checking the outputxgFitplot(xgFit)

#using Predict function to predict the modelxgboost <- predict(xgFit, newdata = tcTest)

str(xgboost)

#Checking for problemsxgProbs <- predict(xgFit, newdata = tcTest,

type = "raw") head(xgProbs)######End of Modeling###########

Performance metrics comparison between 3 models:The output of these models shows four metrics namely-

RMSE, R Squared, RMSE SD and R Squared SD. RMSE andR Squared are considered for comparing the models. RMSEstands for Root Mean Square Error is a standard metric forreporting predicting error of a continuous variable. R Squaredis used to examine how well the model fits the training data.This will tell us what percentage of the variance in the dataare explained by the model. The least value of RMSE of amodel is considered as optimal. Fig 7 shows the comparisonof metrics for all three models. As shown in Fig 5, RF andXgboost have the least RMSE value. Also these values are veryclose with 3.874635 and 3.872974 respectively. This allows usto ensemble RF and xgboost models and use it for regressionanalysis.Below figure shows comparison between 3 models.

Figure 6. comparison Metrics

IV. EVALUATION

A. Classification Evaluation

Random Forest model with parameters ntree = 13 and mtry= 2 is executed against test data set and predicted the results.

R-Code for Random Forest output

###Survival prediction for Titanic data setusing Random Forest######

library(randomForest)library(readr)set.seed(1001)

#Read Titanic data from csvtcTrain<-read_csv("trainTc.csv",

col_names = TRUE,n_max = -1,progress = interactive())

tcTest <- read_csv("test.csv",col_names = TRUE,n_max = -1,progress = interactive())

#Manual function to preprocess datatcPredictors <- function(inputdat) {predictors <- c("Pclass","Age","Sex",

Page 7: DataMining_CA2-4

"Parch","SibSp","Fare","Embarked")

preds <- inputdat[,predictors]preds$Fare[is.na(preds$Fare)] <-median(preds$Fare, na.rm=TRUE)

preds$Embarked[preds$Embarked==""] = "S"preds$Age[is.na(preds$Age)] <- -1preds$Sex <- as.factor(preds$Sex)preds$Embarked <- as.factor(preds$Embarked)return(preds)

}

#Random Forest algorithmrf <- randomForest(tcPredictors(tcTrain),

as.factor(tcTrain$Survived),ntree=13,mtry = 2,importance=TRUE)

#Write out resultoutResult <- data.frame(PassengerId =

tcTest$PassengerId)outResult$Survived <- predict(rf,

tcPredictors(tcTest))

#Write CSV file for out resultwrite.csv(outResult, file =

"TitanicResult.csv", row.names=FALSE)###End of code###

Fig 7 shows a screen shot of the output written csv formatby Random Forest model.

Figure 7. Predicted Output screen shot

B. Regression Evaluation

With the results of the model comparison, we have selectedrf and xgboost models as ensemble for regression analysis.The tuning parameters are selected by the result of the modelcomparison. For applying the regression algorithm, train andtest datasets are used and the predictor dataset is formed fromTrain data. The tuning parameters are as follow: For random

forestntree=20; sampsize=10000; Xgboost; nrounds=150;eta = .3; maxdepth = 1;

and the objective of xgboost is specified as linear regression.R-Code for ensemble of Rf and Xgboost models

### Predicting transformed count ofHazards####

# load required packagesrequire(xgboost)library(caret)library(randomForest)library(readr)

# load raw datatrain = read_csv(’C:/train.csv’)test = read_csv(’C:/test.csv’)

# Create the response variablehazard = train$Hazard

# Create the predictor data set and encodecategorical variables using caret library.

htrain = train[,-c(1,2)]htest = test[,-c(1)]dummy <- dummyVars(˜ ., data = htrain)htrain = predict(dummy, newdata = htrain)htest = predict(dummy, newdata = htest)

#running Random Forestset.seed(1234)rf <- randomForest(htrain, hazard, ntree=20,

imp=TRUE, sampsize=10000, do.trace=TRUE)predict_rf <- predict(rf, htest)

# Set necessary parameters and use parallelthreads

parameters <- list("objective" ="reg:linear", "nthread" = 8, "verbose"=0)

# running xgboost modelxgb.fit = xgboost(param=parameters, data =

htrain, label = hazard, nrounds=150, eta= .3, max_depth = 1, min_child_weight =5, scale_pos_weight = 1.0, subsample=0.8)

predict_xgboost <- predict(xgb.fit, htest)

# Predict Hazard for the test setpredict <- data.frame(Id=test$Id)predict$Hazard <-

(predict_rf+predict_xgboost)/2

#write predict output as csv to systemwrite_csv(predict, "predict.csv")

V. CONCLUSION

CARET package in R has helped in model evaluation andcomparing the performances between different models. Mul-tiple classification and regression models were modeled andevaluated with respect to performance metrics using CARETpackage. Four classification models were used namely- KNN,rpart, randomForest and regularized discriminant analysis (rda)

Page 8: DataMining_CA2-4

Figure 8. output of the regression analysis

Figure 9. Scree plot of regression analysis

using CARET package for checking the best model. Ran-domForest performed best with respect to F-measure. Hencethis model was used for predicting what type of peoplesurvived the Titanic shipwreck. Three regression models wereused namely-principal component analysis, randomForest andextreme gradient boosting (xgboost) using CARET packagefor checking the model for regression analysis. Ensemble ofrandomForest and Xgboost was used for predicting the hazardsfor each property.The future work proposes for ensembleof classification models and evaluating it. We also proposeevaluating and comparing other models of classification andregression analysis.

REFERENCES

[1] Ana Isabel Rojao Lourenco Azevedo. “KDD, SEMMAand CRISP-DM: a parallel overview”. In: (2008).

[2] Cheng-Mei Chen et al. “Prediction of survival in pa-tients with liver cancer using artificial neural networksand classification and regression trees”. In: NaturalComputation (ICNC), 2011 Seventh International Con-ference on. Vol. 2. IEEE. 2011, pp. 811–815.

[3] Hui-Ling Chen et al. “A novel bankruptcy predictionmodel based on an adaptive fuzzy k-nearest neighbormethod”. In: Knowledge-Based Systems 24.8 (2011),pp. 1348–1359.

[4] Jerome H Friedman. “Regularized discriminant analy-sis”. In: Journal of the American statistical association84.405 (1989), pp. 165–175.

[5] Lijia Guo. “Applying data mining techniques in proper-ty/casualty insurance”. In: in CAS 2003 Winter Forum,Data Management, Quality, and Technology Call Pa-pers and Ratemaking Discussion Papers, CAS. Citeseer.2003.

[6] Max Kuhn. “Building predictive models in R using thecaret package”. In: Journal of Statistical Software 28.5(2008), pp. 1–26.

[7] Andy Liaw and Matthew Wiener. “Classification andregression by randomForest”. In: R news 2.3 (2002),pp. 18–22.

[8] Mark Menor and Kyungim Baek. “Relevance unitsmachine for classification”. In: Biomedical Engineeringand Informatics (BMEI), 2011 4th International Con-ference on. Vol. 4. IEEE. 2011, pp. 2295–2299.

[9] Bjorn-Helge Mevik and Ron Wehrens. “The pls pack-age: principal component and partial least squares re-gression in R”. In: Journal of Statistical Software 18.2(2007), pp. 1–24.

[10] Carolin Strobl, James Malley, and Gerhard Tutz. “Anintroduction to recursive partitioning: rationale, applica-tion, and characteristics of classification and regressiontrees, bagging, and random forests.” In: Psychologicalmethods 14.4 (2009), p. 323.

[11] Stephen Tyree et al. “Parallel boosted regression treesfor web search ranking”. In: Proceedings of the 20thinternational conference on World wide web. ACM.2011, pp. 387–396.