Machine Learning (ML) using classification Algorithm in R

code

ML & AI

Author

Nyamisi Peter

Published

March 8, 2023

Introduction

Technology is becoming more important in our daily lives in every second. In order to keep up with the pace of these technological changes, scientists are more heavily learning different algorithms to make things easier so as to meet consumer’s demand. These technologies are commonly associated with artificial intelligence, machine learning, deep learning, and neural networks.

Machine learning (ML) and artificial intelligence (AI) are closely related concepts that are often used interchangeably, but they are not the same thing.

Artificial intelligence refers to the ability of machines to mimic human cognitive functions such as learning, reasoning, and problem-solving IBM. AI encompasses a wide range of techniques and approaches, including rule-based systems, expert systems, and machine learning. It is used to predict, automate, and optimize tasks that humans have historically done, such as speech and facial recognition, decision making, and translation.

Machine learning is a specific type of AI that involves the development of algorithms that can learn from data and make predictions or decisions based on that data (Mitchell et al., 2007). In other words, machine learning algorithms are designed to learn patterns and relationships in data without being explicitly programmed to do so.

Machine Learning Algorithms

There are several types of machine learning algorithms, including supervised learning, unsupervised learning, and reinforcement learning. Supervised learning involves using labeled (predictor) training data to train a model to make predictions on new, unseen data. Unsupervised learning involves finding patterns and relationships in data without the use of labeled training data. Reinforcement learning involves training a model to make decisions based on feedback in the form of rewards or punishments.

Classification and regression are two of the most common types of supervised learning algorithms in machine learning and artificial intelligence. Classification is a type of supervised learning algorithm used for predicting discrete or categorical outcomes. It involves mapping input variables to discrete output categories or labels. The objective of classification is to build a model that accurately assigns new data points to the correct class or label.

On the other hand, regression is a type of supervised learning algorithm used for predicting continuous or numeric outcomes. It involves mapping input variables to continuous output values. The objective of regression is to build a model that accurately predicts the value of the dependent variable based on the values of the independent variables.

Difference between classification and regression

In this tutorial we are going to deal with classification algorithm in predicting the type of penguin species flipper length, bill dimensions and sex. We will use penguins data from the palmerpenguins package (Horst et al., 2020). It includes measurements for penguin species, island in Palmer Archipelago, size (flipper length, body mass, bill dimensions), and sex.

First, we will load the packages which we are going to use in this tutorial; I will use require() function, but you may also use library() function depending on your preferences.

require(tidyverse)
require(tidymodels)
require(palmerpenguins)
require(ranger)
require(patchwork)

Data

After loading the packages, we will load the penguins data (Horst et al., 2020) and remove all the missing values in the dataset and equate them as penguin.data. The dataset consist of 333 rows and 8 columns

penguin.data = palmerpenguins::penguins %>% 
  drop_na()

penguin.data

# A tibble: 333 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           36.7          19.3               193        3450
 5 Adelie  Torgersen           39.3          20.6               190        3650
 6 Adelie  Torgersen           38.9          17.8               181        3625
 7 Adelie  Torgersen           39.2          19.6               195        4675
 8 Adelie  Torgersen           41.1          17.6               182        3200
 9 Adelie  Torgersen           38.6          21.2               191        3800
10 Adelie  Torgersen           34.6          21.1               198        4400
# ℹ 323 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Then, we will look on the internal structure of the dataset using glimpse() function of dplyr package. Our dataset consist of 8 variables (columns); 3 are factors data (species, island and sex), 2 numeric or double data (bill_length_mm, bill_depth_mm) and 3 integers (flipper_length_mm, body_mass_g, year). The variable species have 3 different levels which are Adelie, Gentoo and Chinstrap

glimpse(penguin.data)

Rows: 333
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 41.1, 38.6…
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 17.6, 21.2…
$ flipper_length_mm <int> 181, 186, 195, 193, 190, 181, 195, 182, 191, 198, 18…
$ body_mass_g       <int> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3200, 3800…
$ sex               <fct> male, female, female, female, male, female, male, fe…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

In this tutorial we will use only three variables from penguins data; which are species, bill_length_mm, bill_depth_mm. We will run the select() function of dplyr package (Wickham et al., 2022) to select our variables of interest

penguin.data = penguin.data %>% 
  select(species, 
         bill_length_mm, 
         bill_depth_mm)

Before we apply ML algorithms, first we will crosscheck whether the predictor variable (in this case, species), have distinct features (like size) which will help in providing the more accurate output during predictions. When there is interconnection or inter-relation between response variables, some confusion might arise during predictions.

We will use scatter plot between bill_length_mm and bill_depth_mm to see the distribution of each species in the dataset. There is a distinct differences between the size of the three penguins species (Figure 1). Each species has its size range; therefore, the dataset fit best in our analysis.

penguin.data %>% 
  ggplot(aes(y = bill_length_mm, 
             x = bill_depth_mm, 
             color = species)) +
  geom_point() +
  theme_bw()

Figure 1: Scatter plot showing the distribution of *Adelie*, *Gentoo* and *Chinstrap* species

Classification algorithm

As explained earlier, supervised learning involves training a dataset before you have your predictions. Since our output is categorical (prediction of the species type), then we will use classification algorithm in our analysis.

Data Spliting

Classification algorithm as one of the supervised learning, needs two data types; the training and testing data set. Our penguin.data will be split into these two groups. The training dataset will have a proportion of 70% and the remaining 30% will be the test dataset. In total, our dataset has 333 observations in which 233 samples will be used to train our model while in testing the accuracy of the model 100 samples will used.

split.penguin = penguin.data %>% 
  initial_split(prop = 0.7)

split.penguin

<Training/Testing/Total>
<233/100/333>

The training data below with 233 samples will be used for training the model;

train.set = split.penguin %>% 
  training()

train.set

# A tibble: 233 × 3
   species   bill_length_mm bill_depth_mm
   <fct>              <dbl>         <dbl>
 1 Adelie              36.2          17.2
 2 Gentoo              48.7          15.1
 3 Chinstrap           49.8          17.3
 4 Chinstrap           54.2          20.8
 5 Gentoo              44.9          13.3
 6 Chinstrap           50.7          19.7
 7 Adelie              38.6          17  
 8 Gentoo              46.2          14.9
 9 Gentoo              50.1          15  
10 Adelie              39.8          19.1
# ℹ 223 more rows

The testing data with 100 observations will be used to test the accuracy of the model

test.set = split.penguin %>% 
  testing()

test.set

# A tibble: 100 × 3
   species bill_length_mm bill_depth_mm
   <fct>            <dbl>         <dbl>
 1 Adelie            40.3          18  
 2 Adelie            38.9          17.8
 3 Adelie            38.6          21.2
 4 Adelie            42.5          20.7
 5 Adelie            37.8          18.3
 6 Adelie            37.7          18.7
 7 Adelie            35.9          19.2
 8 Adelie            38.8          17.2
 9 Adelie            35.3          18.9
10 Adelie            40.5          17.9
# ℹ 90 more rows

Data training

Use the train.set data to train the model. Run the rand_forest() of the parsnip package (Kuhn and Vaughan, 2022), set engine as ranger and the mode as classification. Fit the train data set with the response be the species. The summary of the training model will be shown with the predictions error of 0.02 equivalent to 2%.

mod = rand_forest() %>% 
  set_engine(engine = "ranger") %>% 
  set_mode(mode = "classification") %>% 
  fit(species~., data = train.set)

mod

parsnip model object

Ranger result

Call:
 ranger::ranger(x = maybe_data_frame(x), y = y, num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) 

Type:                             Probability estimation 
Number of trees:                  500 
Sample size:                      233 
Number of independent variables:  2 
Mtry:                             1 
Target node size:                 10 
Variable importance mode:         none 
Splitrule:                        gini 
OOB prediction error (Brier s.):  0.03958214

Predictions

Then, the test.set data will be used in predictions of the type of the species provided we have the bill length and bill depth. The predicted class and the test dataset will be binded together to see if there is a match or mismatch of the predicted versus the actual species type.

rf.pred = mod %>% 
  predict(test.set) %>% 
  bind_cols(test.set)

rf.pred

# A tibble: 100 × 4
   .pred_class species bill_length_mm bill_depth_mm
   <fct>       <fct>            <dbl>         <dbl>
 1 Adelie      Adelie            40.3          18  
 2 Adelie      Adelie            38.9          17.8
 3 Adelie      Adelie            38.6          21.2
 4 Adelie      Adelie            42.5          20.7
 5 Adelie      Adelie            37.8          18.3
 6 Adelie      Adelie            37.7          18.7
 7 Adelie      Adelie            35.9          19.2
 8 Adelie      Adelie            38.8          17.2
 9 Adelie      Adelie            35.3          18.9
10 Adelie      Adelie            40.5          17.9
# ℹ 90 more rows

Accuracy testing

We have already predicted for the results, then we need to test for the accuracy of the model. The model is considered accurate when its accuracy value is higher or equal to 80%. The obtained accuracy of our model is 0.98 (98%) and the kap 96%. Our model can then be used to predict the type of species provided you have the bill length and the bill depth of the penguins.

rf.pred %>% 
  metrics(truth = species,
          estimate = .pred_class)

# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.94 
2 kap      multiclass     0.905

Now, we need to see the accuracy for each predicted species versus the actual type. We will use conf_mat() of the yardstick package (Kuhn et al., 2022). The result indicate that 43 Adelie were correctly predicted while 1 Chinstrap was incorrectly predicted as Adelie. While 16 Chinstrap were correctly predicted, 1 of this species was wrongly predicted as Gentoo. On the other hand, all 39 Gentoo were correctly predicted.

rf.pred %>% 
  conf_mat(truth = species, 
           estimate = .pred_class)

           Truth
Prediction  Adelie Chinstrap Gentoo
  Adelie        46         1      0
  Chinstrap      2        17      2
  Gentoo         0         1     31

Then, we need to calculate the probability of each observation be accurate for every species. We will use predict() function of *stats package of R (R Core Team, 2022).

rf.preda = mod %>% 
  predict(test.set, 
          type = "prob") %>% 
  bind_cols(test.set)

rf.preda

# A tibble: 100 × 6
   .pred_Adelie .pred_Chinstrap .pred_Gentoo species bill_length_mm
          <dbl>           <dbl>        <dbl> <fct>            <dbl>
 1        1             0           0        Adelie            40.3
 2        0.997         0.00313     0        Adelie            38.9
 3        0.996         0.00396     0        Adelie            38.6
 4        0.796         0.182       0.0223   Adelie            42.5
 5        0.999         0.00075     0        Adelie            37.8
 6        0.989         0.0107      0        Adelie            37.7
 7        0.998         0.00156     0        Adelie            35.9
 8        0.996         0.00327     0.000286 Adelie            38.8
 9        0.997         0.003       0        Adelie            35.3
10        1             0           0        Adelie            40.5
# ℹ 90 more rows
# ℹ 1 more variable: bill_depth_mm <dbl>

Roc Curve

The probability results will then be used to plot the roc curve (Figure 2). The roc_curve() function of yardstick package (Kuhn et al., 2022) will be used supplied with species, .pred_Adelie, .pred_Chinstrap, and .pred_Gentoo. The autoplot() function of the workflowsets package (Kuhn and Couch, 2022) will then be applied to create the curve. The curve shows that Adelie species fitted better to the model than other the Gentoo and Chinstrap species (Figure 2).

rf.preda %>% 
  roc_curve(species, 
            .pred_Adelie, 
            .pred_Chinstrap, 
            .pred_Gentoo) %>% 
  autoplot()+
  ggpubr::theme_pubclean()+
  theme(strip.background = element_blank())

Figure 2: The ROC curve of *Adelie*, *Chinstrap* and *Gentoo* species

Model validation for the future data

In order to validate whether the model works to other newly collected data, we will validate it using the created data consists of bill length and depth. The data will be applied to the model and test if the prediction of the type of species will be done. The new created data will be named as new.penguin containing three observations;

new.penguin = tibble(bill_length_mm = c(35,47.5, 70),
                     bill_depth_mm = c(13,17, 18))

new.penguin

# A tibble: 3 × 2
  bill_length_mm bill_depth_mm
           <dbl>         <dbl>
1           35              13
2           47.5            17
3           70              18

Then, the model will be applied and see if it will predict the type of species. We will use predict() function.

Hoolah!! the model give us the predicted species as Adelie and Chinstrap

mod %>% 
  predict(new.penguin)

# A tibble: 3 × 1
  .pred_class
  <fct>      
1 Adelie     
2 Chinstrap  
3 Chinstrap

We will also create another data named as aa with 10 observations using rnorm() function of stats package (R Core Team, 2022). We will also predict the type of species for these data using our model. Again the model works and give us predictions!!

aa = tibble(bill_depth_mm = rnorm(n = 10,
                                  mean = 15, 
                                  sd = 3),
            bill_length_mm = rnorm(n = 10,
                                   mean = 50, 
                                   sd = 10))

mod %>% 
  predict(aa) %>% 
  bind_cols(aa)

# A tibble: 10 × 3
   .pred_class bill_depth_mm bill_length_mm
   <fct>               <dbl>          <dbl>
 1 Adelie               14.3           30.4
 2 Gentoo               13.9           62.6
 3 Gentoo               12.9           71.0
 4 Chinstrap            18.0           50.5
 5 Gentoo               13.4           53.5
 6 Chinstrap            18.7           49.9
 7 Gentoo               14.9           46.5
 8 Adelie               17.2           30.2
 9 Gentoo               14.7           53.5
10 Gentoo               10.8           48.7

Summary

All in all, the success of machine learning and artificial intelligence depend on the quality of the data, the complexity of the problem, and the choice of the appropriate algorithms and techniques (Jiawei Han and Pei, 2011; Witten et al., 2005).

The quality of the data is one of the most important factors for the success of machine learning. The data should be accurate, complete, and representative of the problem domain (Jiawei Han and Pei, 2011; Witten et al., 2005). In addition, the data should be properly labeled and preprocessed to ensure that the machine learning algorithms can effectively learn from it.

The complexity of the problem is also a key factor in determining the success of machine learning (Gomez-Cabrero et al., 2014). Some problems are inherently more complex than others, and require more sophisticated algorithms and techniques to solve. For example, image recognition and natural language processing are typically more complex than simple regression problems.

Finally, the choice of the appropriate algorithms and techniques is critical for the success of machine learning (Gomez-Cabrero et al., 2014; Sarker, 2021). Different algorithms and techniques are suited for different types of problems, and the choice of the appropriate one will depend on the specific problem and the available data. Additionally, the parameters and hyperparameters of the algorithms need to be properly tuned to ensure that the models are optimized for the problem at hand.

Note

Don’t miss out our next tutorial on Machine Learning (ML) using regression Algorithm!!!

Consultated references

Gomez-Cabrero, D., Abugessaisa, I., Maier, D., Teschendorff, A., Merkenschlager, M., Gisel, A., Ballestar, E., Bongcam-Rudloff, E., Conesa, A., Tegnér, J., 2014. Data integration in the era of omics: Current and future challenges. BMC systems biology 8, 1–10.

Horst, A.M., Hill, A.P., Gorman, K.B., 2020. Palmerpenguins: Palmer archipelago (antarctica) penguin data. https://doi.org/10.5281/zenodo.3960218

Jiawei Han, M., Pei, J., 2011. Data mining: Concepts and techniques: Concepts and techniques.

Kuhn, M., Couch, S., 2022. Workflowsets: Create a collection of ’tidymodels’ workflows.

Kuhn, M., Vaughan, D., 2022. Parsnip: A common API to modeling and analysis functions.

Kuhn, M., Vaughan, D., Hvitfeldt, E., 2022. Yardstick: Tidy characterizations of model performance.

Mitchell, T.M., others, 2007. Machine learning. McGraw-hill New York.

R Core Team, 2022. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

Sarker, I., 2021. Machine learning: Algorithms, real-world applications and research directions. SN comput sci 2: 160.

Wickham, H., François, R., Henry, L., Müller, K., 2022. Dplyr: A grammar of data manipulation.

Witten, I.H., Frank, E., Hall, M.A., Pal, C.J., DATA, M., 2005. Practical machine learning tools and techniques, in: Data Mining.