How did you split the data into training and test data


  R Interview Questions

Say you have a dataset – iris. To test your ML models, you typically need to divide your data into training and test data sets. Here are some methods to do it.

From Scratch

You can do this from scratch without using libraries.

# A simple function to take in data and split by percentage
train_test_split = function (data, perc_split) {
  # Get the size of the dataset
  data_size = nrow(data)
  # Calculate the sample size required
  sample_size = round( data_size * perc_split)
  # Get the sample indices
  sample = sample(1:data_size, 
                  size =sample_size,
                  replace = FALSE)
  # Sort the sample vector before sending
  return (sort(sample))
}

Now, you can use this function to split the data into training and test data sets

Training dataset

> train = iris[train_test_split(iris,0.75),]

Test dataset

> test = iris[-train_test_split(iris,0.75),]

Using libraries

There are many libraries to do this split for us.

caTools package

require(caTools)
set.seed(100)
# returns 
sample = sample.split(iris$Sepal.Length,SplitRatio = 3/4)
# Training data set
train = iris[sample==TRUE,]
test = iris[sample==FALSE,]

CARET package

library(caret)
set.seed(100)
# Get the split indices
trainIndex <- createDataPartition(iris$Species, p = .8, 
                                  list = FALSE, 
                                  times = 1)
# Generate the training and test data
train = iris[trainIndex,]
test = iris[-trainIndex,]

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.