How did you split the data into training and test data
Say you have a dataset – iris. To test your ML models, you typically need to divide your data into training and test data sets. Here are some methods to do it.
From Scratch
You can do this from scratch without using libraries.
# A simple function to take in data and split by percentage
train_test_split = function (data, perc_split) {
# Get the size of the dataset
data_size = nrow(data)
# Calculate the sample size required
sample_size = round( data_size * perc_split)
# Get the sample indices
sample = sample(1:data_size,
size =sample_size,
replace = FALSE)
# Sort the sample vector before sending
return (sort(sample))
}
Now, you can use this function to split the data into training and test data sets
Training dataset
> train = iris[train_test_split(iris,0.75),]
Test dataset
> test = iris[-train_test_split(iris,0.75),]
Using libraries
There are many libraries to do this split for us.
caTools package
require(caTools)
set.seed(100)
# returns
sample = sample.split(iris$Sepal.Length,SplitRatio = 3/4)
# Training data set
train = iris[sample==TRUE,]
test = iris[sample==FALSE,]
CARET package
library(caret)
set.seed(100)
# Get the split indices
trainIndex <- createDataPartition(iris$Species, p = .8,
list = FALSE,
times = 1)
# Generate the training and test data
train = iris[trainIndex,]
test = iris[-trainIndex,]