# Managing Multiple Learning Curves

``````library(tibble)
library(purrr)   # map()
library(tidyr)
library(dplyr)
library(broom)
library(ggplot2)
library(RColorBrewer)
``````

# Introduction

I’ve been a big fan of Learning Curves and Managing Multiple Models. This is my first crude exploration of combining the two. This post briefly covers both topics independently before going into my combined attempt.

# Learning curves

The power of learning curves is very well taught in the Machine Learning class by Andrew Ng. Some other posts I enjoyed are Why Big Data? Learning Curves and Learning from Learning Curves by Bob Horton, Senior Data Scientist, Microsoft, and a post by ‘Marc in the box’.

I’ve used variations of learning curves to view bias-variance balance (more data or more complexity), to plot model accuracy versus computing time, and even to predict the number of training samples my small netbook could apply a Random Forest to in 6 hours. If you don’t know learning curves already you’ll see an example in this post, but I do advise you to check out any of the above mentioned links.

# Managing Multiple Models

For a full lecture on Managing Multiple Models see Hadley’s talk at Edinburgh. The method boils down to 7 main functions from 4 packages:

• dplyr – group_by()
• tidyr – nest() and unnest()
• purrr – map()
• broom – tidy(), glance(), augment()

In the following example you can see how you can use the method to generate a summary data frame on linear models fitted on the mpg and weight columns of mtcars split by the number of cylinders.

``````mtcars %>%
group_by(cyl)  %>%
nest()  %>%
mutate(model = map(data, function(x) lm(data = x, mpg ~ wt)))  %>%
mutate(tidy = map(model, tidy))  %>%
unnest(tidy)
``````
```## Source: local data frame [6 x 6]
##
##     cyl        term  estimate std.error statistic        p.value
##   (dbl)       (chr)     (dbl)     (dbl)     (dbl)          (dbl)
## 1     6 (Intercept) 28.408845 4.1843688  6.789278 0.001054844191
## 2     6          wt -2.780106 1.3349173 -2.082605 0.091757660172
## 3     4 (Intercept) 39.571196 4.3465820  9.103980 0.000007771511
## 4     4          wt -5.647025 1.8501185 -3.052251 0.013742781987
## 5     8 (Intercept) 23.868029 3.0054619  7.941551 0.000004052705
## 6     8          wt -2.192438 0.7392393 -2.965803 0.011792809757
```

One of the best aspects of this method is that data and models stick together in an organised way. Let’s apply several learning models on the same data sets.

``````mtcars %>%
group_by(cyl)  %>%
nest()  %>%
mutate(mod_1     = map(data, function(x) lm(data = x, mpg ~ wt)),
mod_2     = map(data, function(x) lm(data = x, mpg ~ wt*hp))) %>%
mutate(glance_1 = map(mod_1, broom::glance),
glance_2 = map(mod_2, broom::glance),
rsq_1    = map_dbl(glance_1, "r.squared"),
rsq_2    = map_dbl(glance_2, "r.squared"))
``````
```## Source: local data frame [3 x 8]
##
##     cyl             data   mod_1   mod_2            glance_1
##   (dbl)            (chr)   (chr)   (chr)               (chr)
## 1     6  <tbl_df [7,10]> <S3:lm> <S3:lm> <data.frame [1,11]>
## 2     4 <tbl_df [11,10]> <S3:lm> <S3:lm> <data.frame [1,11]>
## 3     8 <tbl_df [14,10]> <S3:lm> <S3:lm> <data.frame [1,11]>
## Variables not shown: glance_2 (chr), rsq_1 (dbl), rsq_2 (dbl)
```

We have clearly organised sets of data and models on those sets of data.

# Managing Multiple Learning Curves

The reason I thought to combine the two methods is that I’d be able to create a neatly nested data frame that contains small subsets of the complete data and columns containing the summaries.

I’ll reuse the data simulation of Bob Horton’s post. Where y is 100 plus some noise and plus 10 if X1 equals X2. Where X1, X2 and X3 are randomly sampled letters:

``````sim_data <- function(N, noise_level=1){

X1 <- sample(LETTERS[1:10], N, replace=TRUE)
X2 <- sample(LETTERS[1:10], N, replace=TRUE)
X3 <- sample(LETTERS[1:10], N, replace=TRUE)

y <- 100 + ifelse(X1 == X2, 10, 0) + rnorm(N, sd=noise_level)

data_frame(X1, X2, X3, y)
}
``````
``````set.seed(123)
data <- sim_data(100000, noise = 2)
``````

Now we want to nest small samples of this data into a data frame. For example, we 8 subsets incremental in size from 1000 to 8000 rows long. Also we should be able to pick the same rows in different subsets. There is no simple group_by() and nest() like in Managing Multiple Models. The following function should do the trick.

``````nest_sample_data <- function(data, num_tests = 8, min = 1000, max = 8000, seed = NULL, prop_cv = 0.2) {
if(0 < prop_cv && prop_cv < 1) {
num_row_cv <- ceiling(prop_cv * nrow(data))
} else stop("prop_cv: cross validation proportion not between 0 and 1")

if(!is.data.frame(data)) stop("Data is not a data frame object")
if(!(nrow(data) >= (max + num_row_cv))) stop("Number of rows of data doesn't equal or exceed max plus cv rows")

num_rows             <- floor(seq(min, max, length = num_tests))
num_rows_list        <- as.list(num_rows)
names(num_rows_list) <- paste0("num_rows_", unlist(num_rows_list))

if(!is.null(seed)) {
if(is.numeric(seed)) {
set.seed(seed)
} else stop("Seed is not numeric")
}
cv_rows   <- sample(1:nrow(data), num_row_cv)
test_rows <- setdiff(1:nrow(data), cv_rows)
cv        <- rep(list(data[cv_rows,]),num_tests)
test      <- map(num_rows_list, function(x) data[sample(test_rows, x),])

data_frame(num_row = num_rows, test = test, cv = cv)
}
``````
``````nest_data <- nest_sample_data(data, num_tests = 8, min = 1000, max = 8000)
nest_data
``````
```## Source: local data frame [8 x 3]
##
##   num_row              test                 cv
##     (dbl)             (chr)              (chr)
## 1    1000 <tbl_df [1000,4]> <tbl_df [20000,4]>
## 2    2000 <tbl_df [2000,4]> <tbl_df [20000,4]>
## 3    3000 <tbl_df [3000,4]> <tbl_df [20000,4]>
## 4    4000 <tbl_df [4000,4]> <tbl_df [20000,4]>
## 5    5000 <tbl_df [5000,4]> <tbl_df [20000,4]>
## 6    6000 <tbl_df [6000,4]> <tbl_df [20000,4]>
## 7    7000 <tbl_df [7000,4]> <tbl_df [20000,4]>
## 8    8000 <tbl_df [8000,4]> <tbl_df [20000,4]>
```

We now have a nested data frame with number of rows, the test data and the cross validation data. The cross validation data is the same in each row, which is a bit of a waste, but for now I like to keep a clear data frame.

We can now apply models on the nested data:

``````models <- nest_data %>%
mutate(mod_1 = map(test, function(x) lm(data = x, y ~ X1*X2*X3)),
mod_2 = map(test, function(x) lm(data = x, y ~ X1*X3)))
``````

The nested data and models proved a bit of pain to access for prediction. The pain is mostly in subsetting columns of nested data frames. I used an approach like:

```map(df\$nested_data, `[`, c("column_name_1", "column_name_2"))
```

The map() function is from the purrr package and works like lapply.

The following function takes lists with the predictor values, models and actual outcomes and returns an vector with the root mean squared error for each of the list elements.

``````nested_error <- function(predictor, model, val) {
predicted <- unname(unlist(mapply(function(x, y) {predict(y, x)}, x = predictor, y = model)))
actual <- unname(unlist(val))

error <- (predicted - actual)

l <- unname(map(val, function(x) dim(x)[1]))
m <- as.list(seq_along(l))
s <- as.factor(unlist(mapply(function(x, y) {rep(x, y)}, x = m, y = l)))

a <- split(error, s)
map_dbl(a, function(x) sqrt(mean(x^2)))
}
``````

To call the nested_error function you have to carefully select the correct columns of the nested data.

``````errors <- models %>%
mutate(training_error_1 = nested_error(predictor = map(test, `[`, c("X1", "X2", "X3")), model = mod_1, val = map(test, `[`, "y")),
validation_error_1 = nested_error(predictor = map(cv, `[`, c("X1", "X2", "X3")), model = mod_1, val = map(cv, `[`, "y")),
training_error_2 = nested_error(predictor = map(test, `[`, c("X1", "X2", "X3")), model = mod_2, val = map(test, `[`, "y")),
validation_error_2 = nested_error(predictor = map(cv, `[`, c("X1", "X2", "X3")), model = mod_2, val = map(cv, `[`, "y")))
errors
``````
```## Source: local data frame [8 x 9]
##
##   num_row              test                 cv   mod_1   mod_2
##     (dbl)             (chr)              (chr)   (chr)   (chr)
## 1    1000 <tbl_df [1000,4]> <tbl_df [20000,4]> <S3:lm> <S3:lm>
## 2    2000 <tbl_df [2000,4]> <tbl_df [20000,4]> <S3:lm> <S3:lm>
## 3    3000 <tbl_df [3000,4]> <tbl_df [20000,4]> <S3:lm> <S3:lm>
## 4    4000 <tbl_df [4000,4]> <tbl_df [20000,4]> <S3:lm> <S3:lm>
## 5    5000 <tbl_df [5000,4]> <tbl_df [20000,4]> <S3:lm> <S3:lm>
## 6    6000 <tbl_df [6000,4]> <tbl_df [20000,4]> <S3:lm> <S3:lm>
## 7    7000 <tbl_df [7000,4]> <tbl_df [20000,4]> <S3:lm> <S3:lm>
## 8    8000 <tbl_df [8000,4]> <tbl_df [20000,4]> <S3:lm> <S3:lm>
## Variables not shown: training_error_1 (dbl), validation_error_1 (dbl),
##   training_error_2 (dbl), validation_error_2 (dbl)
```

What you end up with is the root mean squared error for the training and cross validation sets for each model neatly stored in a data frame.

Finally you can plot the learning curves.

``````bind_rows(errors %>%
select(num_row, training_error_1, validation_error_1) %>%
gather("type", "error", 2:3) %>%
mutate(type = gsub(pattern = "_1", replacement = "", x = type),
mod = "model 1"),
errors %>%
select(num_row, training_error_2, validation_error_2) %>%
gather("type", "error", 2:3) %>%
mutate(type = gsub(pattern = "_2", replacement = "", x = type),
mod = "model 2")) %>%
ggplot(aes(x = num_row, y = error, col = factor(mod))) +
xlab("training set size") +
ylab("Mean Squared error") +
ggtitle("Learning curve") +
geom_line(stat="smooth", method = "auto", aes(linetype = factor(type))) +
expand_limits(y = 0, x = 0) +
geom_point(shape = 1, size = 2) +
scale_colour_brewer(palette = "Dark2") +
labs(linetype = "Type of error", col = "Model")
``````

Introduction to learning curves by the example.

The function of model 1 is \(y \~ X1*X2*X3\) and the validation and training error both converge to 2. That’s great because the error of 2 is the random noise that can’t be modeled, this shows that the model is complex enough to discribe the data and that more training data will improve the validation error untill it reaches the training error.

The function of model 2 is \(y \~ X1\). Only looking at \(X1\) will never be enough to explain the part of our function that adds 10 when \$X1 = X2. The model is too simple and it will adjust the seemingly random +10 in it’s linear model and produce a larger error that can’t be fixed by adding more data. More data doesn’t always help, and learning curves are a great visual tool for that.