Practical Theory for Time Series Forecasting Models －A Case of the Past Kaggle Competition－

Part1 EDA and Base Model

Shimpei Ikeno, Satyaki Roy, Yuta Suzuki, and Takeru Sone
2022-07-12

Purpose of This Blog Series: Introduce How to Use a Practical Multivariate Time Series Forecasting Model

Welcome to our Data Science Blog! I am Shimpei Ikeno, a data scientist in the Time Series Forecasting Practice Team at Nomura Research Institute, Ltd.

According to Wikipedia, a time series is “a series of data points indexed (or listed or graphed) in time order”. Professor Rob J. Hyndman, a leading authority on time series forecasting, states that forecasting is “about predicting the future as accurately as possible, given all of the information available, including historical data and knowledge of any future events that might impact the forecasts”. So, time series forecasting is an approach to looking into the future as accurately as possible based on observed series of data points indexed in time order. Time series forecasting models are based on such patterns found from observed temporal changes.

In business settings, we rarely want to forecast only a single time series. We need to forecast not only the company’s total sales, but also by store, by product, and by various other units. Many sources explain forecasts for single time series, but few explain forecasts for multivariate (hierarchical or related) time series. Therefore, many people have difficulties when trying to introduce multivariate time series forecasting into their business.

In this blog series, we will explain how to introduce multivariate time series forecasting models into your business in several parts. This blog provides detailed explanations, making it easy to read, even for those who are not very familiar with time series analysis. I hope you will enjoy it.

Current Trends and Issues of Time Series Forecasting Models

There are a variety of time series forecasting models that are being researched and developed. For example, there are classical statistical models such as ARIMA and ETS, GBDT based on machine learning, Prophet developed by Facebook (Meta), and even the popular Deep Learning. However, using the latest forecasting models does not always yield good results. For example, using the AutoML framework provided by cloud vendors and specialized companies in the latest competitions may not provide predictions with the high accuracy as expected. On the other hand, a simple statistical model can make highly accurate forecasts. Even the greatest model, if used in the wrong way, will not give good results. Now that we can use various models, we data scientists must be able to select appropriate models based on a proper understanding of the characteristics of the models and data.

The selection of explanatory variables is one of the most critical issues. This is because weekly and monthly data, which are often used in business, have very few data points, and putting more variables into them does not always improve accuracy. For example, for monthly data, there are only 12 data points for a single year and 120 data points for 10 years. However, only a few resources provide detailed explanations of how to select explanatory variables.

Furthermore, multivariate time series forecasts are more complex than single time series forecasts.

Here is an example of the sales volume of each product for each store.

Figure 1. Time series plot of store_item

The labels in the table above stand for store numbers and product numbers. If forecasts were made for all combinations of stores and products, the number of combinations would exceed 1000. If we try some models for each combination, split the data into training and validation sets, and cross-validate the data, the number of models will explode. Thus, trying various models and variables blindly is inefficient and unproductive. To master time series forecasting, you need to learn the framework and be able to efficiently trial and error many models and sets of variables.

Based on the recognition of the above issues, we would like to share the following three points with you in this series of articles.

How to select models by the situation.
The concept of variable creation and selection.
Efficient modeling framework.

Goal: Acing the Past Kaggle Competition!

In this blog, we would like to introduce practical methods, but the data we usually deal with is our clients’ data, which is difficult to disclose to the public. Therefore, we will use the data from Kaggle to share the essence of data analysis for your business.

This is the competition we will be challenging in this blog. This competition was held by Walmart in 2014 and its subject was to forecast sales by store and by the department. We chose this competition for two reasons. The first reason is that the data volume is easy to handle. The data size is only a few MB, so it can be handled well with laptops. Second, we can analyze and discuss the data from a business perspective, because the data includes explanatory variables such as promotions.

Now that we have decided which competition we will participate in, the next step is to set a target rank. Considering that the competition was held in 2014, the time when XGBoost and Deep Learning were not widely available, we set our target ranking as

1st out of 688.

Due to the fact that this is one of the past competitions, it is possible that some published notebooks scored higher than the competition’s top score. However, we did not find any such notebooks. Thus, if our model exceeds the top score, it is clear that the model is by us.

Data

There are four data sets.

train.csv: Data for training. The contents of the data are as follows

## # A tibble: 6 × 5
##   Store  Dept Date       Weekly_Sales IsHoliday
##   <dbl> <dbl> <date>            <dbl> <lgl>    
## 1     1     1 2010-02-05       24924. FALSE    
## 2     1     1 2010-02-12       46039. TRUE     
## 3     1     1 2010-02-19       41596. FALSE    
## 4     1     1 2010-02-26       19404. FALSE    
## 5     1     1 2010-03-05       21828. FALSE    
## 6     1     1 2010-03-12       21043. FALSE

test.csv: Data for testing. The columns are the same as train.csv, but only Weekly_Sales is missing.
stores.csv: Data containing store information. Matched by store.

## # A tibble: 6 × 3
##   Store Type    Size
##   <dbl> <chr>  <dbl>
## 1     1 A     151315
## 2     2 A     202307
## 3     3 B      37392
## 4     4 A     205863
## 5     5 B      34875
## 6     6 A     202505

features.csv: Data by store, by date, and contains various information such as promotion.

## # A tibble: 6 × 12
##   Store Date       Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3
##   <dbl> <date>           <dbl>      <dbl>     <dbl>     <dbl>     <dbl>
## 1     1 2010-02-05        42.3       2.57        NA        NA        NA
## 2     1 2010-02-12        38.5       2.55        NA        NA        NA
## 3     1 2010-02-19        39.9       2.51        NA        NA        NA
## 4     1 2010-02-26        46.6       2.56        NA        NA        NA
## 5     1 2010-03-05        46.5       2.62        NA        NA        NA
## 6     1 2010-03-12        57.8       2.67        NA        NA        NA

Finally, also check the format of the submission file.

## # A tibble: 6 × 2
##   Id             Weekly_Sales
##   <chr>                 <dbl>
## 1 1_1_2012-11-02            0
## 2 1_1_2012-11-09            0
## 3 1_1_2012-11-16            0
## 4 1_1_2012-11-23            0
## 5 1_1_2012-11-30            0
## 6 1_1_2012-12-07            0

The submission data must include the IDs, which are Store, Dept and Date connected by “_”, the sales forecast for each ID.

Evaluation

The evaluation metric is WMAE. It is essentially the same as Mean Absolute Error, but it's weighted.

Please check the detailed definitions on your own, but it should be noted that the error score is five times higher for Holiday Weeks.This is because sales are expected to be very high during Christmas and Thanksgiving.Therefore, forecasting these periods is very important. It is important to set up custom metrics that reflect this business context, which should be considered when actually building the model.

EDA

Now that the data and the evaluation indexes have been described, the next step is to introduce Exploratory Data Analysis (EDA). Actually, we conducted a variety of analyses, but it was difficult to introduce all of them because it would be too much. Therefore, in this blog, I will focus on the training and test data and explain only the most remarkable parts.

Training Data

First, let’s check the contents of the training data.Use the R package Skimr to view the contents of the training data.

Data summary
Name	Piped data
Number of rows	421570
Number of columns	6
_______________________
Column type frequency:
Date	1
factor	3
logical	1
numeric	1
________________________
Group variables	None

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
Date	0	1	2010-02-05	2012-10-26	2011-06-17	143

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
Store	1	FALSE	45	13: 10474, 10: 10315, 4: 10272, 1: 10244
Dept	1	FALSE	81	1: 6435, 2: 6435, 3: 6435, 4: 6435
Id	1	FALSE	3331	1_1: 143, 1_1: 143, 1_1: 143, 1_1: 143

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
IsHoliday	0	1	0.07	FAL: 391909, TRU: 29661

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Weekly_Sales	0	1	15981.26	22711.18	-4988.94	2079.65	7612.03	20205.85	693099.4	▇▁

Date covers the period from February 2010 to October 2012, and contains about 2.5 years of training data. Store and Dept contain 45 stores and 81 departments, respectively. Combining Store and Dept with the unit Id to check weekly sales, we find that there are 3331 IDs in total.

Next, when IsHoliday is checked, approximately 7% of the total number of Id corresponds to IsHoliday. The below lists of vacation periods were provided by Kaggle:

Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13
Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13

The above four lists are registered vacation periods. Because Christmas and the Super Bowl have different impacts on sales, these vacations need to be considered separately.

Finally, check Weekly_Sales. It seems that sales may be negative in some of the data. It is strange to see negative sales in a supermarket, but it is a common case when dealing with actual client data. Reasons could be product returns or cancelled sales orders. Most of the sales values are in the range of less than 20,000, but the maximum value is very large, about 700,000, which suggests that the distribution is spread to the right.

Let’s check out some more data on training for IDs.

## # A tibble: 4 × 3
##   ndates         n  prop
##   <fct>      <int> <dbl>
## 1 (0,40]      3835 0.009
## 2 (40,80]     4800 0.011
## 3 (80,142]   32555 0.077
## 4 (142,143] 380380 0.902

While 143 are complete data, this is only about 90% of the total, with the remaining 10% missing during the period. From a business perspective, this could be the case if a new store opens during the period, a new department is created, or a department is redefined. As mentioned earlier, time series data tend to have fewer data points, and one of the challenges is how to handle such missing data.

Test Data

For the test data, skim in the same way.

Data summary
Name	Piped data
Number of rows	115064
Number of columns	5
_______________________
Column type frequency:
Date	1
factor	3
logical	1
________________________
Group variables	None

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
Date	0	1	2012-11-02	2013-07-26	2013-03-15	39

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
Store	1	FALSE	45	13: 2836, 4: 2803, 19: 2799, 2: 2797
Dept	1	FALSE	81	1: 1755, 2: 1755, 3: 1755, 4: 1755
Id	1	FALSE	3169	1_1: 39, 1_1: 39, 1_1: 39, 1_1: 39

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
IsHoliday	0	1	0.08	FAL: 106136, TRU: 8928

As a result of merging the two data, we can see that the number of ID has increased from 3331 to 3342. It seems that there are at least 11 IDs that are only found in the test data. In general, it is impossible to create time series forecasting model for data that is not in the training data. Therefore, even if we create time series model for each ID, the forecast model will not work. We need to solve this problem, which will be explained in detail in the building model part.

Padding & imputation

We found that the training data is partially missing in terms of ID units, and that there are some IDs that are in the test data but not in the training data. Therefore, if trying to apply time series models without any effort, errors will occur. To build the base model, we will fill in the missing data with NA (Padding) and impute the appropriate values.

First, we will fill in the missing date records in the training data with NA. From here on, we will basically use the modeltime package for processing time series. It is very easy to use and allows us to smoothly do everything from pre-processing time series to modeling using the tidymodels framework. Currently, Python is the most popular choice when it comes to data analysis, but R also has useful and easy to use analysis tools. Especially for time series analysis, Prof. Rob J. Hyndman’s package series and modeltime, which we will use in this blog, are very useful, so please give them a try. Some people may think that R cannot use models based on deep learning, but it can. This will be explained later.

full_tr <- tr %>% 
   # create Id
   mutate(Id = paste(Store, Dept,sep = "_")) %>% 
   
   # padding
   group_by(Id) %>%
   select(Id, Date, Weekly_Sales) %>%
   pad_by_time(
     Date,
     .start_date   = min(tr$Date),
     .end_date   = max(tr$Date)
   ) %>%
   ungroup() %>% 
   
   # change negative to 0
   mutate(Weekly_Sales = ifelse(Weekly_Sales<0, 0, Weekly_Sales))

Let me explain code in a little more detail. Tr is the training data. For tr, we first created Id by attaching Store and Dept. Next, we filled in the missing Date in the first and last dates of the entire training data with NA. In addition, we replaced the negative values in Weekly_Sales with 0.

Data summary
Name	Piped data
Number of rows	476333
Number of columns	3
_______________________
Column type frequency:
character	1
Date	1
numeric	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Id	0	1	3	5	0	3331	0

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
Date	0	1	2010-02-05	2012-10-26	2011-06-17	143

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Weekly_Sales	54763	0.89	15981.47	22711.03	0	2079.65	7612.03	20205.85	693099.4	▇▁▁▁▁

We can see that the number of rows has increased by about 50,000. Then, let’s complete NA.

tr_imputed <- full_tr %>% 
   group_by(Id) %>% 
   group_split() %>% 
   map_dfr(.f = function(df) {
     df <- df %>% arrange(Date)
     m_date <- df$Date
     m_id <- df$Id
     df = ts(df$Weekly_Sales, start = decimal_date(min(tr$Date)), frequency = 52)
   if(sum(!is.na(df)) >= 3){
     df <-as.numeric(na_seadec(df))
   } else if(sum(!is.na(df)) == 2){
     df <-as.numeric(na_interpolation(df))
   } else{
     df <-as.numeric(na_locf(df))
   }
     tibble(Id = m_id, Date = m_date, Weekly_Sales = as.numeric(df)) 
   }) %>% mutate(Weekly_Sales = ifelse(Weekly_Sales<0, 0, Weekly_Sales))

The method of completion depends on the remaining data in each ID unit. The basic policy is as follows:

3 or more data points: Completion considering seasonality (na_seadec)
2 data points: linear completion (na_interpolation)
1 data point: Apply the data of the last observation (na_locf)

tr_imputed %>% skim()

Data summary
Name	Piped data
Number of rows	476333
Number of columns	3
_______________________
Column type frequency:
character	1
Date	1
numeric	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Id	0	1	3	5	0	3331	0

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
Date	0	1	2010-02-05	2012-10-26	2011-06-17	143

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Weekly_Sales	0	1	14155.72	21958.19	0	797.01	5742.07	17773.32	693099.4	▇▁▁▁▁

The value of n_missing is 0, and we were able to create the training data with no missing any IDs. There must be some IDs that exist only in the test, but there are no such IDs in the training data, so we just need to fill in 0 when preparing the submission file.

Base Model（Snaive)

Now that the training data is complete, it is time for modeling. Here we create a snaive model as the base model. Snaive is a Seasonal naive model, which in this case simply uses the values from the same week in the previous year to forecast. Using values from the previous year for forecasting is still commonly used in business, so it is appropriate as a baseline.

Forecasting for a single time series is not difficult, because it is simply applying the model. However, the key point in this case is that we want to apply that process to more than 3,000 time series at once.

In this case, there are two main options:

Repeat the application of the model to each time series. (Iterative)
Applying one model to the entire time series. (Global)

This time we adopt the former, so we start by creating the data set for it.

# set forecast horizon 
 horizon <- 39
 
 # create nested data
 nested_data_tbl <- tr_imputed %>% 
   group_by(Id) %>%
   extend_timeseries(
     .id_var = Id,
     .date_var = Date,
     .length_future = horizon 
   ) %>% 
   nest_timeseries(
     .id_var = Id,
     .length_future = horizon
   ) %>%
   split_nested_timeseries(
     .length_test = horizon 
   )

For a detailed explanation, please click here. The above process can be used to create data stored as a time series nested by ID. We then apply the model to this data with modeltime, which utilizes parsnip, an excellent framework developed by Dr. Max Kuhn, a legend in data science. This framework is used to apply models to data. The concept of this framework is to allow various models to be applied with the same syntax.As shown below, models and recipes are defined separately, then integrated with workflow and fitted. The framework may seem difficult to use at first, but once you get used to it, it is very easy to use. It is also very useful because it allows you to create and evaluate various models at once. So, let’s get started.

# set model spec
 model_snaive <- naive_reg(
   seasonal_period = 52
 ) %>% 
   set_engine("snaive")
 
 # create recipe
 rec_snaive <- recipe(Weekly_Sales ~ Date, 
                      extract_nested_train_split(nested_data_tbl))
 # create workflow
 wflw_snaive <- workflow() %>%
   add_model(model_snaive) %>%
   add_recipe(rec_snaive)

First, define model and recipe. To define model, use naive_reg() in parsnip. seasonal_period is set to 52 because this is weekly data. The recipe needs no adjustment, but the iterative model needs additional settings (extract_nested_ train_split). Then, integrate the model and recipe with workflow. Now we move on to the next step. Processing lots of time series at once can take a long time. So, it is better to start with a small sample to see if the model works properly. First, extract arbitrary IDs.

set.seed(235)
 id_sample <- sample(nested_data_tbl$Id %>% unique(), size = 6)

Let’s fit the model to the sampled data.

try_sample_tbl <- nested_data_tbl %>%
   filter(Id %in% id_sample) %>%
   modeltime_nested_fit(
     model_list = list(
       wflw_snaive
     ),
     
     control = control_nested_fit(
       verbose   = TRUE,
       allow_par = TRUE
     )
   )

Then check for errors, evaluation accuracy, problems with the graphs. Because these operations can be done with only a few lines of code, modeltime is very useful.

try_sample_tbl %>% extract_nested_error_report()

## # A tibble: 0 × 4
## # … with 4 variables: Id <chr>, .model_id <int>, .model_desc <chr>,
## #   .error_desc <chr>

try_sample_tbl %>%
   extract_nested_test_accuracy() %>%
   table_modeltime_accuracy()

Id	.model_id	.model_desc	.type	mae	mape	mase	smape	rmse	rsq
26_25	1	SNAIVE	Test	893.76	13.55	0.86	14.24	1288.86	0.35
29_34	1	SNAIVE	Test	2185.42	22.83	1.17	20.29	2696.7	0.32
44_25	1	SNAIVE	Test	19.24	55.11	0.84	52.89	30.86	0.09
45_67	1	SNAIVE	Test	1052.62	14.77	0.57	14.1	2019.59	0.58
6_1	1	SNAIVE	Test	4931.23	20.43	1.1	17.28	9950.62	0.11
8_19	1	SNAIVE	Test	631.38	122.37	5.13	60.99	1381.9	0

 try_sample_tbl %>%
   extract_nested_test_forecast() %>%
   group_by(Id) %>%
   plot_modeltime_forecast(.interactive = FALSE, .facet_ncol = 3)

Figure2. Sample plots of the snaive model

There is no error. The graph shows that data from 52 weeks ago applies as expected. It seems that the model fits the sample well, so we do the same for the entire data set. This process will take a long time, and it is better to use parallel processing.

parallel_start(6)
 nested_modeltime_tbl <- nested_data_tbl %>%
   modeltime_nested_fit(
     
     model_list = list(
       wflw_snaive
     ),
     
     control = control_nested_fit(
       verbose   = TRUE,
       allow_par = TRUE
     )
   )

With 6/8 cores, it took around 5 minutes on my computer. So let’s check the errors and MAE.

# * Review Any Errors ----
 nested_modeltime_tbl %>% extract_nested_error_report()

## # A tibble: 0 × 4
## # … with 4 variables: Id <chr>, .model_id <int>, .model_desc <chr>,
## #   .error_desc <chr>

nested_modeltime_tbl %>%
   extract_nested_test_accuracy() %>%
   table_modeltime_accuracy()

Id	.model_id	.model_desc	.type	mae	mape	mase	smape	rmse	rsq
1_1	1	SNAIVE	Test	4050.18	17.94	0.89	14.7	8977.06	0.26
1_10	1	SNAIVE	Test	3009.71	8.84	1.05	9.08	3988.16	0.19
1_11	1	SNAIVE	Test	4020.55	15.31	1.03	15.22	4959.1	0.47
1_12	1	SNAIVE	Test	1356.33	12.64	1.13	12.32	1675.36	0.12
1_13	1	SNAIVE	Test	2607.15	6.33	1.06	6.62	3113.97	0.37
1_14	1	SNAIVE	Test	1742.82	11.7	1.09	12.27	2232.6	0.34

There seems to be no error. Some MAE values are very different, but this is the base model, so it is ok. Next, let’s refit with all the training data.

nested_best_refit_tbl <- nested_modeltime_tbl %>%
   modeltime_nested_refit(
     control = control_refit(
       verbose   = TRUE,
       allow_par = TRUE
     )
   )

It took about 3 minutes. Then, let’s plot it the same way as before.

set.seed(234)
 id_sample <- sample(nested_data_tbl$Id %>% unique(), size = 6)
 nested_best_refit_tbl %>%
   extract_nested_future_forecast() %>%
   filter(Id %in% id_sample) %>%
   group_by(Id) %>%
   plot_modeltime_forecast(.facet_ncol = 3)

Overall, the forecast looks relatively good, except for the cases where an overall trend is seen, such as 3_2 and 34_35.

Submit Base Model & Check Scores

Finally, let's prepare the submission file.

results <- nested_best_refit_tbl %>% extract_nested_future_forecast()
 results <- results %>% mutate(Id = paste(Id, .index,sep = "_")) %>%
   select(Id, .value)
 
 results %>% head()

## # A tibble: 6 × 2
##   Id             .value
##   <chr>           <dbl>
## 1 1_1_2010-02-05 24924.
## 2 1_1_2010-02-12 46039.
## 3 1_1_2010-02-19 41596.
## 4 1_1_2010-02-26 19404.
## 5 1_1_2010-03-05 21828.
## 6 1_1_2010-03-12 21043.

 submission %>% 
   left_join(results, by = "Id") %>%
   mutate(Weekly_Sales = case_when(!is.na(.value) ~ .value,
                                   is.na(.value)  ~ 0
                                  )
   ) %>% select(Id, Weekly_Sales)  %>%  write_csv("submission_snaive.csv")

The spaces that were not predicted are filled in with zeros. Submitted the file and the rank was,,,

170th out of 688

Private score: 3026.42451

Public score: 2944.19927

Although this is a simple application of data from 52 weeks ago, it is in the top quartile of the total, which is not bad performance. The reasons for this result can be attributed to the fact that time series analysis processing methods were not widespread enough in 2014, when the competition was held, and the limitations of the data points made it difficult to apply machine learning-based models. Snaive is the simplest model that we can use, and we would like to challenge how far we can rank higher based on this.

In part 2, we will cover feature engineering, which is very important when considering the application of machine learning-based models. See you again in the next blog!