Part1 EDA and Base Model
Shimpei Ikeno, Satyaki Roy, Yuta Suzuki, and Takeru Sone
2022-07-12
Purpose of This Blog Series: Introduce How to Use a Practical Multivariate Time Series Forecasting Model
Welcome to our Data Science Blog! I am Shimpei Ikeno, a data scientist in the Time Series Forecasting Practice Team at Nomura Research Institute, Ltd.
According to Wikipedia, a time series is “a series of data points indexed (or listed or graphed) in time order”. Professor Rob J. Hyndman, a leading authority on time series forecasting, states that forecasting is “about predicting the future as accurately as possible, given all of the information available, including historical data and knowledge of any future events that might impact the forecasts”. So, time series forecasting is an approach to looking into the future as accurately as possible based on observed series of data points indexed in time order. Time series forecasting models are based on such patterns found from observed temporal changes.
In business settings, we rarely want to forecast only a single time series. We need to forecast not only the company’s total sales, but also by store, by product, and by various other units. Many sources explain forecasts for single time series, but few explain forecasts for multivariate (hierarchical or related) time series. Therefore, many people have difficulties when trying to introduce multivariate time series forecasting into their business.
In this blog series, we will explain how to introduce multivariate time series forecasting models into your business in several parts. This blog provides detailed explanations, making it easy to read, even for those who are not very familiar with time series analysis. I hope you will enjoy it.
Current Trends and Issues of Time Series Forecasting Models
There are a variety of time series forecasting models that are being researched and developed. For example, there are classical statistical models such as ARIMA and ETS, GBDT based on machine learning, Prophet developed by Facebook (Meta), and even the popular Deep Learning. However, using the latest forecasting models does not always yield good results. For example, using the AutoML framework provided by cloud vendors and specialized companies in the latest competitions may not provide predictions with the high accuracy as expected. On the other hand, a simple statistical model can make highly accurate forecasts. Even the greatest model, if used in the wrong way, will not give good results. Now that we can use various models, we data scientists must be able to select appropriate models based on a proper understanding of the characteristics of the models and data.
The selection of explanatory variables is one of the most critical issues. This is because weekly and monthly data, which are often used in business, have very few data points, and putting more variables into them does not always improve accuracy. For example, for monthly data, there are only 12 data points for a single year and 120 data points for 10 years. However, only a few resources provide detailed explanations of how to select explanatory variables.
Furthermore, multivariate time series forecasts are more complex than single time series forecasts.
Here is an example of the sales volume of each product for each store.
The labels in the table above stand for store numbers and product numbers. If forecasts were made for all combinations of stores and products, the number of combinations would exceed 1000. If we try some models for each combination, split the data into training and validation sets, and cross-validate the data, the number of models will explode. Thus, trying various models and variables blindly is inefficient and unproductive. To master time series forecasting, you need to learn the framework and be able to efficiently trial and error many models and sets of variables.
Based on the recognition of the above issues, we would like to share the following three points with you in this series of articles.
- How to select models by the situation.
- The concept of variable creation and selection.
- Efficient modeling framework.
Goal: Acing the Past Kaggle Competition!
In this blog, we would like to introduce practical methods, but the data we usually deal with is our clients’ data, which is difficult to disclose to the public. Therefore, we will use the data from Kaggle to share the essence of data analysis for your business.
This is the competition we will be challenging in this blog. This competition was held by Walmart in 2014 and its subject was to forecast sales by store and by the department. We chose this competition for two reasons. The first reason is that the data volume is easy to handle. The data size is only a few MB, so it can be handled well with laptops. Second, we can analyze and discuss the data from a business perspective, because the data includes explanatory variables such as promotions.
Now that we have decided which competition we will participate in, the next step is to set a target rank. Considering that the competition was held in 2014, the time when XGBoost and Deep Learning were not widely available, we set our target ranking as
1st out of 688.
Due to the fact that this is one of the past competitions, it is possible that some published notebooks scored higher than the competition’s top score. However, we did not find any such notebooks. Thus, if our model exceeds the top score, it is clear that the model is by us.
Data
There are four data sets.
- train.csv: Data for training. The contents of the data are as follows
## # A tibble: 6 × 5
## Store Dept Date Weekly_Sales IsHoliday
## <dbl> <dbl> <date> <dbl> <lgl>
## 1 1 1 2010-02-05 24924. FALSE
## 2 1 1 2010-02-12 46039. TRUE
## 3 1 1 2010-02-19 41596. FALSE
## 4 1 1 2010-02-26 19404. FALSE
## 5 1 1 2010-03-05 21828. FALSE
## 6 1 1 2010-03-12 21043. FALSE
- test.csv: Data for testing. The columns are the same as train.csv, but only Weekly_Sales is missing.
- stores.csv: Data containing store information. Matched by store.
## # A tibble: 6 × 3
## Store Type Size
## <dbl> <chr> <dbl>
## 1 1 A 151315
## 2 2 A 202307
## 3 3 B 37392
## 4 4 A 205863
## 5 5 B 34875
## 6 6 A 202505
- features.csv: Data by store, by date, and contains various information such as promotion.
## # A tibble: 6 × 12
## Store Date Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3
## <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2010-02-05 42.3 2.57 NA NA NA
## 2 1 2010-02-12 38.5 2.55 NA NA NA
## 3 1 2010-02-19 39.9 2.51 NA NA NA
## 4 1 2010-02-26 46.6 2.56 NA NA NA
## 5 1 2010-03-05 46.5 2.62 NA NA NA
## 6 1 2010-03-12 57.8 2.67 NA NA NA
Finally, also check the format of the submission file.
## # A tibble: 6 × 2
## Id Weekly_Sales
## <chr> <dbl>
## 1 1_1_2012-11-02 0
## 2 1_1_2012-11-09 0
## 3 1_1_2012-11-16 0
## 4 1_1_2012-11-23 0
## 5 1_1_2012-11-30 0
## 6 1_1_2012-12-07 0
The submission data must include the IDs, which are Store, Dept and Date connected by “_”, the sales forecast for each ID.
Evaluation
The evaluation metric is WMAE. It is essentially the same as Mean Absolute Error, but it's weighted.
Please check the detailed definitions on your own, but it should be noted that the error score is five times higher for Holiday Weeks.This is because sales are expected to be very high during Christmas and Thanksgiving.Therefore, forecasting these periods is very important. It is important to set up custom metrics that reflect this business context, which should be considered when actually building the model.
EDA
Now that the data and the evaluation indexes have been described, the next step is to introduce Exploratory Data Analysis (EDA). Actually, we conducted a variety of analyses, but it was difficult to introduce all of them because it would be too much. Therefore, in this blog, I will focus on the training and test data and explain only the most remarkable parts.
Training Data
First, let’s check the contents of the training data.Use the R package Skimr to view the contents of the training data.
Name | Piped data |
Number of rows | 421570 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
Date | 1 |
factor | 3 |
logical | 1 |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
Date | 0 | 1 | 2010-02-05 | 2012-10-26 | 2011-06-17 | 143 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Store | 0 | 1 | FALSE | 45 | 13: 10474, 10: 10315, 4: 10272, 1: 10244 |
Dept | 0 | 1 | FALSE | 81 | 1: 6435, 2: 6435, 3: 6435, 4: 6435 |
Id | 0 | 1 | FALSE | 3331 | 1_1: 143, 1_1: 143, 1_1: 143, 1_1: 143 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
IsHoliday | 0 | 1 | 0.07 | FAL: 391909, TRU: 29661 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Weekly_Sales | 0 | 1 | 15981.26 | 22711.18 | -4988.94 | 2079.65 | 7612.03 | 20205.85 | 693099.4 | ▇▁ |
Date covers the period from February 2010 to October 2012, and contains about 2.5 years of training data. Store and Dept contain 45 stores and 81 departments, respectively. Combining Store and Dept with the unit Id to check weekly sales, we find that there are 3331 IDs in total.
Next, when IsHoliday is checked, approximately 7% of the total number of Id corresponds to IsHoliday. The below lists of vacation periods were provided by Kaggle:
- Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13
- Labor Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13
- Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13
- Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13
The above four lists are registered vacation periods. Because Christmas and the Super Bowl have different impacts on sales, these vacations need to be considered separately.
Finally, check Weekly_Sales. It seems that sales may be negative in some of the data. It is strange to see negative sales in a supermarket, but it is a common case when dealing with actual client data. Reasons could be product returns or cancelled sales orders. Most of the sales values are in the range of less than 20,000, but the maximum value is very large, about 700,000, which suggests that the distribution is spread to the right.
Let’s check out some more data on training for IDs.
## # A tibble: 4 × 3
## ndates n prop
## <fct> <int> <dbl>
## 1 (0,40] 3835 0.009
## 2 (40,80] 4800 0.011
## 3 (80,142] 32555 0.077
## 4 (142,143] 380380 0.902
While 143 are complete data, this is only about 90% of the total, with the remaining 10% missing during the period. From a business perspective, this could be the case if a new store opens during the period, a new department is created, or a department is redefined. As mentioned earlier, time series data tend to have fewer data points, and one of the challenges is how to handle such missing data.
Test Data
For the test data, skim in the same way.
Name | Piped data |
Number of rows | 115064 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
Date | 1 |
factor | 3 |
logical | 1 |
________________________ | |
Group variables | None |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
Date | 0 | 1 | 2012-11-02 | 2013-07-26 | 2013-03-15 | 39 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
Store | 0 | 1 | FALSE | 45 | 13: 2836, 4: 2803, 19: 2799, 2: 2797 |
Dept | 0 | 1 | FALSE | 81 | 1: 1755, 2: 1755, 3: 1755, 4: 1755 |
Id | 0 | 1 | FALSE | 3169 | 1_1: 39, 1_1: 39, 1_1: 39, 1_1: 39 |
Variable type: logical
skim_variable | n_missing | complete_rate | mean | count |
---|---|---|---|---|
IsHoliday | 0 | 1 | 0.08 | FAL: 106136, TRU: 8928 |
As a result of merging the two data, we can see that the number of ID has increased from 3331 to 3342. It seems that there are at least 11 IDs that are only found in the test data. In general, it is impossible to create time series forecasting model for data that is not in the training data. Therefore, even if we create time series model for each ID, the forecast model will not work. We need to solve this problem, which will be explained in detail in the building model part.
Padding & imputation
We found that the training data is partially missing in terms of ID units, and that there are some IDs that are in the test data but not in the training data. Therefore, if trying to apply time series models without any effort, errors will occur. To build the base model, we will fill in the missing data with NA (Padding) and impute the appropriate values.
First, we will fill in the missing date records in the training data with NA. From here on, we will basically use the modeltime package for processing time series. It is very easy to use and allows us to smoothly do everything from pre-processing time series to modeling using the tidymodels framework. Currently, Python is the most popular choice when it comes to data analysis, but R also has useful and easy to use analysis tools. Especially for time series analysis, Prof. Rob J. Hyndman’s package series and modeltime, which we will use in this blog, are very useful, so please give them a try. Some people may think that R cannot use models based on deep learning, but it can. This will be explained later.
full_tr <- tr %>%
# create Id
mutate(Id = paste(Store, Dept,sep = "_")) %>%
# padding
group_by(Id) %>%
select(Id, Date, Weekly_Sales) %>%
pad_by_time(
Date,.start_date = min(tr$Date),
.end_date = max(tr$Date)
%>%
) ungroup() %>%
# change negative to 0
mutate(Weekly_Sales = ifelse(Weekly_Sales<0, 0, Weekly_Sales))
Let me explain code in a little more detail. Tr is the training data. For tr, we first created Id by attaching Store and Dept. Next, we filled in the missing Date in the first and last dates of the entire training data with NA. In addition, we replaced the negative values in Weekly_Sales with 0.
Name | Piped data |
Number of rows | 476333 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
character | 1 |
Date | 1 |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Id | 0 | 1 | 3 | 5 | 0 | 3331 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
Date | 0 | 1 | 2010-02-05 | 2012-10-26 | 2011-06-17 | 143 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Weekly_Sales | 54763 | 0.89 | 15981.47 | 22711.03 | 0 | 2079.65 | 7612.03 | 20205.85 | 693099.4 | ▇▁▁▁▁ |
We can see that the number of rows has increased by about 50,000. Then, let’s complete NA.
tr_imputed <- full_tr %>%
group_by(Id) %>%
group_split() %>%
map_dfr(.f = function(df) {
<- df %>% arrange(Date)
df <- df$Date
m_date <- df$Id
m_id = ts(df$Weekly_Sales, start = decimal_date(min(tr$Date)), frequency = 52)
df if(sum(!is.na(df)) >= 3){
<-as.numeric(na_seadec(df))
df else if(sum(!is.na(df)) == 2){
} <-as.numeric(na_interpolation(df))
df else{
} <-as.numeric(na_locf(df))
df
}tibble(Id = m_id, Date = m_date, Weekly_Sales = as.numeric(df))
%>% mutate(Weekly_Sales = ifelse(Weekly_Sales<0, 0, Weekly_Sales)) })
The method of completion depends on the remaining data in each ID unit. The basic policy is as follows:
- 3 or more data points: Completion considering seasonality (na_seadec)
- 2 data points: linear completion (na_interpolation)
- 1 data point: Apply the data of the last observation (na_locf)
tr_imputed %>% skim()
Name | Piped data |
Number of rows | 476333 |
Number of columns | 3 |
_______________________ | |
Column type frequency: | |
character | 1 |
Date | 1 |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
Id | 0 | 1 | 3 | 5 | 0 | 3331 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
Date | 0 | 1 | 2010-02-05 | 2012-10-26 | 2011-06-17 | 143 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
Weekly_Sales | 0 | 1 | 14155.72 | 21958.19 | 0 | 797.01 | 5742.07 | 17773.32 | 693099.4 | ▇▁▁▁▁ |
The value of n_missing is 0, and we were able to create the training data with no missing any IDs. There must be some IDs that exist only in the test, but there are no such IDs in the training data, so we just need to fill in 0 when preparing the submission file.
Base Model(Snaive)
Now that the training data is complete, it is time for modeling. Here we create a snaive model as the base model. Snaive is a Seasonal naive model, which in this case simply uses the values from the same week in the previous year to forecast. Using values from the previous year for forecasting is still commonly used in business, so it is appropriate as a baseline.
Forecasting for a single time series is not difficult, because it is simply applying the model. However, the key point in this case is that we want to apply that process to more than 3,000 time series at once.
In this case, there are two main options:
- Repeat the application of the model to each time series. (Iterative)
- Applying one model to the entire time series. (Global)
This time we adopt the former, so we start by creating the data set for it.
# set forecast horizon
<- 39
horizon
# create nested data
<- tr_imputed %>%
nested_data_tbl group_by(Id) %>%
extend_timeseries(
.id_var = Id,
.date_var = Date,
.length_future = horizon
%>%
) nest_timeseries(
.id_var = Id,
.length_future = horizon
%>%
) split_nested_timeseries(
.length_test = horizon
)
For a detailed explanation, please click here. The above process can be used to create data stored as a time series nested by ID. We then apply the model to this data with modeltime, which utilizes parsnip, an excellent framework developed by Dr. Max Kuhn, a legend in data science. This framework is used to apply models to data. The concept of this framework is to allow various models to be applied with the same syntax.As shown below, models and recipes are defined separately, then integrated with workflow and fitted. The framework may seem difficult to use at first, but once you get used to it, it is very easy to use. It is also very useful because it allows you to create and evaluate various models at once. So, let’s get started.
# set model spec
<- naive_reg(
model_snaive seasonal_period = 52
%>%
) set_engine("snaive")
# create recipe
<- recipe(Weekly_Sales ~ Date,
rec_snaive extract_nested_train_split(nested_data_tbl))
# create workflow
<- workflow() %>%
wflw_snaive add_model(model_snaive) %>%
add_recipe(rec_snaive)
First, define model and recipe. To define model, use naive_reg() in parsnip. seasonal_period is set to 52 because this is weekly data. The recipe needs no adjustment, but the iterative model needs additional settings (extract_nested_ train_split). Then, integrate the model and recipe with workflow. Now we move on to the next step. Processing lots of time series at once can take a long time. So, it is better to start with a small sample to see if the model works properly. First, extract arbitrary IDs.
set.seed(235)
<- sample(nested_data_tbl$Id %>% unique(), size = 6) id_sample
Let’s fit the model to the sampled data.
try_sample_tbl <- nested_data_tbl %>%
filter(Id %in% id_sample) %>%
modeltime_nested_fit(
model_list = list(
wflw_snaive
),
control = control_nested_fit(
verbose = TRUE,
allow_par = TRUE
) )
Then check for errors, evaluation accuracy, problems with the graphs. Because these operations can be done with only a few lines of code, modeltime is very useful.
try_sample_tbl %>% extract_nested_error_report()
## # A tibble: 0 × 4
## # … with 4 variables: Id <chr>, .model_id <int>, .model_desc <chr>,
## # .error_desc <chr>
try_sample_tbl %>%
extract_nested_test_accuracy() %>%
table_modeltime_accuracy()
Id | .model_id | .model_desc | .type | mae | mape | mase | smape | rmse | rsq |
---|---|---|---|---|---|---|---|---|---|
26_25 | 1 | SNAIVE | Test | 893.76 | 13.55 | 0.86 | 14.24 | 1288.86 | 0.35 |
29_34 | 1 | SNAIVE | Test | 2185.42 | 22.83 | 1.17 | 20.29 | 2696.7 | 0.32 |
44_25 | 1 | SNAIVE | Test | 19.24 | 55.11 | 0.84 | 52.89 | 30.86 | 0.09 |
45_67 | 1 | SNAIVE | Test | 1052.62 | 14.77 | 0.57 | 14.1 | 2019.59 | 0.58 |
6_1 | 1 | SNAIVE | Test | 4931.23 | 20.43 | 1.1 | 17.28 | 9950.62 | 0.11 |
8_19 | 1 | SNAIVE | Test | 631.38 | 122.37 | 5.13 | 60.99 | 1381.9 | 0 |
%>%
try_sample_tbl extract_nested_test_forecast() %>%
group_by(Id) %>%
plot_modeltime_forecast(.interactive = FALSE, .facet_ncol = 3)
There is no error. The graph shows that data from 52 weeks ago applies as expected. It seems that the model fits the sample well, so we do the same for the entire data set. This process will take a long time, and it is better to use parallel processing.
parallel_start(6)
<- nested_data_tbl %>%
nested_modeltime_tbl modeltime_nested_fit(
model_list = list(
wflw_snaive
),
control = control_nested_fit(
verbose = TRUE,
allow_par = TRUE
) )
With 6/8 cores, it took around 5 minutes on my computer. So let’s check the errors and MAE.
# * Review Any Errors ----
%>% extract_nested_error_report() nested_modeltime_tbl
## # A tibble: 0 × 4
## # … with 4 variables: Id <chr>, .model_id <int>, .model_desc <chr>,
## # .error_desc <chr>
nested_modeltime_tbl %>%
extract_nested_test_accuracy() %>%
table_modeltime_accuracy()
Id | .model_id | .model_desc | .type | mae | mape | mase | smape | rmse | rsq |
---|---|---|---|---|---|---|---|---|---|
1_1 | 1 | SNAIVE | Test | 4050.18 | 17.94 | 0.89 | 14.7 | 8977.06 | 0.26 |
1_10 | 1 | SNAIVE | Test | 3009.71 | 8.84 | 1.05 | 9.08 | 3988.16 | 0.19 |
1_11 | 1 | SNAIVE | Test | 4020.55 | 15.31 | 1.03 | 15.22 | 4959.1 | 0.47 |
1_12 | 1 | SNAIVE | Test | 1356.33 | 12.64 | 1.13 | 12.32 | 1675.36 | 0.12 |
1_13 | 1 | SNAIVE | Test | 2607.15 | 6.33 | 1.06 | 6.62 | 3113.97 | 0.37 |
1_14 | 1 | SNAIVE | Test | 1742.82 | 11.7 | 1.09 | 12.27 | 2232.6 | 0.34 |
There seems to be no error. Some MAE values are very different, but this is the base model, so it is ok. Next, let’s refit with all the training data.
nested_best_refit_tbl <- nested_modeltime_tbl %>%
modeltime_nested_refit(
control = control_refit(
verbose = TRUE,
allow_par = TRUE
) )
It took about 3 minutes. Then, let’s plot it the same way as before.
set.seed(234)
<- sample(nested_data_tbl$Id %>% unique(), size = 6)
id_sample %>%
nested_best_refit_tbl extract_nested_future_forecast() %>%
filter(Id %in% id_sample) %>%
group_by(Id) %>%
plot_modeltime_forecast(.facet_ncol = 3)
Overall, the forecast looks relatively good, except for the cases where an overall trend is seen, such as 3_2 and 34_35.
Submit Base Model & Check Scores
Finally, let's prepare the submission file.
results <- nested_best_refit_tbl %>% extract_nested_future_forecast()
<- results %>% mutate(Id = paste(Id, .index,sep = "_")) %>%
results select(Id, .value)
%>% head() results
## # A tibble: 6 × 2
## Id .value
## <chr> <dbl>
## 1 1_1_2010-02-05 24924.
## 2 1_1_2010-02-12 46039.
## 3 1_1_2010-02-19 41596.
## 4 1_1_2010-02-26 19404.
## 5 1_1_2010-03-05 21828.
## 6 1_1_2010-03-12 21043.
%>%
submission left_join(results, by = "Id") %>%
mutate(Weekly_Sales = case_when(!is.na(.value) ~ .value,
is.na(.value) ~ 0
)%>% select(Id, Weekly_Sales) %>% write_csv("submission_snaive.csv") )
The spaces that were not predicted are filled in with zeros. Submitted the file and the rank was,,,
170th out of 688
Private score: 3026.42451
Public score: 2944.19927
Although this is a simple application of data from 52 weeks ago, it is in the top quartile of the total, which is not bad performance. The reasons for this result can be attributed to the fact that time series analysis processing methods were not widespread enough in 2014, when the competition was held, and the limitations of the data points made it difficult to apply machine learning-based models. Snaive is the simplest model that we can use, and we would like to challenge how far we can rank higher based on this.
In part 2, we will cover feature engineering, which is very important when considering the application of machine learning-based models. See you again in the next blog!