Times Series with Sarimax & Prophet | ||
---|---|---|
This notebook contains my notes from the tutorial source linked below. I have personalized the code in many places to match my own way of doing things, and I have added my own explanations of things for my own study. | ||
| Source Author | Source Repository | |
Evan Marie online: | EvanMarie.com| EvanMarie@Proton.me | Linked In | GitHub | Hugging Face | Mastadon | Jovian.ai | TikTok | CodeWars | Discord ⇨ ✨ EvanMarie ✨#6114 | |
from helpers_strawberries import *
import_all()
plt.style.use('strawberries.mplstyle')
%matplotlib inline
import seaborn as sns
import statsmodels.api as sm
%%html
<style>
a:link {color: #742a40 !important; font-weight: 600 !important;}
a:visited {color: #742a40 !important; font-weight: 600 !important;}
</style>
Time Series Components | ||
---|---|---|
- trend shows whether the series is consistently decreasing (downward trend), constant (no trend) or increasing (upward trend) over time - seasonality describes the periodic signal in your time series - noise or residual is the unexplained variance and volatility of the time series |
||
- Python’s statsmodels library: seasonal_decompose | ||
- Seasonality: a pattern that occurs in a fixed and known period - Cylicality: a pattern that does not have a fixed or known period - Stationarity: when data's statistical properties do. not change over time. - With algorithms like SARIMAX, it is important to identify this property, because they depend on it - with linear regression, it is assumed that observations are independent of each other - in a time series, observations are time dependent - by making the time series stationary it is possible to apply regression techniques to time dependent variables - non-stationary time series can be made stationary | ||
Stationary Time Series Criteria - the variance in the seasonality component is constant - the amplitude of the signal does not change much over time - autocorrelation is constant - the relationship of each value in the time series and its neighbors stays the same - analyzing components is a common way to check for stationarity |
Additive Models and Multiplicative Models | ||
---|---|---|
Time series trend, seasonal, and residual components can occur in either an additive or mutliplicative way. | ||
Additive Models: - the components are added linearly - changes over time are consistent in the amount they change $$Y(t) = trend + seasonality + residual$$- linear trend is a straight line - linear seasonality has the same frequency and amplitude, width and height of cycles |
||
Multiplicative Models: - components are multiplied with one another $$Y(t) = trend * seasonality * residual$$ - nonlinear, i.e. it is quadratic or exponential - changes increase or decrease over time - trend is a curved, non-linear line - non-linear seasonality which varies in frequency and amplitude | ||
Decomposition Models - a main objective of decomposition is to estimate seasonal effects - these can be used to create seasonally adjusted values - additive models are useful when seasonal variation is fairly constant over time - multiplicative models are useful when seasonal variation increases over time. |
Google Trends Data | ||
---|---|---|
- dataset reflecting the Google searches for the word 'diet - date range is 2016-03-27 to 2021-03-21' |
diet = pd.read_csv("./sarimax_prophet_repo/data/raw/time-series/multiTimeline_diet.csv",
skiprows=[0,1],
index_col='Week',
parse_dates=['Week'])
df_overview(diet, 'Google "diet"')
diet: (United States) | |
---|---|
datatype | int64 |
missing values | 0 |
count | 261.00 |
mean | 56.35 |
std | 11.50 |
min | 30.00 |
25% | 49.00 |
50% | 55.00 |
75% | 64.00 |
max | 100.00 |
total rows | 261 |
---|---|
total columns | 1 |
column names | diet: (United States) |
index start | 2016-03-27 00:00:00 |
index end | 2021-03-21 00:00:00 |
total missing values | 0 |
Google "diet" Head and Tail |
diet: (United States) | |
---|---|
Week | |
2016-03-27 | 58 |
2016-04-03 | 63 |
2016-04-10 | 59 |
diet: (United States) | |
---|---|
Week | |
2021-03-07 | 48 |
2021-03-14 | 44 |
2021-03-21 | 48 |
Seasonal Pattern | ||
---|---|---|
- searches for 'diet' decrease rapidly at the end of each year - at the beginning of each year, they spike - seasonality occrs in a fixed and known period - no consistent increase or decrease in the trend, suggesting a non-linear trend |
diet.plot(title = 'Google Searches for the Word "diet"', color = 'cyan');
Breaking down components: | ||
---|---|---|
- trend being non-linear, the model parameter will be multiplicative (by default, it is additive) - period can be specified depending on the time series - because the data is given in weeks, the period is set to the number of weeks in a year - frequency and amplitude of seasonality remain constant, suggesting linearity |
statsmodels.tsa.seasonal_decompose() |
---|
# help(sm.tsa.seasonal_decompose)
import statsmodels.api as sm
decomposition = sm.tsa.seasonal_decompose(diet['diet: (United States)'],
model = 'multiplicative',
period=53)
fig = decomposition.plot();
fig.tight_layout(pad=0.75);
model = 'additive' |
---|
import statsmodels.api as sm
decomposition = sm.tsa.seasonal_decompose(diet['diet: (United States)'],
model = 'additive',
period=53)
fig = decomposition.plot();
fig.tight_layout(pad=0.75);
- trend is still non-linear - time series follows no consistent up or down slope - thus no positive or negative trend (up or down) - additive model fits the data better |
||
---|---|---|
- Additive & Multiplicative Decomposition | ||
Stationarity Tests | ||
---|---|---|
statsmodels documentation on ADF and KPSS tests Augmented Dicky-Fuller test (ADF) stationarity test can allow data to pass that may not actually be stationary - it is best to also apply the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) to check for true stationarity - it is also important to observe a time series' plot |
||
ADF Test: - tests for the presence of unit root in the series - helps determine if series is stationary In statistics, an augmented Dickey–Fuller test (ADF) tests the null hypothesis that a unit root is present in a time series sample. The alternative hypothesis is different depending on which version of the test is used, but is usually stationarity or trend-stationarity. It is an augmented version of the Dickey–Fuller test for a larger and more complicated set of time series models. The augmented Dickey–Fuller (ADF) statistic, used in the test, is a negative number. The more negative it is, the stronger the rejection of the hypothesis that there is a unit root at some level of confidence.(Source) Unit Root In probability theory and statistics, a unit root is a feature of some stochastic processes (such as random walks) that can cause problems in statistical inference involving time series models. A linear stochastic process has a unit root if 1 is a root of the process's characteristic equation. Such a process is non-stationary but does not always have a trend.(Source) |
||
- Null Hypothesis: The series has a unit root, meaning it is non-stationary. It has some time dependent structure.
- Alternate Hypothesis: The series has no unit root, meaning it is stationary. It does not have time-dependent structure. - if null hypothesis is not rejected, possible evidence a series is non-stationary - p-value below a threshold (1%, 5%, etc) suggests null hypothesis, i.e. stationary - p-value above the threshold suggest failed null hypothesis, i.e. non-stationary |
||
KPSS Text - null and alternate hypothesis are opposite those of ADF - Null Hypothesis - the time series is trend stationary - Alternate Hypothesis - the seies has a unit root and is not stationary - a p-value below a threshold suggests rejection of the null hypothesis and non-stationarity - a p-value above the threshold suggests failure to reject null hypothesis, i.e. stationary |
# statsmodels for the two tests |
---|
statsmodels.tsa.stattools.adfuller() - Augmented Dickey Fuller Test |
---|
# help(sm.tsa.stattools.adfuller)
from statsmodels.tsa.stattools import adfuller
def adf_test(timeseries):
print ('Results of Dickey-Fuller Test:')
dftest = adfuller(timeseries, autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print (dfoutput)
statsmodels.tsa.stattools.kpss() - Kwiatkowski-Phillips-Schmidt-Shin test |
---|
# help(sm.tsa.stattools.kpss)
from statsmodels.tsa.stattools import kpss
def kpss_test(timeseries):
print ('Results of KPSS Test:')
kpsstest = kpss(timeseries, regression='c', nlags="auto")
kpss_output = pd.Series(kpsstest[0:3], index=['Test Statistic','p-value','Lags Used'])
for key,value in kpsstest[3].items():
kpss_output['Critical Value (%s)'%key] = value
print (kpss_output)
The following outcomes are possible: |
||
---|---|---|
- both tests deem the series is not stationary, therefor the series is not stationary - both tests deem the series IS stationary, therefor the series is stationary - KPSS indicates stationarity, and ADF indicates non-stationarity, meaning that the series is trend stationary. In this case, the trend must be removed to make the series strict stationary, and the series should be checked for stationarity with the trend removed - KPSS indicates non-stationarity, and ADF indicates stationarity, meaning the series is difference stationary. In this case, differencing must be used to make the series stationary, and the differenced series should be checked for stationarity |
ADF Results |
---|
adf_test(diet['diet: (United States)'])
Results of Dickey-Fuller Test: Test Statistic -2.86 p-value 0.05 #Lags Used 6.00 Number of Observations Used 254.00 Critical Value (1%) -3.46 Critical Value (5%) -2.87 Critical Value (10%) -2.57 dtype: float64
KPSS Results |
---|
kpss_test(diet['diet: (United States)'])
Results of KPSS Test: Test Statistic 0.63 p-value 0.02 Lags Used 9.00 Critical Value (10%) 0.35 Critical Value (5%) 0.46 Critical Value (2.5%) 0.57 Critical Value (1%) 0.74 dtype: float64
Results: | ||
---|---|---|
ADF -> 0.05 threshold - the p-value is not below the threshold, null hypothesis is regected. Series is stationary KPSS -> evidence suggests rejecting the null hypothesis for the alternate hypothesis, thus suggesting non-stationarity - these results fall into the category of the last in our list, thus it is necessary to apply difference to achieve stationarity - the series must then be tested again |
||
Summary: - the trend is non-linear and multiplicative, neither increasing or decreasing consistently - seasonality is influenced at the end and beginning of each year - seasonality is linear and does not vary in frequency or amplitude - additive residuals are lower than multiplicative - ADF says stationary, while KPSS says non-stationary, so difference must be applied |
||
Conclusion: - since seasonality is linear and the smaller additive residuals, it is reasonable to choose the additive model as the more appropriate |
Making a Time Series Stationary | ||
---|---|---|
- statistical models often require a time series be stationary in order to make effective and precise predictions, for example the ARIMA model DIFFERENCING: - subtract the previous value from each value in the time series Other transformations are possible as well, i.e. taking the log or the square root from a time series |
def adf_kpss(timeseries, max_d):
pd.reset_option('display.float_format')
""" Build dataframe with ADF statistics and p-value for time series after applying difference on time series
Args:
time_series (df): Dataframe of univariate time series
max_d (int): Max value of how many times apply difference
Returns:
Dataframe showing values of ADF statistics and p when applying ADF test after applying d times
differencing on a time-series.
"""
results=[]
for idx in range(max_d):
adf_result = adfuller(timeseries, autolag='AIC')
kpss_result = kpss(timeseries, regression='c', nlags="auto")
timeseries = timeseries.diff().dropna()
if adf_result[1] <=0.05:
adf_stationary = True
else:
adf_stationary = False
if kpss_result[1] <=0.05:
kpss_stationary = False
else:
kpss_stationary = True
stationary = adf_stationary & kpss_stationary
results.append((idx,adf_result[1], kpss_result[1],adf_stationary,kpss_stationary, stationary))
# Construct DataFrame
results_df = pd.DataFrame(results, columns=['d','adf_stats','p-value', 'is_adf_stationary','is_kpss_stationary','is_stationary' ])
return results_df
adf_kpss(diet, 3)
/Users/evancarr/opt/anaconda3/envs/sktime_project_02_05_23/lib/python3.10/site-packages/statsmodels/tsa/stattools.py:2022: InterpolationWarning: The test statistic is outside of the range of p-values available in the look-up table. The actual p-value is greater than the p-value returned. warnings.warn( /Users/evancarr/opt/anaconda3/envs/sktime_project_02_05_23/lib/python3.10/site-packages/statsmodels/tsa/stattools.py:2022: InterpolationWarning: The test statistic is outside of the range of p-values available in the look-up table. The actual p-value is greater than the p-value returned. warnings.warn(
d | adf_stats | p-value | is_adf_stationary | is_kpss_stationary | is_stationary | |
---|---|---|---|---|---|---|
0 | 0 | 4.992622e-02 | 0.019878 | True | False | False |
1 | 1 | 1.171221e-13 | 0.100000 | True | True | True |
2 | 2 | 2.761322e-12 | 0.100000 | True | True | True |
pd.DataFrame.diff() - used to make the differenced and stationary time series |
---|
help(pd.DataFrame.diff)
Help on function diff in module pandas.core.frame: diff(self, periods: 'int' = 1, axis: 'Axis' = 0) -> 'DataFrame' First discrete difference of element. Calculates the difference of a DataFrame element compared with another element in the DataFrame (default is element in previous row). Parameters ---------- periods : int, default 1 Periods to shift for calculating difference, accepts negative values. axis : {0 or 'index', 1 or 'columns'}, default 0 Take difference over rows (0) or columns (1). Returns ------- DataFrame First differences of the Series. See Also -------- DataFrame.pct_change: Percent change over given number of periods. DataFrame.shift: Shift index by desired number of periods with an optional time freq. Series.diff: First discrete difference of object. Notes ----- For boolean dtypes, this uses :meth:`operator.xor` rather than :meth:`operator.sub`. The result is calculated according to current dtype in DataFrame, however dtype of the result is always float64. Examples -------- Difference with previous row >>> df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6], ... 'b': [1, 1, 2, 3, 5, 8], ... 'c': [1, 4, 9, 16, 25, 36]}) >>> df a b c 0 1 1 1 1 2 1 4 2 3 2 9 3 4 3 16 4 5 5 25 5 6 8 36 >>> df.diff() a b c 0 NaN NaN NaN 1 1.0 0.0 3.0 2 1.0 1.0 5.0 3 1.0 1.0 7.0 4 1.0 2.0 9.0 5 1.0 3.0 11.0 Difference with previous column >>> df.diff(axis=1) a b c 0 NaN 0 0 1 NaN -1 3 2 NaN -1 7 3 NaN -1 13 4 NaN 0 20 5 NaN 2 28 Difference with 3rd previous row >>> df.diff(periods=3) a b c 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 3.0 2.0 15.0 4 3.0 4.0 21.0 5 3.0 6.0 27.0 Difference with following row >>> df.diff(periods=-1) a b c 0 -1.0 0.0 -3.0 1 -1.0 -1.0 -5.0 2 -1.0 -1.0 -7.0 3 -1.0 -2.0 -9.0 4 -1.0 -3.0 -11.0 5 NaN NaN NaN Overflow in input dtype >>> df = pd.DataFrame({'a': [1, 0]}, dtype=np.uint8) >>> df.diff() a 0 NaN 1 255.0
# time series is now stationary |
---|
fig, axes = plt.subplots(nrows=2, ncols=1, figsize = (12, 8))
axes[0] = diet.plot(ax = axes[0], title = "Original timeseries" )
axes[0].legend(loc = 2)
axes[1] = diet.diff().dropna().plot(ax = axes[1], title = "Stationary timeseries - original timeseries differenced once")
axes[1].legend().remove()
plt.tight_layout()
Global Temperature Dataset | ||
---|---|---|
- Dataset source - includes global monthly average temperature (°C) anomalies from 1880 to 2016 - GISS surface temperature analysis, GISTEMP - global component of Climate at a Glance, GCAG |
weather = pd.read_csv("./sarimax_prophet_repo/data/raw/time-series/monthly_csv.csv",index_col='Date', parse_dates=['Date'])
weather.sort_index(inplace=True)
df_overview(weather, 'Global Weather', 4)
Source | Mean | |
---|---|---|
datatype | object | float64 |
missing values | 0 | 0 |
count | nan | 3,288.00 |
mean | nan | 0.04 |
std | nan | 0.34 |
min | nan | -0.78 |
25% | nan | -0.21 |
50% | nan | -0.04 |
75% | nan | 0.24 |
max | nan | 1.35 |
total rows | 3,288 |
---|---|
total columns | 2 |
column names | Source, Mean |
index start | 1880-01-06 00:00:00 |
index end | 2016-12-06 00:00:00 |
total missing values | 0 |
Global Weather Head and Tail |
Source | Mean | |
---|---|---|
Date | ||
1880-01-06 | GISTEMP | -0.30 |
1880-01-06 | GCAG | 0.00 |
1880-02-06 | GCAG | -0.12 |
1880-02-06 | GISTEMP | -0.21 |
Source | Mean | |
---|---|---|
Date | ||
2016-11-06 | GISTEMP | 0.93 |
2016-11-06 | GCAG | 0.75 |
2016-12-06 | GISTEMP | 0.81 |
2016-12-06 | GCAG | 0.79 |
gistemp = pd.DataFrame(weather[weather.Source == 'GISTEMP']['Mean'].copy())
gistemp.columns = ['giss_mean']
gcag = pd.DataFrame(weather[weather.Source == 'GCAG']['Mean'].copy())
gcag.columns = ['gcag_mean']
multi([(gistemp.head(5), 'GISTEMP'),
(gcag.head(5), 'GCAG')])
giss_mean | |
---|---|
Date | |
1880-01-06 | -0.30 |
1880-02-06 | -0.21 |
1880-03-06 | -0.18 |
1880-04-06 | -0.27 |
1880-05-06 | -0.14 |
gcag_mean | |
---|---|
Date | |
1880-01-06 | 0.00 |
1880-02-06 | -0.12 |
1880-03-06 | -0.14 |
1880-04-06 | -0.05 |
1880-05-06 | -0.07 |
df_overview(gistemp, 'GISSTEMP', 3)
giss_mean | |
---|---|
datatype | float64 |
missing values | 0 |
count | 1,644.00 |
mean | 0.02 |
std | 0.34 |
min | -0.78 |
25% | -0.23 |
50% | -0.05 |
75% | 0.23 |
max | 1.35 |
total rows | 1,644 |
---|---|
total columns | 1 |
column names | giss_mean |
index start | 1880-01-06 00:00:00 |
index end | 2016-12-06 00:00:00 |
total missing values | 0 |
GISSTEMP Head and Tail |
giss_mean | |
---|---|
Date | |
1880-01-06 | -0.30 |
1880-02-06 | -0.21 |
1880-03-06 | -0.18 |
giss_mean | |
---|---|
Date | |
2016-10-06 | 0.89 |
2016-11-06 | 0.93 |
2016-12-06 | 0.81 |
df_overview(gcag, 'GCAG', 3)
gcag_mean | |
---|---|
datatype | float64 |
missing values | 0 |
count | 1,644.00 |
mean | 0.05 |
std | 0.33 |
min | -0.68 |
25% | -0.19 |
50% | -0.02 |
75% | 0.25 |
max | 1.22 |
total rows | 1,644 |
---|---|
total columns | 1 |
column names | gcag_mean |
index start | 1880-01-06 00:00:00 |
index end | 2016-12-06 00:00:00 |
total missing values | 0 |
GCAG Head and Tail |
gcag_mean | |
---|---|
Date | |
1880-01-06 | 0.00 |
1880-02-06 | -0.12 |
1880-03-06 | -0.14 |
gcag_mean | |
---|---|
Date | |
2016-10-06 | 0.73 |
2016-11-06 | 0.75 |
2016-12-06 | 0.79 |
fig, axes = plt.subplots(2, 1, figsize = (12, 8))
gistemp.plot(ax = axes[0]);
gcag.plot(ax = axes[1]);
plt.tight_layout()
plt.suptitle("Monthly mean temperature anomalies in degrees Celsius relative to a base period", size = 16, y = 1.04);
- plots above suggest a positive trend - time series is therefor non-stationary |
---|
# decomposing the time series, using period = 12, 1 year |
---|
decomposition_gistemp = sm.tsa.seasonal_decompose(gistemp['giss_mean'],
period = 12)
fig = decomposition_gistemp.plot()
# increasing frequency/period to make seasonality possible to observe, increasing from 1 year to 20 years |
---|
decomposition_gistemp = sm.tsa.seasonal_decompose(gistemp['giss_mean'],
period = 240)
fig = decomposition_gistemp.plot()
Conclusions: | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
- trend is positive, suggesting non-stationarity - amplitude and frequency do not change, suggesting use of an additive model |
ADF and KPSS Tests |
---|
adf_test(gistemp)
Results of Dickey-Fuller Test: Test Statistic -0.360964 p-value 0.916415 #Lags Used 24.000000 Number of Observations Used 1619.000000 Critical Value (1%) -3.434396 Critical Value (5%) -2.863327 Critical Value (10%) -2.567721 dtype: float64
kpss_test(gistemp)
Results of KPSS Test: Test Statistic 4.970866 p-value 0.010000 Lags Used 26.000000 Critical Value (10%) 0.347000 Critical Value (5%) 0.463000 Critical Value (2.5%) 0.574000 Critical Value (1%) 0.739000 dtype: float64
/Users/evancarr/opt/anaconda3/envs/sktime_project_02_05_23/lib/python3.10/site-packages/statsmodels/tsa/stattools.py:2018: InterpolationWarning: The test statistic is outside of the range of p-values available in the look-up table. The actual p-value is smaller than the p-value returned. warnings.warn(
# both tests who the gistemp time series as non-stationary, which confirms what is seen in the plots above. |
---|
Zooming in to view the plots for just the time period 2014 - 2016, the last two years of the data |
---|
decomposition = sm.tsa.seasonal_decompose(gistemp['giss_mean']["2014":"2016"],
period = 12)
fig = decomposition.plot()
Observations: | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
- positive trend - seasonality present with lower values in July and higher in March |
# How many times must the data be differenced to achieve stationarity? |
---|
adf_kpss(gistemp, 3)
/Users/evancarr/opt/anaconda3/envs/sktime_project_02_05_23/lib/python3.10/site-packages/statsmodels/tsa/stattools.py:2018: InterpolationWarning: The test statistic is outside of the range of p-values available in the look-up table. The actual p-value is smaller than the p-value returned. warnings.warn( /Users/evancarr/opt/anaconda3/envs/sktime_project_02_05_23/lib/python3.10/site-packages/statsmodels/tsa/stattools.py:2022: InterpolationWarning: The test statistic is outside of the range of p-values available in the look-up table. The actual p-value is greater than the p-value returned. warnings.warn( /Users/evancarr/opt/anaconda3/envs/sktime_project_02_05_23/lib/python3.10/site-packages/statsmodels/tsa/stattools.py:2022: InterpolationWarning: The test statistic is outside of the range of p-values available in the look-up table. The actual p-value is greater than the p-value returned. warnings.warn(
d | adf_stats | p-value | is_adf_stationary | is_kpss_stationary | is_stationary | |
---|---|---|---|---|---|---|
0 | 0 | 9.164152e-01 | 0.01 | False | False | False |
1 | 1 | 9.464601e-23 | 0.10 | True | True | True |
2 | 2 | 2.255372e-29 | 0.10 | True | True | True |
adf_kpss(gcag, 3)
/Users/evancarr/opt/anaconda3/envs/sktime_project_02_05_23/lib/python3.10/site-packages/statsmodels/tsa/stattools.py:2018: InterpolationWarning: The test statistic is outside of the range of p-values available in the look-up table. The actual p-value is smaller than the p-value returned. warnings.warn( /Users/evancarr/opt/anaconda3/envs/sktime_project_02_05_23/lib/python3.10/site-packages/statsmodels/tsa/stattools.py:2022: InterpolationWarning: The test statistic is outside of the range of p-values available in the look-up table. The actual p-value is greater than the p-value returned. warnings.warn( /Users/evancarr/opt/anaconda3/envs/sktime_project_02_05_23/lib/python3.10/site-packages/statsmodels/tsa/stattools.py:2022: InterpolationWarning: The test statistic is outside of the range of p-values available in the look-up table. The actual p-value is greater than the p-value returned. warnings.warn(
d | adf_stats | p-value | is_adf_stationary | is_kpss_stationary | is_stationary | |
---|---|---|---|---|---|---|
0 | 0 | 8.921048e-01 | 0.01 | False | False | False |
1 | 1 | 7.940293e-22 | 0.10 | True | True | True |
2 | 2 | 4.754621e-29 | 0.10 | True | True | True |
# gistemp and gcag time series become stationary after differencing once (d=1). |
---|
Original Time Series Plotted |
---|
fig, (ax1,ax2) = plt.subplots(2,1)
plt.suptitle("Monthly mean temperature anomalies in degrees Celsius relative to a base period", size = 16)
gistemp.plot(ax=ax1, color = 'cyan')
gcag.plot(ax=ax2, color = 'yellow')
plt.tight_layout()
Differenced Time Series Plotted |
---|
fig, (ax1,ax2) = plt.subplots(2,1)
plt.suptitle("DIFFERENCED: Monthly mean temperature anomalies in degrees Celsius relative to a base period", size = 14);
gistemp.diff().dropna().plot(ax=ax1, color = 'cyan');
gcag.diff().dropna().plot(ax=ax2, color='yellow');
Forecasting with ARIMA | ||
---|---|---|
- Using SARIMAX but not setting any of the seasonality orders |
import warnings
warnings.filterwarnings('ignore')
from statsmodels.tools.sm_exceptions import ConvergenceWarning, InterpolationWarning
warnings.simplefilter('ignore', ConvergenceWarning)
warnings.simplefilter('ignore', InterpolationWarning)
# function to reduce memory usage, (Source, Numer.AI) |
---|
def reduce_memory_usage(df, verbose=True):
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
start_mem = df.memory_usage().sum() / 1024**2
for col in df.columns:
col_type = df[col].dtypes
if col_type in numerics:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
end_mem = df.memory_usage().sum() / 1024**2
if verbose:
print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
return df
(S)ARIMA(X) models | ||
---|---|---|
ARIMA - Auto Regressive Moving Average - combination of two models, auto regressive, which uses lagged values to forecast - and moving average, which uses lagged values of residual errors to forecast - uses dependencies between data values and error values from past data to make predictions - takes three parameters * p - number of autoregressive lags * d - number of times differencing is applied to make the data stationary * q - number of moving average lags - it is also possible to apply transformations before using the ARIMA model - However, if differencing and other transformations are applied before the model, they must be reverse transformed to access the forecast of the original values - it is important to difference the data ONLY until it is stationary and no further - to know the value for d, the number of times to run differencing, use the ADF and KPSS tests, the adf_kpss() function above |
||
ARIMAX - extended version of the ARIMA model which encorporates exogenous inputs - modeled using other independednt variables in addition to the time series - example: when modeling the waiting time in an emergency room. The number of nurses available at a certain shift could be considered an external variable since it may impact on the waiting time. If this is indeed the case, by changing the number of nurses we can affect the waiting times. | ||
SARIMA - this model should be used when there is seasonality - ARIMA ignores seasonality - SARIMA includes additional parameters to work with seasonality: P, D, Q, and S * P - seasonal autoregressive order * D - seasonal differencing order * Q - seasonal moving average order * S - length of the seasonal cycle |
The Walmart Dataset | ||
---|---|---|
- comes from a Store Item Demand Forecasting Challenge on Kaggle - 5 years of store-item sales data split into train and test csv files - competition objective: forecast 3 months of sales for 50 different items at 10 different stores using the data |
Training Data Overview |
---|
df_train = pd.read_csv("./sarimax_prophet_repo/data/raw/store_item_demand_forecasting/train.csv")
reduce_memory_usage(df_train)
df_overview(df_train, 'Walmart Training Dataset')
Mem. usage decreased to 10.45 Mb (62.5% reduction)
date | store | item | sales | |
---|---|---|---|---|
datatype | object | int8 | int8 | int16 |
missing values | 0 | 0 | 0 | 0 |
count | nan | 913,000.00 | 913,000.00 | 913,000.00 |
mean | nan | 5.50 | 25.50 | 52.25 |
std | nan | 2.87 | 14.43 | 28.80 |
min | nan | 1.00 | 1.00 | 0.00 |
25% | nan | 3.00 | 13.00 | 30.00 |
50% | nan | 5.50 | 25.50 | 47.00 |
75% | nan | 8.00 | 38.00 | 70.00 |
max | nan | 10.00 | 50.00 | 231.00 |
total rows | 913,000 |
---|---|
total columns | 4 |
column names | date, store, item, sales |
index start | 0 |
index end | 912999 |
total missing values | 0 |
Walmart Training Dataset Head and Tail |
date | store | item | sales | |
---|---|---|---|---|
0 | 2013-01-01 | 1 | 1 | 13 |
1 | 2013-01-02 | 1 | 1 | 11 |
2 | 2013-01-03 | 1 | 1 | 14 |
date | store | item | sales | |
---|---|---|---|---|
912997 | 2017-12-29 | 10 | 50 | 74 |
912998 | 2017-12-30 | 10 | 50 | 62 |
912999 | 2017-12-31 | 10 | 50 | 82 |
Testing Data Overview |
---|
df_test = pd.read_csv("./sarimax_prophet_repo/data/raw/store_item_demand_forecasting/test.csv")
reduce_memory_usage(df_test)
df_overview(df_test, 'Walmart Testing Dataset')
Mem. usage decreased to 0.60 Mb (56.2% reduction)
id | date | store | item | |
---|---|---|---|---|
datatype | int32 | object | int8 | int8 |
missing values | 0 | 0 | 0 | 0 |
count | 45,000.00 | nan | 45,000.00 | 45,000.00 |
mean | 22,499.50 | nan | 5.50 | 25.50 |
std | 12,990.53 | nan | 2.87 | 14.43 |
min | 0.00 | nan | 1.00 | 1.00 |
25% | 11,249.75 | nan | 3.00 | 13.00 |
50% | 22,499.50 | nan | 5.50 | 25.50 |
75% | 33,749.25 | nan | 8.00 | 38.00 |
max | 44,999.00 | nan | 10.00 | 50.00 |
total rows | 45,000 |
---|---|
total columns | 4 |
column names | id, date, store, item |
index start | 0 |
index end | 44999 |
total missing values | 0 |
Walmart Testing Dataset Head and Tail |
id | date | store | item | |
---|---|---|---|---|
0 | 0 | 2018-01-01 | 1 | 1 |
1 | 1 | 2018-01-02 | 1 | 1 |
2 | 2 | 2018-01-03 | 1 | 1 |
id | date | store | item | |
---|---|---|---|---|
44997 | 44,997 | 2018-03-29 | 10 | 50 |
44998 | 44,998 | 2018-03-30 | 10 | 50 |
44999 | 44,999 | 2018-03-31 | 10 | 50 |
Unique Items by Store |
---|
header_text('Unique Items by Store: Training and Testing')
multi([(df_train.groupby('store').nunique()['item'],
'Training Data'),
(df_test.groupby('store').nunique()['item'],
'Testing Data')])
Unique Items by Store: Training and Testing |
item | |
---|---|
store | |
1 | 50 |
2 | 50 |
3 | 50 |
4 | 50 |
5 | 50 |
6 | 50 |
7 | 50 |
8 | 50 |
9 | 50 |
10 | 50 |
item | |
---|---|
store | |
1 | 50 |
2 | 50 |
3 | 50 |
4 | 50 |
5 | 50 |
6 | 50 |
7 | 50 |
8 | 50 |
9 | 50 |
10 | 50 |
# both datasets have 10 stores each with 50 unique items. Target column is 'sales' |
---|
# date columns contain string data and must be converted to datetime |
---|
df_train['date'] = pd.to_datetime(df_train['date'])
df_test['date'] = pd.to_datetime(df_test['date'])
train_range = str(df_train['date'].min().date()) + ' to ' + str(df_train['date'].max().date())
test_range = str(df_test['date'].min().date()) + ' to ' + str(df_test['date'].max().date())
pretty(train_range, "Period covered in train dataset:")
pretty(test_range, "Period covered in test dataset:")
Period covered in train dataset: |
2013-01-01 to 2017-12-31 |
Period covered in test dataset: |
2018-01-01 to 2018-03-31 |
# goal - predict sales of items in all stores from 01-01-2018 to 03-31-2018, i.e., 3 months. |
---|
pretty(len(df_train.groupby(["store"]).groups.keys()), 'number of stores')
pretty(len(df_train.groupby(["item"]).groups.keys()), 'number of items')
pretty(len(df_train.groupby(["store", "item"]).groups.keys()), 'number of time series')
number of stores |
10 |
number of items |
50 |
number of time series |
500 |
# sales amounts by store |
---|
see(df_train[['store','sales']].groupby('store')\
.sum()\
.sort_values('sales', ascending=False),
'Sales Amounts by Store')
Sales Amounts by Store |
sales | |
---|---|
store | |
2 | 6120128 |
8 | 5856169 |
3 | 5435144 |
10 | 5360158 |
9 | 5025976 |
4 | 5012639 |
1 | 4315603 |
5 | 3631016 |
6 | 3627670 |
7 | 3320009 |
plt.figure()
plt.title('Sales by Store');
sns.barplot(data=df_train,x='store',y='sales');
# items sold by the two stores with highest sales |
---|
df_store_2 = df_train[df_train.store == 2][['date','item','sales']]
reduce_memory_usage(df_store_2)
see(df_store_2.head(), 'store no. 2 sales')
Mem. usage decreased to 1.65 Mb (0.0% reduction)
store no. 2 sales |
date | item | sales | |
---|---|---|---|
1826 | 2013-01-01 | 1 | 12 |
1827 | 2013-01-02 | 1 | 16 |
1828 | 2013-01-03 | 1 | 16 |
1829 | 2013-01-04 | 1 | 20 |
1830 | 2013-01-05 | 1 | 16 |
see(df_store_2[['item','sales']].groupby('item')\
.sum()\
.sort_values('sales', ascending=False)[:5],
'store no. 2: highest selling items')
store no. 2: highest selling items |
sales | |
---|---|
item | |
28 | 205677 |
15 | 205569 |
18 | 197422 |
13 | 197031 |
25 | 188856 |
plt.figure()
plt.title("Store No. 2: Sales per Item")
sns.barplot(data=df_store_2[['item','sales']].groupby('item').sum().reset_index(),x='item',y='sales');
plt.xticks(size = 10);
# datetime feature engineering |
---|
df_store_2['year'] = df_store_2['date'].dt.year
df_store_2['month'] = df_store_2['date'].dt.month
df_store_2['day'] = df_store_2['date'].dt.dayofyear
df_store_2['weekday'] = df_store_2['date'].dt.weekday
df_store_2['year-month'] = df_store_2['date'].apply(lambda x: str(x.year)+'-'+str(x.month))
see(df_store_2.sample(5), 'Store No. 2: added datetime features')
Store No. 2: added datetime features |
date | item | sales | year | month | day | weekday | year-month | |
---|---|---|---|---|---|---|---|---|
878884 | 2014-08-02 | 49 | 49 | 2014 | 8 | 214 | 5 | 2014-8 |
203315 | 2014-09-22 | 12 | 90 | 2014 | 9 | 265 | 0 | 2014-9 |
824605 | 2015-12-16 | 46 | 55 | 2015 | 12 | 350 | 2 | 2015-12 |
257504 | 2013-02-08 | 15 | 67 | 2013 | 2 | 39 | 4 | 2013-2 |
112175 | 2015-03-01 | 7 | 87 | 2015 | 3 | 60 | 6 | 2015-3 |
# boxplot shows the level of outliers for each weekday |
---|
plt.title('Store No. 2: Sales by Weekday')
sns.boxplot(x="weekday", y="sales", data=df_store_2);
plt.xticks(ticks = [0, 1, 2, 3, 4, 5, 6],
labels = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']);
# highest sales are in June and July |
---|
plt.title('Store No. 2: Sales by Month')
sns.boxplot(x="month", y="sales", data=df_store_2);
plt.xticks(ticks = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
labels = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']);
# observing seasonality across the years |
---|
plt.figure()
plt.xticks(rotation=45)
plt.title('Sales per Year-Month')
sns.boxplot(x="year-month", y="sales", data=df_store_2);
plt.xticks(size = 8);
# seasonality repeats year after year # volume differs year to year |
---|
sns.lineplot(data=df_store_2,
x='month',
y='sales',
hue='year',
palette='cool')
plt.title('Seasonality by Year')
plt.legend(loc=2);
# higher sales on weekends # July is the month with more sales. |
---|
sns.lineplot(data=df_store_2,
x='weekday',
y='sales',
hue='month',
legend='full',
palette='rainbow')
plt.title('Weekday Seasonality by Month')
plt.legend(bbox_to_anchor=(1.01, 1), loc=2);
plt.xticks(ticks = [0, 1, 2, 3, 4, 5, 6],
labels = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']);
sns.lineplot(data=df_store_2,
x='weekday',
y='sales',
hue='year',
palette='cool')
plt.title('Weekday Seasonality by Year')
plt.legend(loc=2);
plt.xticks(ticks = [0, 1, 2, 3, 4, 5, 6],
labels = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']);
The Box-Jenkins Method | ||
---|---|---|
- Helps in choosing parameters that will lead to a good model. (Source) The original model uses an iterative three-stage modeling approach: 1. Model identification and model selection: making sure that the variables are stationary, identifying seasonality in the dependent series (seasonally differencing it if necessary), and using plots of the autocorrelation (ACF) and partial autocorrelation (PACF) functions of the dependent time series to decide which (if any) autoregressive or moving average component should be used in the model. 2. Parameter estimation using computation algorithms to arrive at coefficients that best fit the selected ARIMA model. The most common methods use maximum likelihood estimation or non-linear least-squares estimation.
3. Statistical model checking by testing whether the estimated model conforms to the specifications of a stationary univariate process. In particular, the residuals should be independent of each other and constant in mean and variance over time. (Plotting the mean and variance of residuals over time and performing a Ljung–Box test or plotting autocorrelation and partial autocorrelation of the residuals are helpful to identify misspecification.) If the estimation is inadequate, we have to return to step one and attempt to build a better model. |
||
- the Walmart dataset contains 500 times series which are paired stores and items sold (10 stores, 50 items) - each of the time series will need a forecast model applied to it in order to forecast sales for all stores |
Box-Jenkins Step One: Identify the Model | ||
---|---|---|
- Identifying the characteristics of a time series in order to choose the appropriate model Questions to ask: 1. Is the time series stationary? 2. If it is not stationary, which transformation is best to make it stationary? 3. Is the time series seasonal? 4. If seasonal, what is the periodicity of its seasonality? 5. Which orders should be used? (p for Arima, q for Arimax) |
||
- 500 time series, each being a store-item pair - a forecast model must be applied to all 500 time series pairs - the following is the method applied to one time series, the pair: store no. 2 and item 1 |
store2_item1 = df_store_2[['date','sales']]\
[df_store_2.item == 1]\
.reset_index(drop=True)\
.set_index('date')
head_tail_horz(store2_item1, 5, 'Store No. 2 - Item No. 1')
Store No. 2 - Item No. 1 |
sales | |
---|---|
date | |
2013-01-01 | 12 |
2013-01-02 | 16 |
2013-01-03 | 16 |
2013-01-04 | 20 |
2013-01-05 | 16 |
sales | |
---|---|
date | |
2017-12-27 | 19 |
2017-12-28 | 21 |
2017-12-29 | 18 |
2017-12-30 | 24 |
2017-12-31 | 31 |
store2_item1.plot(grid=True, color = 'cyan',
title = 'Store No. 2 - Item No. 1');
# period = 365 - yearly cycle |
---|
header_text('Store No. 2 - Item No. 1')
decomposition = sm.tsa.seasonal_decompose(store2_item1,
model = 'additive',
period=365);
decomposition.plot();
plt.tight_layout();
Store No. 2 - Item No. 1 |
Observations: | ||
---|---|---|
- upward sales volume trend means non-stationary - seasonal component does not increase greatly overtime (i.e. not multiplicative) means model is additive - seasonality component with lower sales at the beginning of the year and higher sales in the middle of summer - differencing to make stationary - use adf_kpss to find out how many times to apply differencing |
# applying stationarity tests |
---|
adf_kpss(store2_item1, 3)
d | adf_stats | p-value | is_adf_stationary | is_kpss_stationary | is_stationary | |
---|---|---|---|---|---|---|
0 | 0 | 3.804258e-02 | 0.01 | True | False | False |
1 | 1 | 8.084986e-22 | 0.10 | True | True | True |
2 | 2 | 1.151984e-27 | 0.10 | True | True | True |
# time series is stationary with d = 1 |
---|
Plotting ACF and PACF: | ||
---|---|---|
- autocorrelation and partial autocorrelation plots give clues for the best ARIMA parameter values - also shows if differencing is needed or if too much has already been applied Autocorrelation Function (ACF) - the plot of the autocorrelation of a time series by lag - also known as correlogram - includes direct and indirect dependence information - describes how well a present value of a time series is related to past values - bars represent ACF values at increasing lags - shaded area represents the confidence interval, default 95% - if if bars like within the shaded region, they are statistically insignificant Partial AutoCorrelation Function (PACF)- describes only the relationship between and observation and its lag - finds correlation of the residuals rather than finding correlations of present values with lagged values - Autocorrelation and Partial Autocorrelation - Identifying ARIMA parameters - the time series should be made stationary before creating these plots - if ACF values start high and trail off very slowly, it is a sign of non-stationarity and the need for differencing. - if autocorrelation at lag-1 is very negative, it is a sign of too much differencing. |
Plotting correlation and autocorrelation |
---|
# help(plot_acf)
# help(plot_pacf)
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
fig, (ax1, ax2) = plt.subplots(2,1, figsize=(12,8))
plot_acf(store2_item1,lags=14, zero=False, ax=ax1)
plot_pacf(store2_item1,lags=14, zero=False, ax=ax2)
ax1.set_ylim(-0.2, 1); ax2.set_ylim(-0.2, 0.8)
plt.tight_layout()
- ACF shows period correlation patterns - find the periodicity by finding a lag greater than 1 - the peak above is at 7, so the seasonal component repeats every 7 steps, weekly |
---|
# Create figure
fig, (ax1, ax2) = plt.subplots(2,1, figsize=(12,8))
plot_acf(store2_item1.diff().dropna(),lags=14, zero=False, ax=ax1)
plot_pacf(store2_item1.diff().dropna(),lags=14, zero=False, ax=ax2)
ax1.set_ylim(-0.5, 0.8); ax2.set_ylim(-0.8, 0.6)
plt.tight_layout()
# no clear trailing off in either plot - observed seasonal behavior by weeks - suggest the use of a model with seasonal parameters, i.e. SARIMA |
---|
Box-Jenkins Step Two: Estimate Coefficients (p, q) | ||
---|---|---|
- althought SARIMA is the better choice, applying ARIMA first for comparison (i.e. SARIMAX with no seasonality settings? Not sure, because she uses SARIMAX first as well.) - this will show some of the advantages of choosing the appropriate model |
||
- to choose proper parameters, there is some trial and error - will use the ARIMA model with different values - choosing best values based on metrics like AIC and BIC AIC - Akaike Information Criterion - tells how good a model is - lower value means a better model - penalizes models wwith many parameters - i.e. if order is too high compared to the data, there will be a high score - this indicates where work should be done to avoid overfitting to the training data BIC - Bayesian Information Criterion - similar to AIC in that lower value is better - penalizes additional model orders more than AIC does - consequently BIC will sometimes suggest a simpler model - these statistics can be obtained after fitting a model - there is usually some agreement between the two metrics - if there is no agreement, it is best to choose a smaller AIC for a predictive model |
SARIMAX() - model and attributes information |
---|
# help(SARIMAX)
# help(SARIMAX.fit)
# help(results)
store2_item1.index = pd.DatetimeIndex(store2_item1.index.values,
freq=store2_item1.index.inferred_freq)
from statsmodels.tsa.statespace.sarimax import SARIMAX
header_text('ARIMA Model Fitting')
model = SARIMAX(store2_item1, order=(1,1,1), freq='D')
results = model.fit()
ARIMA Model Fitting |
RUNNING THE L-BFGS-B CODE * * * Machine precision = 2.220D-16 N = 3 M = 10 At X0 0 variables are exactly at the bounds At iterate 0 f= 3.37742D+00 |proj g|= 1.21670D-01 At iterate 5 f= 3.30842D+00 |proj g|= 6.96031D-03 At iterate 10 f= 3.30768D+00 |proj g|= 8.50151D-06 * * * Tit = total number of iterations Tnf = total number of function evaluations Tnint = total number of segments explored during Cauchy searches Skip = number of BFGS updates skipped Nact = number of active bounds at final generalized Cauchy point Projg = norm of the final projected gradient F = final function value * * * N Tit Tnf Tnint Skip Nact Projg F 3 10 12 1 0 0 8.502D-06 3.308D+00 F = 3.3076756960161755 CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL
This problem is unconstrained.
results = results.summary()
# # Create empty list to store search results
# order_aic_bic=[]
# # Loop over p values from 0-6
# for p in range(7):
# # Loop over q values from 0-6
# for q in range(7):
# # create and fit ARMA(p,q) model
# # because adf test showed that d=1
# model = SARIMAX(store2_item1, order=(p,1,q), freq="D")
# results = model.fit()
# # Append order and results tuple
# order_aic_bic.append((p,q,results.aic, results.bic))
# Construct DataFrame from order_aic_bic
order_df = pd.DataFrame(order_aic_bic,
columns=['p','q','AIC','BIC'])
order_AIC = order_df.sort_values('AIC')
order_BIC = order_df.sort_values('BIC')
multi([(order_AIC.head(5), 'AIC Best Results'),
(order_BIC.head(5), 'BIC Best Results')])
p | q | AIC | BIC | |
---|---|---|---|---|
33 | 4 | 5 | 11,664.36 | 11,719.46 |
40 | 5 | 5 | 11,673.12 | 11,733.72 |
34 | 4 | 6 | 11,682.70 | 11,743.30 |
48 | 6 | 6 | 11,708.12 | 11,779.74 |
26 | 3 | 5 | 11,717.15 | 11,766.73 |
p | q | AIC | BIC | |
---|---|---|---|---|
33 | 4 | 5 | 11,664.36 | 11,719.46 |
40 | 5 | 5 | 11,673.12 | 11,733.72 |
34 | 4 | 6 | 11,682.70 | 11,743.30 |
24 | 3 | 3 | 11,727.62 | 11,766.19 |
26 | 3 | 5 | 11,717.15 | 11,766.73 |
# AIC and BIC agree that p = 4 and q = 5 are the best parameter values |
---|
Box-Jenkins Step Three: Model Evaluation | ||
---|---|---|
- evaluating the accuracy of the model before choosing it as the best - focusing on residuals for evaluation - residuals are the difference between the model's one-step-ahead predictions and the real values of the time series |
||
Mean Absolute Error (MAE) - calculating the MAE of the residuals - this will show on average how far off the predictions are from the true values |
header_text('ARIMA Model Fitting')
model = SARIMAX(store2_item1, order=(4,1,5))
results = model.fit()
ARIMA Model Fitting |
This problem is unconstrained.
RUNNING THE L-BFGS-B CODE * * * Machine precision = 2.220D-16 N = 10 M = 10 At X0 0 variables are exactly at the bounds At iterate 0 f= 3.25847D+00 |proj g|= 5.01528D-01 At iterate 5 f= 3.24105D+00 |proj g|= 7.56448D-02 At iterate 10 f= 3.22557D+00 |proj g|= 6.79302D-01 At iterate 15 f= 3.21356D+00 |proj g|= 5.40317D-02 At iterate 20 f= 3.20500D+00 |proj g|= 3.40172D-01 At iterate 25 f= 3.20100D+00 |proj g|= 2.07851D-01 At iterate 30 f= 3.19762D+00 |proj g|= 1.21599D-01 At iterate 35 f= 3.19641D+00 |proj g|= 2.73309D-01 At iterate 40 f= 3.19246D+00 |proj g|= 2.33485D-01 At iterate 45 f= 3.18906D+00 |proj g|= 3.29806D-01 At iterate 50 f= 3.18849D+00 |proj g|= 1.59965D-01 * * * Tit = total number of iterations Tnf = total number of function evaluations Tnint = total number of segments explored during Cauchy searches Skip = number of BFGS updates skipped Nact = number of active bounds at final generalized Cauchy point Projg = norm of the final projected gradient F = final function value * * * N Tit Tnf Tnint Skip Nact Projg F 10 50 60 1 0 0 1.600D-01 3.188D+00 F = 3.1884893223892212 STOP: TOTAL NO. of ITERATIONS REACHED LIMIT
mae = np.mean(np.abs(results.resid))
pretty(f'{mae: .3f}%', 'Mean Absolute Error')
Mean Absolute Error |
4.730% |
df_overview(store2_item1, 'store2_item1')
sales | |
---|---|
datatype | int16 |
missing values | 0 |
count | 1,826.00 |
mean | 28.17 |
std | 8.68 |
min | 6.00 |
25% | 22.00 |
50% | 28.00 |
75% | 34.00 |
max | 58.00 |
total rows | 1,826 |
---|---|
total columns | 1 |
column names | sales |
index start | 2013-01-01 00:00:00 |
index end | 2017-12-31 00:00:00 |
total missing values | 0 |
store2_item1 Head and Tail |
sales | |
---|---|
2013-01-01 | 12 |
2013-01-02 | 16 |
2013-01-03 | 16 |
sales | |
---|---|
2017-12-29 | 18 |
2017-12-30 | 24 |
2017-12-31 | 31 |
- the MAE is around 5 sales per day - the average sale for item 1 in store 2 is 28 sales per day - for an ideal model, the residuals should be uncorrelated, white Gaussian noise centered on zero - in the following section, this will be evaluated as well |
---|
Diagnostic Summary Statistics | ||
---|---|---|
- Analyzing the residual test statistics from the results summary - Evaluating Prob(Q) by applying the Ljung-Box Test with the Null Hypothesis: there are no correlations in the residuals - Evaluating Prob(JB) by applying the Jarque-Bera Test with the Null Hypothesis: residuals are normally distributed |
- Prob(Q) = 0.65, which is greater than 0.05 - this means do not reject the null hypothesis that the residuals are uncorrelated - therefor the residuals are not correlated - Prob(JB) = 0.02, which is less than 0.05 - this means reject the null hypothesis that the residuals are normally distributed - i.e. the residuals are not normally distributed |
---|
Plot Diagnostics | ||
---|---|---|
- the following are four diagnostic plots using plot_diagnostics() that help in deciding whether a model is a good fit |
results.plot_diagnostics(figsize = (12, 8))
plt.tight_layout()
Conclusions: | ||
---|---|---|
- an ideal model will have residuals resembling uncorrelating white Gaussian noise centered on zero - this concept will be evaluated using the plots above |
||
Standardized Residual: - there are no obvious patterns in the residuals - this suggests a good model Histogram and KDE Estimate: - shows the measured distribution of the residuals - green line shows the KDE curve, a smoothed version of the histogram - the line shows a normal distribution - for a good model, the N line will be similar to the KDE line Correlogram: - 95% of correlations for lag greater than one should not be significant (inside the blue area) - indicates a good model Normal Q-Q: - most of the data points occur on the line - this indicates a normal distribution of the residuals |
||
The conclusion is that these metrics all point to this being a good model. | ||
- if residuals are not normally distributed, increasing d can fix - if the residuals are correlated, increasing p and / or q can help fix |
Accounting for Seasonality | ||
---|---|---|
- for this, P, Q, D, and S should also be set - earlier the ACF plot determined the seasonal period was 7 days, or 1 week - remember to make a time series stationary before creating the ACF plot - seasonal data might require seasonal differencing - seasonal differencing subtracts the time series value from one previous cycle - if the time series shows a trend, the normal difference is taken - if there is a strong seasonal cycle, the seasonal difference will also be taken - before, our d = 1, and it is a general rule that d plus D should not equal more than 2 - first find the two orders of differencing, d and D and make the series stationary - non-season orders, such as p and q can still be found by plotting ACF and PACF of the differenced time series - to find seasonal orders, P and Q, it is necessary to plot ACF and PACF of the differenced time series and multiple seasonal steps |
fig, (ax1, ax2) = plt.subplots(2,1)
plot_acf(store2_item1, lags=14, zero=False, ax=ax1)
plot_pacf(store2_item1, lags=14, zero=False, ax=ax2)
ax1.set_ylim(-0.25, 0.75); ax2.set_ylim(-0.25, 0.6)
plt.tight_layout()
# Take the first and seasonal differences (7 days, 1 week) and drop NaNs |
---|
store2_item1_diff = store2_item1.diff().diff(7).dropna()
fig, (ax1, ax2) = plt.subplots(2,1,figsize=(8,6))
plot_acf(store2_item1_diff, lags=14, zero=False, ax=ax1)
plot_pacf(store2_item1_diff, lags=14, zero=False, ax=ax2)
ax1.set_ylim(-0.75, 0.5); ax2.set_ylim(-0.75, 0.5);
plt.tight_layout()
- he non-seasonal ACF and PACF plots above show moving average model pattern with q = 1 - for the seasonal parameters, the lags parameter takes a list of lags instead of a maximum - follow with plotting ACF and PACF for these specific lags only. |
---|
lags = [7, 14, 21, 28, 35]
fig, (ax1, ax2) = plt.subplots(2,1,figsize=(8,6))
plot_acf(store2_item1_diff, lags=lags, zero=False, ax=ax1)
plot_pacf(store2_item1_diff, lags=lags, zero=False, ax=ax2)
ax1.set_ylim(-0.75, 0.25); ax2.set_ylim(-0.75, 0.25);
plt.tight_layout()
- the seasonal ACF and PACF plots look like a moving average = 1 model, i.e., Q=1 - combining both of these results: SARIMA(order = (0,1,6), seasonal_order = (0,1,1,7)) |
---|
sarima_model = SARIMAX(store2_item1, order=(0,1,6), seasonal_order=(0,1,1,7))
header_text('SARIMAX Model Fitting')
sarima_results = sarima_model.fit()
SARIMAX Model Fitting |
RUNNING THE L-BFGS-B CODE * * * Machine precision = 2.220D-16 N = 8 M = 10 At X0 0 variables are exactly at the bounds At iterate 0 f= 3.33061D+00 |proj g|= 6.10152D-02
This problem is unconstrained.
At iterate 5 f= 3.15105D+00 |proj g|= 1.39278D-02 At iterate 10 f= 3.14702D+00 |proj g|= 5.69528D-03 At iterate 15 f= 3.14364D+00 |proj g|= 3.50404D-02 At iterate 20 f= 3.14073D+00 |proj g|= 3.02480D-03 At iterate 25 f= 3.14069D+00 |proj g|= 2.30562D-04 * * * Tit = total number of iterations Tnf = total number of function evaluations Tnint = total number of segments explored during Cauchy searches Skip = number of BFGS updates skipped Nact = number of active bounds at final generalized Cauchy point Projg = norm of the final projected gradient F = final function value * * * N Tit Tnf Tnint Skip Nact Projg F 8 29 36 1 0 0 2.485D-05 3.141D+00 F = 3.1406911338133612 CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH
mae = np.mean(np.abs(sarima_results.resid))
pretty(f'{mae:.3f}%', 'Mean Absolute Error for SARIMAX:')
Mean Absolute Error for SARIMAX: |
4.555% |
header_text('sarima_results.summary()')
sarima_results.summary()
sarima_results.summary() |
Dep. Variable: | sales | No. Observations: | 1826 |
---|---|---|---|
Model: | SARIMAX(0, 1, 6)x(0, 1, [1], 7) | Log Likelihood | -5734.902 |
Date: | Tue, 07 Feb 2023 | AIC | 11485.804 |
Time: | 11:45:43 | BIC | 11529.848 |
Sample: | 01-01-2013 | HQIC | 11502.054 |
- 12-31-2017 | |||
Covariance Type: | opg |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
ma.L1 | -0.8688 | 0.023 | -37.673 | 0.000 | -0.914 | -0.824 |
ma.L2 | 0.0019 | 0.032 | 0.060 | 0.952 | -0.061 | 0.065 |
ma.L3 | -0.0070 | 0.033 | -0.216 | 0.829 | -0.071 | 0.057 |
ma.L4 | 0.0082 | 0.030 | 0.270 | 0.787 | -0.051 | 0.068 |
ma.L5 | -0.0022 | 0.030 | -0.072 | 0.942 | -0.062 | 0.057 |
ma.L6 | -0.0064 | 0.024 | -0.268 | 0.789 | -0.054 | 0.041 |
ma.S.L7 | -0.9915 | 0.007 | -150.225 | 0.000 | -1.004 | -0.979 |
sigma2 | 31.6282 | 1.096 | 28.847 | 0.000 | 29.479 | 33.777 |
Ljung-Box (L1) (Q): | 0.00 | Jarque-Bera (JB): | 4.15 |
---|---|---|---|
Prob(Q): | 0.97 | Prob(JB): | 0.13 |
Heteroskedasticity (H): | 1.31 | Skew: | 0.10 |
Prob(H) (two-sided): | 0.00 | Kurtosis: | 2.90 |
- further progress in choosing model parameters can be made from investigating the AIC and BIC as done above - however, there are many more parameters now that the seasonal ones are added - instead Automated Model Selection is a good alternative - Prob(Q) = 0.97, which is greater than 0.05 - do not reject the null hypothesis that residuals are uncorrelated - residuals are not correlated - Prob(JB) = 0.13, which is greater than 0.05 - do not reject the null hypothesis that the residuals are normally distributed - the residuals are normally distributed |
---|
sarima_results.plot_diagnostics()
plt.tight_layout()
Standardized Residual: - no obvious patterns in the residuals - suggests a good model Histogram and KDE Estimate: - the two curves line up - suggests a good model Correlogram: - 95% of correlations for lag greater than one should not be significant (inside the shaded area) - these results also suggest a good model Normal Q-Q: - most of the data points should lie on the line, aside from outliers - this indicates a normal distribution of the residuals |
---|
Automated Model Selection | ||
---|---|---|
- pmdarima allows for the automatation of model order search - using the information from the Box-Jenkins identification step to predefine some of the orders before fitting - this can speed up the process of choosing model orders, but needs to be done carefully - automation can make mistakes due to imperfect data, which can affect the test scores in non-predictable ways - the only required parameter in auto_arima is data - however, using knowledge to specify other parameters can help find the best model |
||
- seasonal = True # is the time series seasonal - m = 7 # the seasonal period - one week - d = 1 # non-seasonal difference order - D = 1 # seasonal difference order - max_p = 6 # max value of p to test - max_q = 6 # max value of q to test - max_P = 6 # max value of P to test - max_Q = 6 # max value of Q to test - information_criterion = 'aic' # used to select best mode - trace = True # prints the information_criterion for each model it fits - error_action = 'ignore' # ignore orders that don't work - stepwise = True # apply an intelligent order search |
import pmdarima as pm
auto_arima_model = pm.auto_arima(store2_item1,
seasonal=True,
m=7,
d=1,
D=1,
max_p=6,
max_q=6,
max_P=6,
max_Q=6,
information_criterion='aic',
trace=True,
error_action='ignore',
stepwise=True,
suppress_warnings=True)
Performing stepwise search to minimize aic ARIMA(2,1,2)(1,1,1)[7] : AIC=inf, Time=10.86 sec ARIMA(0,1,0)(0,1,0)[7] : AIC=13739.527, Time=0.06 sec ARIMA(1,1,0)(1,1,0)[7] : AIC=12650.572, Time=0.28 sec ARIMA(0,1,1)(0,1,1)[7] : AIC=inf, Time=3.70 sec ARIMA(1,1,0)(0,1,0)[7] : AIC=13229.902, Time=0.10 sec ARIMA(1,1,0)(2,1,0)[7] : AIC=12480.179, Time=0.58 sec ARIMA(1,1,0)(3,1,0)[7] : AIC=12344.238, Time=1.17 sec ARIMA(1,1,0)(4,1,0)[7] : AIC=12270.400, Time=9.19 sec ARIMA(1,1,0)(5,1,0)[7] : AIC=12224.430, Time=16.30 sec ARIMA(1,1,0)(6,1,0)[7] : AIC=12196.670, Time=21.66 sec ARIMA(1,1,0)(6,1,1)[7] : AIC=inf, Time=258.01 sec ARIMA(1,1,0)(5,1,1)[7] : AIC=inf, Time=209.87 sec ARIMA(0,1,0)(6,1,0)[7] : AIC=12702.277, Time=12.74 sec ARIMA(2,1,0)(6,1,0)[7] : AIC=11998.566, Time=15.74 sec ARIMA(2,1,0)(5,1,0)[7] : AIC=12027.447, Time=13.99 sec ARIMA(2,1,0)(6,1,1)[7] : AIC=inf, Time=199.64 sec ARIMA(2,1,0)(5,1,1)[7] : AIC=inf, Time=176.24 sec ARIMA(3,1,0)(6,1,0)[7] : AIC=11888.959, Time=14.89 sec ARIMA(3,1,0)(5,1,0)[7] : AIC=11915.280, Time=11.35 sec ARIMA(3,1,0)(6,1,1)[7] : AIC=inf, Time=258.11 sec ARIMA(3,1,0)(5,1,1)[7] : AIC=inf, Time=244.30 sec ARIMA(4,1,0)(6,1,0)[7] : AIC=11829.853, Time=15.01 sec ARIMA(4,1,0)(5,1,0)[7] : AIC=11855.759, Time=11.59 sec ARIMA(4,1,0)(6,1,1)[7] : AIC=inf, Time=241.76 sec ARIMA(4,1,0)(5,1,1)[7] : AIC=inf, Time=196.47 sec ARIMA(5,1,0)(6,1,0)[7] : AIC=11787.764, Time=15.72 sec ARIMA(5,1,0)(5,1,0)[7] : AIC=11813.941, Time=12.64 sec ARIMA(5,1,0)(6,1,1)[7] : AIC=inf, Time=1564.05 sec ARIMA(5,1,0)(5,1,1)[7] : AIC=inf, Time=239.66 sec ARIMA(6,1,0)(6,1,0)[7] : AIC=11723.670, Time=22.98 sec ARIMA(6,1,0)(5,1,0)[7] : AIC=11754.653, Time=19.05 sec ARIMA(6,1,0)(6,1,1)[7] : AIC=inf, Time=684.40 sec ARIMA(6,1,0)(5,1,1)[7] : AIC=inf, Time=3178.42 sec ARIMA(6,1,1)(6,1,0)[7] : AIC=11680.726, Time=152.59 sec ARIMA(6,1,1)(5,1,0)[7] : AIC=11700.932, Time=69.91 sec ARIMA(6,1,1)(6,1,1)[7] : AIC=inf, Time=1472.88 sec ARIMA(6,1,1)(5,1,1)[7] : AIC=inf, Time=3274.95 sec ARIMA(5,1,1)(6,1,0)[7] : AIC=11681.051, Time=39.52 sec ARIMA(6,1,2)(6,1,0)[7] : AIC=11681.423, Time=255.22 sec ARIMA(5,1,2)(6,1,0)[7] : AIC=11683.639, Time=4220.90 sec ARIMA(6,1,1)(6,1,0)[7] intercept : AIC=11682.576, Time=9994.42 sec Best model: ARIMA(6,1,1)(6,1,0)[7] Total fit time: 27161.003 seconds
# Best model: ARIMA(6,1,1)(6,1,0)[7] |
---|
header_text('Fitting SARIMAX with optimized orders:')
sarima_optimized = SARIMAX(store2_item1, order=(6,1,1), seasonal_order=(6,1,0,7))
sarima_optimized_results = sarima_optimized.fit()
Fitting SARIMAX with optimized orders: |
This problem is unconstrained.
RUNNING THE L-BFGS-B CODE * * * Machine precision = 2.220D-16 N = 14 M = 10 At X0 0 variables are exactly at the bounds At iterate 0 f= 3.31381D+00 |proj g|= 1.91453D-01 At iterate 5 f= 3.23070D+00 |proj g|= 3.97055D-02 At iterate 10 f= 3.20137D+00 |proj g|= 6.16757D-03 At iterate 15 f= 3.19869D+00 |proj g|= 3.66528D-02 At iterate 20 f= 3.19206D+00 |proj g|= 1.39816D-02 At iterate 25 f= 3.19079D+00 |proj g|= 8.52650D-04 At iterate 30 f= 3.19078D+00 |proj g|= 2.85992D-05 * * * Tit = total number of iterations Tnf = total number of function evaluations Tnint = total number of segments explored during Cauchy searches Skip = number of BFGS updates skipped Nact = number of active bounds at final generalized Cauchy point Projg = norm of the final projected gradient F = final function value * * * N Tit Tnf Tnint Skip Nact Projg F 14 31 35 1 0 0 7.264D-06 3.191D+00 F = 3.1907792689051648 CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL
mae = np.mean(np.abs(sarima_optimized_results.resid))
pretty(f'{mae:.3f}%', 'Mean Absolute Error for SARIMAX optimized:')
print('MAE: %.3f' % mae)
Mean Absolute Error for SARIMAX optimized: |
4.788% |
MAE: 4.788
header_text('Optimized Sarimax Results')
sarima_optimized_results.summary()
Optimized Sarimax Results |
Dep. Variable: | sales | No. Observations: | 1826 |
---|---|---|---|
Model: | SARIMAX(6, 1, 1)x(6, 1, [], 7) | Log Likelihood | -5826.363 |
Date: | Tue, 07 Feb 2023 | AIC | 11680.726 |
Time: | 12:10:10 | BIC | 11757.803 |
Sample: | 01-01-2013 | HQIC | 11709.164 |
- 12-31-2017 | |||
Covariance Type: | opg |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
ar.L1 | 0.0671 | 0.026 | 2.595 | 0.009 | 0.016 | 0.118 |
ar.L2 | 0.0575 | 0.027 | 2.118 | 0.034 | 0.004 | 0.111 |
ar.L3 | 0.0467 | 0.026 | 1.800 | 0.072 | -0.004 | 0.098 |
ar.L4 | 0.0486 | 0.026 | 1.905 | 0.057 | -0.001 | 0.099 |
ar.L5 | 0.0363 | 0.026 | 1.390 | 0.164 | -0.015 | 0.087 |
ar.L6 | 0.0410 | 0.025 | 1.613 | 0.107 | -0.009 | 0.091 |
ma.L1 | -0.9464 | 0.014 | -69.687 | 0.000 | -0.973 | -0.920 |
ar.S.L7 | -0.8558 | 0.024 | -36.375 | 0.000 | -0.902 | -0.810 |
ar.S.L14 | -0.6815 | 0.032 | -21.632 | 0.000 | -0.743 | -0.620 |
ar.S.L21 | -0.5757 | 0.035 | -16.429 | 0.000 | -0.644 | -0.507 |
ar.S.L28 | -0.4127 | 0.033 | -12.481 | 0.000 | -0.478 | -0.348 |
ar.S.L35 | -0.2411 | 0.031 | -7.798 | 0.000 | -0.302 | -0.181 |
ar.S.L42 | -0.1133 | 0.024 | -4.677 | 0.000 | -0.161 | -0.066 |
sigma2 | 35.3666 | 1.202 | 29.416 | 0.000 | 33.010 | 37.723 |
Ljung-Box (L1) (Q): | 0.00 | Jarque-Bera (JB): | 0.04 |
---|---|---|---|
Prob(Q): | 1.00 | Prob(JB): | 0.98 |
Heteroskedasticity (H): | 1.31 | Skew: | 0.01 |
Prob(H) (two-sided): | 0.00 | Kurtosis: | 3.00 |
- Prob(Q) = 1.00, which is greater than 0.05 - do not reject the null hypothesis that residuals are uncorrelated - residuals are not correlated - Prob(JB) = 0.98, which is greater than 0.05 - do not reject the null hypothesis that the residuals are normally distributed - the residuals are normally distributed |
---|
sarima_optimized_results.plot_diagnostics()
plt.tight_layout()
Standardized Residual: - no obvious patterns in the residuals - suggests a good model Histogram and KDE Estimate: - shows the measured distribution of the residuals - KDE (green line) shows a smoothed version of the histogram - the two curves line up, but are slightly off Correlogram: - 95% of correlations for lag greater than one should not be significant (inside the shaded area) - these results also suggest a good model Normal Q-Q: - most of the data points should lie on the line - this indicates a normal distribution of the residuals |
---|
- compared with the first SARIMAX model, the first is closer to a normal distribution for the KDE curve than the second - the first model also achieved a lower MAE score than the second - 4.55% vs 4.78% MAE scores - conclusion: the first model is the more accurate |
---|
Forecasting: SARIMA vs ARIMA | ||
---|---|---|
- working with both the ARIMA and the better of the two SARIMA models - confirming that SARIMA is the better model - working with the last 90 days of data as the validation set |
||
Comparison Metrics: - Mean Absolute Error and Mean Absolute Percentage Error will be used to compare models - MAE & MAPE: evaluating forecasting models - MAE is clear and easy to comprehead - MAPE is unit-free and therefore useful for comparing forecast performance between datasets |
Comparing Model Result Summaries |
---|
header_text('ARIMA Model Results')
results.summary()
ARIMA Model Results |
Dep. Variable: | sales | No. Observations: | 1826 |
---|---|---|---|
Model: | SARIMAX(4, 1, 5) | Log Likelihood | -5822.182 |
Date: | Tue, 07 Feb 2023 | AIC | 11664.363 |
Time: | 12:27:23 | BIC | 11719.456 |
Sample: | 01-01-2013 | HQIC | 11684.686 |
- 12-31-2017 | |||
Covariance Type: | opg |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
ar.L1 | -0.5046 | 0.023 | -21.911 | 0.000 | -0.550 | -0.459 |
ar.L2 | 0.2367 | 0.013 | 18.744 | 0.000 | 0.212 | 0.261 |
ar.L3 | -0.5679 | 0.012 | -46.904 | 0.000 | -0.592 | -0.544 |
ar.L4 | -0.9455 | 0.022 | -42.098 | 0.000 | -0.989 | -0.901 |
ma.L1 | -0.4094 | 0.035 | -11.544 | 0.000 | -0.479 | -0.340 |
ma.L2 | -0.6496 | 0.045 | -14.279 | 0.000 | -0.739 | -0.560 |
ma.L3 | 0.7708 | 0.032 | 23.949 | 0.000 | 0.708 | 0.834 |
ma.L4 | 0.3847 | 0.045 | 8.529 | 0.000 | 0.296 | 0.473 |
ma.L5 | -0.7822 | 0.032 | -24.115 | 0.000 | -0.846 | -0.719 |
sigma2 | 33.3829 | 1.136 | 29.388 | 0.000 | 31.157 | 35.609 |
Ljung-Box (L1) (Q): | 0.21 | Jarque-Bera (JB): | 8.31 |
---|---|---|---|
Prob(Q): | 0.65 | Prob(JB): | 0.02 |
Heteroskedasticity (H): | 1.33 | Skew: | 0.16 |
Prob(H) (two-sided): | 0.00 | Kurtosis: | 2.96 |
header_text('First SARIMAX Model Results')
sarima_results.summary()
First SARIMAX Model Results |
Dep. Variable: | sales | No. Observations: | 1826 |
---|---|---|---|
Model: | SARIMAX(0, 1, 6)x(0, 1, [1], 7) | Log Likelihood | -5734.902 |
Date: | Tue, 07 Feb 2023 | AIC | 11485.804 |
Time: | 12:27:50 | BIC | 11529.848 |
Sample: | 01-01-2013 | HQIC | 11502.054 |
- 12-31-2017 | |||
Covariance Type: | opg |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
ma.L1 | -0.8688 | 0.023 | -37.673 | 0.000 | -0.914 | -0.824 |
ma.L2 | 0.0019 | 0.032 | 0.060 | 0.952 | -0.061 | 0.065 |
ma.L3 | -0.0070 | 0.033 | -0.216 | 0.829 | -0.071 | 0.057 |
ma.L4 | 0.0082 | 0.030 | 0.270 | 0.787 | -0.051 | 0.068 |
ma.L5 | -0.0022 | 0.030 | -0.072 | 0.942 | -0.062 | 0.057 |
ma.L6 | -0.0064 | 0.024 | -0.268 | 0.789 | -0.054 | 0.041 |
ma.S.L7 | -0.9915 | 0.007 | -150.225 | 0.000 | -1.004 | -0.979 |
sigma2 | 31.6282 | 1.096 | 28.847 | 0.000 | 29.479 | 33.777 |
Ljung-Box (L1) (Q): | 0.00 | Jarque-Bera (JB): | 4.15 |
---|---|---|---|
Prob(Q): | 0.97 | Prob(JB): | 0.13 |
Heteroskedasticity (H): | 1.31 | Skew: | 0.10 |
Prob(H) (two-sided): | 0.00 | Kurtosis: | 2.90 |
header_text('Optimized SARIMAX Model Results')
sarima_optimized_results.summary()
Optimized SARIMAX Model Results |
Dep. Variable: | sales | No. Observations: | 1826 |
---|---|---|---|
Model: | SARIMAX(6, 1, 1)x(6, 1, [], 7) | Log Likelihood | -5826.363 |
Date: | Tue, 07 Feb 2023 | AIC | 11680.726 |
Time: | 12:28:05 | BIC | 11757.803 |
Sample: | 01-01-2013 | HQIC | 11709.164 |
- 12-31-2017 | |||
Covariance Type: | opg |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
ar.L1 | 0.0671 | 0.026 | 2.595 | 0.009 | 0.016 | 0.118 |
ar.L2 | 0.0575 | 0.027 | 2.118 | 0.034 | 0.004 | 0.111 |
ar.L3 | 0.0467 | 0.026 | 1.800 | 0.072 | -0.004 | 0.098 |
ar.L4 | 0.0486 | 0.026 | 1.905 | 0.057 | -0.001 | 0.099 |
ar.L5 | 0.0363 | 0.026 | 1.390 | 0.164 | -0.015 | 0.087 |
ar.L6 | 0.0410 | 0.025 | 1.613 | 0.107 | -0.009 | 0.091 |
ma.L1 | -0.9464 | 0.014 | -69.687 | 0.000 | -0.973 | -0.920 |
ar.S.L7 | -0.8558 | 0.024 | -36.375 | 0.000 | -0.902 | -0.810 |
ar.S.L14 | -0.6815 | 0.032 | -21.632 | 0.000 | -0.743 | -0.620 |
ar.S.L21 | -0.5757 | 0.035 | -16.429 | 0.000 | -0.644 | -0.507 |
ar.S.L28 | -0.4127 | 0.033 | -12.481 | 0.000 | -0.478 | -0.348 |
ar.S.L35 | -0.2411 | 0.031 | -7.798 | 0.000 | -0.302 | -0.181 |
ar.S.L42 | -0.1133 | 0.024 | -4.677 | 0.000 | -0.161 | -0.066 |
sigma2 | 35.3666 | 1.202 | 29.416 | 0.000 | 33.010 | 37.723 |
Ljung-Box (L1) (Q): | 0.00 | Jarque-Bera (JB): | 0.04 |
---|---|---|---|
Prob(Q): | 1.00 | Prob(JB): | 0.98 |
Heteroskedasticity (H): | 1.31 | Skew: | 0.01 |
Prob(H) (two-sided): | 0.00 | Kurtosis: | 3.00 |
Model Predictions and Metrics |
---|
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import mean_absolute_percentage_error as mape
# sklearn.metrics.mean_absolute_error() |
---|
# help(mean_absolute_error)
# sklearn.metrics.mean_absolute_percentage_error() |
---|
# help(mean_absolute_percentage_error)
# results.get_prediction() |
---|
# help(results.get_prediction)
# predictions.predicted_mean() |
---|
# help(arima_predictions.predicted_mean)
# creating the predictions and predicted mean for each model |
---|
arima_predictions = results.get_prediction(start=-90, dynamic=True)
arima_mean = arima_predictions.predicted_mean
sarima_og_predictions = sarima_results.get_prediction(start=-90, dynamic=True)
sarima_og_mean = sarima_og_predictions.predicted_mean
sarima_02_predictions = sarima_optimized_results.get_prediction(start=-90, dynamic=True)
sarima_02_mean = sarima_02_predictions.predicted_mean
# getting the MAE and MAPE for each model's predictions |
---|
arima_metrics = [round(mae(store2_item1[-90:], arima_mean),3),
round(mape(store2_item1[-90:], arima_mean),3)]
sarima_og_metrics = [round(mae(store2_item1[-90:], sarima_og_mean),3),
round(mape(store2_item1[-90:], sarima_og_mean),3)]
sarima_02_metrics = [round(mae(store2_item1[-90:], sarima_02_mean),3),
round(mape(store2_item1[-90:], sarima_02_mean),3)]
# combining model metric results into a dataframe |
---|
model_results = pd.DataFrame({'metrics':['MAE','MAPE'],
'ARIMA(4,1,5)': arima_metrics,
'SARIMA(0,1,6)(0,1,1)7': sarima_og_metrics,
'SARIMA(6,1,1)(6,1,0)7': sarima_02_metrics,
})
# model metric results: MAE and MAPE |
---|
model_results
metrics | ARIMA(4,1,5) | SARIMA(0,1,6)(0,1,1)7 | SARIMA(6,1,1)(6,1,0)7 | |
---|---|---|---|---|
0 | MAE | 6.906 | 6.910 | 6.285 |
1 | MAPE | 0.298 | 0.301 | 0.213 |
- based on these metrics, the second SARIMAX model performed the best and has the lowest loss metrics |
---|
model_results.to_csv('model_results.csv')
# plotting the predictions of all three models |
---|
dates = store2_item1.index
plt.figure(figsize=(15,10))
plt.title('Forecasting of All Models', size = 22)
plt.plot(arima_mean.index, arima_mean, label='ARIMA(4,1,5)')
plt.plot(sarima_og_mean.index, sarima_og_mean, label='SARIMA(0,1,6)(0,1,1)7')
plt.plot(sarima_02_mean.index, sarima_02_mean, label='SARIMAX(6,1,1)(6,1,0)7')
plt.plot(store2_item1[-90:], label='observed')
plt.legend(loc = 3);
# actual values vs ARIMA predictions |
---|
plt.figure(figsize=(15,5))
plt.title('Actual Values vs ARIMA(4,1,5)', size = 22)
plt.plot(store2_item1[-90:], label='observed');
plt.plot(arima_mean.index, arima_mean, label='ARIMA(4,1,5)');
plt.legend(loc = 3);
# actual values versus first SARIMAX predictions |
---|
plt.figure(figsize=(15,5))
plt.title('Actual Values vs SARIMA(0,1,6)(0,1,1)7', size = 22)
plt.plot(store2_item1[-90:], label='observed')
plt.plot(sarima_og_mean.index, sarima_og_mean, label='SARIMA(0,1,6)(0,1,1)7')
plt.legend(loc = 3);
# actual values versus second (optimized) SARIMAX predictions |
---|
plt.figure(figsize=(15,5))
plt.title('Actual Values vs SARIMA(6,1,1)(6,1,0)7 (automated selection)', size = 22)
plt.plot(store2_item1[-90:], label='observed')
plt.plot(sarima_02_mean.index, sarima_02_mean, label='SARIMA(6,1,1)(6,1,0)7');
plt.legend(loc = 1);
- of the three models, the optimized, automated selection SARIMAX model clearly follows the actual values more closely |
---|
Forecasting into the Future |
---|
# results.get_forecast() |
---|
# help(results.get_forecast)
# getting forecast predictions into the future with ARIMA and optimized SARIMAX |
---|
arima_predictions = results.get_forecast(steps=90)
arima_mean = arima_predictions.predicted_mean
sarima_02_pred = sarima_optimized_results.get_forecast(steps=90)
sarima_02_mean = sarima_02_pred.predicted_mean
dates = store2_item1.index
# Plot mean ARIMA and SARIMA predictions and observed
plt.title("Forecasting Comaprison - ARIMA vs SARIMA", size = 22)
plt.plot(store2_item1['2017':], label='actuals')
plt.plot(arima_mean.index, arima_mean, label='ARIMA(4,1,5)')
plt.plot(sarima_02_mean.index, sarima_02_mean, label='SARIMA(6,1,1)(6,1,0)7')
plt.legend(loc = 'lower center');
- the optimized SARIMAX model follows the trajectory of the actual values much more than the ARIMA model, which ignores the seasonal information |
---|
Saving the optimized SARIMAX model |
---|
import joblib
filename = "../model/store_2_item_28_model.pkl"
joblib.dump(sarima_optimized, 'optimized_sarimax_store2_item1.pkl')
['optimized_sarimax_store2_item1.pkl']
Loading the saved model |
---|
loaded_model = joblib.load('optimized_sarimax_store2_item1.pkl')
header_text('Loaded Model Summary')
loaded_model.fit().summary()
Loaded Model Summary |
RUNNING THE L-BFGS-B CODE * * * Machine precision = 2.220D-16 N = 14 M = 10 At X0 0 variables are exactly at the bounds At iterate 0 f= 3.19078D+00 |proj g|= 0.00000D+00 * * * Tit = total number of iterations Tnf = total number of function evaluations Tnint = total number of segments explored during Cauchy searches Skip = number of BFGS updates skipped Nact = number of active bounds at final generalized Cauchy point Projg = norm of the final projected gradient F = final function value * * * N Tit Tnf Tnint Skip Nact Projg F 14 0 1 0 0 0 0.000D+00 3.191D+00 F = 3.1907792689051648 CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL
This problem is unconstrained.
Dep. Variable: | sales | No. Observations: | 1826 |
---|---|---|---|
Model: | SARIMAX(6, 1, 1)x(6, 1, [], 7) | Log Likelihood | -5826.363 |
Date: | Tue, 07 Feb 2023 | AIC | 11680.726 |
Time: | 13:16:09 | BIC | 11757.803 |
Sample: | 01-01-2013 | HQIC | 11709.164 |
- 12-31-2017 | |||
Covariance Type: | opg |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
ar.L1 | -0.2785 | 0.000 | -775.130 | 0.000 | -0.279 | -0.278 |
ar.L2 | -0.2101 | 0.000 | -440.977 | 0.000 | -0.211 | -0.209 |
ar.L3 | -0.1874 | 0.001 | -350.905 | 0.000 | -0.188 | -0.186 |
ar.L4 | -0.1287 | 0.001 | -165.544 | 0.000 | -0.130 | -0.127 |
ar.L5 | 0.0875 | 0.001 | 87.445 | 0.000 | 0.086 | 0.089 |
ar.L6 | 0.3150 | 0.000 | 991.377 | 0.000 | 0.314 | 0.316 |
ma.L1 | -0.4701 | 0.000 | -2208.458 | 0.000 | -0.471 | -0.470 |
ar.S.L7 | -0.8755 | 0.000 | -7659.522 | 0.000 | -0.876 | -0.875 |
ar.S.L14 | -0.6881 | 0.000 | -4731.725 | 0.000 | -0.688 | -0.688 |
ar.S.L21 | -0.5753 | 0.000 | -3307.260 | 0.000 | -0.576 | -0.575 |
ar.S.L28 | -0.4573 | 0.000 | -2090.210 | 0.000 | -0.458 | -0.457 |
ar.S.L35 | -0.2777 | 0.000 | -770.609 | 0.000 | -0.278 | -0.277 |
ar.S.L42 | -0.1210 | 0.001 | -146.261 | 0.000 | -0.123 | -0.119 |
sigma2 | 60.9096 | 1.64e-06 | 3.71e+07 | 0.000 | 60.910 | 60.910 |
Ljung-Box (L1) (Q): | 0.00 | Jarque-Bera (JB): | 0.04 |
---|---|---|---|
Prob(Q): | 1.00 | Prob(JB): | 0.98 |
Heteroskedasticity (H): | 1.31 | Skew: | 0.01 |
Prob(H) (two-sided): | 0.00 | Kurtosis: | 3.00 |
# store2_item1.to_csv('store2_item1.csv')
Section Conclusions | ||
---|---|---|
- the optimized model could be further improved by using a true SARIMAX model and adding exogenous data such as holidays, etc. - Python Holidays Library - |
Forecasting with Facebook Prophet Model | ||
---|---|---|
- Using Prophet on the Walmart Forecasting Dataset - Facebook Prophet on GitHub - Facebook / Meta Data Science Research - Prophet works with univariate (one variable) time series forecasting data based on the additive model - it supports trends, seasonality, and holidays - works best with data that has strong seasonal effects and with several seasons worth of historical data - Prophet is robust in dealing with missing data, shifts in trends, and handles outliers well - Prophet is easy to use - automatically finds a good set of hyperparameters for the model - |
from prophet import Prophet
store_df = pd.read_csv('store2_item1.csv', parse_dates = ['Unnamed: 0'])
store_df.columns = ['date', 'sales']
store_df.date = pd.to_datetime(store_df.date)
store_df_dtindex = store_df.set_index('date')
df_overview(store_df_dtindex, 'Walmart Data')
sales | |
---|---|
datatype | int64 |
missing values | 0 |
count | 1,826.00 |
mean | 28.17 |
std | 8.68 |
min | 6.00 |
25% | 22.00 |
50% | 28.00 |
75% | 34.00 |
max | 58.00 |
total rows | 1,826 |
---|---|
total columns | 1 |
column names | sales |
index start | 2013-01-01 00:00:00 |
index end | 2017-12-31 00:00:00 |
total missing values | 0 |
Walmart Data Head and Tail |
sales | |
---|---|
date | |
2013-01-01 | 12 |
2013-01-02 | 16 |
2013-01-03 | 16 |
sales | |
---|---|
date | |
2017-12-29 | 18 |
2017-12-30 | 24 |
2017-12-31 | 31 |
# df.resample() |
---|
# help(store_df.resample)
store_df_dtindex.plot(title = 'Walmart Store No.2 Item No.1 Data',
color = 'cyan');
store_df_dtindex.resample('M')\
.sum()\
.plot(title = 'Walmart Data Resampled Monthly',
color = 'cyan');
store_df_dtindex.resample('A')\
.sum()\
.plot(title = 'Walmart Data Resampled Annually',
color = 'cyan');
decomposition = sm.tsa.seasonal_decompose(store_df_dtindex,
model = 'additive',
period = 365)
fig = decomposition.plot()
plt.suptitle('Decomposition of Store Data: period = 365 (1 year)')
plt.tight_layout()
decomposition = sm.tsa.seasonal_decompose(store_df_dtindex.resample('M').sum(),
model = 'additive',
period = 12)
fig = decomposition.plot()
plt.suptitle('Decomposition of Store Data: Resampled Monthly')
plt.tight_layout()
Conclusions: - there is a clear upward trend in the time series - the time series is not stationary - the seasonal component is similar across time, not multiplicative - this points to the model being additive - seasonality in sales is higher in July and lower in January |
---|
Prophet Data Prep & Training | ||
---|---|---|
- Prophet requires as input a dataframe with two columns - ds will be the datetime column - y represents the metric to be forecast |
store_df.columns = ['ds', 'y']
head_tail_horz(store_df, 5, 'Dataframe Prepared for Prophet')
Dataframe Prepared for Prophet |
ds | y | |
---|---|---|
0 | 2013-01-01 | 12 |
1 | 2013-01-02 | 16 |
2 | 2013-01-03 | 16 |
3 | 2013-01-04 | 20 |
4 | 2013-01-05 | 16 |
ds | y | |
---|---|---|
1821 | 2017-12-27 | 19 |
1822 | 2017-12-28 | 21 |
1823 | 2017-12-29 | 18 |
1824 | 2017-12-30 | 24 |
1825 | 2017-12-31 | 31 |
# Prophet |
---|
# help(Prophet)
Model Training | ||
---|---|---|
- Prophet does not require hyper parameter specification - however, if the data is multiplicated, seasonality_mode must be set to multiplicative since Prophet is based on an additive model - for this data, no parameter must be set |
||
- Prophet Documentation | ||
- interval_width, which refers to the confidence level, is 0.8 by default - setting this parameter to 0.95 |
prophet = Prophet(interval_width=0.95) #by default is 80%
prophet = prophet.fit(store_df)
14:49:09 - cmdstanpy - INFO - Chain [1] start processing 14:49:09 - cmdstanpy - INFO - Chain [1] done processing
Forecasting: | ||
---|---|---|
- make_future_dataframe() creates a dataframe for future predictions to be stored - the prediction length is based on the periods parameter - the prediction periods here will be 90 for 90 days - by default, it also stores past dates as well |
# Prophet.make_future_dataframe() |
---|
help(prophet.make_future_dataframe)
Help on method make_future_dataframe in module prophet.forecaster: make_future_dataframe(periods, freq='D', include_history=True) method of prophet.forecaster.Prophet instance Simulate the trend using the extrapolated generative model. Parameters ---------- periods: Int number of periods to forecast forward. freq: Any valid frequency for pd.date_range, such as 'D' or 'M'. include_history: Boolean to include the historical dates in the data frame for predictions. Returns ------- pd.Dataframe that extends forward from the end of self.history for the requested number of periods.
future = prophet.make_future_dataframe(periods = 90)
head_tail_horz(future, 5, 'Future Predictions DataFrame')
head_tail_horz(store_df, 5, 'Original DataFrame (90 days shorter)')
Future Predictions DataFrame |
ds | |
---|---|
0 | 2013-01-01 |
1 | 2013-01-02 |
2 | 2013-01-03 |
3 | 2013-01-04 |
4 | 2013-01-05 |
ds | |
---|---|
1911 | 2018-03-27 |
1912 | 2018-03-28 |
1913 | 2018-03-29 |
1914 | 2018-03-30 |
1915 | 2018-03-31 |
Original DataFrame (90 days shorter) |
ds | y | |
---|---|---|
0 | 2013-01-01 | 12 |
1 | 2013-01-02 | 16 |
2 | 2013-01-03 | 16 |
3 | 2013-01-04 | 20 |
4 | 2013-01-05 | 16 |
ds | y | |
---|---|---|
1821 | 2017-12-27 | 19 |
1822 | 2017-12-28 | 21 |
1823 | 2017-12-29 | 18 |
1824 | 2017-12-30 | 24 |
1825 | 2017-12-31 | 31 |
# make predictions by calling predict on the future dataframe |
---|
forecast = prophet.predict(future)
head_tail_vert(forecast, 5, 'Prophet Forecast Predictions')
Prophet Forecast Predictions: head(5) |
ds | trend | yhat_lower | yhat_upper | trend_lower | trend_upper | additive_terms | additive_terms_lower | additive_terms_upper | weekly | weekly_lower | weekly_upper | yearly | yearly_lower | yearly_upper | multiplicative_terms | multiplicative_terms_lower | multiplicative_terms_upper | yhat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013-01-01 | 22.234184 | 1.622243 | 22.571609 | 22.234184 | 22.234184 | -9.684190 | -9.684190 | -9.684190 | -2.316626 | -2.316626 | -2.316626 | -7.367564 | -7.367564 | -7.367564 | 0.0 | 0.0 | 0.0 | 12.549994 |
1 | 2013-01-02 | 22.241942 | 2.140688 | 22.260295 | 22.241942 | 22.241942 | -10.080440 | -10.080440 | -10.080440 | -2.724676 | -2.724676 | -2.724676 | -7.355764 | -7.355764 | -7.355764 | 0.0 | 0.0 | 0.0 | 12.161502 |
2 | 2013-01-03 | 22.249700 | 3.583024 | 25.636708 | 22.249700 | 22.249700 | -7.527728 | -7.527728 | -7.527728 | -0.167199 | -0.167199 | -0.167199 | -7.360528 | -7.360528 | -7.360528 | 0.0 | 0.0 | 0.0 | 14.721973 |
3 | 2013-01-04 | 22.257458 | 6.660655 | 28.213595 | 22.257458 | 22.257458 | -5.071248 | -5.071248 | -5.071248 | 2.309826 | 2.309826 | 2.309826 | -7.381074 | -7.381074 | -7.381074 | 0.0 | 0.0 | 0.0 | 17.186210 |
4 | 2013-01-05 | 22.265216 | 8.750696 | 29.470322 | 22.265216 | 22.265216 | -3.660032 | -3.660032 | -3.660032 | 3.756210 | 3.756210 | 3.756210 | -7.416242 | -7.416242 | -7.416242 | 0.0 | 0.0 | 0.0 | 18.605184 |
Prophet Forecast Predictions: tail(5) |
ds | trend | yhat_lower | yhat_upper | trend_lower | trend_upper | additive_terms | additive_terms_lower | additive_terms_upper | weekly | weekly_lower | weekly_upper | yearly | yearly_lower | yearly_upper | multiplicative_terms | multiplicative_terms_lower | multiplicative_terms_upper | yhat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1911 | 2018-03-27 | 33.055049 | 18.865442 | 40.079129 | 33.031575 | 33.080619 | -3.285364 | -3.285364 | -3.285364 | -2.316626 | -2.316626 | -2.316626 | -0.968738 | -0.968738 | -0.968738 | 0.0 | 0.0 | 0.0 | 29.769684 |
1912 | 2018-03-28 | 33.059424 | 18.745015 | 39.585653 | 33.035271 | 33.085344 | -3.553184 | -3.553184 | -3.553184 | -2.724676 | -2.724676 | -2.724676 | -0.828508 | -0.828508 | -0.828508 | 0.0 | 0.0 | 0.0 | 29.506240 |
1913 | 2018-03-29 | 33.063799 | 21.300935 | 42.938685 | 33.038966 | 33.090405 | -0.852174 | -0.852174 | -0.852174 | -0.167199 | -0.167199 | -0.167199 | -0.684975 | -0.684975 | -0.684975 | 0.0 | 0.0 | 0.0 | 32.211625 |
1914 | 2018-03-30 | 33.068174 | 24.834109 | 45.752286 | 33.042662 | 33.095465 | 1.771207 | 1.771207 | 1.771207 | 2.309826 | 2.309826 | 2.309826 | -0.538618 | -0.538618 | -0.538618 | 0.0 | 0.0 | 0.0 | 34.839382 |
1915 | 2018-03-31 | 33.072550 | 26.562624 | 47.057891 | 33.046358 | 33.100391 | 3.366219 | 3.366219 | 3.366219 | 3.756210 | 3.756210 | 3.756210 | -0.389991 | -0.389991 | -0.389991 | 0.0 | 0.0 | 0.0 | 36.438768 |
- the forecast dataframe contains Prophet's predictions on sales - because it also got the historical dates, Prophet provides an in-sample fit that can be used to evaluate the model - forecast() includes columns for yhat with the forecast - it also includes columns for components and uncertainty intervals |
---|
help(prophet.plot)
Help on method plot in module prophet.forecaster: plot(fcst, ax=None, uncertainty=True, plot_cap=True, xlabel='ds', ylabel='y', figsize=(10, 6), include_legend=False) method of prophet.forecaster.Prophet instance Plot the Prophet forecast. Parameters ---------- fcst: pd.DataFrame output of self.predict. ax: Optional matplotlib axes on which to plot. uncertainty: Optional boolean to plot uncertainty intervals. plot_cap: Optional boolean indicating if the capacity should be shown in the figure, if available. xlabel: Optional label name on X-axis ylabel: Optional label name on Y-axis figsize: Optional tuple width, height in inches. include_legend: Optional boolean to add legend to the plot. Returns ------- A matplotlib figure.
Plotting the Prophet Forecast | ||
---|---|---|
- the dark blue is the forecast for sales, forecast['y_hat'] - the black dots are the actual sales, forecast['y'] - the light blue shading is the 95% confidence interval around the forecast - the uncertainty interval is bounded by forecast['yhat_lower'] and forecast['yhat_upper'] |
header_text('Prophet Forecast Visualization')
plt.style.use('ggplot')
forecast_plot = prophet.plot(forecast)
plt.tick_params(color = 'black')
Prophet Forecast Visualization |
Trend Changepoints | ||
---|---|---|
- trend changepoints refer to the abrupt changes in trajectory that real-life time series data often contains - these are caused by things like new product launches, unforseen problems, etc. - Prophet automatically detects changepoints and allows the trend to adapt appropriately - the growth rate varies and makes the model more flexible - this can cause overfitting or underfitting, however - changepoint_prior_scale is the parameter that can be used to adjust trend flexibility and deal with over or underfitting - a higher values fits a more flexible curve to the time series - by default, changepoints are only inferred for the first 80% of the data - changepoint_range can be used to affect this behavior - changepoints can also be added manually using thechangepoints argument |
||
- Changepoint Documentation | ||
- Changepoints are represented by the dotted lines in the plot below |
header_text('Prophet Forecast with Changepoints')
from prophet.plot import add_changepoints_to_plot
plot = prophet.plot(forecast)
changepoints = add_changepoints_to_plot(plot.gca(), prophet, forecast)
Prophet Forecast with Changepoints |
# component plots |
---|
plot = prophet.plot_components(forecast)
Observations | ||
---|---|---|
- trend component - trends upwards - weekly seasonality component - shows more purchases on weekends and a drop in sales from Sunday to Monday - yearly seasonality component - confirms previously seen trends as well, in the increase in sales around July and decrease around January - the model can be further improved by accounting for holidays, etc. |
||
- Prophet: Seasonality, Holdays, and Special Events |
Prophet Model Evaluation | ||
---|---|---|
- the forecast dataframe also contains predictions made on the training data - it is possible to use this in-sample fit to evaluate the model |
prophet_data = pd.merge(store_df, forecast[['ds','yhat_lower','yhat_upper','yhat']],
on = 'ds')
prophet_data = prophet_data[['ds','yhat_lower','yhat_upper','yhat','y']]
df_overview(prophet_data, 'Prophet Forecast Data')
ds | yhat_lower | yhat_upper | yhat | y | |
---|---|---|---|---|---|
datatype | datetime64[ns] | float64 | float64 | float64 | int64 |
missing values | 0 | 0 | 0 | 0 | 0 |
count | nan | 1,826.00 | 1,826.00 | 1,826.00 | 1,826.00 |
mean | nan | 17.75 | 38.58 | 28.17 | 28.17 |
std | nan | 6.84 | 6.84 | 6.84 | 8.68 |
min | nan | -2.02 | 18.83 | 8.63 | 6.00 |
25% | nan | 13.10 | 33.83 | 23.53 | 22.00 |
50% | nan | 17.98 | 38.88 | 28.37 | 28.00 |
75% | nan | 22.72 | 43.42 | 33.12 | 34.00 |
max | nan | 35.39 | 56.51 | 46.14 | 58.00 |
total rows | 1,826 |
---|---|
total columns | 5 |
column names | ds, yhat_lower, yhat_upper, yhat, y |
index start | 0 |
index end | 1825 |
total missing values | 0 |
Prophet Forecast Data Head and Tail |
ds | yhat_lower | yhat_upper | yhat | y | |
---|---|---|---|---|---|
0 | 2013-01-01 | 1.62 | 22.57 | 12.55 | 12 |
1 | 2013-01-02 | 2.14 | 22.26 | 12.16 | 16 |
2 | 2013-01-03 | 3.58 | 25.64 | 14.72 | 16 |
ds | yhat_lower | yhat_upper | yhat | y | |
---|---|---|---|---|---|
1823 | 2017-12-29 | 17.29 | 37.47 | 27.46 | 18 |
1824 | 2017-12-30 | 18.39 | 39.24 | 28.97 | 24 |
1825 | 2017-12-31 | 19.76 | 40.56 | 29.80 | 31 |
# using MAE and MAPE to evaluate the model |
---|
actuals = prophet_data['y'].values
predictions = prophet_data['yhat'].values
prophet_mae = mae(actuals, predictions)
pretty(f'{prophet_mae:.3f}%', 'Prophet MAE Score:')
Prophet MAE Score: |
4.275% |
prophet_mape = mape(actuals, predictions)
pretty(f'{prophet_mape:.3f}%', 'Prophet MAPE Score:')
Prophet MAPE Score: |
0.169% |
plt.style.use('strawberries.mplstyle')
plt.plot(actuals, label='Actual Values')
plt.plot(predictions, label='Predicted Values')
plt.legend();
plt.title('Prophet Actual Values vs Predicted Values');
Prophet's Diagnostic Tools | ||
---|---|---|
- cross-validation and hyperparameter tuning are also available through Prophet | ||
- Prophet Diagnostic Tools Documentation | ||
Cross Validation - this compares predicted values with the actual values - forecast horizon (horizon) must be specified - the initial training period (initial) must also be specified - period refers to the spacing between cutoff dates |
%%capture
from prophet.diagnostics import cross_validation
cross_val = cross_validation(prophet, horizon = '90 days')
cross_val_df = cross_validation(prophet, initial='270 days',
period='45 days',
horizon = '90 days')
head_tail_vert(cross_val_df, 5, 'Prophet Cross Validation Data')
Prophet Cross Validation Data: head(5) |
ds | yhat | yhat_lower | yhat_upper | y | cutoff | |
---|---|---|---|---|---|---|
0 | 2013-10-24 | 25.245507 | 16.163705 | 35.008951 | 23 | 2013-10-23 |
1 | 2013-10-25 | 27.261290 | 18.080649 | 36.274408 | 22 | 2013-10-23 |
2 | 2013-10-26 | 28.151781 | 18.666897 | 36.423524 | 23 | 2013-10-23 |
3 | 2013-10-27 | 29.485806 | 20.902836 | 39.078639 | 18 | 2013-10-23 |
4 | 2013-10-28 | 21.198505 | 11.769138 | 30.222497 | 23 | 2013-10-23 |
Prophet Cross Validation Data: tail(5) |
ds | yhat | yhat_lower | yhat_upper | y | cutoff | |
---|---|---|---|---|---|---|
2965 | 2017-12-27 | 22.978292 | 12.523857 | 33.215580 | 19 | 2017-10-02 |
2966 | 2017-12-28 | 25.539708 | 15.842017 | 35.708661 | 21 | 2017-10-02 |
2967 | 2017-12-29 | 28.051326 | 17.459577 | 38.621720 | 18 | 2017-10-02 |
2968 | 2017-12-30 | 29.564439 | 19.699071 | 39.674125 | 24 | 2017-10-02 |
2969 | 2017-12-31 | 30.239662 | 19.039010 | 41.010030 | 31 | 2017-10-02 |
Performance Metrics |
---|
from prophet.diagnostics import performance_metrics
perf_metrics = performance_metrics(cross_val_df)
df_overview(perf_metrics, 'Prophet Performance Metrics')
horizon | mse | rmse | mae | mape | mdape | smape | coverage | |
---|---|---|---|---|---|---|---|---|
datatype | timedelta64[ns] | float64 | float64 | float64 | float64 | float64 | float64 | float64 |
missing values | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
count | 82 | 82.00 | 82.00 | 82.00 | 82.00 | 82.00 | 82.00 | 82.00 |
mean | 49 days 12:00:00 | 39.77 | 6.30 | 4.95 | 0.21 | 0.14 | 0.18 | 0.88 |
std | 23 days 19:33:58.568845364 | 3.92 | 0.31 | 0.26 | 0.01 | 0.01 | 0.01 | 0.02 |
min | 9 days 00:00:00 | 32.09 | 5.67 | 4.45 | 0.17 | 0.13 | 0.16 | 0.85 |
25% | 29 days 06:00:00 | 36.40 | 6.03 | 4.77 | 0.20 | 0.13 | 0.17 | 0.87 |
50% | 49 days 12:00:00 | 40.26 | 6.34 | 4.98 | 0.20 | 0.14 | 0.18 | 0.88 |
75% | 69 days 18:00:00 | 43.02 | 6.56 | 5.09 | 0.22 | 0.14 | 0.18 | 0.89 |
max | 90 days 00:00:00 | 46.97 | 6.85 | 5.56 | 0.23 | 0.16 | 0.20 | 0.91 |
total rows | 82 |
---|---|
total columns | 8 |
column names | horizon, mse, rmse, mae, mape, mdape, smape, coverage |
index start | 0 |
index end | 81 |
total missing values | 0 |
Prophet Performance Metrics Head and Tail |
horizon | mse | rmse | mae | mape | mdape | smape | coverage | |
---|---|---|---|---|---|---|---|---|
0 | 9 days 00:00:00 | 35.45 | 5.95 | 4.87 | 0.19 | 0.15 | 0.18 | 0.90 |
1 | 10 days 00:00:00 | 36.20 | 6.02 | 4.92 | 0.20 | 0.15 | 0.18 | 0.90 |
2 | 11 days 00:00:00 | 38.09 | 6.17 | 4.99 | 0.20 | 0.15 | 0.18 | 0.89 |
horizon | mse | rmse | mae | mape | mdape | smape | coverage | |
---|---|---|---|---|---|---|---|---|
79 | 88 days 00:00:00 | 40.96 | 6.40 | 5.07 | 0.20 | 0.15 | 0.18 | 0.86 |
80 | 89 days 00:00:00 | 43.29 | 6.58 | 5.20 | 0.21 | 0.14 | 0.18 | 0.86 |
81 | 90 days 00:00:00 | 44.83 | 6.70 | 5.30 | 0.21 | 0.15 | 0.18 | 0.86 |
header_text('Prophet Cross Validation MAE Scores')
plt.style.use('classic')
from prophet.plot import plot_cross_validation_metric
plot = plot_cross_validation_metric(cross_val_df, metric='mae')
- the blue line shoes the MAE (above) and MAPE (below) where the mean is taken over a rolling window represented by the dots - errors around 18% are typical for predictions 9 days into the future - error decreases to around 17% for predictions 90 days out |
---|
header_text('Prophet Cross Validation MAPE Scores')
plot = plot_cross_validation_metric(cross_val_df, metric='mape')
Fine-Tuning Prophet | ||
---|---|---|
- using grid-search to fine-tune the hyperparameters - hyperparameters to tune are: - changepoint_propr_scale - seasonality_prior_scale -changepoint_prior_scale determines the flexibility of the trend, specifically how much the trend changes at the trend changepoints - seasonality_prior_scale controls the flexibility of seasonality - Prophet Hyperparameter Tuning |
# code from the Prophet documentation that I turned into a function |
---|
def prophet_hypertune(df):
%%capture
import itertools
param_grid = {
'changepoint_prior_scale': [0.001, 0.01, 0.1, 0.5],
'seasonality_prior_scale': [0.01, 0.1, 1.0, 10.0],
}
# Generate all combinations of parameters
all_params = [dict(zip(param_grid.keys(), v)) for v in itertools.product(*param_grid.values())]
maes = [] # Store the MAE for each params here
mapes = [] # Store the MAPE for each params here
# Use cross validation to evaluate all parameters
for params in all_params:
m = Prophet(**params).fit(df) # Fit model with given params
df_cv = cross_validation(m, horizon='90 days', parallel="processes")
df_p = performance_metrics(df_cv, rolling_window=1)
maes.append(df_p['mae'].values[0])
mapes.append(df_p['mape'].values[0])
# Find the best parameters
tuning_results = pd.DataFrame(all_params)
tuning_results['mae'] = maes
tuning_results['mape'] = mapes
return tuning_results
# tuning_results = prophet_hypertune(store_df)
# tuning_results.to_csv('prophet_tuning_results.csv')
tuning_results = pd.read_csv('prophet_tuning_results.csv')
header_text('Prophet Hyperparameter Tuning Results')
tuning_results.sort_values(['mae', 'mape'])
Prophet Hyperparameter Tuning Results |
Unnamed: 0 | changepoint_prior_scale | seasonality_prior_scale | mae | mape | |
---|---|---|---|---|---|
6 | 6 | 0.010 | 1.00 | 4.564569 | 0.175060 |
7 | 7 | 0.010 | 10.00 | 4.566228 | 0.175187 |
10 | 10 | 0.100 | 1.00 | 4.568626 | 0.171569 |
11 | 11 | 0.100 | 10.00 | 4.570424 | 0.171582 |
5 | 5 | 0.010 | 0.10 | 4.573044 | 0.175335 |
9 | 9 | 0.100 | 0.10 | 4.573120 | 0.171731 |
15 | 15 | 0.500 | 10.00 | 4.660408 | 0.173758 |
14 | 14 | 0.500 | 1.00 | 4.663109 | 0.173874 |
13 | 13 | 0.500 | 0.10 | 4.681716 | 0.174252 |
2 | 2 | 0.001 | 1.00 | 4.766691 | 0.187368 |
3 | 3 | 0.001 | 10.00 | 4.798983 | 0.189097 |
4 | 4 | 0.010 | 0.01 | 4.810390 | 0.184923 |
1 | 1 | 0.001 | 0.10 | 4.842964 | 0.191668 |
0 | 0 | 0.001 | 0.01 | 5.015576 | 0.199107 |
8 | 8 | 0.100 | 0.01 | 5.637384 | 0.210224 |
12 | 12 | 0.500 | 0.01 | 6.977255 | 0.255949 |
best_params = all_params[np.argmin(mapes)]
pretty(best_params, 'Best hyperparameters:')
Best hyperparameters: |
{'changepoint_prior_scale': 0.1, 'seasonality_prior_scale': 1.0} |
Hypertuned Model |
---|
model = Prophet(interval_width=0.95, weekly_seasonality=True,
changepoint_prior_scale=best_params['changepoint_prior_scale'],
seasonality_prior_scale=best_params['seasonality_prior_scale'])
prophet = model.fit(store_df)
17:36:48 - cmdstanpy - INFO - Chain [1] start processing 17:36:49 - cmdstanpy - INFO - Chain [1] done processing
future = prophet.make_future_dataframe(periods=90)
forecast = prophet.predict(future)
header_text('Optimized Prophet Forecasting Results')
plot = prophet.plot(forecast)
Optimized Prophet Forecasting Results |
header_text('Optimized Prophet Components')
plot = prophet.plot_components(forecast)
Optimized Prophet Components |
opt_prophet_results = pd.merge(store_df,
forecast[['ds','yhat_lower','yhat_upper','yhat']],
on='ds')
opt_prophet_results = opt_prophet_results[['ds','yhat_lower','yhat_upper','yhat','y']]
actuals = opt_prophet_results['y'].values
predictions = opt_prophet_results['yhat'].values
mae_02 = mae(actuals, predictions)
pretty(f'{mae_02:.3f}%', 'Optimized Prophet MAE')
Optimized Prophet MAE |
4.270% |
mape_02 = mape(actuals, predictions)
pretty(f'{mape_02:.3f}%', 'Optimized Prophet MAPE')
Optimized Prophet MAPE |
0.168% |
# results are the same as the model before hyperparameter tuning |
---|
plt.style.use('strawberries.mplstyle')
plt.plot(actuals, label='Actual Values')
plt.plot(predictions, label='Predicted Values')
plt.title('Optimized Prophet Accuracy')
plt.legend(loc = 2);
# %%capture
# # df_cv = cross_validation(m, horizon='90 days')
# cross_val_df = cross_validation(m, initial='270 days', period='45 days', horizon = '90 days')
perf_metrics = performance_metrics(cross_val_df)
head_tail_vert(perf_metrics, 5, 'Optimized Prophet Performance')
Optimized Prophet Performance: head(5) |
horizon | mse | rmse | mae | mape | mdape | smape | coverage | |
---|---|---|---|---|---|---|---|---|
0 | 9 days | 35.947220 | 5.995600 | 4.926284 | 0.191647 | 0.152566 | 0.176928 | 0.690236 |
1 | 10 days | 36.742474 | 6.061557 | 4.980188 | 0.199179 | 0.156536 | 0.181208 | 0.683502 |
2 | 11 days | 38.087855 | 6.171536 | 4.999319 | 0.201397 | 0.150212 | 0.181926 | 0.676768 |
3 | 12 days | 38.071688 | 6.170226 | 4.962704 | 0.202298 | 0.144186 | 0.181509 | 0.676768 |
4 | 13 days | 35.446195 | 5.953671 | 4.777227 | 0.197603 | 0.137403 | 0.175873 | 0.690236 |
Optimized Prophet Performance: tail(5) |
horizon | mse | rmse | mae | mape | mdape | smape | coverage | |
---|---|---|---|---|---|---|---|---|
77 | 86 days | 45.547330 | 6.748876 | 5.358512 | 0.220390 | 0.157445 | 0.190333 | 0.670034 |
78 | 87 days | 43.664712 | 6.607928 | 5.238313 | 0.209376 | 0.151744 | 0.184221 | 0.683502 |
79 | 88 days | 43.111591 | 6.565942 | 5.242348 | 0.202869 | 0.150846 | 0.181497 | 0.673401 |
80 | 89 days | 45.311504 | 6.731382 | 5.354620 | 0.213010 | 0.150907 | 0.185890 | 0.656566 |
81 | 90 days | 46.559753 | 6.823471 | 5.402354 | 0.215018 | 0.152460 | 0.186969 | 0.649832 |
header_text('Optimized Prophet MAPE Scores')
plt.style.use('classic')
plot = plot_cross_validation_metric(cross_val_df, metric='mape')
Optimized Prophet MAPE Scores |
# comparing prophet_01 and prophet_02 |
---|
metrics_prophet_01 = [round(prophet_mae,3), round(prophet_mape,3)]
metrics_prophet_02 = [round(mae_02,3), round(mape_02,3)]
pd.DataFrame({'metrics':['MAE','MAPE'],
'Prophet_01':metrics_prophet_01,
'Prophet_02':metrics_prophet_02,
})
metrics | Prophet_01 | Prophet_02 | |
---|---|---|---|
0 | MAE | 4.275 | 4.270 |
1 | MAPE | 0.169 | 0.168 |
- the fine-tuned, second Prophet model slightly outperformed the first |
---|
Incorporating Holidays | ||
---|---|---|
- adding US Holidays by using add_country_holidays(country_name='US') |
model = Prophet(interval_width=0.95, weekly_seasonality=True,
changepoint_prior_scale=best_params['changepoint_prior_scale'],
seasonality_prior_scale=best_params['seasonality_prior_scale'])
model.add_country_holidays(country_name='US')
prophet_03 = model.fit(store_df)
18:10:16 - cmdstanpy - INFO - Chain [1] start processing 18:10:16 - cmdstanpy - INFO - Chain [1] done processing
pretty('US Holidays from Prophet')
pd.DataFrame(model.train_holiday_names).style\
.hide(axis = 'index')\
.hide(axis = 'columns')
US Holidays from Prophet |
New Year's Day |
Martin Luther King Jr. Day |
Washington's Birthday |
Memorial Day |
Independence Day |
Labor Day |
Columbus Day |
Veterans Day |
Thanksgiving |
Christmas Day |
Christmas Day (Observed) |
New Year's Day (Observed) |
Veterans Day (Observed) |
Independence Day (Observed) |
future = prophet_03.make_future_dataframe(periods=90)
forecast = prophet_03.predict(future)
header_text('Prophet Model Predictions, holidays incl.')
plt.style.use('classic')
plot = prophet_03.plot(forecast)
Prophet Model Predictions, holidays incl. |
header_text('Prophet Model Components, holidays incl.')
plot = prophet_03.plot_components(forecast)
Prophet Model Components, holidays incl. |
prophet_03_results = pd.merge(store_df,
forecast[['ds','yhat_lower','yhat_upper','yhat']],on='ds')
prophet_03_results = prophet_03_results[['ds','yhat_lower','yhat_upper','yhat','y']]
# calculate MAE between expected and predicted values for december
actuals = prophet_03_results['y'].values
predictions = prophet_03_results['yhat'].values
mae_03 = mae(actuals, predictions)
pretty(f'{mae_03:.3f}', 'Prophet MAE, holidays incl.')
Prophet MAE, holidays incl. |
4.250 |
mape_03 = mape(actuals, predictions)
pretty(f'{mape_03:.3f}', 'Prophet MAPE, holidays incl.')
Prophet MAPE, holidays incl. |
0.168 |
By including, holidays in US provided by Prophet we succeed in improving a bit our model. Even if it was not a significative improvement it shows that changing parameters based on our business case can play in our favour. |
---|
plt.style.use('strawberries.mplstyle')
plt.plot(actuals, label='Actual Values')
plt.plot(predictions, label='Predicted Values')
plt.legend();
cross_val_df = cross_validation(prophet_03, initial='270 days', period='45 days', horizon = '90 days')
metrics_prophet_03 = [round(mae_03,3), round(mape_03,3)]
pd.DataFrame({'metrics':['MAE','MAPE'],
'Prophet_01':metrics_prophet_01,
'Prophet_02':metrics_prophet_02,
'Prophet_03':metrics_prophet_03,
})
metrics | Prophet_01 | Prophet_02 | Prophet_03 | |
---|---|---|---|---|
0 | MAE | 4.275 | 4.270 | 4.250 |
1 | MAPE | 0.169 | 0.168 | 0.168 |
# adding holidays improved the model a bit more |
---|
perf_metrics = performance_metrics(cross_val_df)
head_tail_vert(perf_metrics, 5, 'Prophet Performance, holidays incl.')
Prophet Performance, holidays incl.: head(5) |
horizon | mse | rmse | mae | mape | mdape | smape | coverage | |
---|---|---|---|---|---|---|---|---|
0 | 9 days | 36.150575 | 6.012535 | 4.918369 | 0.193490 | 0.146556 | 0.177075 | 0.892256 |
1 | 10 days | 37.134604 | 6.093817 | 4.996481 | 0.202334 | 0.152226 | 0.182360 | 0.885522 |
2 | 11 days | 38.820715 | 6.230627 | 5.040513 | 0.205745 | 0.151914 | 0.184104 | 0.868687 |
3 | 12 days | 38.447719 | 6.200622 | 4.977827 | 0.205776 | 0.144818 | 0.182800 | 0.872054 |
4 | 13 days | 35.533510 | 5.960999 | 4.774992 | 0.200319 | 0.136720 | 0.176643 | 0.895623 |
Prophet Performance, holidays incl.: tail(5) |
horizon | mse | rmse | mae | mape | mdape | smape | coverage | |
---|---|---|---|---|---|---|---|---|
77 | 86 days | 44.068967 | 6.638446 | 5.205878 | 0.215983 | 0.150907 | 0.185194 | 0.865320 |
78 | 87 days | 42.260486 | 6.500807 | 5.090820 | 0.204757 | 0.149675 | 0.179273 | 0.872054 |
79 | 88 days | 41.025674 | 6.405129 | 5.083517 | 0.197753 | 0.151751 | 0.176397 | 0.875421 |
80 | 89 days | 43.895481 | 6.625367 | 5.239270 | 0.210582 | 0.146854 | 0.182493 | 0.865320 |
81 | 90 days | 45.513572 | 6.746375 | 5.336867 | 0.214079 | 0.151751 | 0.185094 | 0.868687 |
plt.style.use('classic')
plot = plot_cross_validation_metric(cross_val_df, metric='mape')
Saving and Loading Best Model | ||
---|---|---|
- in Python, Prophet models should not be saved with pickle - the Stan backend makes this impossible - instead use the built-in serialization functions to serialize the model to json |
import json
from prophet.serialize import model_to_json, model_from_json
# saving model |
---|
with open('prophet_model.json', 'w') as model_out:
json.dump(model_to_json(prophet_03), model_out)
# loading model |
---|
with open('prophet_model.json', 'r') as model_in:
prophet = model_from_json(json.load(model_in))
Conclusion | ||
---|---|---|
- the Prophet models clearly outperform the ARIMA and SARIMA models - the best Prophet model has a MAE 32.38% lower than the MAE of the best SARIMA model - MAPE of the best Prophet is 21.13% lower than the MAPE of the best SARIMA model |
Comparing all results - SARIMA models vs Prophet models |
---|
sarimax_results = pd.read_csv("model_results.csv")
prophet_results = pd.DataFrame({'metrics':['MAE','MAPE'],
'Prophet_01':metrics_prophet_01,
'Prophet_02':metrics_prophet_02,
'Prophet_03':metrics_prophet_03,
})
sarimax_results.merge(prophet_results, on='metrics').drop(columns = ['Unnamed: 0'])
metrics | ARIMA(4,1,5) | SARIMA(0,1,6)(0,1,1)7 | SARIMA(6,1,1)(6,1,0)7 | Prophet_01 | Prophet_02 | Prophet_03 | |
---|---|---|---|---|---|---|---|
0 | MAE | 6.906 | 6.910 | 6.285 | 4.275 | 4.270 | 4.250 |
1 | MAPE | 0.298 | 0.301 | 0.213 | 0.169 | 0.168 | 0.168 |