What is .ARIMA timing system. on Ctrip.com? Let me tell you, this is for business volume forecasting.

Original: Li Sheng

One here


Time series analysis

This is an important branch of statistics

This is mainly due to study of laws of development and change of things in time

To predict future developments

In our daily life, dynamics of stock prices, daily sales in milk tea shops

Replay 00:00 / 00 :00 Live 00:00 Enter Full Screen< /i> 50
    Press and hold to drag video

    Annual precipitation distribution

    The fluctuation of river water during four seasons refers to time series. Time Series Analysis Penetrates Many Industries


    Time series classification

    Picture 1

    1. By stability, index is divided into stationary and non-stationary time series

    2. By nature of indicators, they are divided into general time series of indicators, relative indicators and time series of average indicators

    3. In accordance with classification of indicators by attribute of time, they are divided into time series of period indicators and time series of indicators of time points

    You can add time series of period indicators

    And addition makes sense

    For example, order quantity per day

    One month's order quantity can be directly added to that month's corresponding daily order quantity

    Time series of time point indicators cannot be added and reflect level reached at a particular point in time

    For example, daily inventory

    The addition of inventory is not statistically significant, and total monthly inventory does not equal sum of daily inventory


    For Internet companies

    Business volume is one of important indicators of company management

    The complexity of real situation has led to numerous difficulties in analyzing and forecasting business volume

    1. Cyclical impact on business performance

    2. Changing specific time nodes such as holidays

    3. Regional differences, spatial interaction

    4. Depends on stocks and actual market capacity

    5. Other exogenous variables, uncontrollable natural or social factors

    Time series analysis

    For example, order volume, traffic volume, inventory management, etc.

    How to achieve this


    ANN, RNN, LR, ARIMA, Prophet, etc.

    Here I want to tell you about key points

    This is .ARIMA parsing method.

    2. The practice of time series analysis

    2.1 Introduction to ARIMA Model


    The full name of model is autoregressive moving average model

    Perhaps most commonly used model for fitting stationary sequences


    The model consists of two parts

    They are as follows:

    AR(p) P-Order Autoregressive Model

    When φ0=0

    The autoregressive model is also called centralized AR(p) model

    Decentralized AR(p) sequences can also be translated (by translation) into centralized AR(p) models

    The AR model expresses value of t at a certain moment using a linear combination and noise with values ​​from t-1 to t-p at several past moments

    MA(q) Q-order moving average model

    When μ=0

    The MA(q) model is called centralized MA(q) model

    For a decentralized MA(q) model, it can be converted to a centralized MA(q) model by simply doing a simple offset

    The MA model represents current value through a linear combination of historical point noise

    The ARMA model is actually a combination of AR(P) and MA(q)

    In following way:

    The same

    When φ0=0, model is called ARMA(p,q) centralized model

    It combines characteristics of two models

    The AR model looks at relationship between current data and more recent data, while MA looks at impact of random changes

    ARMA model can be used for stationary time series

    Fits directly

    But in fact, all our time series have a trend, that is, general time series is non-stationary

    That's why you need smooth processing, most commonly used differential processing

    ARMA analysis after stabilization of time series

    This is actually an ARIMA process

    Applying an ARMA model to a stationary time series after handling a first or second order difference based on original non-stationary time series

    The ARIMA(p,d,q) model is a 3-tuple model where difference d is added to ARMA(p,q) two-tuple model


    2.2 Stages of practical analysis of ARIMA model

    Picture 2

    Concrete implementation

    Let's take Python as an example

    Step 1. Reading Time Series

    df = pd.read_csv('testdata.csv', encoding='gbk', index_col='ddate')# Time series index converted to date format df.index = pd.to_datetime(df.index) #Indicator volume converted to floating point type df['cnt'] = df['cnt'].astype(float)plt.figure(facecolor='white',figsize=(20,8))plt.plot(df .index ,df['cnt'],label='Time Series')plt.legend(loc='best')plt.show() 

    Step 2. Checking stationarity of time series

    What is stability

    The stable is divided into strict and wide stationary

    Strict stability ensures that any finite-dimensional time series distribution is time-transfer invariant

    For example, Gaussian white noise is a strictly stationary sequence

    Wide stationarity requires that covariance structure does not change over time, or that mean and variance be constant

    Why do you need stability

    ARIMA includes an AR model. The essence of AR model is to use historical data of points in time to predict value corresponding to current point in time

    This requires that correlation of series does not change over time

    from statsmodels.tsa.stattools import adfullerdef test_stationarity(timeseries): dftest = adfuller(timeseries, autolag='AIC') return dftest[1]

    Original time

    Sequential stationarity test failed (0.94)

    I wrote analysis

    You can also see in Figure 3

    The time series has a clear upward trend

    Therefore, you need to try to process time series differentially and check its stationarity again

    Step 3. Check stationarity after handling differences

    pred_day = 7 train_start = datetime(2017,3,1) train_end = datetime(2019,8,16) pred_start = train_end+timedelta(1) pred_end = train_end+timedelta(pred_day) train_diff=df[train_start : train_end] train_diff['cnt']=train_diff.diff()print(test_stationarity(train_diff['cnt'][train_start+timedelta(1):train_end]))plt.figure(facecolor='white',figsize=( 20 ,8))plt.plot(train_diff.index,train_diff['cnt'],label='Time series after diff')plt.legend(loc='best')plt.show()

    The value of test for stationarity of time series after difference 9.51*e(-15)

    This shows that time series after difference is already a stationary time series and ARIMA model can be applied

    Step 4. Draw ACF and PACF charts

    Autocorrelation function ACF reflects correlation between two points

    The PACF partial autocorrelation function eliminates influence of other points between two points

    reflects correlation between two points

    For example: in AR(2), even if y(t-3) does not appear directly in model, there is a correlation between y(t) and y(t-3)

    import statsmodels.api as smfig = plt.figure(figsize=(12,8))ax1 = fig.add_subplot(211)fig = sm.graphics.tsa.plot_acf(train_diff['cnt'][ 1:], lags=20, ax=ax1)ax1.xaxis.set_ticks_position('bottom')fig.tight_layout()ax2 = fig.add_subplot(212)fig = sm.graphics.tsa.plot_pacf(train_diff['cnt' ) ][1:], lags=20, ax=ax2)ax2.xaxis.set_ticks_position('bottom')fig.tight_layout()plt.show()

    Strictly speaking

    ACF and PACF show a certain degree of tail and oscillation


    ACF and PACF have a sharp drop and a steady trend after third order, given that this is a short-term forecast scenario

    This can be judged by combining effect of prediction and model testing

    Step 5. ARIMA Model Order


    ACF and PACF provide us with a guideline for choosing model parameters

    However, in general

    We always need to determine final parameter value using model learning effect

    In ARMA model

    We usually use AIC rule (Akaike information criterion, AIC=2k-2ln(L)

    k is number of model parameters, n is number of samples, L is likelihood function)

    AIC encourages data

    The fit is good, but try to avoid overfitting

    Therefore, in real work, we will choose set of parameters with smallest AIC value of model

    As it should

    #定order warnings.filterwarnings("ignore") # specify to ignore warning message spmax = 8qmax = 8aic_matrix = [] #aic matrix for p in range(1,pmax+1): tmp = [] for q in range(1,qmax+1): try: #There are error messages, so use try to skip error messages. model = ARIMA(endog=df['cnt'],order=(p,1,q)) results = model.fit(disp=-1) tmp.append(results.aic) print('ARIMA p:{} q:{} - AIC:{}'.format(p, q, results.aic)) except: tmp.append(None) aic_matrix.append(tmp)aic_matrix = pd.DataFrame(aic_matrix) #Minimum can be found from it Value p,q = aic_matrix.stack().idxmin() # First use stack for alignment, then use idxmin to find position of minimum value. print(u'AIC minimum p value and q value: %s, %s' %(p+1,q+1))

    Because time series is a first-order stationary time series

    So, model parameter d=1, according to APC minimum principle, p=7, q=7

    Step 6. Model testing and optimization

    Add trained parameters to model and analyze effect of model

    model = ARIMA(endog=df['cnt'], order=(p,1,q)) #Build model ARIMA(7, 1,7) result_ARIMA = model.fit(disp=-1 ,method='css')predict_diff=result_ARIMA.predict()# reduce first order difference df_shift=df['cnt'].shift(1)predict=predict_diff+df_shiftplt.figure(figsize=(18,5),facecolor = 'white')predict[train_start+timedelta(p+1):train_end].plot(color='blue', label='Predict')df['cnt'][train_start+timedelta(p+1):train_end ] .plot(color='red', label='Original')err=sum(np.sqrt((predict[train_start+timedelta(p+1):train_end]-df['cnt'][train_start+timedelta( p +1):train_end])**2)/df['cnt'][train_start+timedelta(p+1):train_end])/df['cnt'][train_start+timedelta(p+1):train_end ] .sizeplt.legend(loc='best')plt.title('Error: %.4f'%err) plt.show()

    Use trained model to make predictions about future.

    y_forecasted =result_ARIMA.forecast(steps=pred_day, alpha=0.01)[0] #like 7-day forecast y_truth = df[pred_start:pred_end]['cnt']#rms error#average frequency errors mse = np.sqrt( ((y_forecasted - y_truth) ** 2) ).mean()error_rate = (abs(y_forecasted - y_truth)/y_truth).mean()print('\nThe average error rate of our forecasts {}' .format(round(error_rate, 4))) 

    Model prediction error is 8.58% (mean [variance/true])

    The result is not perfect, so we need to optimize model

    Keep this in mind because indicator is affected by holidays and weeks

    So, we add identification parameters of holidays and weeks to exogenous variables of model

    After adding exog variables

    You need to reorder and retrain model, steps are same as above

    The error of optimized forecast is 1.77%, which is much better than before

    Picture 8

    Step 7: Checking Model

    Use rest of model to check plausibility of model.

    resid = result_ARIMA_improve.resid #Assign plt.figure(figsize=(12,8))qqplot(resid,line='q',fit=True)#Use D-W test to check autocorrelation residual field print( 'D-W check value is {}'.format(durbin_watson(resid.values)))

    Picture 9

    This can also be seen from qq diagram in Figure 9

    The remainder mostly follows a normal distribution

    The result of D-W test is 1.99, which is close to 2, indicating no autocorrelation in residual sequence, i.e. model is better

    3. Summary and perspectives

    For time series analysis, we need to do a good pre-evaluation, and intuitive analysis of charts will help us to make a decision

    Learning over better open source tool libraries often allows us to get twice results for half effort

    The choice of model is very important, check applicable model scenarios and select appropriate model analysis according to your own time

    The expected effect of ARIMA model in short term is not bad, but in long term

    For example, forecast for next year is not suitable because deviation will gradually increase

    Difficult scenes in reality

    A single model is difficult to solve and a combination of multiple models must be considered for analysis and prediction