ISSN: 2231-2196 (Print)

ISSN: 0975-5241 (Online)

ISSN: 0975-5241 (Online)

**IJCRR - Vol 09 Issue 24, December**

**Pages:** 20-27

**Date of Publication:** 26-Dec-2017

Print Article Download XML Download PDF

**Author:** J. C. Ramesh Reddy, T. Ganesh, M. Venkateswaran, PRS Reddy

**Category:** General Sciences

**Abstract:**Seasonal Autoregressive Integrated Moving Average (SARIMA) model in analyzing the forecast of monthly mean rainfall in Rayalaseema (India) using R is discussed. We have checked the different ARIMA models according to the structure of the data, finally we found that the ARIMA (5,0,1)(2,0,0)[12] has been fitted to the data and the significance test has been made by using

lowest AIC and BIC values.

**Keywords:** Box-Jenkins Methodology, SARIMA model, AIC & BIC

**DOI:** 10.7324/IJCRR.2017.9244

**Full Text:**

**Introduction**

Andhra Pradesh is one of the states of Indian, in which we have taken the rainfall data of Rayalaseema which statistical reports called the drought area in the state. In this blog, we have done the analysis like forecasting of annual rainfall in Rayalaseema for coming years. For the experiment, we have taken data of Mean Annual Rainfall from www.data.gov.in. The data is having the information of mean annual rainfall from year 1901 to 2016.

In this experiment we have taken the help of R programming that is now one of most useful software in the field of data science and statistics. For the analysis, first column of the dataset is chosen to do analysis that is having annual mean rainfall information in mm unit.

**Methodology **

ARIMA models are capable of modeling a wide range of seasonal data. A seasonal ARIMA model is formed by including additional seasonal terms in the ARIMA models we have seen so far. It is written as follows:

ARIMA (p, d, q) (P, D, Q)_{m}: the first parenthesis represents the non-seasonal part of the model and second represents the seasonal part of the model, where m= number of periods per season. We use uppercase notation for the seasonal parts of the model, and lowercase notation for the non-seasonal parts of the model. The additional seasonal terms are simply multiplied with the non-seasonal terms.

The seasonal part of an AR or MA model will be seen in the seasonal lags of the PACF and ACF. For example, an ARIMA(0,0,0)(0,0,1)_{12} model will show: a spike at lag 12 in the ACF but no other significant spikes. The PACF will show exponential decay in the seasonal lags; that is, at lags 12, 24, 36,….etc. Similarly, an ARIMA(0,0,0)(1,0,0)_{12} model will show: exponential decay in the seasonal lags of the ACF a single significant spike at lag 12 in the PACF. In considering the appropriate seasonal orders for an ARIMA model, restrict attention to the seasonal lags. The modeling procedure is almost the same as for non-seasonal data, except that we need to select seasonal AR and MA terms as well as the non-seasonal components of the model.

Seasonal autoregressive integrated moving average (SARIMA) model for any variable involves mainly four steps: Identification, Estimation, Diagnostic checking and Forecasting. The basic form of SARIMA model is denoted by and the model is given by , is the time series value at time t and are polynomials of order of p, P, q and Q respectively. B is the backward shift operator, . Order of seasonality is represented by s. Non-seasonal and seasonal difference orders are denoted by d and D respectively. White noise process is denoted by . The Box-Jenkins methodology involves four steps (Box et al., 1994): (i) identification (ii) estimation (iii) diagnostic checking and (iv) forecasting. First, the original series must be transformed to become stationary around its mean and its variance. Second, the appropriate order of p and q must be specified using autocorrelation and partial autocorrelation functions. Third, the value of the parameters must be estimated using some non-linear optimization procedure that minimizes the sum of squares of the errors or some other appropriate loss function. Diagnostic checking of the model adequacy is required in the fourth step. This procedure is continued until an adequate model is obtained. Finally, the future forecasts generate using minimum mean square error method (Box et al. 1994). SARIMA models are used as benchmark models to compare the performance of the other models developed on the same data set. The iterative procedure of SARIMA model building was explained by Kumari et al. (2013), Boiroju (2012), Rao (2011) and Box et al. (1994).

By default, the `arima()`

command in R sets c=μ=0 when d>0 and provides an estimate of μ when d=0. The parameter μ is called the “intercept” in the R output. It will be close to the sample mean of the time series, but usually not identical to it as the sample mean is not the maximum likelihood estimate when p+q>0. The `arima()`

command has an argument `include. mean`

which only has an effect when d=0 and is `TRUE`

by default. Setting `include.mean=FALSE`

will force μ=0.

The `Arima()`

command from the forecast package provides more flexibility on the inclusion of a constant. It has an argument `include.mean`

which has identical functionality to the corresponding argument for `arima()`

. It also has an argument `include.drift`

which allows μ≠0μ≠0 when d=1. For the d>1, no constant is allowed as a quadratic or higher order trend is particularly dangerous when forecasting. The parameter μ is called the “drift” in the R output when d=1.

This is also an argument include.constant which, if TRUE will see include.mean=TRUE if d=0 and include.drift=TRUE when d=1. If include.constant=FALSE. Both include.mean and include.drift will be set to FALSE. If include.constant is used, the values of include.mean=TRUE and include.drift=TRUE are ignored.

The `auto.arima()`

function automates the inclusion of a constant. By default, for d=0 or d=1, a constant will be included if it improves the AIC value; for d>1, the constant is always omitted. If `allow drift=FALSE`

is specified, then the constant is only allowed when d=0.

There is another function `arima()`

in R which also fits an ARIMA model. However, it does not allow for the constant cc unless d=0, and it does not return everything required for the `forecast()`

function. Finally, it does not allow the estimated model to be applied to new data (which is useful for checking forecast accuracy). Consequently, it is recommended that you use `Arima()`

instead.

When fitting an ARIMA model to a set of time series data, the following procedure provides a useful general approach.

- Plot the data. Identify any unusual observations.
- If necessary, transform the data (using a Box-Cox transformation) to stabilize the variance.
- If the data are non-stationary: take first differences of the data until the data are stationary.
- Examine the ACF/PACF: Is an AR(pp) or MA(qq) model appropriate?
- Try your chosen model(s), and use the AICc to search for a better model.
- Check the residuals from your chosen model by plotting the ACF of the residuals, and doing a portmanteau test of the residuals. If they do not look like white noise, try a modified model.
- Once the residuals look like white noise, calculate forecasts.

AIC and BIC are both penalized-likelihood criteria. They are sometimes used for choosing best predictor subsets in regression and often used for comparing non-nested models, which ordinary statistical tests cannot do. The AIC or BIC for a model is usually written in the form [-2log*L* + *kp*], where *L* is the likelihood function, *p* is the number of parameters in the model, and *k* is 2 for AIC and log(*n*) for BIC.

AIC is an estimate of a constant plus the relative distance between the unknown true likelihood function of the data and the fitted likelihood function of the model, so that a lower AIC means a model is considered to be closer to the truth. BIC is an estimate of a function of the posterior probability of a model being true, under a certain Bayesian setup, so that a lower BIC means that a model is considered to be more likely to be the true model. Both criteria are based on various assumptions and asymptotic approximations. Each, despite its heuristic usefulness, has therefore been criticized as having questionable validity for real world data. But despite various subtle theoretical differences, their only difference in practice is the size of the penalty; BIC penalizes model complexity more heavily. The only way they should disagree is when AIC chooses a larger model than BIC.

AIC and BIC are both approximately correct according to a different goal and a different set of asymptotic assumptions. Both sets of assumptions have been criticized as unrealistic. Understanding the difference in their practical behavior is easiest if we consider the simple case of comparing two nested models. In such a case, several authors have pointed out that IC’s become equivalent to likelihood ratio tests with different alpha levels. Checking a chi-squared table, we see that AIC becomes like a significance test at alpha=.16, and BIC becomes like a significance test with alpha depending on sample size, e.g., .13 for *n* = 10, .032 for *n* = 100, .0086 for *n* = 1000, .0024 for *n* = 10000. Remember that power for any given alpha is increasing in n. Thus, AIC always has a chance of choosing too big a model, regardless of* n*. BIC has very little chance of choosing too big a model if *n* is sufficient, but it has a larger chance than AIC, for any given *n*, of choosing too small a model.

So what’s the bottom line? In general, it might be best to use AIC and BIC together in model selection. For example, in selecting the number of latent classes in a model, if BIC points to a three-class model and AIC points to a five-class model, it makes sense to select from models with 3, 4 and 5 latent classes. AIC is better in situations when a false negative finding would be considered more misleading than a false positive, and BIC is better in situations where a false positive is as misleading as, or more misleading than, a false negative.

**Forecasting of rainfall in Rayalaseema**

The data has been taken to predict the rain fall of Rayalaseema area of Andhra Pradesh from the year 1951 to 2016. The data has monthly rainfall for each year. In this section, we have to check forecasting model to this data using one of statistical tool R software. In R software majorly we need packages for forecasting model. Using these packages is predicting the model for the Rayalaseema data. The packages are ‘**ggplot2’**, ‘**forecast’** and ‘**tseries’**. Install the above mentioned packages using *install.packages()* function and call that packages using library function as below:

**Library ('ggplot2')** # calls the packages using Library ().

**Library ('forecast')**

**Library ('tseries')**

The data has been converted into “*.csv*” file and then read the data into the R programming.

**CA<-read.csv ("D:/ganesh/statistics Softwares/project/Analysis2017/Ganesh/CA.csv") **# to read the data

To find the summery of the rainfall of Rayalaseema, the following command can be use:

**Summary (CA)** # to see basic details of data set

**Summary details:**

Rayalaseema |
Min |
1 |
Median |
Mean |
3 |
Max |

0.00 |
8.17 |
43.20 |
57.83 |
88.45 |
327.60 |

The minimum and maximum rainfall is 0.00 and 327.60, first and third quartiles are 8.17 and 88.45. Based on the quartile, the deviation is 40.14.Average rainfall of Rayalaseema is 57.83and median rainfall is 43.20 Depending on the above summery, we cannot give any decision about the rainfall of Rayalaseema.

To know the pattern of the data, we have applied the below mentioned time series command.

**Myts<- ts (CA [, 3], start=c (1951, 1), end = c (2016, 12), frequency = 12)** # Applying time series to data

If you draw the graph of the rainfall, we can observe whether the data is in stationary or not. To check the stationary of the data we have applied Augmented Dickey-Fuller test.The test is significant (p<0.05*), so that the data is stationary and we can observe in the graph of the data and its difference.

**Plot (Myts, xlab='year', ylab = 'Stocks', main="Rainfall difference in Rayalaseema Area", col="blue")** # Graph for time series data

**adf.test (myts, alternative = "stationary") **# Stationary checking

**Augmented Dickey-Fuller Test**

`Dickey-Fuller = -13.41, Lag order = 9, p-value = 0.01`

`Alternative hypothesis: stationary`

**Dec <- decompose (Myts)** # using decompose function to see the decompose details.

**Plot (Dec, col="red") # **plot of decompose values

The four graphs are the original data, seasonal component, trend component and the remainder and this shows the periodic seasonal pattern extracted out from the original data and the trend. There is a bar at the right hand side of each graph to allow a relative comparison of the magnitudes of each component. For this data the change in trend is less than the variation doing to the monthly variation.

**ARIMA model for data**

We have considered the different models to check the least AIC and BIC values of the model so that, the model will be the fitted one. Basing on the different models we made a comparison with AIC and BIC values. And finally we applied auto ARIMA to find the best fitted model to the data.

From the above table, we can observe clearly, the model ARIMA (4,1,1)(1,1,1)[12] has the least AIC and BIC values, it may be considered as the best fitted model, but the seasonal differences have been taken is very high so that it has the least values of AIC and BIC.

**ARIMAfit<- auto.arima (Myts, approximation=FALSE, trace=FALSE) **# to build forecasting model

The ACF plot of the residuals from the ARIMA (5,0,1)(2,0,0)[12] model shows all correlations within the threshold limits indicating that the residuals are behaving like white noise. A portmanteau test returns a large p-value, also suggesting the residuals are white noise. The PACF shown is suggestive of model. So an initial candidate model is an ARIMA (5,0,1)(2,0,0)[12]. There are no other obvious candidate models.

To check the details of ACF and PCF of Rain fall data.

**Acf (ts (Myts), main='ACF of Mean Rainfall’, col="red")**

The ACF plot of the residuals from the ARIMA (5,0,1)(2,0,0)[12] model shows all correlations within the threshold limits indicating that the residuals are behaving like white noise. A portmanteau test returns a large p-value, also suggesting the residuals are white noise. The PACF shown is suggestive of model. So an initial candidate model is an ARIMA (5,0,1)(2,0,0)[12]. There are no other obvious candidate models.

Automate ARIMA model for the data is ARIMA (5,0,1)(2,0,0)[12] with non-zero mean. The predicted values for coast area rain fall details using ARIMA method of (5, 0, 1) and (2, 0, 0).

By applying auto ARIMA, we got the best fitted model which has the lowest AICc. When models are compared using AICc values, it is important that all models have the same orders of differencing. However, when comparing models using a test set, it does not matter how the forecasts were produced - the comparisons are always valid. The given model has been passed the residual tests. In practice, we would normally use the best model we could find, even if it did not pass all tests. Forecasts from the ARIMA(5,0,1)(2,0,0)_{12} model are shown in the figure below.

**Fact <- forecast (ARIMAfit, h=60)** #forecasting values

**Print(Fact)** # to print the forecast values

**plot(fact)**# Plot the actual and forecast values

**Forecasting for Seasonal Differences**

In this section, we have considered the rainfall data with differences. The same interpretation has been carried out for the below mentioned model.

**plot (diff (Myts), main="Rainfall difference in Rayalaseema", ylab='Differenced Rainfall', col="red") #**Graph for time series difference data

To check the details of ACF and PCF of Rain fall data differences.

**References:**

- Alfaro, E. (2004). A Method for Prediction of California Summer Air Surface Temperature, Eos, Vol. 85, No. 51, 21 December 2004.
- Anisimov O.A., (2001). Predicting Patterns of Near-Surface Air Temperature Using Empirical Data, Climatic Change, Vol. 50, No. 3, 297-315.
- Box, G. E. P., Jenkins, G. M. & Reinsel, G. C. (1994). Time Series Analysis Forecasting and Control, 3rd ed., Englewood Cliffs, N.J. Prentice Hall.
- Brunetti, M., Buffoni, L., Maugeri, M. & Nanni, T., (2000). Trends of minimum and maximum daily temperatures in Italy from 1865 to 1996. Theor. Appl. Climatol., 66, 49–60.
- FAN Ke, (2009). Predicting Winter Surface Air Temperature in Northeast China, Atmospheric and Oceanic Science Letters, Vol. 2, NO. 1, 14−17.
- Hejase, H.A.N. & Assi, A.H. (2012). Time-Series Regression Model for Prediction of Mean Daily Global Solar Radiation in Al-Ain, UAE, ISRN Renewable Energy, Vol. 2012, Article ID 412471, 11 pages.
- J. C. Ramesh Reddy, T. Ganesh, M. Venkateswaran & P. R. S. Reddy (2017), “Rainfall Forecast of Andhra Pradesh Using Artificial Neural Networks”, International Journal of Current Research and Modern Education, Volume 2, Issue 2, Page Number 223-234. DOI: http://doi.org/10.5281/zenodo.1050435.
- J. C. Ramesh Reddy, T. Ganesh, M. Venkateswaran, PRS Reddy (2017), Forecasting of Monthly Mean Rainfall in Coastal Andhra, International Journal of Statistics and Applications 2017, 7(4): 197-204. DOI: 10.5923/j.statistics.20170704.01.
- Lee, J.H., Sohn, K. (2007). Prediction of monthly mean surface air temperature in a region of China, Advances in Atmospheric Sciences, Vol. 24, 3, 503-508.
- Stein, M. & Lloret, J. (2001). Forecasting of Air and Water Temperatures for Fishery Purposes with Selected Examples from Northwest Atlantic, J. Northw. Atl. Fish. Sci., Vol. 29, 23-30.