Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery.
SUMMARY: The purpose of this project is to construct a time series prediction model and document the end-to-end steps using a template. The Annual Immigration into the USA dataset is a time series situation where we are trying to forecast future outcomes based on past data points.
INTRODUCTION: The problem is to forecast the annual number of people immigrating to the United States. The dataset describes a time-series of people (in thousands) over 143 years (1820-1962), and there are 143 observations. We used the first 80% of the observations for training and testing various models while holding back the remaining observations for validating the final model.
ANALYSIS: The baseline prediction (or persistence) for the dataset resulted in an RMSE of 38312. After performing a grid search for the most optimal ARIMA parameters, the final ARIMA non-seasonal order was (0, 1, 2). Furthermore, the chosen model processed the validation data with an RMSE of 61789, which was significantly worse than the baseline model. We can partly attribute the loss in the model’s prediction performance to the exceptionally low immigration during the 1930s, the great depression era in the United States.
CONCLUSION: For this dataset, the chosen ARIMA model did not achieve a satisfactory result. We should acquire more data or figure out how to account for the external factors (economy and so on) in our time series modeling effort.
Dataset Used: Annual immigration into the United States, 1820-1962
Dataset ML Model: Time series forecast with numerical attributes
Dataset Reference: Rob Hyndman and Yangzhuoran Yang (2018). tsdl: Time Series Data Library. v0.1.0. https://pkg.yangzhuoranyang./tsdl/.
The HTML formatted report can be found here on GitHub.