Models Dealing with Real World Data

Nov. 12, 2018

Forecasting, nowcasting or backcasting: These predictions require state-of-the-art tools, where the dynamic factor model (DFM) is the key. Seton Leonard from LogIndex provides some insights

Noisy data

When it comes to processing large real world data to do forecasting, nowcasting (predicting contemporaneous unobserved information) or backcasting (predicting past information that was not observed), sophisticated predictive methods are required. Time series data for real world applications can be messy and chaotic and large. Observations may be sometimes inaccurate (“noisy” in the jargon) or be missing altogether. In addition data series might start and end at different dates, and the information can be recorded at different intervals (e.g. daily, weekly and monthly).

LogIndex framework

Consider LogIndex's task of estimating industrial production the USA for the current month: the input data series we use for this task include daily trade and price data, weekly employment data from the US-Employment-and-Training-Administration, monthly survey data with early release dates, logistics data and past industrial production itself. Try plugging this information into any standard statistical software, and it will crash. Bayesian dynamic factor models offer means of bringing all this information together in a consistent model or framework.

Use of Latent Variables

To overcome the challenges of real-world data mentioned above, dynamic factor models describe the changes in observable series over time using "latent variables", that is unobserved determinants of observed outcomes. As in the physical sciences, one way to extract this meaningful signal (the signal of the latent variables), is via the Kalman filter. This approach begins by predicting next period (e.g. next month) outcomes. Then, as new data is released, these predictions are updated to incorporate all new available information. By doing this, the models can handle missing values.

In the example below, we use a dynamic factor model to estimate a vector autoregression with missing or noisy data. In this case, our model will actually provide estimates of a cleaned, square data set.

Example: we have three noisy series with many missing values. The dashed lines give us estimates of the true underlying data that clean up the noise and fill in the missing observations - they are our latent variables.


Bayesian Estimation

While in the physical sciences one often has an existing model for signal extraction (for example, velocity is the first derivative of position with respect to time and acceleration the second derivative), in econometrics and statistics we typically need to estimate a model. While factor models reduce the dimension of variables used to forecast from the entire set of observables to a few latent variables, similar to principal components, there may be a lot of parameters that need to be estimated. This could potentially lead to over-parameterisation respectively overfitting of the model, resulting in a good in-sample fit but poor with respect to out-of-sample performance (with new untested data). To further reduce the risk of misleading in-sample results, and to improve out-of-sample performance, we can place prior beliefs on parameters. If we are hoping to nowcast only one or two series, a particularly useful "prior" is the belief that underlying factors should look something like the series we would like to predict. For this reason, we use Markov Chain Monte Carlo (MCMC) to simulate posterior distributions of parameters and underlying factors.

Putting Predictive Statistics into Practice

Although we focused here on the basics of Bayesian estimation of dynamic factor models, there is much more that goes into cutting edge predictive methodology. At LogIndex we calculate thousands of estimates daily. We do this by using multiple Bayesian models and Forecast Pooling to get final estimates for each of them. Forecast Pooling summarises multiple forecast models in order to obtain even more accurate predictions. Moreover, these models need to incorporate mixed frequency data (typically we nowcast monthly variables but most of the input is daily). Thus, while this new book will is a great introduction to factor models for time series,

At LogIndex we continuously scan for new and better sources of real-world data and continue to develop cutting edge predictive methodologies. Our aim is to deliver the best estimates of macroeconomic indicators, up to 60 days in advance for different countries and sectors.

Seton Leonard is a developer of LogIndex, the data company of Kuehne + Nagel Group. He is a data scientist and full-stack developer specialising in programming time series estimators for forecasting and nowcasting including factor models and large Bayesian VARs. He holds a Doctor of Philosophy (PhD) focused in macro-econometrics and macroeconomic theory from Graduate Institute of International and Development Studies in Geneva. Leonard has programming experience in R, C++, Matlab, Docker, Javascript, and Python.

The new working paper of our colleague Dr. Seth Leonard - Practical Implementation of Dynamic Factor Models, free to download - provides an overview of how these models used by LogIndex work. For all the scientific and technical details, the paper will walk you through the necessary derivations with lots of details on how to put these models to work, and gives you an idea of what goes on under the hood at gKNi.