New England PointsAbstract

Even professional precipitation forecasts have a remarkably low level of accuracy. Most of these forecasts are done using numerical modeling, but an alternative method is to train machine learning algorithms on historical weather and precipitation data, and use these results to make predictions based on the current weather data. In this paper we implement a feed-forward back propagation neural network to predict precipitation in Hanover, NH. We then compared these results with other algorithms including a radial basis function neural network, a random forest, k Nearest-Neighbors and Logistic Regression. This work aims to improve on previous research in precipitation prediction by considering a more generalized neural network structure.

1. Background

Meteorologists still lack the ability to produce truly accurate weather forecasts, especially with respect to precipitation. Nature’s complex phenomena and the scope of the problem in both time and space lead to predictions that sometimes vary greatly from the actual weather experienced (Santhanam et all, 2011). Figure 1.1 (HPC, 2012) is provided by the National Oceanic and Atmospheric Administration (NOAA) and demonstrates the accuracy of quantitative precipitation forecasts from three different professional prediction models: the North American Mesoscale (NAM), the Global Forecast System (GFS), and the Hydrometeorological Prediction Center (HPC). A threat score is a function that ranges from 0 to 1 and provides a measure of how accurate a prediction was, with 1 being a perfect prediction. The inaccuracy of these predictions provides motivation for developing a better algorithm for predicting precipitation.


2. State of the Art

Traditionally, weather forecasting is done via two different methods. The first and foremost is based on numerically modeling physical simulations with computers (Santhanam et all 2011). Although there has been much work and progress in this field, precipitation forecasts are still far from perfect.

Neural Networks have been used by meteorologists to accurately model many different weather phenomena such as humidity and temperature (Maqsood, et all 2004). Santhanam et all, worked to predict rain as a classification problem using a 2 layer back propagation feed-forward neural network as well as radial basis function networks. They produced a classification error rate of 18% and 11.51% for their feed-forward network and radial basis function network respectively (Santhanam et all 2011). Other work has been done using neural networks to predict precipitation but they have limited themselves to simple network architectures and short prediction times (French et all, 1991), (Kuligowski et all, 1998), (Hall et all, 1998).


3. Data

All of our data was obtained from the Physical Sciences Division of the National Oceanic and Atmospheric Administration. Which weather variables are most relevant to precipitation and the physical distance of measurements from Hanover necessary to predict precipitation was determined through an interview with a meteorology Ph.D. candidate at the University of Utah. We concluded that on the earth’s surface we should consider wind direction, relative humidity, and mean sea level pressure. At elevations (measured in pressure) 500mb, 700mb, and 850mb we should use wind direction, relative humidity, specific humidity and geopotential height. For these variables, predicting precipitation one day in advance can depend on variables up to and over 1,000 miles away. Thus we selected the location of measurements for our data to approximate a box surrounding Hanover of approximately 2,000 miles in width and height. Figure 3.1 shows this box, with the blue marker indicating the location of Hanover and red markers representing the location of variable measurements.

This provided us with a total of 23 years of training data separated over 6-hour intervals, corresponding to 33,600 individual samples. Using a 13×10 grid of measurements shown in Figure 3.1 for our 19 variables of choice (four on the surface, five at three different pressure levels) gave us 2,470 input features. The data was preprocessed to normalize each input with a mean of zero and a standard deviation of one.


4. Algorithms

4.1 Neural Network

We have implemented a feed-forward back propagation neural network. The structure of the network has been generalized to allow for any number of hidden layers with any number of nodes in each layer. This means we have several hyper parameters that need to be tuned, such as the number of hidden layers and the number of nodes in each layer. The structure of the network can be seen in Figure 4.1. Networks with only two hidden layers have been shown to be able to model any continuous function with arbitrary precision. With three or more hidden layers these networks can generate arbitrary decision boundaries (Bishop, 1995). Furthermore, the output of a neural network for a classification problem has been shown to provide a direct estimate of the posterior probabilities (Zhang, 2000). The ability to model arbitrary decision boundaries is of particular interest for us since weather phenomena has been shown to be highly non-linear and dependent on a large number of variables.

The training of the neural network is done by the generalized back propagation algorithm to optimize the sum-squared error as a function of the weights of each node output and the bias values. An exact value for the gradient can be computed through application of the chain rule. Our implementation is a variation of the one outlined by K. Ming Leung in his lecture on back-propagation in multilayer perceptrons (Leung, 2008). The weights are initialized by sampling from a Gaussian distribution with a mean of zero and a variance of the square root of the number of input variables as suggested by Bishop (Bishop, 1995). Below we will discuss the generalized learning rule and methodology for our gradient decent method.

4.1.1 Neural Network Notation

Consider a general Neural Network with L layers, excluding the input layer but including the output layer as shown in Figure 4.1 to the right.

In Figure 4.1 layer l has N(l) nodes denoted Xi(l)∈ {X1(l) ,…,XN(l)(l)}. We will consider each node to have the same activation function denoted, f. However, one could easily extend this algorithm to have a different activation function for each node or layer. wij(l) is the weight factor from X­­i(l-1) to node X­­j(l) . We can also define an N(l-1)N(l)  weight matrix, W(l), for each layer whose elements are wij (l). In addition, every node will have a bias term defined as bj(l).

For a given set of labeled training vectors, T = {(Z1, Y1),…,(ZM, YM)}, the goal of training is to minimize the sum squared error,

where ti is the output of the neural network for the input vector Zi. For our algorithm we will only be considering one training example at a time, so E becomes just a single element. We then need to compute the gradient of E with respect to all of the weights.

Via back propagation we can derive the following update rules:

where sn(L) is defined as:

and f ̇ is the derivative of the activation function. For the remaining internal hidden layers we get the following:

where sj(l)can be found recursively through the following equation:

Due to a heavy majority of examples with no precipitation, our initial results provided low classification error rates but classified everything as a non-precipitation event. After experimenting with different solutions to this problem, we settled on modifying our training algorithm to probabilistically sample precipitation examples evenly with non-precipitation examples during the stochastic training. Training also used a line-search method to dynamically adjust the learning parameter during training.

4.2 Radial Basis Function Neural Network

We also implemented a specific kind of neural network called a radial basis function (RBF) network, the structure of which can be seen in Figure 4.2.

An RBF network has only one hidden layer composed of an arbitrary number of radial basis functions, functions that are only dependent on the distance of the input from some point. In this case we call these points the centers, and pick k centers in our sample space to correspond with k unique radial basis functions. Two general forms of radial basis functions that we chose to use are the Gaussian and thin-plate spline, shown below, respectively.

For each of these equations, x is the input sample vector being trained on or tested, and cj the jth center vector. There are various ways to define the scaling parameter σj in the Gaussian function for an RBF network (Schwenker, et al). Two definitions that we tested are σj=  dmax/√2m and σ=2dmean, where dmean is the mean distance between all samples and centers, dmax the maximum distance of all samples from center j, and m the number of samples. After preliminary testing showing very similar results for each definition, we chose to use σ=2dmean for computational efficiency

For selection of our centers, we focused on random selection from our sample space in two ways, first selecting random samples from our training set and second selecting a random k x n matrix of feature values from our training matrix X. Other ways to pick center vectors include using a k-means clustering algorithm and using orthogonal least squares to pick k vectors in the sample space. Centers were chosen using a Euclidean distance metric.

After designing our algorithm, we chose to use k = 100, 500, and 1000 for the Gaussian as well as the thin-plate spline RBF to get a set of predictions for a wide spread of hidden layer sizes and two different RBFs.

The advantage of RBF networks is that the optimization can be done in two parts, closed form, thus making the computational aspect easier, and allowing for us to consider hidden layers of much larger magnitudes. For a given a weight function W, and a matrix of values calculated by our basis function, ϕ, the general of form of our network is

where x is some sample. Using a sum-of-squares error, we can then reformulate this as

where Y is a matrix of outputs for the training set X. We than have that the weight matrix W can be solved for as

where ϕ? is the pseudo-inverse of ϕ (Bishop). Thus our training process is to first select centers cj, for i = 1,…,k and calculate each row of our matrix ϕjby plugging each training sample xj into our radial basis function with centers C. We can then use this matrix and our output training set Y to solve for the weight matrix. To test data, we then plug a test sample x into y(x)= Wϕ.


5. Results

5.1 Training and Testing standards

For all of our training and testing we used the following standards. We used data sets from the first 20 years as our training set. The data has been randomly reordered and pre-processed as discussed above. For testing we used the last 3 years of data.

As classification is our initial concern, we needed to convert our output values to boolean values. To do this we considered precipitation greater than .1 inches to be significant, and thus classified outputs greater than 0.1in as precipitation. Predictions of less than 0.1 in were set to zero as non-precipitation.

There is no easy way to rank our classifiers as the “best” since we could tune our model to predict arbitrarily accurate for either of the two classifications. We therefore present the classification errors on both classes as well as the overall classification error to provide more information as to the success of our model.

5.2 Neural Network Classifier

In our final version of the neural network we were able to get informative results training our network. We ran 10-Fold cross validation across 21 different network complexities attempting to train our network to predict pitation 6 hours in the future. These cases trained different network structures ranging from two to four hidden layers with 5, 10 or 20 hidden nodes per layer. The best average validation error for the classification data was 0.1509 with a network structure of 3 hidden layers with five nodes in each of the two non-output layers. Figure 5.1 shows an example of the error on the test and validation set as a function of the epoch during training. The network had two hidden layers with three nodes and one node in layers one and two, respectively.   For visualization purposes Figure 5.2 below shows both our classification and regression results for nine of these cases where the number of hidden nodes were the same at each layer.

Having observed that, in general, the simpler neural network structures produce the most extensible results to unseen data, we re-ran our 10-Fold cross validation considering only two layer networks. In this experiment we varied the number of hidden nodes in the first layer from one to ten. Figure 5.3 to the above right shows the average validation error and average training error for these 10 cases. However we can see that there was little trend from simple to complex structure.

Having achieved fairly good success classifying precipitation 6 hours in advance, we then moved on to predict further into the future, running our training algorithm on prediction data sets of 12, 18 and 24 hours. Due to excessive training times we were unable run 10-Fold cross validation but instead simply train each case one time. Unfortunately, due to the stochastic nature of our training method this provides a rather inaccurate and noisy look into the prediction potential of our model for the different network structure cases. The best results for these larger prediction times are contained in Table 6.1.

Figure 5.4 (next page) illustrates the best results as we increased prediction time. We can see that in general our predictions get worse as time increases. The decision trees were able to maintain a fairly low overall error, but were one of the worst with classifying precipitation.

After training our RBF algorithm on the training data for both the Gaussian and thin-plate spline over time periods of 6, 12, 18 and 24 hours, and for hidden layer sizes of k = 100, 500, and 1000, we ran our test data through the resulting classifier. Table 5.1 on the next page shows the error for precipitation events, error for non-precipitation events and overall error for each of the resulting classifiers.

As can be seen in Table 5.1, the RBF algorithm performed admirably well, especially when extending our forecasts to longer time intervals. Both the Gaussian and thin-plate spline produced good results, with the Gaussian proving to be marginally more accurate. Although the RBF classification of precipitation events six hours in advance could not compete with the standard neural network results, as we increased the prediction timeframe all the way to 24 hours, the RBF network demonstrated a robustness that we saw in no other algorithms. A part of this robustness however did lead to some over-classification of precipitation events, leading to a higher overall error. The RBF error in classifying precipitation events in the training set can be seen in Figure 5.5, the error for non-precipitation events can be seen in Figure 5.6, and the overall error can be seen in Figure 5.7 (next page). Note these figures contain results from both the Gaussian and thin-plate spline basis functions, for k-values of 100, 500 and 1000, and time intervals of 6, 12, 18 and 24 hours.

It is interesting to notice that the plots for the classification error of non-precipitation events and overall classification error are very similar, but in fact it makes sense. There are so many more samples of non-precipitation than precipitation events that the total error rate and non-precipitation error rate should be similar, as non-precipitation events make up a very large proportion of the entire test set. A similar correlation can be seen in other algorithms that produced very low total error rates, which were initially convincing, until it was determined that the algorithm had simply classified everything as non-precipitation.

The over-classification done by the RBF network can be seen in the opposite trends of error in precipitation events and non-precipitation events as the time frame increased. To maintain a robust accuracy in predicting precipitation events, the algorithm began classifying more precipitation events over longer timeframes, clearly not all of which were accurate. Thus the accuracy of precipitation classification actually increased slightly as the timeframe increased, but the overall accuracy and accuracy of non-precipitation simultaneously declined linearly.

5.3 Comparison

To compare the results of our networks to a baseline, we also ran our data through a few simpler models for comparison. We chose to run a forest of 20 randomly generated decision trees, k-nearest neighbors (kNN) and logistic regression.


6. Conclusion

A comparison between our best classification results for each model can be seen on the next page in Table 6.1. We can see that both our neural networks as well as the radial basis functions performed very well against the simpler algorithms. The radial basis function also performed better than other algorithms as we moved out to longer prediction times.

The RBF network could be enhanced by running cross-validation, something computationally intensive but favorable to improved results. It is also worth looking into alternative basis functions, as we only considered two of many possible options. The Gaussian performed better of the two, probably due to its inclusion of a scaling factor gauging the general distance between samples and centers, and there are other basis functions with similar scaling parameters. Last, the choice of centers is a critical part of the RBF network, and implementing either clustering algorithms or an orthogonal least squares method could offer significant improvement on results.

Overall, our classification networks performed favorably with those published by Santhanam et all. Furthermore we were able to achieve moderate success looking further into the future with 12, 18 and 24-hour predictions. Our best classification results at predicting precipitation in six hours was a neural network structure, which produced error rates of 12.1% and 17.9% on the precipitation and non-precipitation examples respectively. This presents an improvement over the Santhanam group, which achieved only 19.7 % and 16.3% error rates classifying rain and no rain respectively. Furthermore, we were able to achieve relative success taking our algorithms and training them for longer prediction times.

Future work could look into more network structures, especially for regression. Most of our work focused on classification of precipitation versus non-precipitation and not optimizing the regression networks. Very large and complex networks with large numbers of nodes were not considered due to overwhelming training time but these could provide better modeling for the larger test frames. Additionally, further work could look to vary the activation function or the connectivity of a neural network.


Contact Chris Hoder at

Contact Ben Southworth at



1. Bishop, Christopher M. “Neural networks for pattern recognition.” (1995): 5.

2. French, Mark N., Witold F. Krajewski, and Robert R. Cuykendall. “Rainfall forecasting in space and time using a neural network.” Journal of hydrology137.1 (1992): 1-31.

3. Ghosh, Soumadip, et al. “Weather data mining using artificial neural network.” Re- cent Advances in Intelligent Computational Systems (RAICS), 2011 IEEE. IEEE, 2011.

4. Hall, Tony, Harold E. Brooks, and Charles A. Doswell III. “Precipitation forecasting using a neural network.” Weather and forecasting 14.3 (1999): 338-345.

5. HPC Verification vs. the models Threat Score. National Oceanic and Atmospheric Administration Hydrometeorological Prediction Center, 2012. Web. 20 Jan. 2013.

6. Hsieh, William W. “Machine learning methods in the environmental sciences.” Cam- bridge Univ. Pr., Cambridge (2009).

7. Kuligowski, Robert J., and Ana P. Barros. “Experiments in short-term precipitation forecasting using artificial neural networks.” Monthly weather review 126.2 (1998): 470- 482.

8. Kuligowski, Robert J., and Ana P. Barros. “Localized precipitation forecasts from a numerical weather prediction model using artificial neural networks.” Weather and Fore- casting 13.4 (1998): 1194-1204.

9. Luk, K. C., J. E. Ball, and A. Sharma. “A study of optimal model lag and spatial inputs to artificial neural network for rainfall forecasting.” Journal of Hydrology 227.1 (2000): 56-65.

10. Manzato, Agostino. “Sounding-derived indices for neural network based short-term thun- derstorm and rainfall forecasts.” Atmospheric research 83.2 (2007): 349-365.

11. Maqsood, Imran, Muhammad Riaz Khan, and Ajith Abraham. “An ensemble of neu- ral networks for weather forecasting.” Neural Computing & Applications 13.2 (2004): 112-122.
McCann, Donald W. “A neural network short-term forecast of significant thunderstorms.” Weather and Forecasting;(United States) 7.3 (1992).

12. Ming Leung, K. “Backpropagation in Multilayer Perceptrons.” Polytechnic University. 3 Mar 2008. Lecture.

13. PSD Gridded Climate Data Sets: All. National Oceanic and Atmospheric Adminis- tration Earth System Research Laboratory, 2012. Web. 20 Jan. 2013.

14. Santhanam, Tiruvenkadam, and A. C. Subhajini. “An Efficient Weather Forecasting System using Radial Basis Function Neural Network.” Journal of Computer Science 7.

15. Schwenker, Friedhelm, Hans A. Kestler, and Gunther Palm. “Three learning phases for radial-basis-function networks.” Neural Networks 14.4 (2001): 439-458.

16. Silverman, David, and John A. Dracup. “Artificial neural networks and long-range pre- cipitation prediction in California.” Journal of applied meteorology 39.1 (2000): 57-66.

17. Veisi, H. and M. Jamzad, 2009. sc. Int. J. Sign. Process., 5: 82-92. “A complexity-based approach in image compression using neural networks.”

18. Zhang, Guoqiang Peter. “Neural networks for classification: a survey.” Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on 30.4 (2000): 451-462.