One thing I wasn’t expecting when entering the world of weather software with Seasonality was the sheer amount of computing resources that weather data and processing requires. You hear about the supercomputers that different government organizations purchase for weather forecasting and research, but I never really gave a second thought to it. I just assumed that those computers were running high-end computations in a completely different league than the typical online weather service. This is true to some extent, but I severely underestimated how many resources an online weather service can use itself.
Take for example the fairly simple idea of recording temperatures, wind speeds, pressure, and other conditions for locations worldwide. One commonly used system in place for this is a network of 4,000-5,000 different ICAO stations around the globe. This network of weather stations is the same used to generate weather graphs in Seasonality. Most ICAO stations are located at airports or military bases, but some of them are at other public facilities. Each of these stations will report their conditions, on average, maybe once an hour. Now that’s a lot of data for one day, but it seems pretty manageable. However, what if the idea comes up to store this information for a long period of time, say to allow Seasonality users to download several months of data to populate their graphs when using Seasonality for the first time. This raises the required resources to a whole new level. I’ve been collecting data from these weather stations for the past 9 months or so. Every month, about 4.5 million new data records are added to the PostgreSQL database I set up to keep track of this data. By now, the database is several gigabytes. The problem comes when trying to access this data at a later point in time. Even singling out a single ICAO weather station for one month of data can take 30 seconds to query. And this is on a 2Ghz Athlon 64 system with 3GB of RAM and a RAID storage system. What happens when thousands of users are hitting this database at the same time? Literally nothing, the database would be denying more requests than it could fulfill. I’ll continue to work on this functionality, and hope to find a good way to manage this load in the future.
A more recent example presented itself a little over a week ago when Environment Canada ceased to provide forecasts for international weather locations. Seasonality depended on that data to display forecasts outside the U.S., so right now a big hit is taken for users depending on Seasonality to provide that data. So I attempted to find a new data source for Seasonality to use. Prepared worldwide forecasts are hard to come by, but I have found a suitable replacement for this data in a more raw format. The source I found provides GRIB files containing the gridded output from the GFS (Global Forecast System) model. This is really good data, providing more than 50 different variables (temperature, wind speed, wind direction, pressure, cloud cover, among many others) across several different layers of the atmosphere (from the surface to 1 millibar). The grid resolution is pretty good as well, down to 0.5 latitude x 0.5 longitude blocks.
With all this data I can generate a forecast for any location around the globe with reasonable accuracy, but it comes at a cost. The data is plentiful, and it takes plenty of space. The model produces a forecast every 3 hours out to 180 hours (7.5 days) into the future. That’s 60 data sets total. Each data set is around 26MB to download, 1.5GB for a complete forecast. This is just a massive amount of data, especially when a new forecast is generated 4 times a day.
Fortunately, there is a way to pick and choose which variables and which atmospheric levels you would like to download. At the moment, I’ve narrowed down the data set I would require to about 200MB per forecast. Great, that’s doable…I adapted some Perl code I found with a free license on the internet to fit my needs, and I have a cron job going to download the data I want. Depending on the server speed, the download takes an hour or two.
Next, I need to convert the data from the GRIB binary format into something I can throw into the PostgreSQL database. The database should have a row for each block in the longitude/latitude grid, for each 3-hour time period the forecast is generated. With 0.5 longitude/latitude resolution, that’s just under 260k rows in the database for every 3 hour data set, or 15.5 million rows for the entire forecast. My Athlon server takes about 2 hours to parse all the GRIB files and throw them into the database, and this is after extensive optimization on my part.
The data doesn’t do any good if I can’t get to it easily. I need to index the data to help speed up querying. It makes sense to set up indexes based on the longitude and latitude, since this is how I will be querying the data. The indexes I’ve got so far take about 30 minutes to an hour to generate.
When all is said and done, it takes around 4-5 hours to get the forecast back-end ready for Seasonality to query. The data source I’m hitting provides new forecasts every 6 hours, so it really doesn’t make sense for me to refresh that often when it takes so much time to process the data. Maybe updating once or twice a day will be reasonable in the end.
So what about query speed when Seasonality users want to grab a forecast from the server? I think it’s fair for a query to take about 5 seconds to return data…and right now that looks to be doable with my current hardware. With adequate caching, I can probably drop the response time for frequently-used locations even lower. Then there’s redundancy to think about. I want to make sure if something wonky happens with my server or network connection, that there will be a backup somewhere else. I can’t put the CPU load required for parsing the data onto my hosting provider’s servers, but I can use their bandwidth after the data is in a format that can be queried. I’m hoping to replicate the data after it’s been processed and throw it on the hosting server.
All of this for the seemingly simple feature of a forecast. There’s still a lot of work I need to do before this new forecast will be ready, but in the end I think it will be worth it. Seasonality will no longer have a dependency on another weather outfit for forecast data. The GRIB data format is a standard, so it’s easy to add redundant sources for this data. I will also be able to update the code to provide more accurate forecasts without having to release a Seasonality update, because that is all server-side. Initial forecasts will be fairly simple, but I expect that over time I’ll be able to improve forecast detail and accuracy by making use of more data variables.
Leave a Reply