One thing I never expected when starting work on Seasonality is the shear amount of weather data that is available on the web. Sometimes it’s hard to get a handle on all the information. I have a weather database served by PostgreSQL, and that’s starting to take several GB on my server. Well, last night I was browsing for a METAR station name listing (the station names I currently have are truncated to around 15 characters), and I came across an NOAA FTP server that allows users to download raw METAR observations for all the sites around the globe. This is the same data I use to draw the graph in Seasonality, so I had to check it out…

Sure enough, even though the file format is kind of messy, I was able to write a download/parsing script to access the data and insert it into my database. The dataset is just huge…after only 12 hours I’ve collected somewhere in the order of 100,000 records. It’s almost too much data for my lowly database server to handle, originally taking about 75% of the time to actually parse the data that was generated in a given time span. I eventually optimized it a bit, so it will fetch data every 2 hours and take maybe a half hour to do the parsing.

So what am I going to do with this data? Well, there’s a ton of possibilities, but the best one I can think of is to offer a method for new Seasonality users to “download the last month/3 months of graph data.” This way users won’t have to wait so long to get a good range of graph data. We’ll have to see how this goes though. Obviously there are bandwidth and storage issues involved, and I’m sure that once the database gets beyond a certain size, scalability will be an issue as well.

Another cool database I found about a month ago was the GEOnet Names Server (GNS). This is a database of around 5 million different cities and geographic features worldwide. I ended up importing this data into Postgres as well, though I had to transfer the database to my G5 to execute queries in any reasonable amount of time. After creating several indexes on the table and giving the database 1.5Gb of RAM (enough to store the full table in), I got a typical select down to around 10 seconds for a full table scan based on indexed columns. It’s actually faster locally then using their web server to do a query…then again I’m the only user hitting it over here. 🙂 This data source helped a lot when I was trying to find longitude/latitude pairs for the new global locations in the upcoming version.