An Overview of Geospatial Databases
In preparation for my talk at LOGIN 2011 (and my own location-based stuff), I’ve done a lot of research into using databases to store geospatial information. I figured I’d put a partial brain dump up here. Maybe some of you folks can debunk any of this info before I start spreading misinformation in my presentation slides.
When developing my new platform, my first choice was MySQL. After all, I know how to use MySQL and so do a lot of people. MySQL has had spatial extensions since version 4.1, so it seemed like a good bet. Right?
MySQL stores geospatial information as Well-known Text (WKT) or its binary equivalent, well-known binary (WKB). You can store points, lines, and polygons as well as perform intersection and distance tests on this data.
WKT is an old markup language for geospatial data that’s just a plaintext representation of points and geometry. For instance, if we were to store the latitude and longitude values of 12, -75 as WKT, it would be “POINT (12 -75)”. Simple, eh?
MySQL provides a variety of functions for computing the distance between points, selecting points inside a square, etc. The problem is that all of these calculations occur in Euclidean space. That is, on a flat plane.
As we all know, the earth is round. So, just getting the Pythagorean distance between two points on the globe is going to be really inaccurate. You might get away with it near the equator (where the earth is flatter), or over small distances.
The solution for accuracy is to write your own great-circle distance formula. You can do it inside a MySQL query, or in your business logic via PHP etc. Either way, it’s a huge performance hit.
So, MySQL is out.
PostGIS is an enhancement to PostgreSQL that adds geospatial features. It works very much like MySQL where it stores location data as WKT or binary. It has the same types of functions as MySQL for determining distance, intersections, etc. The big difference is that PostGIS supports “geographic” coordinates. Which means it can perform distance and intersection calculations on points as they are projected on a globe, not in flat Euclidean space.
I’m not a PostgreSQL expert, but the performance seems good and it’s a very robust RDMBS. Plus, the spatial support has been there since 2001–this is the most mature solution. A lot of geospatial applications have been built on top of it. Much like MySQL, it’s free and open-source.
This is probably the best solution if you want to use a relational database.
MongoDB is the hip NoSQL database that’s big with all the kids these days. Hey, if it’s good enough for Foursquare, it’s good enough for me!
Last year, geospatial indexing was added to MongoDB. MongoDB stores locations as a geohash. This is a pretty neat way of storing latitude and longitude as a 1-dimensional value. Although, it’s really storing a bounding box, not a single point. Also, geohashes have some weird accuracy issues. I do think MongoDB compensates for this.
MongoDB can only store points–so no polygons or linestring paths. Still, for my purposes this is fine. It has all kinds of easy to use functions like searching for points near a point or inside a bounding shape.
MongoDB only supported Euclidean space calculations for a while. However, the latest version has spherical coordinate checks. This is acceptable, but because the Earth is not a perfect sphere it can be inaccurate. It’s good enough to determine your distance from the local Pizza Hut. Maybe not good enough to plan a Lybian airstrike.
SimpleGeo started out as a location-based gaming company, but once they discovered how bad geospatial tools are, they pivoted to a location middleware play. Their latest service is SimpleGeo Storage, a cloud database optimized for location data.
SimpleGeo Storage is a cloud database service painstakingly crafted on Cassandra. Therefore, you have to access it via HTTP as a web service. This can be a pretty big performance hit compared to a local connection to your database server, so design accordingly.
The API is easy to use. You can associate data with latitude/longitude, geohash, or even an IP address (as it is possible to map IP addresses to locations). As a cloud service, you don’t have as many system/cloud administration headaches either.
Amazon should buy these guys and then offer SimpleGeo Storage instances like they do RDS.
There are other solutions too–I noticed there’s a new flavor of CouchDB, GeoCouch, for instance. It seems like if you’re a RDBMS person or need the best accuracy, PostGIS is it. Otherwise, MongoDB is the leading NoSQL solution. SimpleGeo Storage is early, but very promising depending on your performance needs.