When working with large data volumes that have a geospatial component, what are the options available to you on AWS, Azure, and Google Cloud?
In this three-part blog series, for each cloud platform, I’ll present an overview of the spatial support and provide a few typical use-cases for the following types of cloud database:
- Cloud-native relational databases – Online Transactional Processing (OLTP) databases
- NoSQL databases
- Data warehouses – Online Analytical Processing (OLAP) databases
There are of course many factors that influence choosing a database solution (e.g. cost, performance, security) – the goal of this blog series isn’t to cover all of those, merely to focus on what is available from a geospatial perspective.
One of the key things to understand when deciding on a spatially enabled database is if you need the geography or geometry data types.
- The geometry type represents data in a flat coordinate system.
- The geography type represents data in a spherical coordinate system (usually LL84).
There are various trade-offs for each data type, and as a result, there is no one size fits all. The correct choice will depend on your specific use case (this answer on StackExchange provides a good overview). For each database, we will detail if it supports geometry or geography types.
What is a Cloud-Native Relational Database?
Relational databases have been around for decades, and while you might not immediately associate them with handling big data, they might still be the right solution in certain scenarios. The new breed of proprietary cloud-optimized relational databases are particularly interesting.
Cloud-native databases are designed by the cloud vendor from the ground up to take advantage of the cloud architecture. They are often based on existing database engines (which makes it easier to move an on-premises database to the cloud) but are optimized for the cloud with a focus on performance and availability.
First, let’s have a look where you might run into problems when handling large data volumes on a traditional relational database.
Limitations of Traditional Relational Databases
Relational databases have the most comprehensive support in-terms of supported geometry types and spatial functions (view support here). This makes them an inviting option. For most workflows running on a relational database, you aren’t going to run into scaling or latency issues, but for this post, we are thinking BIG. So when might relational databases not be the best fit when working with big data?
- Large tables – If you have billions of rows in your database, then if the indexes aren’t perfect (or sometimes even if they are), querying might be slow. Since spatial queries can be very compute-intensive, this is a key consideration.
- Fluid schema – If the schema of the incoming data is not well defined, then a relational database’s fixed schema could pose a problem.
- A large number of concurrent connections – If your architecture has the potential to open a large number of connections to the database—for example, a serverless architecture powering an app that scales as the number of users scale— then scaling database network connections can be problematic.
- Real-time requirements – If your database is powering an app or service that requires extremely low latency and you have large data volumes, a relational database might be too slow.
Traditionally, if you were running into any of these issues, then it would be time to look at another database solution (e.g. NoSQL). Before you do that, it is worth looking at the cloud-native relational databases as they still offer the same level of spatial support as the managed relational databases, but they start to address some of the constraints highlighted above.
Spotlight on Cloud-Native Relational Databases
AWS, Azure and Google all offer a cloud-native relational database service. AWS and Azure have native geospatial support while Google offers support via a library.
|Database||Geospatial Support||FME Support|
Spatial types: Geometry and Geography
|Read, write and query.|
|Azure SQL Database||
This is based on SQL Server with the same spatial support.
Spatial types: Geometry and Geography
|Read, write and query.|
|Google Cloud Spanner||Cloud Spanner does not natively support geospatial queries. Google’s S2 library can be leveraged which uses spherical geometry and is used by Google itself on Google Maps.||Read, write and query.|
As you can see from the table, these databases are impressive in their support for spatial data. So how does this new breed of relational database solve some of the scaling and latency issues you might experience with traditional relational databases when working with large data volumes?
Note, if you are not ready to take the leap with a cloud-native database yet, the spatial support for traditional cloud relational databases (e.g. AWS RDS, Azure Databases) is detailed here.
Support for Large Data Volumes
All of the cloud-native databases listed above now support a serverless architecture with features such as:
- Scaling – Autoscale compute, memory and storage capacity as needed
- Performance – Software optimized for the infrastructure, coupled with active geo-replication allows data to be served globally very quickly.
- High Availability – Storage and compute is separated which means the data is safe even if the DB instances fail.
- Maintenance – Automated updates and backups lowers the overhead of maintaining the infrastructure.
These features, coupled with per-second billing, really lend themselves to high-throughput spatial processing. As always there are some trade-offs compared to traditional relational databases, but on the whole it means you can now leverage the power that spatially enabled relational databases deliver, with the flexibility and scale that a serverless architecture brings.
Scaling database connections can be problematic with traditional relational databases. A key benefit of NoSQL databases is that applications don’t need to maintain a persistent network connection to interact with them, requests are stateless and happen over HTTPS. Database connections are also not a good fit for serverless functions that might only run for a few milliseconds. Cloud-native databases are beginning to address this.
AWS Aurora has added a new feature called Aurora Data API that provides you with an API to interact with your database. Crucially, the Data API doesn’t require a persistent connection, so rather than opening up a connection to the database, you can use the API to run SQL statements over HTTPS! This is a big deal as since Aurora supports PostGIS, it allows you to interact with the full power of PostGIS in a stateless manner, so if you scale your infrastructure you can scale the number of requests and not have to worry.
If you have a workflow that has many transaction-oriented tasks and you need to use a relational database, but you have large throughout or require a highly available solution, then a proprietary cloud database might be worth a look. You can read and write data to all relational databases (traditional and cloud) with FME.
Next in the series, we’ll look at spatial support for cloud NoSQL databases.
Stewart HarperStewart is the Technical Director of Cloud Applications and Infrastructure at Safe. When he isn’t building location-based tools for the web, he’s probably skiing or mountain biking.