Original: “Spanner: Google’s Globally-Distributed Database”
Authors: James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, Dale Woodford (Google, Inc.)
introduce #
Spanner is a database. It’s distributed across the world.
Spanner provides a SQL interface, extends some function for placement,
what we called placement is which data should have how many replicas and where to deploy this data so that the users(database clients) are close to the data source geographically. we called it geographic locality.
Spanner provides global availability, because the data is replicated across the world to prevent the potential disaster for a specific place.
The data of Spanner can be rebalanced automatically for better workload performance tuning and nodes changes.
My interest is Spanner has a high performance transaction even in global level. The average delay from Europe to U.S. should be more than 100ms and a bit more for some Asian country like Japan. The generic transaction provides atomic updates and lock free read-only transaction, which is the critical part to improve transaction overall performance because for most of the applications, the read operation should be much more than write operation. The r/w ratio could usually be 10 ~ 1k.
The internal basic data distribution model is some paxos groups which hold some data range(which is defined as directory) replicas, the number of replicas could be 1, 3, 5 or even more, which depends on application requirements and could be adjust dynamically.
what it looks like #
A spanner instance is called a universe.
Includes some zones, it’s basically a abstract like a data center, but for application level for isolation, the real data center can contains more than 1 zones.
In addition to the zones, there are 1 universe master and 1 placement driver.
The universe master is a plain and simple dashboard.
The placement driver is responsible for replicas modification(data migration), balance load.
A zone includes a zone master as the manager to distribute data, a proxy, and a spanserver. The spanserver is the core component for the system, it provides transaction and replication, it’s basically some paxos groups, the transaction coordinators is embed in the paxos groups to improve availability of coordinators. The basic implementation of concurrency control is 2 phase lock, a lock table is introduced to store the lock state. The leaders of the paxos groups work as the replication consensus makers, and as participants for coordinators. The different replication leaders elect new leaders as coordinator leaders, (I guess, but the paper doesn’t point out explicitly, in the last 2 sentence in paper 2.1), and the result is a 2 phase commit implementation for multi shards/directories transaction.