Original: “The Design of a Practical System for Fault-Tolerant Virtual Machines”
Authors: Daniel J. Scales, Mike Nelson, and Ganesh Venkitachalam (VMware, Inc{scales,mnelson,ganesh}@vmware.com)
introduce #
This paper (virtual machine fault tolerant) I think is one of the best paper for someone who wants to step into distributed system. It provides some basic opinions about how to replicate state.
background #
The background of this paper is to design a fault tolerant virtual machine service, and the most common idea for fault tolerant is to replicate, so they talk about how to replicate computation.
If these terminology sounds a bit scare, like what the f**k they are? can you just say something more understandable like what a human can figure out, don’t worry, it’s really easy, assume you have a windows os, you don’t want it to stop running while losing power connection or want it to survive even from an earthquake. so you just run 2 windows in different computers simultaneously. Now we have 2 running windows os, if one of them stops, the other one will still keep running, thus our excel application can survive, however, we want these 2 windows run the same excel, handle the same row, the critical part becomes how to synchronize the state between 2 windows, the problem is what we called replicated state machine.
The nature is how to replicate state sequence, and the result is a fault tolerant computation system.
what we have #
We have 2 instances(virtual machines, image it as windows).
One called primary, the other one called backup.
They are connected via network, they shared a net disk.
what to replicate #
But the first problem is what to replicate?
- the sequence of the current result. include the registers, memory bytes, and all what we have modified. That’s not a good idea, because the data is too large.
- the sequence of the operation. like we clicked some place. The real operation is the underlying instructions in the execution stack I guess. It doesn’t matter if our interests is just try to understand the replication instead of virtual machine. A terrible thing is operating system has some uncertain events like time and IO interrupt, as the result, this uncertain things need to be translated to certain things. They have a professional terminology called deterministic state machine to describe this uncertain operation, a common example is now function, assume you want to replicate the data between 2 database, set and get operation is certain, but now system call is uncertain.
how to replicate (which called protocol) #
The whole progress looks like this:
The clients send requests to the primary,
primary translate it to a couple of certain operations,
primary send it to the backup,
backup responds ack,
primary execute the request and response client,
done.
We have some problems here,
The first one is what if primary runs really fast, the backup falls behind a lot. sounds like after the primary respond, the backup found it will take him 2 days to catch up with the primary, that means if our primary stops running, we will lose our recent 2 days work!
The idea is to create a fixed length buffer to store the operations. We can assume the buffer can store 1 minutes work at most, if it’s full, the primary should pause and wait the backup to catch up with him. it’s like a block queue in inter process communication or a fixed length channel instance in golang. The purpose is to synchronize the speed of both sides.
Another problem is what if primary fails. If some machine fails, fault tolerant design starts work.
If the backup stops, everything seems still work, we fallback to the common scene: we have only one computer, it runs windows, it works.
If the primary stops, we want to shift to backup, but first, how does the backup recognize the primary has stopped. Just build a heartbeat to sense each other. If the primary find no heartbeat from backup, it can create a new backup. If the backup find no heartbeat, it try to become the primary, and repeat the behavior of the primary.
The progress of becoming a primary is usually called election in some consensus algorithm, vmft use a simple method to elect, use a cas(compare and set/swap) operation to write to a shared net disk. The winner of election is called primary, leader, master, main…
By the way, to discover to the broken nodes in a system,
we use heartbeat,
if we have only 2 nodes, they can heartbeat each other,
if we have 3 or 5 nodes for a certain usage, we can create full connection among them,
if we have unknown numbers of nodes, we can create a third party monitor service to probe all of the nodes.
compare with other system #
GFS:
vmft uses a more underlying model to replicate, it directly replicate the virtual machine, it theoretically means it can replicate all application run in the virtual machine like excel, database, game, but noticed that it must runs in low performance too.
GFS is used to replicate bytes block specifically, that means it replicate in the application level. however, it provides no consistency promise while vmft says it will return only the primary and the backup persist successfully.
That means the data stored in GFS could be not the same at all, but application runs in vmft will always keep the same.
something else #
There are bunch of things we don’t care about, like the virtual machines specific improvement. Our interest focus on the general replication.
Replicate the computation is not a good idea in some ways. because computation itself is expensive and it takes a lot of resources include not only the CPU, but also expensive memory, IO bandwidth. Calculate separately and replicate the result of a short period would be more competitive.