Storage Engine: from Primitives to Implementation

Table of Contents

Todo: work in progress.

disk oriented
#

What a storage system provides, is interacting with disk. That’s the system we called database, store, block or small file store, log, persist queue. The disk is a block device, so the most efficient way to interact with disk is read/write a certain length of bytes, that means don’t read a byte, a bit, read more at a time. The specific length is usually 4KB, 8KB, 16KB, it depends on disk property and usage of our upper layer system. But no matter what size it is, we called it a page. The page could be fixed length, or not. Experience shows that fix length means indexable, for example, in http2 protocol, the frame header are 9 bytes, it’s convenient to parse, it’s much easier to skip to the specific content with a certain offset. You can image if you skip 1 byte, that is the package size, if you skip 3 byte, it’s the package type, if you skip 6 bytes, it’s the package checksum.

Therefore, the disk basically split into pages for an application. As the basic storage unit, we fill content to the pages, it could be a tuple for RDB, a log entry for write ahead log, a offset number for B+ tree node, or a simple chunk for a file, even a plain blob.

Most of the time, we design the page as self-contained, that says a page contains some meta information about itself, like the page size, page type, page id, but just in most of the case. I talk about it because I think it as a kind of primitive to construct storage system, if we master the principles, we can design and implement a storage system for a specific use.

Let’s talk more about pages. The pages are usually indexed by directory, which records all the page positions and page data ranges, for example, the page No.23 is stored at position 42312 or position #5, it’s range is apple ~ application. But Not all the page content are sequential, it could be arranged by hash. Think about the mysql database, it provides B+ tree index and hash index. The main difference is sequential structure provides ability to range scan, but hash index could be faster. Using a directory to index the page is called page heap.

To track a page in directory, we need record the content of the page and whether the page is empty.

transaction
#

implementation: innodb
#

implementation: leveldb
#

graph LR A(user) -- put/get --> B(leveldb) B --> C(memtable) B --> D(sstable) B --> E(write ahead log) D --> F(disk)

Storage Engine: from Primitives to Implementation

disk oriented
#

transaction
#

implementation: innodb
#

implementation: leveldb
#

implementation: boltdb
#

implementation: badgerdb
#

disk oriented #

transaction #

implementation: innodb #

implementation: leveldb #

implementation: boltdb #

implementation: badgerdb #

disk oriented
#

transaction
#

implementation: innodb
#

implementation: leveldb
#

implementation: boltdb
#

implementation: badgerdb
#