Karan Sharma

Ramblings on tech, cycling, and finance

18 Dec 22

Writing a disk-based key-value store in Golang

I'd been mulling around reading a computer science paper and implementing a project based on it. Distributed systems, Networking and Databases are some of the things that fascinate me a lot. However, I had been looking to implement a more approachable project to avoid getting inundated initially. And I happened to chance upon the Bitcask paper through Avinash's project: CaskDB.

After giving a quick read of this reasonably short paper, I decided to write a Golang implementation of the same, as it looked like an exciting project. If you're interested in checking out the complete project, checkout BarrelDB.


Bitcask is a disk-based key-value storage engine designed for fast read and write operations. It is mainly in production use by Riak (which is a distributed database) as one of the storage engines. Bitcask under the hood has a straightforward yet clever design. It writes to the file in an append-only mode. This means that writes are performed only by appending to the end of the file, thus avoiding the need to perform any random disk I/O seek.

Let's look at various components of Bitcask:

Format of the record

image

  • CRC: Stores the checksum of the value to ensure data consistency
  • Timestamp: Timestamp in UNIX format, stored as int32.
  • Expiry: If the record has an expiry defined, then the timestamp, in UNIX format, is stored as int32.
  • Key Size: Size of the key in bytes
  • Value Size: Size of the value in bytes
  • Key
  • Value

This additional metadata stored alongside the key/value is represented with a fixed-width header. Each field is represented as int32, so the total size of the header is 4*5 = 20 bytes. Here's the code which encodes and decodes this record:

type Record struct {
    Header Header
    Key    string
    Value  []byte
}

// Header represents the fixed width fields present at the start of every record.
type Header struct {
    Checksum  uint32
    Timestamp uint32
    Expiry    uint32
    KeySize   uint32
    ValSize   uint32
}

// Encode takes a byte buffer, encodes the value of header and writes to the buffer.
func (h *Header) encode(buf *bytes.Buffer) error {
    return binary.Write(buf, binary.LittleEndian, h)
}

// Decode takes a record object decodes the binary value the buffer.
func (h *Header) decode(record []byte) error {
    return binary.Read(bytes.NewReader(record), binary.LittleEndian, h)
}

The record is encoded in the binary format before storing it on the disk.

Datafile

A "datafile" (term used for the DB file on disk) is an append-only record of all the write operations. An instance of Bitcask can have several datafiles. However, there's only one "active" datafile. In BarrelDB, a goroutine runs in the background at regular intervals to check if the size of the active DB file has crossed the threshold and then rotates the active file. It appends this DB file to the list of "stale" data files. All the new writes only happen to the "active" data file, and the stale files are merged as a part of the "compaction" process (described later in the post).

image

Here's how a datafile is represented:

type DataFile struct {
    sync.RWMutex

    writer *os.File
    reader *os.File
    id     int

    offset int
}

It contains different handlers for writing and reading the file. The reason we have 2 file handlers instead of re-using the same one is that the writer is only opened in an "append-only" mode. In addition, since the active file can be rotated, the writer can be set to nil, ensuring no new writes ever happen on that file.

    writer, err := os.OpenFile(path, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644)
    if err != nil {
        return nil, fmt.Errorf("error opening file for writing db: %w", err)
    }

    // Create a reader for reading the db file.
    reader, err := os.Open(path)
    if err != nil {
        return nil, fmt.Errorf("error opening file for reading db: %w", err)
    }

KeyDir

In addition to storing the file on disk, Bitcask also stores additional metadata, which defines how to retrieve the record. This hashtable is a map of keys with this metadata and is referred to as KeyDir. An important point to note here is that the value is never stored in this map. This makes it possible for Bitcask to handle datasets more significant than what the RAM can hold.

// KeyDir represents an in-memory hash for faster lookups of the key.
// Once the key is found in the map, the additional metadata, like the offset record
// and the file ID is used to extract the underlying record from the disk.
// Advantage is that this approach only requires a single disk seek of the db file
// since the position offset (in bytes) is already stored.
type KeyDir map[string]Meta

// Meta represents some additional properties for the given key.
// The actual value of the key is not stored in the in-memory hashtable.
type Meta struct {
    Timestamp  int
    RecordSize int
    RecordPos  int
    FileID     int
}

Here, RecordPos tells the record's position offset (in bytes) in the entire file. Since the position of the record is stored in memory along with the key, the retrieval of the key doesn't require more than a single disk seek. Bitcask achieves really low latency even with many keys in the database. A file system read-ahead cache also helps boost the performance and comes for free - no need to design a separate caching mechanism.

image

Compaction

As we looked at previously, a datafile is simply an append-only sequence of writes. Any modification of the key is merely a new record appended to the datafile. KeyDir overwrites the entry of the key with the new metadata, which contains the new location of the record. Thus all reads will automatically return the updated value.

Deletes are handled similarly by writing a "tombstone" record for the key. When the user requests the key after it's been deleted, BarrelDB can check whether that value equals the tombstone value and return an appropriate error.

As you would have guessed, our database will grow unbounded if we don't perform any garbage cleanup. The datafiles need to be pruned for deleting expired/deleted records and merging all stale files in a single active file - to keep the number of opened files in check. All of these processes are together called "Compaction".

Let's take a look at how each of these compaction routine works under the hood:

Merge

The merge process iterates over all the keys inside KeyDir and fetches their value. The value could come from a stale file as well. Once the new keys/values are updated, it writes them to a new active file. All the old file handlers are closed, and the stale files are deleted from the disk. The KeyDir is updated similarly since the new records live in a different position/file.

Hints File

Bitcask paper describes a way of creating a "hints" file initially loaded in the database for faster startup time. This file is essential to bootstrap KeyDir after a cold startup. This avoids iterating over all data files and reading their values sequentially. In BarrelDB, gob encoding is used to dump the KeyDir map as a gob dump.

// generateHints encodes the contents of the in-memory hashtable
// as `gob` and writes the data to a hints file.
func (b *Barrel) generateHints() error {
    path := filepath.Join(b.opts.dir, HINTS_FILE)
    if err := b.keydir.Encode(path); err != nil {
        return err
    }

    return nil
}

During the startup, BarrelDB checks the presence of a .hints file, decodes this gob dump, and loads the data in KeyDir.

Removing expired keys

A goroutine runs at a configurable interval to check if the value of the key has expired. If it has, it deletes the entry from KeyDir. During the following merge process, since this entry won't be present in KeyDir, it'll automatically be removed when the new datafile is created.

To check if the key has expired, a simple check, like comparing their timestamps in UNIX epoch format, is enough: time.Now().Unix() > int64(r.Header.Expiry).


Redis Server

In addition to using BarrelDB as a Go library, I also implemented a redis-compatible server. I found tidwall/redcon as an easy-to-use library to create a Redis-compatible server for Go applications. All I'd do was wrap BarrelDB API methods and define handlers for SET / GET.

I was able to use redis-cli and connect to the BarrelDB server:

127.0.0.1:6379> set hello world
OK
127.0.0.1:6379> get hello
"world"

Benchmarks

You can check the repo for the actual benchmarks. However, I'd like to point out some inferences of the results from redis-benchmark.

First, let's send 100000 requests to the server using 50 parallel clients. This command creates a unique key for each SET operation.

redis-benchmark -p 6379 -c 50 -t set -n 100000 -r 100000000

Summary:
  throughput summary: 145985.41 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.179     0.016     0.183     0.207     0.399     1.727

So, 140k requests per second is not bad at all for a disk-based KV. But the exciting thing to note here is that the performance is predictable even if you increase the load by increasing clients:

redis-benchmark -p 6379 -c 200 -t set -n 100000 -r 100000000

Summary:
  throughput summary: 140845.08 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.718     0.224     0.711     0.927     1.183     5.775

If we increase the number of requests (by 5x) as well, the throughput looks almost the same:

redis-benchmark -p 6379 -c 200 -t set -n 500000 -r 100000000

Summary:
  throughput summary: 138350.86 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.748     0.056     0.711     0.879     1.135    63.135

This magic is all because of the way Bitcask uses Log structured hash table (just append-only records for writing data). Even with a lot of records, all it has to do is to write to the end of the file, which avoids any expensive I/O operations.

Summary

Overall, I am happy with BarrelDB implementation as I covered everything described in the paper. This project had excellent learning outcomes for me. I spent a lot of time coming up with a design for structuring different components and their API methods and handling all the edge scenarios during the compaction process. Although, full credit to Bitcask as it kept its design so elegant and minimal yet achieved some significant numbers in the benchmarks. This is also a reminder that simple need not necessarily mean less powerful.

I look forward to implementing a distributed KV store by adding support for multiple BarrelDB nodes connected via Raft. For now, gonna enjoy some chai and release this project to the WWW :)

Fin!