← Back to all posts

PipelineDB's Engineering Workflow


Engineering a database product is a unique and challenging endeavour. I began using databases as an engineer long before I started hacking on them, and I often wondered how these highly robust (well, usually) black boxes were developed. How are code changes applied and run by the engineer? How does testing work? How do you interactively step through the execution of a query, or track down the cause of a single corrupted bit on a remote server?

In this post, we'll share some of the high-level aspects of PipelineDB's engineering process, philosophy, and tooling as well as a few of the important lessons we've learned along the way.

Development

As a small engineering team, our development process has very little structure. We meet once every two weeks for typically less than an hour to discuss how the previous two weeks of work went, and to identify what specifically should be completed in the next two-week sprint. For long-term projects that obviously can't be completed in a two-week timeframe, we break them down into chunks that can. Having an immediate focus on what needs to be finished helps us a maintain a consistent, measured pace of development as well as protect ourselves from the human nature of procrastination.

GitHub is our primary organizational tool for development. All product work is organized into issues and milestones, and all new code is reviewed as a pull request from a feature branch into master. Once a pull request's diff has been reviewed and polished, all commits on the feature branch are squashed into a single commit which is then merged into master. We find that this approach makes it easier to quickly examine how code has evolved over time. Squashing feature-branch commits also makes it less likely that a work-in-progress commit will leave the source in an unstable state at that commit, which can complicate searching for when a bug was introduced.

We also use Slack quite a bit. We push all of our builds, tests, releases, deployments, and GitHub activity to various channels, which is conducive to good situational awareness for the entire team. It's also used for sharing restaurant recommendations, music, books, and general intellectual curiosities.

Each of us uses slightly different sets of tools to write code, but our team does have underlying philosophies about how that code should be written. Principally, correctness and stability are more important to us than performance. When building a database it's often tempting to be clever about performance at every possible opportunity, which can lead to code that is either overly complex or dangerously simple. In practice it is very difficult to make something both optimal and correct on the first pass. But once code is obviously correct, it is much easier to reason about how it can be made as fast as possible. The converse probably isn't true.

We also believe in frequent refactors. Refactors are highly profitable long-term product investments because they fortify the foundation upon which new things are built. We probably spend about a quarter of our time refactoring code, and as a result have very little technical debt.

The general process of writing and running code is pretty simple. After compiling and launching the latest server build with make run, we'll typically run a canonical query or specific test against the server, making it increasingly complex as the functionality becomes more complete:

pipeline -f some_test_script.sql

Logging is also very useful here, as the console's output is often the only immediate insight into what various processes are doing. It's not unusual for one of our feature branches to be littered with logging statements that ultimately get removed before merging into master. It's a primitive but tried-and-true approach to basically any type of development. Once the functionality is closer to completion, more exhaustive testing can be introduced.

Testing

Testing is probably where databases differ the most relative to building other more common types of software products. PipelineDB is based on PostgreSQL, so let's first look at Postgres' approach to testing.

PostgreSQL is mostly tested with what would typically be called integration tests [0]. That is, most of its tests involve verifying the outputs of inputs run against an actual instance of PostgreSQL. A test is simply a .sql file containing arbitrary SQL statements, which is executed with a psql client against a PostgreSQL database named "regression". The output generated by psql is then compared (by simply using diff) to a .out file containing the expected output of the corresponding .sql file. For example, here is the test for PostgreSQL's INSERT statement, and here is its expected output. There is also a ubiquity of assertions throughout the code, which is effective in identifying programmer errors.

PostgreSQL's approach to testing is remarkably simple. It may be surprising to learn that such a famously stable database is largely tested using little more than psql, diff, and PostgreSQL itself. But this simplicity provides realism. Each test involves executing the same code paths as a live deployment, complete with all of the various layers of the system interoperating with each other under realistic conditions. Given that a database's core purpose is to correctly respond to queries with results, this strategy makes a lot of sense. Many of PipelineDB's tests use this approach.

But the psql -f test.sql | diff - test.out approach does have drawbacks. First, these tests can be brittle. Any changes to logging, error messages, or system tables will likely break tests. Secondly, using SQL to write tests isn't conducive to writing long-running or randomized tests, which are both effective ways to test a database. To address this, we've added additional test suites to PipelineDB.

The most heavily used test suite we've added is based on pytest. PipelineDB's pytests still use a running instance of PipelineDB, but they aren't susceptible to the brittleness of comparing all database output to a predefined .out file. They also allow us to easily write expressive tests with more randomization and complexity than we would be able to achieve with the same amount of work using a SQL test. Here is an example of a PipelineDB pytest.

We run tests locally during development, but database tests can be quite resource intensive so we offload a lot of work to CircleCI. CircleCI runs tests for every commit to every branch and integrates nicely with Slack and GitHub, so it's great for testing during development. It also only took about 30 seconds to initially set up, which surprised us because we are probably not their typical user.

Our heavier-duty test runner is built on Buildbot, which we love. Once per night, Buildbot builds packages with optimized binaries, verifies the packages, installs them, and then runs tests against the installation. These optimized package tests are run on a variety of different platforms. Typically only very subtle, platform-dependent bugs are uncovered by the time tests are run on Buildbot.

PipelineDB's Buildbot tests are powered by EC2, using the fantastic EC2LatentBuildSlave. This build slave implementation will launch an EC2 instance, run the given build steps, and then shut it down. Configuring it is very simple; here's what our setup looks like:

USER_DATA = """#! /bin/bash
echo "%s" > /home/buildslave/platform
mkdir /usr/lib/pipelinedb
chown buildslave:buildslave /usr/lib/pipelinedb
PATH=$PATH:/usr/local/bin buildslave create-slave /home/buildslave %s "%s" %s
chown -R buildslave:buildslave /home/buildslave
su -l buildslave -c "buildslave start"
"""

c['slaves'] = [
    EC2IAMLatentBuildSlave(
        # How this build slave identifies itself to the build master
        'Ubuntu 14 DEB', secret('ubuntu14'),

        # What type of EC2 instance this slave should run on
        'm3.xlarge', 'us-west-2', 'ami-12345678',

        # This user data will launch a build slave instance that will begin
        # communicating with the build master instance
        user_data=USER_DATA % ('ubuntu14', MASTER_HOST, 'Ubuntu 14 DEB', secret('ubuntu14'))
    ),
    # ...
]

EC2IAMLatentBuildSlave is our own modified implementation of EC2LatentBuildSlave, with the only modification being that it uses the IAM role associated with the buildmaster's EC2 instance, making it unnecessary to explicitly use keys anywhere. We also use Buildbot to autodeploy documentation, allowing us to edit docs directly from GitHub and see the changes live at docs.pipelinedb.com a few seconds later. We even have a Mac OSX box in our SF office that Buildbot uses as an OSX build slave.

Although we are happy with the thoroughness of our tests, something we've learned over time is that it's more important to be able to quickly understand what's happening when something breaks than it is to strive for perfect test coverage. Note that 100% test coverage is not the same thing as 100% correct code. Something will eventually break in the most mysterious and unlikely of ways, so good diagnostic tools are critical. One such tool that has saved us more than once is provided by our internal stress-testing framework, juggernaut.

In particular, juggernaut supports running a workload in EC2 and will upload all crash data to S3 for postmortem examination. When a nasty bug takes hours of running at maximum load to reproduce, this is extremely useful. Here's what a given session might look like:

First we run the workload in EC2 (the workload can be any executable file):

$ juggernaut.py run repro-some-issue --remote

After running for a few hours and crashing, we can now do a postmortem on the core dump:

$ juggernaut.py debug 20150801000143-0094.tar.gz
downloaded and initialized debug package 20150801000143-0094

The debug package contains all libraries used by PipelineDB on the remote system, so regardless of the system we run gdb on, all of the symbols will be resolved. Downloading it to a local machine also makes it easier to use other tools on the core dump, such as a visual debugger. Now let's look at the core dump:

$ gdb attach 20150801000143-0094/core.5618.@000143

Program received signal SIGSEGV, Segmentation fault.
0x000000000056e8cb in parse_analyze (parseTree=0x0, sourceText=0xab9e0b "INSERT", paramTypes=0x0, numParams=0) at analyze.c:106
106             if (IsA(parseTree, SelectStmt))
(gdb) where
#0  0x000000000056e8cb in parse_analyze (parseTree=0x0, sourceText=0xab9e0b "INSERT", paramTypes=0x0, numParams=0) at analyze.c:106
#1  0x00000000007e4cb9 in pg_analyze_and_rewrite (parsetree=0x0, query_string=0xab9e0b "INSERT", paramTypes=0x0, numParams=0) at postgres.c:665
#2  0x000000000073e1c6 in InsertIntoStream (ins=0x2b54ab0, params=0x0) at stream.c:402
#3  0x00000000007e508d in exec_stream_inserts (ins=0x2b54ab0, pstmt=0x0, params=0x0) at postgres.c:892
#4  0x00000000007e52f2 in exec_simple_query (query_string=0x2b53cf8 "insert into stream (x) values (10);") at postgres.c:996
#5  0x00000000007e9eb3 in PostgresMain (argc=1, argv=0x2b3a728, dbname=0x2b3a588 "derek", username=0x2b3a568 "derek") at postgres.c:4285
#6  0x000000000076b3ae in BackendRun (port=0x2b5ad90) at postmaster.c:4227
#7  0x000000000076aa2f in BackendStartup (port=0x2b5ad90) at postmaster.c:3887
#8  0x0000000000766f5e in ServerLoop () at postmaster.c:1614
#9  0x0000000000766599 in PostmasterMain (argc=9, argv=0x2b39590) at postmaster.c:1261
#10 0x00000000006a967c in main (argc=9, argv=0x2b39590) at main.c:229
(gdb)

We can now step through various stack frames to figure out why parseTree was ultimately NULL when it shouldn't have been. Luckily we haven't had to use this tool very many times, but when we do need it we're glad we have it.

After correctness testing, juggernaut will be used for its primary purpose: performance profiling.

[0] There are a few other types of tests, such as isolation tests. See the PostgreSQL docs for more information

Profiling

juggernaut is a loose collection of Python scripts designed to easily reproduce PipelineDB workloads for the purpose of consistent profiling. A juggernaut workload is simply an executable file, or a directory containing an executable file named "run". Here's an example of a simple workload:

#! /bin/bash

./execute "CREATE CONTINUOUS VIEW v AS 
 SELECT 
  a::integer, b::integer, c::integer, d::integer,
  e::integer, f::integer, g::integer, COUNT(*) FROM stream
 GROUP BY a, b, c, d, e, f, g"

./psycopg2-emit stream \
--a=int,0,10 \
--b=int,0,10 \
--c=int,0,10 \
--d=int,0,10 \
--e=int,0,10 \
--f=int,0,10 \
--g=int,0,10 \
--batchsize=10000 \
--duration=43200 \
--concurrency=4

./execute "SELECT sum(count) FROM v"
./execute "DROP CONTINUOUS VIEW v"

This workload creates a continuous view with a 7-column grouping and counts the number of times each group is seen. The psycopg2-emit script is used to randomly generate input streams, and here we're telling it to randomly generate 10 million possible groupings and write 10,000 events at a time from 4 threads for a duration of 12 hours. Running workloads in EC2 gives us a wealth of system and database-level metrics that are visualized by Ganglia. We also have a PipelineDB Ganglia plugin that we use to graph statistics from PipelineDB's various stats tables, which gives us great insight into how the system behaves under load over long periods of time and how different tuning parameters can affect performance. Here is an example of a Ganglia overview for a running workload.

Long-running performance insight is great, but often we need more granular visibility for things like memory leak detection and CPU optimization. For this, we use valgrind. More specifically, we use the valgrind tools Massif and Callgrind. Massif is a heap memory profiler and is very effective at identifying memory leaks, especially when coupled with massif-visualizer. Can you spot the leak? Similarly, KCachegrind visualizes callgrind CPU profiling data

All of this is put together in our nightly juggernaut build, which is run by Buildbot. It runs a few representative workloads, saves the results to a database, and compares them to the results of the previous build. If an obvious performance regression is detected, the build fails. The juggernaut build also runs memory profiling workloads and analyzes the results to determine if a memory leak may have been introduced, although that's not a very exact science.

Releasing

Ok, the code has been written, tested, and profiled, and it's time to cut a release. Quite a bit of work has been put into making PipelineDB's release process as automated as possible. Here's what releasing looks like, verbatim:

git branch -D release
git checkout master
git checkout -b release
git tag -a 0.8.0
git push origin release --tags --force

As soon as changes are seen on the release branch after pushing, Buildbot triggers release builds on all of our supported platforms. The release builds are actually quite similar to our nightlies, with some differences around how they're versioned. Once the packages are built, they are run through the test suite a final time. When they pass tests, packages are published and made available for public download at pipelinedb.com/download, at which time we'll see something like this in Slack. Debian and RPM packages are also published to PackageCloud, which hosts our apt and yum repositories for us. It's a really well-built service that seamlessly automates the process of creating and publishing to your own yum, apt and gem repos.

Thanks!

Database engineering is a challenging, complex type of development that demands a discplined approach to maintaining quality and correctness. We hoped you enjoyed learning about some of PipelineDB's high-level approaches to these challenges. And of course, if this sounds like anything you'd be interested in being a part of we're always hiring!

Finally, here's a list of the tools we use that were mentioned in this post: