← Back to all posts

PipelineDB 0.9.5


PipelineDB 0.9.5 is here, download it now!

Fast Binary Upgrades

Some PipelineDB releases change the schema of the internal system catalogs, which means these releases can't be run on previous versions' data directories. In these cases, it was necessary to dump and restore the old database into a data directory created by the new version. That is no longer necessary.

PipelineDB 0.9.5 ships with the pipeline-upgrade tool, which will automatically upgrade system catalog data from an old to new version of PipelineDB. All other database data is simply copied at the operating-system level because its on-disk layout will essentially never change. pipeline-upgrade takes the previous and new PipelineDB installations as arguments, and performs the upgrade automatically from there. For example:

pipeline-upgrade -b <old bin dir> -d <old data dir> -B <new bin dir> -D <new data dir>

Output Streams

The stream of incremental on-disk changes applied to a continuous view as it reads from its input streams can now be accessed as its own stream, and therefore read by other continuous views and transforms for further processing. We call these "output streams".

A common requirement is to have data aggregated at different granularity levels for different periods of time, often with more granularity for more recent data. For example, you may want to keep the last month of data aggregated by hour, and the last year aggregated by day to keep infrequently accessed historical data from growing too large.

With PipelineDB, you could accomplish this using similar copies of the same continuous view, with the only difference between them being the granularity level of summarization. Conversely, output streams allow you to more elegantly construct a tree of processing for which "downstream" continuous views consume from the output streams of more granular continuous views. Downstream continuous views then only need to read the aggregate output of upstream continuous views, rather than perform work on raw input events. This can often be a substantial efficiency win.

Let's look at a simple example to illustrate the power of output streams. Note that output stream event has an old and new tuple representing a change made to an individual continuous view row:

-- This CV will do nearly all of the work
CREATE CONTINUOUS VIEW v_hourly AS SELECT hour(arrival_timestamp), COUNT(*) FROM stream GROUP BY hour;

-- Now simply sum the stream of increments made to the upstream counts
CREATE CONTINUOUS VIEW v_daily AS SELECT day((new).hour), SUM((new).count - (old).count) AS count
  FROM output_of('v_hourly') GROUP BY day;

And since output streams are just regular streams, they don't only work with continuous views. Transforms can process them too, so output streams ultimately unify all PipelineDB primitives in an elegant, recursive manner.

Trigger -> Transform + Output Stream

A few months ago we released continuous triggers, which have essentially become a subset of the functionality that output streams enable. We strive to keep PipelineDB as simple as possible, as well as minimize the surface area of potential code paths, and so we decided to remove triggers entirely in favor of users just using continuous transforms on output streams. Let's look at an example of creating a transform on an output stream that is equivalent to a continuous trigger in previous releases.

Here's a simple continuous trigger that calls a function whenever the avg value in a continuous view crosses a threshold of 10:

CREATE CONTINUOUS VIEW v AS SELECT avg(x) FROM stream;

CREATE TRIGGER threshold
  AFTER UPDATE ON v FOR EACH ROW
  WHEN (old.avg < 10 AND new.avg > 10)
  EXECUTE PROCEDURE do_something();

And the transform + output stream equivalent:

CREATE CONTINUOUS VIEW v AS SELECT avg(x) FROM stream;

CREATE TRANSFORM threshold AS
  SELECT (new).avg FROM output_of('v') WHERE (old).avg < 10 AND (new).avg > 10)
  THEN EXECUTE PROCEDURE do_something();

The second approach reuses much more internal functionality than the trigger-based approach, and eliminates an entire abstraction (the continuous trigger). Continuous triggers are thus no longer supported.

ZeroMQ for IPC

PipelineDB 0.9.5 uses ZeroMQ for inter-process communication. Previous versions of PipelineDB used a simple shared-memory based inter-process message broker, which no longer suited our needs after introducing output streams. Output streams make it possible for continuous views to both read from and write to streams, which introduced the possibility of stalling the system under certain saturated states. Basically, a process could be waiting for space in a buffer to write to when it was ultimately responsible for freeing space in the same buffer.

ZeroMQ handles our new requirements quite well, so after evaluating several IPC options (including home-grown implementations), we decided to leverage its battle hardened implementation. Using socket-based IPC also positions PipelineDB well for doing more over the network as the system matures and evolves.

Removed Support for Inferred Streams

PipelineDB 0.9.5 removes support for inferred streams. Inferred streams made it possible to reference streams in continuous views and transforms without actually having to explicitly create a schema for them. The original intention behind this was to make it as easy as possible for users to begin running continuous SQL queries on streams. But as PipelineDB evolved over time inferred streams became somewhat cumbersome internally, requiring special casing on various hot paths without much fundamental user benefit.

We thus decided to remove them entirely, so all streams must now be explicitly created. Note that this does not mean that your streams must have a rigidly defined schema. Many of our largest users simply use streams with a single json column, and then leverage PipelineDB's builtin json support to manipulate schemaless json events as desired.

Stability, Performance, and Other Improvements

  • Generally increased performance and reduced CPU usage. This is a result of our migration to ZeroMQ, which not only uses less processes but also passes messages at the microbatch level. Previously we passed individual events between processes, which each had a small amount of metadata associated with it. This metadata can be shared across a microbatch of events, so we factored it out and now generally pass batches of messages between processes, which is more efficient.

  • PipelineDB 0.9.5 has a significantly smaller footprint as a result of the removal of several background processes. The removal of triggers eliminated two background processes, and the use of ZeroMQ eliminated our IPC broker process.

  • We added jsonb_agg and jsonb_object_agg for jsonb aggregation support. Analogous json aggregates already existed, so we added these for consistency.

  • Various minor bug fixes

Thanks!

We want to thank our growing community of smart, helpful, and thoughtful PipelineDB users who have consistently given amazing insight and feedback to help drive these releases.

Also, a huge thanks to PipelineDB's chief architect @usmanm, who is responsible for most of the good ideas and groundwork put in to achieve the fundamental improvements that come with this release. Thanks, and stay tuned for some big announcements coming up!

And go download PipelineDB!