← Back to all posts

PipelineDB 0.8


We're excited to release PipelineDB 0.8 today, download it now! This release includes a complete rework of the query parser and analyzer modules, as well as a number of stability and performance improvements. Some of the highlights are:

5432

The default port has been changed from 6543 to 5432 (the default Postgres port). This was done to avoid some quirks in systems with libpq already installed.

Support for Composite Data Types

Previously you could only create probabilistic data structures over single column references to native data types. That limitation no longer exists, so you could do something like:

CREATE TYPE user (
  username text
  app_id   int
);

CREATE CONTINUOUS VIEW global_uniques AS SELECT hll_agg(user::user) FROM global_stream;

Dealing with composite data types can be a little cumbersome though, so we also support using multiple columns without creating an explicit TYPE.

CREATE CONTINUOUS VIEW global_uniques AS
  SELECT hll_agg((username::text, app_id::int)) FROM global_stream;

This support extends to aggregates that internally use these data structures, e.g. count(DISTINCT ...).

There was also a bug in the way text types were hashed, which caused the same text input value to get hashed to different values. This improvement fixes this bug.

stream_insertion_commit_interval

We got a few complaints about disk usage of continuous views constantly increasing, and newly created continuous views sometimes not seeing any tuples being inserted.

The reason behind this was Postgres' use of MVCC for concurrency control. Every update to Postgres essentially creates a new version of the tuple, with the older one marked for deletion. However, an older version of a tuple can't be garbage collected unless all transactions that can see it have completed. If you're inserting into a stream from a long running transaction, the bloat being generated due to MVCC will not be cleaned up. And since continuous views are extremely write heavy, having long running transactions will generate a lot of bloat.

Continuous views not seeing data turned out to be a problem caused by the same thing. All our metadata for queries is stored in catalog tables so a newly created continuous view will only be visible to transactions that start after the CREATE CONTINUOUS VIEW query has committed. If you have long running transactions writing to streams, they won't see any continuous views added concurrently till they commit and start a new transaction.

stream_insertion_commit_interval lets you specify an interval after which stream insertions automatically commit. The default value is 1 second. This behavior can be disabled by setting the value to -1.

Smarter Step-sizes for Sliding Window Queries

Sliding window queries work by splitting the query result into finer groups on the time axis and aggregating over them at read time. Previously, the default step-size was 1 second, so a query like:

CREATE CONTINUOUS VIEW v AS
  SELECT x::int FROM stream WHERE arrival_timestamp > clock_timestamp() - interval '5 hour';

would need 18000 buckets on arrival timestamp to store the partial aggregates for the data in a 5 minute window. The cost of this is that there's more work that needs to happen at read time, as well as potentially using 18000x the space. Now we're a little more aggresive and pick the step-size as one step smaller compared to the window-size (so minute for hour).

Eventually, we want to have some fan-out factor and dynamically generate step-sizes to have simpler control over these costs.

Oh, and you can explicitly change this granularity by using a date truncation function on the time column. For your convenience, we added simpler altenatives to date_trunc : second, minute, hour and day.

Parser & Analyzer Rework

Most of the parser work was done during my first few weeks at PipelineDB. Needless to say, the code was pretty sloppy because I knew nothing about Postgres internals. This rewrite was pretty necessary, and I'm glad we got this in sooner than later. Apart from obviously making sure PipelineDB doesn't choke on some valid queries anymore, the rework also reduces disk usage of continuous views. Unfortunately, as a byproduct of this change, the data directories from previous versions are incompatible so you'll have to initialize a fresh database using pipeline-init.

Well, that's it for this version. Please let us know if you come across any issues or want some new features for the next release!