← Back to all posts

PipelineDB 0.8.1

PipelineDB 0.8.1 is here, download it now! Some of the highlights of this release are:


A lot of our users requested an easier way to integrate PipelineDB with Kafka. This release comes with native Kafka ingestion support in the form of an extension called pipeline_kafka. Check out Derek's blog post to learn more about how it works!

Filtered Space Saving for Top-K Queries

Support for Top-K queries is finally here. We implemented the Filtered Space Saving algorithm which combines sketch and counter based techniques to create a space efficient summary that can be used to answer Top-K queries. The following example keeps track of the top 3 values of x seen for each value of k.

CREATE CONTINUOUS VIEW v AS SELECT k::text, fss_agg(x::int, 3) FROM stream GROUP BY k;

INSERT INTO stream (k, x) VALUES ('a', 1), ('a', 1), ('a', 1), ('a', 1), ('a', 1);
INSERT INTO stream (k, x) VALUES ('a', 2), ('a', 2), ('a', 2);
INSERT INTO stream (k, x) VALUES ('a', 3), ('a', 3), ('a', 3), ('a', 3);
INSERT INTO stream (k, x) VALUES ('a', 4);
INSERT INTO stream (k, x) VALUES ('a', 5), ('a', 5), ('a', 5), ('a', 5), ('a', 5), ('a', 5);
INSERT INTO stream (k, x) VALUES ('b', 100), ('b', 100), ('b', 100);
INSERT INTO stream (k, x) VALUES ('b', 99), ('b', 99), ('b', 99), ('b', 99), ('b', 99), ('b', 99);
INSERT INTO stream (k, x) VALUES ('b', 101), ('b', 101), ('b', 101);
INSERT INTO stream (k, x) VALUES ('b', 102), ('b', 102), ('b', 102);

SELECT k, fss_topk(fss_agg) FROM v;
 k |   fss_topk   
 a | {5,1,3}
 b | {99,100,101}
(2 rows)

This is an initial implementation of the FSS algorithm so it might not be super performant. We will keep on improving it to make it more efficient. For now our implementation only supports fixed-length data types. Any Top-K algorithm requires storing some numer of raw values of the items being tracked. As a result, it can potentially become prohitively space intensive for variable-length values. But if your use case requires storing variable-length values, let us know!

Sliding Window Improvements

We've made a number of improvements to sliding window queries. There were a few issues causing vacuuming of expired tuples to fail which led to unbounded disk usage--this should no longer be a problem. We've added a storage parameter called max_age for continuous views which provides a simpler way to define sliding window queries. The following two views are equivalent:

CREATE CONTINUOUS VIEW v AS SELECT count(*) FROM stream WHERE arrival_timestamp > clock_timstamp() - INTERVAL '1 minute';

CREATE CONTINUOUS VIEW v WITH (max_age = '1 minute') AS SELECT count(*) FROM stream;

Last, you can now create new sliding window views with a different window size on top of existing views. Previously multiple continuous views had to be made which obviously caused the system to do more work and consume extra disk space. For example if you want to have an hour and 15 minute window for the same query, you should do something like:

CREATE CONTINUOUS VIEW v_1h WITH (max_age = '1 hour') AS SELECT count(*) FROM stream;

CREATE VIEW v_15m WITH (max_age = '15 minute') AS SELECT * FROM v_1h;

You can also use arrival_timestamp instead of max_age to describe the sliding window. All continuous views with sliding windows now expose an arrival_timestamp column.

Performance Improvements & Bug Fixes

We added an EXPLICIT representation to our HyperLogLog implementation and tuned the way we union them so that HyperLogLogs for low cardinality sets are much more compact and performant. This also means that count(DISTINCT ...) queries where the count is small are much faster now. Space and performance improvements have also been made to our t-digest implementation (and consequently percentile_cont(...) queries) when the number of items inserted is low.

The combiner process now caches tuples in an LRU fashion so looking up frequently updated tuples is significantly faster. This should improve performance across the board. A critical bug which caused some groupings to have duplicate entries in a continuous view is also fixed in this release.

We added the stream_commit_interval configuration option to the previous release as a means to avoid long running transactions that insert into streams. It ended up introducing some bugs and so it has been removed in this release.