← Back to all posts

PipelineDB 0.9.2


PipelineDB 0.9.2 is here, download it now!

Non-INNER stream-table JOINs

Stream-table joins now support LEFT and RIGHT JOINs as well. However a directional join is only supported if the stream is the relation on that side, so LEFT JOINs require the stream to be of the LHS and RIGHT JOINs require the stream to be on the RHS.

CREATE TABLE t (x int);
INSERT INTO t SELECT generate_series(1, 10) AS x;

CREATE CONTINUOUS VIEW cv_inner AS SELECT count(*) FROM stream JOIN t ON (t.x = stream.x::int);
CREATE CONTINUOUS VIEW cv_left AS SELECT count(*) FROM stream LEFT JOIN t ON (t.x = stream.x::int);
CREATE CONTINUOUS VIEW cv_right AS SELECT count(*) FROM t RIGHT JOIN stream ON (t.x = stream.x::int);

-- This should be accounted for in all CVs
INSERT INTO stream (x) SELECT generate_series(1, 5) AS x;

-- This should only be accounted for in non-INNER JOIN CVs
INSERT INTO stream (x) SELECT generate_series(11, 15) AS x;

SELECT * FROM cv_inner;
 count
-------
     5
(1 row)

SELECT * FROM cv_left;
 count
-------
    10
(1 row)

SELECT * FROM cv_right;
 count
-------
    10
(1 row)

Usability Improvements

continuous_query_ipc_shared_mem now sets the amount of shared memory used per database by continuous query processes, rather than the amount of shared memory used per process. Previously increasing the number of workers or combiners would linearly increase the amount of shared memory needed which was causing failures for some of our users who were running PipelineDB inside containers because enough shared memory wasn't allocated for the container. This change ensures that your system will continue to work with the same amount of shared memory regardless of the number of workers or combiners. It is still a good idea to increase the value of continuous_query_ipc_shared_mem when increasing the number of continuous query processes. We recommend at least 32mb per processes.

When creating a normal view on top of a sliding window continuous view, we now ensure that the max_age of the view being created is large enough compared to the step size of the sliding window continuous view. The way sliding window queries work is that the window is split into smaller steps and at read time all these steps are combined to return the current answer. The number of steps created depends on the value of step_factor. By default the step_factor for a sliding window query is 5, which means that a window is split into 20 smaller steps. So say in the case of max_age = '20 days', the step size will be 1 day. Now if you try to create a view on top of this sliding window query where max_age = '5 minutes', it doesn't make any sense because the data is not being stored with enough granularity. The system will not throw an error if the max_age of the regular view being created is not twice as large as the step size of the underlying continuous view.

Statistics Improvements

Previously, output_rows stats for continuous transform queries wasn't being updated. That is fixed in this release. We also added some average and throughput stats to pipeline_query_stats and pipeline_proc_stats.

\d pipeline_query_stats;
View "pg_catalog.pipeline_query_stats"
    Column     |  Type  | Modifiers
---------------+--------+-----------
 name          | text   |
 type          | text   |
 input_rows    | bigint |
 output_rows   | bigint |
 updated_rows  | bigint |
 input_bytes   | bigint |
 output_bytes  | bigint |
 updated_bytes | bigint |
 tuples_ps     | bigint |
 bytes_ps      | bigint |
 time_pb       | bigint |
 tuples_pb     | bigint |
 errors        | bigint |

\d pipeline_proc_stats;
        View "pg_catalog.pipeline_proc_stats"
    Column     |           Type           | Modifiers
---------------+--------------------------+-----------
 type          | text                     |
 pid           | integer                  |
 start_time    | timestamp with time zone |
 input_rows    | bigint                   |
 output_rows   | bigint                   |
 updated_rows  | bigint                   |
 input_bytes   | bigint                   |
 output_bytes  | bigint                   |
 updated_bytes | bigint                   |
 executions    | bigint                   |
 tuples_ps     | bigint                   |
 bytes_ps      | bigint                   |
 time_pb       | bigint                   |
 tuples_pb     | bigint                   |
 memory        | bigint                   |
 errors        | bigint                   |

tuples_ps and bytes_ps indicate the throughput of the process or view in terms of tuples and bytes per second. time_pb represents the time taken per batch and tuples_pb represents the tuples consumed per batch. All these statistics are averaged over the most recent 10 runs within the past 2 seconds.

memory in pipeline_proc_stats represents the average memory consumed by the process.

Performance Improvements

When copying tuples between different processes, we now copy batches of tuples rather than tuple at a time. This should improvement general throughput of the system. INSERT still copies tuples one at a time, but COPY copies in batches, so you should see max throughput improvement when using COPY.