PipelineDB 0.9.7 - Delta Streams and Towards a PostgreSQL Extension


PipelineDB 0.9.7 has shipped! Download it here.

This release contains some minor but necessary catalog improvements, so to migrate your existing installation to PipelineDB 0.9.7 you'll want to use the binary upgrade tool.

Without further ado, here's what PipelineDB 0.9.7 gives you:

Delta Streams

Delta streams are the most interesting new feature in this release. Output streams were released a couple of versions ago, which give you a stream of the incremental changes made to a continuous view in the form of old and new tuples. Delta streams build on this abstraction by introducing another tuple into the output stream: the delta tuple.

Read the full post

PipelineDB 0.9.6


PipelineDB 0.9.6 is here, download it now!

PipelineDB 0.9.6 is primarily a maintenance release, but also includes one important new feature that PipelineDB users have wanted for a while: proper per-row time-to-live (TTL) support for continuous views.

TTL

A very common PipelineDB pattern is continuous aggregation with a timestamp-based column in the aggregation's grouping. For example:

Read the full post

PipelineDB Partners with HPE to Bring Streaming Analytics to Vertica


We're excited to announce a new partnership between PipelineDB and Hewlett-Packard Enterprise in an effort to bring streaming capabilities to Vertica and expand PipelineDB's usage in the enterprise. This partnership is the result of PipelineDB usage we were seeing from Vertica customers who wanted a tighter integration between the two products in order to enable use cases like realtime anomaly detection, continuous monitoring, and realtime reporting that require joining streaming data in PipelineDB with historical data from Vertica.

The partnership is a part of HPE's Big Data Platform Technology Alliance Program and enables Vertica users to add streaming capabilities without leaving the confines of SQL. PipelineDB and Vertica are both relational databases, making them highly compatible out of the box. The addition of streaming-SQL capabilities to Vertica's columnar database enables Vertica customers to architect a lambda architecture using 100% SQL by continuously processing and acting on streaming data as it arrives and then either joining continuous views in PipelineDB with Vertica tables or sending data to Vertica for further analysis and archiving.

Read the full post

Announcing Stride - A Realtime Analytics API


We released PipelineDB just over a year ago and have seen consistently strong, increasing adoption during that period. We've been very fortunate to have an awesome and growing community of nice, thoughtful, smart (and patient) users who have helped us learn about what organizations on the bleeding-edge of data processing are doing and thus what the future looks like. And in our perpetual effort to build the future, it is with tremendous excitement that we're announcing the upcoming developer preview of our newest product: Stride.

Read the full post

PipelineDB 0.9.5


PipelineDB 0.9.5 is here, download it now!

Fast Binary Upgrades

Some PipelineDB releases change the schema of the internal system catalogs, which means these releases can't be run on previous versions' data directories. In these cases, it was necessary to dump and restore the old database into a data directory created by the new version. That is no longer necessary.

PipelineDB 0.9.5 ships with the pipeline-upgrade tool, which will automatically upgrade system catalog data from an old to new version of PipelineDB. All other database data is simply copied at the operating-system level because its on-disk layout will essentially never change. pipeline-upgrade takes the previous and new PipelineDB installations as arguments, and performs the upgrade automatically from there. For example:

Read the full post

PipelineDB 0.9.3


PipelineDB 0.9.3 is here, download it now!

Merged PostgreSQL 9.5.3

Our last upstream merge was with PostgreSQL 9.5.0. Since then there have been 3 minor releases, so we decided to include all changes made in those releases into PipelineDB. The most significant change is support for 128-bit integers in the code base. See the issue here on PostgreSQL's mailing list. Some numeric aggregate functions involving large integers were using NumericAggState as their transition states which is quite a large structure and arithmetic operations on its numeric representation can be slow. If your platform supports __int128_t, you should see 2.5x-4x performance improvement for sum(int8), avg(int8), var_*(int2), var_*(int4), stdev_*(int2) and stdev_*(int4).

Read the full post

PipelineDB 0.9.2


PipelineDB 0.9.2 is here, download it now!

Non-INNER stream-table JOINs

Stream-table joins now support LEFT and RIGHT JOINs as well. However a directional join is only supported if the stream is the relation on that side, so LEFT JOINs require the stream to be of the LHS and RIGHT JOINs require the stream to be on the RHS.

Read the full post

Continuous SQL Triggers


Since releasing PipelineDB as open source in July last year, one of the primary use cases we've seen it deployed for has been to build real-time dashboards. Real-time dashboards are powerful tools that allow you to get instant insight into the metrics you care about most. Once you have a realtime view of the metrics you care about, the natural next step is to take an action based on some specific condition or event. In SQL land, the way to react to events happening in your database is by using triggers.

Read the full post

PipelineDB 0.9.1


PipelineDB 0.9.1 is here, download it now!

This release brings continuous triggers to PipelineDB open-source. Previously, they were only available for our enterprise users, but now they available to everyone!

Continuous Triggers

You can now create triggers on continuous views which greatly simplifies building a real-time alerting system with PipelineDB. Check out our blog post to learn more about how continuous triggers work!

Debug Mode

PipelineDB packages now come with a debug build of the server binary. The debug binary is compiled in O0 level with debug symbols and assertions turned on. Debug mode is designed to enable us to better support users when something goes wrong. It can be run in two ways:

Read the full post

PipelineDB Case Study: SmartNews


Company: SmartNews, Inc.

Industry: Ad Tech

Use Case: Real-time Reporting Dashboards

Background

SmartNews is a Tokyo-based news discovery app with over 5 million monthly active unique users (MAUS) that connects end users with news publisher sites. They use machine learning algorithms to make recommendations about relevant news content for their users. Their app is integrated with over 150 major publishers and makes money by serving ads, upselling free users to paid subscriptions, and by driving users to publisher paywalls.

Problem

Read the full post

PipelineDB 0.9.0 - Continuous Transforms, Streaming Topologies, and PostgreSQL 9.5 Compatibility


PipelineDB 0.9.0 is here, download it now!

This is a big release that we're very excited to announce today. In addition to various bug fixes and performance improvements, PipelineDB 0.9.0 adds a new much-demanded core abstraction called the continuous transform, a couple of new user-requested aggregates, and full compatibility with PostgreSQL 9.5.

Let's take a look at how all of this stuff can make your life easier.

Continuous Transforms

Continuous transforms are a generalized abstraction of the continuous view. Transforms make it possible to define arbitrary non-aggregate SQL queries that continuously run on a combination of input streams and tables. While quite similar to continuous views, the key difference between transforms and views is that instead of continuously updating an underlying table with the query output, transforms simply call an arbitrary function on any rows that the transform query produces. In fact, internally, continuous views are effectively a continuous transform that happens to write to an underlying table.

Read the full post

PipelineDB 0.8.6


PipelineDB 0.8.6 is here, download it now!

In addition to various bug fixes and performance improvements, here are the highlights of this release:

Removed GIS From Core Codebase

We removed the PostGIS-compatible module that previously was built and installed by default with the PipelineDB core. While this module is still fully compatible with PipelineDB, most of our users were not using GIS so we removed it because it has dependencies. Nearly all PipelineDB installation issues were caused by these dependencies, so building and installing PipelineDB is much simpler now.

Read the full post

Announcing PipelineDB Cluster


Since launching PipelineDB in July of last year, we've been fortunate enough to have a steadily growing base of thoughtful, engaged and patient users who have played a key role in the evolution of PipelineDB. Many of these users are beginning to rely heavily on using PipelineDB in production, which is a trend we expect to continue and accelerate.

As a result, it is of critical importance to us that PipelineDB, the company, becomes a financially self-sustaining entity that can continue to improve and support its free and open-source core product without being dependent on anyone or anything that isn't as fanatical about building amazing infrastrucutre products as we are.

Read the full post

PipelineDB 0.8.5


PipelineDB 0.8.5 is here, download it now!

Database-level Control for Managing Continuous Queries

Previously, continuous_query_num_combiners + continuous_query_num_worker background workers would automatically fire-up for all databases at system start up. In cases when you only need continuous queries on a single database, this would cause a lot of idle processes. Now you can use the ACTIVATE and DEACTIVATE command to enable and disable continuous query execution respectively. This state is persisted across restarts so you only need to run ACTIVATE once.

Read the full post

PipelineDB 0.8.4


PipelineDB 0.8.4 is here, download it now! Some of the highlights of this release are:

Multi-core Scalability Improvements

Previously, all worker processes used a single shared memory queue for IPC which was caused a lot of lock contention when running on machines with a high number of cores. We reworked the entire IPC infrastructure to use per-process queues so that they're lock-free for consumers. We also now write to these queues in batches rather than one tuple at a item which greatly reduces lock contention for producers. In our load tests, we've seen 4x improvement in throughput when running on a 32-core machine. Multi-core performance is going to be something we'll continue to focus on over the next few releases. Please let us know if you face any scalability issues!

Read the full post

PipelineDB 0.8.3


PipelineDB 0.8.3 is here, download it now! Some of the highlights of this release are:

Adhoc Continuous Queries

Support for adhoc continuous queries is in this release. We have added a new ncurses based command line client called padhoc that allows you to perform adhoc queries on streams, and see results update in real time.

This is a useful aid for debugging, or just experimenting with queries in general. Here is a demo of it in action:

Read the full post

SQL on Kafka


Originally developed at LinkedIn and open sourced in 2011, Kafka is a generic, JVM-based pub-sub service that is becoming the de-facto standard messaging bus upon which organizations are building their real-time and stream-processing infrastructure. I tend to think of Kafka as the stream-processing analog to what HDFS has been to batch processing. Just as many transformative technologies have been built on top of HDFS (such as Hadoop's MapReduce), Kafka is (and will increasingly become) an integral component of stream-processing technology in general.

Read the full post

PipelineDB 0.8.1


PipelineDB 0.8.1 is here, download it now! Some of the highlights of this release are:

pipeline_kafka

A lot of our users requested an easier way to integrate PipelineDB with Kafka. This release comes with native Kafka ingestion support in the form of an extension called pipeline_kafka. Check out Derek's blog post to learn more about how it works!

Filtered Space Saving for Top-K Queries

Support for Top-K queries is finally here. We implemented the Filtered Space Saving algorithm which combines sketch and counter based techniques to create a space efficient summary that can be used to answer Top-K queries. The following example keeps track of the top 3 values of x seen for each value of k.

Read the full post

PipelineDB's Engineering Workflow


Engineering a database product is a unique and challenging endeavour. I began using databases as an engineer long before I started hacking on them, and I often wondered how these highly robust (well, usually) black boxes were developed. How are code changes applied and run by the engineer? How does testing work? How do you interactively step through the execution of a query, or track down the cause of a single corrupted bit on a remote server?

In this post, we'll share some of the high-level aspects of PipelineDB's engineering process, philosophy, and tooling as well as a few of the important lessons we've learned along the way.

Read the full post

PipelineDB 0.8


We're excited to release PipelineDB 0.8 today, download it now! This release includes a complete rework of the query parser and analyzer modules, as well as a number of stability and performance improvements. Some of the highlights are:

5432

The default port has been changed from 6543 to 5432 (the default Postgres port). This was done to avoid some quirks in systems with libpq already installed.

Support for Composite Data Types

Previously you could only create probabilistic data structures over single column references to native data types. That limitation no longer exists, so you could do something like:

Read the full post

Making Postgres Bloom


Whenever you think about low-latency stream computation, the key thing is figuring out how to summarize raw data, which might be too voluminous to compute on at low latency, while still maintaining the ability to answer relevant queries. For simple additive queries like counts and averages, it's pretty straightforward to condense the raw data down to a few numbers. But for more complex queries like calculating percentiles, heavy hitters, etc., summarizing the raw data becomes a little more tricky. Your fundamental trade-off in such cases is to either become a resource hog (read: lots of money) or lose a bit of precision by using a compact probabilistic summary data structure.

Read the full post

To Be Continuous


In Neal Stephenson's Snow Crash, Juanita Marquez describes her grandmother's uncanny ability to read people as being able to "condense fact from the vapor of nuance." I often think about that quote, because it eloquently describes what the heavy-handed term "big data" actually means.

Extracting value from large datasets is always a process of distillation. I was an early engineer at AdRoll, where I primarily focused on building out and scaling their data infrastructure. Its fundamental purpose was to distill large amounts of raw data for consumption by analytics dashboards, BI reports, and real time bidding algorithms.

Read the full post