← Back to all posts

To Be Continuous

In Neal Stephenson's Snow Crash, Juanita Marquez describes her grandmother's uncanny ability to read people as being able to "condense fact from the vapor of nuance." I often think about that quote, because it eloquently describes what the heavy-handed term "big data" actually means.

Extracting value from large datasets is always a process of distillation. I was an early engineer at AdRoll, where I primarily focused on building out and scaling their data infrastructure. Its fundamental purpose was to distill large amounts of raw data for consumption by analytics dashboards, BI reports, and real time bidding algorithms.

The infrastructure began as a Python script and a PostgreSQL server. As we grew explosively we moved to batch MapReduce and HBase, eventually evolving that into a large-scale streaming architecture based on Kafka, Storm, and HBase. The infrastructure became highly advanced and the scale grew by orders of magnitude, but the problem we were solving hadn't changed; we were still distilling data. We were just on the bleeding edge of how leading data-driven companies were doing it.

But not many organizations have the capability to continuously extract value from data in real time yet. Many of the tools and methods still being used for data refinement have been around for quite some time, because not a lot has changed in terms of how most organizations think about data. Although in recent years the general conception of data has begun a phase shift: it was once a solid to be stored and pulled out of a silo on demand and it is now becoming a fluid that continuously surges through complex networks of cooperating systems, enabling them or their users to do smart things as soon as information is produced.

This phase shift has led to the beginning of the displacement of classical data technologies by next-generation systems that are able to refine and understand information in real time. It is not surprising that our systems are evolving to process information continuously. After all, that is how all biological systems work and living organisms are an exemplary model of evolved efficiency that software and hardware are striving to reach parity with.

PipelineDB combines these ideas of continuous processing and distillation to provide a simple and efficient platform for analyzing highly fluid, growing, changing datasets in real time. An evolution of the traditional static SQL database, PipelineDB runs predefined SQL queries continuously on streams without having to store raw data.

We think that most data-processing technology will be continuous in the future, so we couldn't be more excited to release PipelineDB as open source today.

Now let's get a bit more technical.

The Continuous View

PipelineDB's fundamental abstraction is what is called a continuous view. These are much like regular SQL views, except that their defining SELECT queries can include streams as a source to read from. The most important property of continuous views is that they only store their output in the database. That output is then continuously updated incrementally as new data flows through streams, and raw stream data is discarded once all continuous views have read it. Let's look at a canonical example:


Only one row would ever physically exist in PipelineDB for this continuous view, and its value would simply be incremented for each new event ingested.

But PipelineDB is much more powerful than that. Here is a continuous view that identifies how many unique sensors have been seen within 1,000 meters of San Francisco over a sliding 5-minute window:

CREATE CONTINUOUS VIEW sf_proximity_count AS
SELECT COUNT(DISTINCT sensor_id::integer) AS distinct_count
FROM geo_stream WHERE ST_DWithin(

  -- Approximate SF coordinates
  ST_GeographyFromText('SRID=4326;POINT(37 -122)')::geometry,

  sensor_coords::geometry, 1000)
WHERE (arrival_timestamp > clock_timestamp() - interval '5 minutes');

The output of a continuous view is stored as a regular SQL table, so it can be further analyzed as such:

-- select the DEDUPLICATED aggregate of the distinct counts
SELECT combine(distinct_count) FROM sf_proximity_count;
(1 row)

Fundamentally, what all of this means is that continuous views store observations about data but not raw data itself. Since observational data can often be much more information dense than the data it was derived from, continuous views break the typical dependence between the amount of data ingested and the size and complexity of the database deployment.

As you've likely noticed, the main restriction of this approach is that continuous views must be known in advance, just like any other SQL view. However, it is still trivial to retroactively send data into a stream. Here's how one might stream archival logfile data from S3 into a PipelineDB stream in order to populate a continuous view:

s3cmd get s3://bucket/logfile.gz - | gunzip | pipeline -c "COPY stream (data) FROM STDIN"

We think that the primary consumer of information is becoming software itself and not necessarily analysts who need to frequently run dynamic, ad-hoc queries on granular data. And applications -- almost by definition -- are pre-programmed to issue certain types of queries to the database, with the only dynamic aspect being parameters such as date ranges. Continuous views are very well-suited for this type of workload.

ETL is Probably Not the Future

PipelineDB can also make the infrastructure that surrounds it more efficient. After meeting with over a hundred different data-driven companies to learn about their pain points, we discovered that one of the most beneficial aspects of PipelineDB's design is that it eliminates the necessity of an ETL stage for many data pipelines. ETL was often described as the most complex and burdensome part of these companies' infrastructure, so they were eager to imagine how it could be simplified or removed altogether. It is indeed difficult to envision a future in which leading organizations are still primarily running periodic, batch ETL jobs.

Most data pipelines begin with the production of granular, raw data and end with a highly distilled, organized view of it. ETL is often the bridge between these two extremes. In contrast, PipelineDB makes it possible to stream raw data directly into the database where only the distilled output of predefined continuous views is stored, effectively eliminating the need for complex processing in front of the database.

Here's what writing to a stream actually looks like: these statements will write to the stream our previous example continuous view is reading from:

INSERT INTO stream (payload) VALUES ('{"device_type": 0, "latitude": 1.323, "longitude": 50.332}');
INSERT INTO stream (payload) VALUES ('{"device_type": 4, "latitude": 1.323, "longitude": 50.332}');

It's just a SQL INSERT, which means it is perfectly feasible to use existing SQL clients to stream data into PipelineDB at very high throughputs. It is also easy to imagine transforming the transaction, replication, or write-ahead logs of other datastores into writes to a PipelineDB stream, effectively making it possible to run continuous queries on datastores that aren't actually conducive to that themselves.

Easy Does It

While PipelineDB is designed to make continuous data processing infrastructure simpler and more efficient, its most ambitious goal is to make the life of the developer as easy as possible. Most organizations don't yet have the infrastructure to process streaming data in real time, but not because they don't want that capability. It's generally because streaming infrastructure is currently only accessible to deeply technical organizations whose core competency is building large-scale software.

Accessibility is precisely why we decided to build a streaming platform that exposes all of its functionality via a ubiquitous, declarative query language -- SQL. Taking this a step further, PipelineDB is fully compatible with PostgreSQL and by design doesn't even have its own client libraries. PipelineDB is designed to work with any library that works with PostgreSQL, which in turn means almost any standard SQL library. Here is a producer that streams JSON events into our previous example continuous view as they're appended to a file using only bash and psql:

tail -f logfile.json | while read payload; do psql -c "INSERT INTO geo_stream (payload) VALUES ($payload)"; done

In fact, PipelineDB is a superset of PostgreSQL. It contains all of PostgreSQL and PostGIS's features and a large amount of advanced streaming functionality on top of all of that. It can thus even be used in place of a vanilla PostgreSQL deployment.

Finally, ease of adoption by developers will perhaps be the most important attribute of the infrastructure products that thrive in the long term, which is why we knew PipelineDB had to be an open-source product. We think that developers will increasingly be making major technology decisions within their respective organizations, and developers aren't waiting for salespeople to call them to learn about the best new products. As an engineer myself, I know I never was.

OK That's All

As a larger number of entities produce and capture more data, more quickly, simpler and more accessible data processing tools will be needed. There is no future in which continuous data processing is not commonplace in the modern stack. PipelineDB aims to leverage three powerful components: SQL, the relational model, and open-source, to give a wide range of users an easy way to tackle stream processing and realtime analytics.

PipelineDB is not the best tool for every job. It sucks at anything that doesn't fit within the confines of SQL, and is no better at ad hoc analytics than vanilla PostgreSQL. But if you're looking for an easy way to build realtime applications for which you know the queries you want to run ahead of time and where those queries can be expressed in SQL, then PipelineDB may save you a great deal of time, energy, and money.

But the great thing about having an open source product is that you don't have to take my word for it. We would be thrilled if you could give PipelineDB 0.7.7 a try and let us know what you think. Oh and by the way, there still may be a few bugs :)

Derek Nelson,
CEO, PipelineDB