← Back to all posts

PipelineDB Case Study: SmartNews


Company: SmartNews, Inc.

Industry: Ad Tech

Use Case: Real-time Reporting Dashboards

Background

SmartNews is a Tokyo-based news discovery app with over 5 million monthly active unique users (MAUS) that connects end users with news publisher sites. They use machine learning algorithms to make recommendations about relevant news content for their users. Their app is integrated with over 150 major publishers and makes money by serving ads, upselling free users to paid subscriptions, and by driving users to publisher paywalls.

Problem

SmartNews had limited engineering resources but wanted to implement a streaming analytics infrastructure solution to move its batch ETL jobs, for reporting and systems monitoring, to a real-time architecture . They needed a solution that could:

  • Be deployed and managed by a single engineer who was only working part-time on their data infrastructure
  • Run seamlessly on Amazon EC2
  • Ingest data from Amazon Kinesis at ~50k events/second with tens of continuous queries running simultaneously
  • Provide real-time analytics up to the second
  • Support querying sliding windows over data streams
  • Easily integrate with Chartio and other in-house BI tools
  • Scale ingestion horizontally as data volumes grew
  • Support high availability

Solution

SmartNews chose PipelineDB for its streaming analytics infrastructure solution after evaluating Spark Streaming, Norikra, and PipelineDB primarily because of PipelineDBs simplicity and ease of use. SmartNews cites the following reasons for choosing PipelineDB over alternative options:

  • Unification of real-time computation and persistent storage into a single system
  • Having a SQL-based approach to streaming analytics
  • PipelineDB being built into the PostgreSQL core, which enables usage of clients and other tools in the PostgreSQL ecosystem
  • Compatibility with existing BI tools like Chartio
  • Native support for efficient and low-error uniques counting using HyperLogLog
  • Ability to easily roll up granular aggregated data (e.g. aggregating minute-level statistics into hour-level statistics)

PipelineDB Cluster

SmartNews worked extensively with the open-source edition of PipelineDB before purchasing a PipelineDB Cluster license. As SmartNews data volumes grew and as their use cases for PipelineDB became more mission-critical, they needed horizontal scaling of hardware nodes to increase write throughput and needed replication and failover of hardware nodes for high availability. They have been running PipelineDB Cluster since December, 2015 with no major hiccups.

Click here to get started with a free 30 day trial of PipelineDB Cluster or email us with any questions.

Technical Implementation

SmartNews has two PipelineDB clusters deployed on Amazon EC2, one with c3.4xlarge instances and the other with c4.xlarge instances. They use Fluentd as their logging layer and load each log record as a JSON object into Amazon Kinesis. Thanks to PostgreSQL's native support for JSON types, they are able to stream these JSON objects directly into PipelineDB, where they are read by continuous SQL queries, what we call continuous views.

SmartNews has six advertising-related streams. Each of these streams consists simply of a single jsonb column. For example, their ads_stream looks like this:

CREATE STREAM ads_stream ( item JSONB );

One of the continuous views consuming data from this stream, which counts ad impressions per hour, is:

CREATE CONTINUOUS VIEW ads_count AS
  SELECT
    hour(arrival_timestamp)
    count(*)
  FROM ads_stream
  GROUP BY hour(arrival_timestamp);

Note that in this example only one row per hour is ever stored in PipelineDB and is incrementally updated in real-time as new data arrives.

Another continuous view consuming data from this stream is used for computing the histogram of assignment statistics for their A/B tests.

CREATE CONTINUOUS VIEW abt_allocation AS
  SELECT
    to_char(to_timestamp((item->>'timestamp')::bigint + 3600*9), 'YYYY-MM-DD') AS ymd_jst,
    (item->>'abtExpLabel')::text AS abt_exp_label,
    (item->>'abtGrpLabel')::text AS abt_grp_label,
    count(distinct (item->>'uuid')::text) AS ucnt
  FROM ads_stream
  WHERE hour(arrival_timestamp) > clock_timestamp() - interval '3 days'
  GROUP BY ymd_jst, abt_exp_label, abt_grp_label;

This continuous view computes the number of unique users, grouped by hour, assigned to the various user segments of the A/B tests SmartNews is running over a rolling period of 3 days. The output of this continuous view is used by SmartNews analysts to optimize campaigns and iterate on product features. Being able to optimize ad campaigns in real-time reduces wasteful spending and increases overall campaign performance, driving more top line revenue for SmartNews and their customers.

As the number and complexity of these continuous views increases we can see how having a SQL based interface reduces both upfront development time and ongoing operational overhead, especially with 30+ queries running simultaneously at scale, being managed by a single engineer with other responsibilities.

Final Thoughts

Troubleshooting complex application code for multiple systems can be cumbersome and time consuming. For SmartNews, this time saving and efficiency was the key reason for choosing PipelineDB. SmartNews was able to implement a production-grade streaming analytics architecture with less than one engineer using PipelineDB. It was easy for them to get started with PipelineDB's open-source edition in a meaningful way and then pay for enterprise features like scaling and HA when they needed them.

SmartNews plans to expand their PipelineDB deployment to include real-time anomaly detection and alerting in 2016.

Download our latest open-source release of PipelineDB here or contact us to get started with PipelineDB Cluster today!