Marriage of Hadoop and OLAP: Best of both worlds to make sense of 200 Terabytes of data

Like many other companies in the social networking world, Zoosk inherits a vast amount of data every day from user interactions, web logs, financial transactions, as well as standard business metric data.  Making sense of the data and turning it into actionable intelligence is of utmost importance to Zoosk, where we are constantly trying to optimize our product offerings and business processes.  The question is: how do we most effectively leverage our data, and turn it into business intelligence?

There are a few typical approaches to answer this question.

Traditionally to gain business intelligence, one can leverage a star schema data warehouse with a multi-dimensional OLAP engine, to provide the business with a user-friendly toolset to quickly “slice and dice” data to identify trends and patterns.  These toolsets can be something that users are familiar with, such as Microsoft Excel and web dashboards. However, traditional data warehouses are typically not as flexible with semi-structured data, especially when data volume gets very large.

More recently, one can leverage what is known as “Big Data”, using tools such as Hadoop and Hive, to cost effectively deal with large amounts of data in unprecedented variety, velocity and volume.   Using MapReduce, Hadoop can effectively process large amounts of data cheaply by splitting the data into smaller nodes.  It also provides the advantage of being able to store unstructured data.  However, unlike traditional DW, analytics via Hadoop/Hive is still relatively primitive and currently lacks very user friendly tools for the business user, who may not want to write their own Hive-SQL queries.

So what do we do at Zoosk?  Simple, do both!

The overall Zoosk Analytics Platform high level architecture is illustrated as follows:

1)      We first start with collecting as much as data as we need.  Our website writes user behavior data to a scribe log, while the web server collects vital web statistics in its web logs.  The web logs are considered our “Big Data” portion of our source data.

2)      In parallel, we also pull mission critical, structured, production data directly from our user cluster database.  These database clusters include user profile data, financial information and other transaction based information.  These are considered the “Small Data” portion of our source data.

3)      “Big Data” such as scribe logs / web logs are staged in Hadoop, where we also have Hive on top of it to provide ad hoc analytical access to the underlying data.  It also acts as a first stage data integration process to structure and aggregate our “Big Data” into more manageable information.

4)      Database scripts and Extract, Transform, Load (ETL) programs are used to load both our “Big Data” and “Small Data” sets into our RDBMS staging tables.

5)      By using industry proven methodologies in data warehouse / dimensional modeling, we conform all of our data with over 100 dimensions across 20+ fact tables, and store them in a centralized Enterprise Data Warehouse (EDW).  The EDW can be used for ad hoc query analysis as well, but its main purpose is to populate our OLAP multidimensional data marts.

6)      OLAP data marts have all the dimensional structures, KPIs, and metrics defined so users can simply browse our cube via Excel or web reports, easily navigate our user friendly BI universe, or create their own reports based on a single version of truth.  They are aggregation aware, and can automatically direct users’ traffic on most used reports to our aggregation cache providing them with sub second query response time.

7)      The Zoosk analytical platform enables our users to gain insights from the data via different complimentary vehicles.  Power users can perform ad hoc analysis directly from Hive or SQL EDW.  Further, power users can connect to the OLAP cube directly via toolsets such as MS Excel and perform data pivot analysis.    Advance visualization techniques, such as charts, gauges, sparklines, in-cell data bars and conditional formatting, can be employed to provide executive dashboard reporting.   Standard template reports are also available online and in mobile platform for 24/7 anywhere data availability.
Below is a few interesting statistics of our analytics platform:

8 OLAP cubes

20+ Fact tables

90 conformed dimensions in the main cube alone

150+ cube dimensions

450+ measures across the measure groups in the main cube alone

3.6 billion session logs processed a year

10 billion tracked interactions tracked a year

80 billion web traffic count tracked a year

130 billion rows of data ingest to Hadoop a year

40 Terabytes of data ingest to Hadoop a year

By combining traditional DW / BI architecture with newer technologies such as Hadoop / Hive, we are able to provide:

  1. Highly scalable infrastructure that easily handles unstructured data, as well as scale to large amounts of data typical to websites
  2. Structured, time proven, enterprise-wide information repository consolidating all incoming data and provide a single version of truth to all of our users.
  3. Conformed analytic platform which enable cross analysis to be performed across various different fact metrics via conformed dimensions.
  4. User friendly analytical platform which allow browsing and creation of custom reports without deep technical knowledge
  5. Performance optimized analytical platform where OLAP / Usage based aggregations help provide sub-second query response time for standard queries / reports