Tackling Big Data – Choosing the option that works for you
In the recent weeks I have been involved in a joint project with a company in the technology field that processes between 5 and 7 million events each day for their customers (most of these customers are retailers). Each of these events has between 40 – 50 attributes. Is this Big Data? By our accounts yes. This company has thus far only looked at accurately reflecting these events for their (rather classical) daily reports, but are now contemplating what kind of analytical approach they could take in order to learn more about their customers but also, equally important (if not more important), what can be extracted out of the data so that gives their customers unique insights, insights for which they would be willing to pay – thus generating an additional revenue stream for this company.
I guess this is a typical question for the modern company: How can I monetize my data. I embarked with them on this journey, in order to find the right solution for them.
As with any big data (or even with “regular size data”) projects, one can separate the stages into 3 areas
– ETL (extraction, normalization, adaptation, transformation etc)
– analytics (e.g. time series aggregation, filtering and, later on, predictions)
– visualization (displaying the aggregated data in a meaningful way)
When approaching this type of project (and not necessarily only in this area) companies are usually presented with two options: Black Box and Open Approach.
Black Box approach would mean in this context choosing a technology that has a very low barrier to entry (technology-wise). You just take your existing data and/or data-sources and you point them to this tool (or tools collection) and then spend most of your efforts into the analytics (to a lesser extent also into visualization domain)
The Open approach implies a inherently more complex setup. In here all the steps have to be automatized (for instance through scripting or otherwise) independently while making sure that they can talk with one another. This leads to a more difficult start (as measured from the start of the project until the first data is visualized), but is more flexible, has better performance and has a very good chance of being capable to be built entirely with open source components – thus leading to a lower software expenditure.
The usual suspects in a Black Box approach are the likes of Microsoft’s Power BI or Tibco’s Jaspersoft maybe with some enhancements from tools like Tableau (which can be used as a visual enhancer and as an ad-hoc-visual explorer on both scenarios).
More interesting, especially if you enjoy more freedom and are not afraid of taking seemingly separated components and craft a custom solution, is the Open approach where you have to fill in multiple boxes, but you have plenty of choices for each of them. I guess choosing between Black Blox and Open Approach it’s a combination of factors like: knowledge, time, available funds and how comfortable you are with open source solutions.
This is my take on this, by all means no-one should be offended of my subjective view.
Step 1: ETL step
If you’re lucky enough and the application(s) generating these events are recent enough (say, at most 5 years old) and you can somehow can get hold of the said events in real time (or near real time) then the preferred choice would be to throw them in a queue (Apache Kafka makes a great candidate and plays well with the tool one might use for aggregation – Apache Spark, but case-by-case analysis might lead to solutions like RabbitMQ or 0mq), eventually using a lightweight (for instance NodeJS) application that receives these events via a standard HTTP interface and puts them in a specially opened Kafka topic.
If you’re not lucky (or the application is old, or there’s no way to snoop on the events in realtime, as they happen) then you’ll probably have to deal with processing some batch files every once in a while, maybe based on some cron jobs that parse some log files. Or maybe use something like Apache Flume to quickly collect and transmit this data. Passing the data through these components is also a good opportunity to enhance it, comb it or de-duplicate it. For instance to ensure some unicity you might want to compute a hash of an event’s parameters and keep them in a circular queue for quick lookup. This way you can ensure unicity for the last X minutes.
Step 2: Analytics, aggregation,…
Once the information reaches the data bus, you have plenty of options.
For instance you might want to have a single consumer that sifts through the data and processes it, or you can implement a “standard’ lambda architecture whereby you have multiple consumers reading the same data, but one flow simply stores it for long-term & persistent storage, while the other aggregates it.
We’re already in the aggregation/filtering stage where the obvious choice these days (or at least a very serious candidate) is Apache Spark which offers the immediate benefit of being able to aggregate parameters based on streaming data, but also is well prepared for the future (i.e. do predictive tasks through MLlib). Considering also the fact that is nowadays easy for people that come to data science by coming from the statistics side and might be proficient in R, while the latest Spark version (1.5) introduces R as an accepted manipulation language, the choice is even more obvious. Since the events are constituting a time-series, it might make sense to pre-aggregate them on predefined intervals – say 1 min, 5min, 1hour, 1day and persist the aggregated results. If the events are having multiple attributes that can be filtered upon, then the interval aggregation can be done also by these attributes, thus ending with aggregation by time and 1 or more attribute. That is, of course, if the on-the-fly aggregation is not acceptable. For instance if you’re asking for the aggregated 1 minute data for a specific day to be displayed in a graph, waiting more than a few seconds might not be optimal and the pre-aggregation is required.
Step 3: Visualization
Once the data is persisted (in a aggregated or raw manner) it should be passed to the visualisation engine (or maybe it’s the other way around – the visualisation engine will ask for specific chunks of data). If you are able to retrieve this data, especially if the result is in JSON format, there are plenty of tools, starting from the classic chart libraries (where Google Charts stands out, but there are other libraries worth mentioning – like ChartJS, MorrisCharts, Rickshaw, RaphaelJS, FlotCharts and many others) which support building an exploratory framework around them and going all the way to newer projects that take a more Ad-Hoc/Get-Insights approach as are, for instance, Vega’s Polystar and Vega’s Data Voyager.
As said, this is my take on how to proceed with these 3 incipient steps when looking at data. But, as most often in technology, there really isn’t a ‘one-size-fits-all’ solution, things that might work well for some, might perform awful for others. As they say, YMMV.