Technically Speaking

The Official Bigstep Blog

 

Spark Structured Streaming in Practice

Bigstep Solution Architect Andrei Muraru @ the HUG UK & Big Data Analytics London MeetupHow does Spark Structured Streaming work with real-time big data workloads? Here’s a case study presented by Bigstep Solution Architect Andrei Muraru, during the Big Data Week 2016 global festival.Spark Structured Streaming provides the means to express streaming computations similarly to those deployed on static data. The built-in engine incrementally and continuously updates the final results as streaming data continues to arrive. Andrei’s presentation covers how a real-life implementation of Spark Structured Streaming on top of a Hadoop Cluster is helping a big online retailer to analyze clickstream data and aggregate it with customer history information.

Bigstep Solution Architect Andrei Muraru @ the HUG UK & Big Data Analytics London Meetup

How does Spark Structured Streaming work with real-time big data workloads? Here’s a case study presented by Bigstep Solution Architect Andrei Muraru, during the Big Data Week 2016 global festival.

Spark Structured Streaming provides the means to express streaming computations similarly to those deployed on static data. The built-in engine incrementally and continuously updates the final results as streaming data continues to arrive. Andrei’s presentation covers how a real-life implementation of Spark Structured Streaming on top of a Hadoop Cluster is helping a big online retailer to analyze clickstream data and aggregate it with customer history information.

Processing real-time workloads is becoming an increasingly common approach to big data. These use cases have potential game-changing implications within several industries. Here are some domains where real-time use cases have a significant impact: Transportation (real-time route optimization, fleet-tracking and data analysis, smart public transportation), Healthcare (real-time personalized patient care, public health data analysis), Finance (fraud detection, stock analysis), and Retail (personalized customer targeting, real-time marketing).

The case study Andrei presented covers the challenges faced by a large retailer in trying to offer better-targeted services to its customers:

Real-Time Marketing

Individually targeted real-time suggestions based on current actions and history.

Bundled Products

Create product bundles based on current user actions and searches.

Complementary & Substitutes

Offer complementary consumable products, show possible substitutes.

The presentation also covers the actual architecture overview: from the real-time clickstream and additional relational data – into a Kafka cluster, onto the Spark main processing tasks and continuously appending data to HDFS, and up to the Zeppelin and/or Qlik output.

The Bigstep Deployment was devised to effectively trace all the stages of online retail marketing, including what happens in between browsing and ordering until final delivery of the merchandise, with interesting insights derived from all the gathered info.

Key Takeaways:

Simple Design

Minimize the number of formats and technologies used.

Tuning and Swapping

Allow for easy tuning and swapping of components.

Think Forward

Take into account scalability, availability, and structural changes.

This Meetup was held by the Hadoop Users Group UK and Big Data Analytics as part of the Big Data Week Global Festival. The video of Spark Structured Streaming in Practice presentation is currently available thanks to SkillsMatter.

There’s also an interesting Q&A at the end.

About the Speaker

Andrei Muraru is a Solution Architect at Bigstep. He has been working for several years in the big-data industry, designing and implementing complex projects for various companies. Andrei currently focuses on large-scale real-time implementations. He is helping customers begin their journey with big-data workloads by providing meaningful insights on the products and services that are appropriate for their use cases. His motto: “Doing a task more than twice? Then, Automate it!”

Got a question? Need advice? We're just one click away.
Sharing is caring:TwitterFacebookLinkedinPinterestEmail

Readers also enjoyed:

HP

Building Data Lakes in the Cloud

Understand why building a data lake in the cloud entails different particularities than building it on premisesEvery industry has both proven and potential…

Leave a Reply

Your email address will not be published.

* Required fields to post your comments.
Please review our Privacy Notice in order to understand how we process your personal data and what are your rights in this respect.