Big Data Expert Interview: Andre Vermeulen
We’re excited to launch a new series of posts on our blog with industry experts in Big Data. We will be interviewing an amazing lineup of industry experts and leaders, ready to share their thinking, expertise or their big data strategies.To kick off the series, we spoke with Andre Vermeulen, Consulting Manager - Business Intelligence, Big Data, Data Science at Steria.
We’re excited to launch a new series of posts on our blog with industry experts in Big Data. We will be interviewing an amazing lineup of industry experts and leaders, ready to share their thinking, expertise or their big data strategies.
To kick off the series, we spoke with Andre Vermeulen, Consulting Manager - Business Intelligence, Big Data, Data Science at Steria.
What problems does big data solve for your customers?
Andre Vermeulen: Our customers use Big Data for the following business areas:
1. Traffic Patterns - Several customers are storing and analyzing traffic patterns in networks, production flows and web navigation.
2. Custom Advertising - Customers are using mobile device identification and face recognition to adapt the advertising appearing on the billboards.
3. Retail Habits - Customers are storing and performing shopping cart analysis as the POTS systems requires the transactions. New technology in RFID enables customers to arrive, for instance, with a recipe they downloaded from a product’s QR code and then be guided through a store to where they can get the specific ingredients for the recipe.
4. Politics - There is major interest in behavioral science to understand the hot topics of the day, for politicians. The mood of the nation can be measured from data sources like Twitter and Facebook. Announcements can be tracked for effectiveness and impact on political indicators.
5. Weather and Climate science - Climate simulations are used to predict what the weather patterns may be for use in retail for product procurement, agriculture planning for crops and even events planning.
7. Infectious diseases - Spatio Temporal Epidemiological Modelling is a major field of interest with Ebola and AIDS projections.
8. Process performance - Customers are simulating their current processes against newly planned processes for effective improvements.
9. Satellite imagery - Imagery from satellites is used by overlaying data and identify patterns.
10. Sensors - Customers are tracking cooling and heating costs by recording over seventy million data points, with readings every three seconds from in-store equipment sensors.
11. Mechatronics - Using a combination of mechanical engineering, electrical engineering, telecommunications engineering, control engineering and computer engineering processes, customers are now combining different science fields to improve products into medical mechatronics, medical imaging systems or robotic manufacturing.
12. Medical Research - Spectrograph, DNA Sequencing, RNA sequencing and cytogenomics as applied to heart disease and cancer research.
13. High-performance Analytics - Massive volumes of data that require highly performing analytics are now stored and processed.
14: Enterprise Resource Planning Analysis - Customers are using patterns to predict behavior that causes actions, like churn.
15: Fraud Detection - Data from transactions plus behavior science are used to detect fraud behavior.
16: Counter-terrorism - Customers are tracking the behavior of people and organisations that can be connected to terrorist activities.
What is a typical data set size for a client that is just getting started with big data? Does this vary by industry?
Andre Vermeulen: The size of the data set ranges from project to project. For the Custom Advertising project we only had 20 TB of data but needed to deliver top 10 results within 30 milliseconds. The project uses the velocity and variance parts of the Big Data triangle to achieve value.
For the Cancer Research project we consumed 240 TB of data per day from 20 DNA Sequencing devices and a data store of 12 petabytes. The project uses the velocity and volume parts of the Big Data triangle to achieve value.
For the Sensors Project we consumed about 10 TB but uses 70 million data points per day from over 1500 sensors with over 400 different data structures to check how effective the company controls its warehouse cooling and shop heating requirements every 3 seconds. The project uses the velocity and variety parts of the Big Data triangle to achieve value.
Is big data a must for everybody?
Andre Vermeulen: No, big data is only a must if you have at least one of the large volume, high velocity, diverse variety as a requirement and you can prove business value. We have sadly been part to several proof of concept big data project teams that simply could not prove business value. Not every mountain of data contains the golden nuggets the business needs.
What services or applications are most often connected with big data deployments? Is Hadoop really everywhere?
Andre Vermeulen: We are seeing a increase in Hadoop based data systems but it is not the only big data technology we deploy. There is major increase in technologies like CouchDB, MongoDB, Cassandra, MarkLogic, Riak, Oracle NoSQL Database and Neo4j. We are seeing an increase of attaching tools like Tableau, Hadoop Hive, or Qlikview via Hadoop Hive. There is also a major increase in the use of R onto big data data sources like Hadoop, to achieve a variety of analytic functions or machine learning.
Everything seems to be in-memory today, how do you see this trend evolving?
Andre Vermeulen: Yes, it is evolving. The concept of in-memory is getting blurred by advances in solid-state drive (SSD) that are getting to near-memory speed with hybrid cache SSD designs. The introduction of Flash Memory into servers is now starting to deliver 1 TB cards that supports 285,000/385,000 read/write IOPS. Also, the introduction of Graphical Processing Units (GPUs) accelerates the interrogation of data from the big data store. GPU Devices like NVIDIA TESLA and AMD FirePro S9000 are now enabling high velocity processing without using in-memory technology.
Cloud and Hadoop-in your view, how do they get along?
Andre Vermeulen: We are seeing a growth in Hadoop on Full Metal implementations due to the performance increase over traditional visualisation cloud use of Hadoop. The traditional cloud visualisation with Hadoop nodes is not delivering the promised performance. Hadoop and cloud works as a pay-per-use asset to a company.
Andre Vermeulen: 1. Private cloud ready 24 / 7 supports the company with full high availability.
2. Physical isolation of the Hadoop nodes.
3. Instant provisioning and cloning of nodes to adapt to changing business requirements.
4. Pay-per-use model enhances the business budgeting process of the data processing.
How do you see Hadoop moving in the future, e.g. Spark versus Map Reduce?
Andre Vermeulen: With the introduction of Hadoop with YARN and TEZ, the advantage of Spark over MapReduce is growing. Typically, Spark runs programs 70 to 100x faster than Hadoop MapReduce in memory and 5 to 10x faster on disk. Several of the big data analytics providers are also looking at providing support for Spark directly in their tools. The big drive is due to Spark supporting “map” and “reduce” operations, SQL queries, streaming data, and complex analytics like machine learning and graph algorithms out-of-the-box.
Andre Vermeulen has worked with Big Data from the industry’s start in the 80s and has kept at the forefront of its developments ever since. His motto is: “A day spent not learning something new is a day wasted.” Andre’s fields of expertise are: Data Science, Business Intelligence Solutions, Advance Statistical Analysis, High Performance and Distributed Computing, Mechatronics.