- Bare Metal
- Bare Metal Cloud
- Big Data Benchmarks
- Big Data Experts Interviews
- Big Data Technologies
- Big Data Use Cases
- Big Data Week
- Data Lake as a Service
- Dedicated Servers
- Disaster Recovery
- Industry Standards
- Online Retail
- People of Bigstep
- Performance for Big Data Apps
- Press Corner
- Tech Trends
- What is Big Data
Interview with Dev Lakhani from Batch Insights on Big Data
Dev Lakhani has a background in Software Engineering and a degree in Computational Statistics from Oxford. Founder of Batch Insights, Dev has been actively working with the Hadoop infrastructure since it’s inception and is currently researching and contributing to the Apache Spark and Tachyon community.
What problems does big data solve for your customers?
The first solution is being able to centralise their data onto a global distributed file system and (SQL/NoSQL) warehouse that has the capacity to store tens, hundreds or thousands of terabytes. This means reduced data silos, duplication and out-of-date information. A simple approach to getting huge savings on redundant processes and repeated ingested data.
Second, it helps them analyse all granular information they collect such as logs, ad-clicks, basket purchases and trades without having to subset or sample the data. In Data Science and Machine Learning, the ability to work with large (and complete) datasets is essential for prediction and classification accuracy. For finance in particular, all data has to be accounted for in order to meet stringent regulatory requirements. This is where big data is an ideal solution.
What is a typical data set size for a client that is just getting started with big data? Does this vary by industry?
It does vary by industry but in general, proof of concepts tend to start out with 1TB to 10TB in total with production datasets increasing to 50TB+ per day. The point is that it’s more about data granularity and complexity than overall file size.
Is big data a must for everybody?
Big data is not always about the total size of data, it also depends on the complexity, diversity and structure of the data. Sometimes you can make use of big data technology such as Cassandra or Spark mainly because they have an efficient way to store and analyse data, whether this is 4GB or 4PB. For example, some of our clients have collected about 5GB/day of data, but this has to be joined from 100 sources and so they have made use of big data tech to achieve this.
As mentioned above, big data is an architectural shift from traditional systems which means that you can perform distributed analytics on centralised data. What’s important to know is that big data tech often comes with distributed processing, redundancy, resiliency and load/performance balancing, on multiple nodes, built in for “free”. A vast improvement over single server architectures where developers and architects had to implement such non-functional requirements.
What services or applications are most often connected with big data deployments? Is Hadoop really everywhere?
HDFS (Hadoop Distributed File System) is normally everywhere as a first step. For a lot of our clients we do not complicate the solution architecture with tools such as Flume, Hue, Hive, HBase unless there is a commercial driver for their use. In our experience HDFS with Map/Reduce and higher level abstractions such as Pig and Cascading are normally ubiquitous.
We typically see tools like R and BI front ends connected to big data deployments. We also see real time messaging like Kafka being used to read and write streams of data to internal clients.
Everything seems to be in-memory today, how do you see this trend evolving?
From a development point of view, in-memory processing is really powerful and helps us leverage real time analytics for big data. Tools like Spark with it’s caching and efficient use of operations on distributed data is a good example of how in-memory processing is taking a lead.
Furthermore, new frameworks for off-heap memory storage for Java is helping developers break through “garbage collection” bottlenecks. This means the data lake can be stored off-heap ready for use facilitating fast distributed caching.
Cloud and Hadoop: in your view, how do they get along?
Hadoop and big data is about scaling out; the best way to do this is via the cloud and instance providing on demand. Improvements in cloud based storage such as SSDs and internode communication speeds have also helped address the issue of IO bottlenecks traditionally encountered by Hadoop.
Cloud and Hadoop: what cloud features are important for Hadoop?
Being able to easily connect (SSH) on to nodes with root access, set up firewalls and easily provision new nodes is really important for production systems. Security is also paramount for our clients with encryption and restricted access being a priority, be it at the hardware, OS or application level. In-built monitoring on hardware and software resources is essential for production deployments.
How do you see Hadoop moving in the future, e.g. Spark versus Map Reduce?
Spark is an elegant solution to replacing Hadoop. With a simplified architecture that focuses more on distributed data sets than batch operations, it is a natural progression from traditional Map Reduce. What’s more is that the integration with graph, machine learning and SQL is a clever strategic decision to make sure that the framework is a complete built-in stack without taking the old patchwork approach with Hadoop when it was first conceived.
Dev Lakhani has a background in Software Engineering and a degree in Computational Statistics from Oxford. He is a founder of Batch Insights, a Big Data consultancy that has worked on numerous Big Data architectures and data science projects in Tier 1 banking, global telecoms, retail, media and fashion. Dev has been actively working with the Hadoop infrastructure since it’s inception and is currently researching and contributing to the Apache Spark and Tachyon community.