Expert Interview with Christian Prokopp
Christian Prokopp, Principal Consultant at Big Data Partnership, kindly accepted the invitation to take part in our Expert Interviews series. Christian has an excellent combination of skills: he’s a Ph.D. data scientist, experienced solution architect and business consultant. Needless to say, we were very happy to talk to him about big data, Hadoop, interesting use cases and market trends.
1. What problems does Hadoop solve for your customers?
The specific problems vary widely since we work with all types of industries, e.g. global banks, car manufacturers, media companies, oil and gas, startups and anything in between. Generally speaking Hadoop often builds the backbone of a new (big) data infrastructure. All types of data can be landed in HDFS in its raw format and with YARN Hadoop has been transformed into a flexible processing platform supporting online and offline computation. A common scenario is initially offloading ETL from expensive and specialised data warehouse systems and then enrich the ETL process with more data and build new products and features often including analytics and machine learning.
Importantly, Hadoop is rarely the sole solution to big data needs. It is good practice to start a big data journey with taking stock of existing infrastructure, business goals, and then identify what a future architecture may look like and the result regularly consists of a mix of big data technologies often but not always including Hadoop. This discovery process is difficult since it requires a deep and broad knowledge of the big data space and consequently is one of the most sought after services Big Data Partnership offers.
2. What is a typical data set size for a client that is just getting started with big data? Does this vary by industry?
Big data is not necessarily about size. Even GBs can be worth leveraging big data technology if velocity or variety is an issue. Consider loading ‘mere’ GBs into a traditional data warehouse and applying machine learning on it especially if the data varies, is semi-structure, noisy and changes regularly, and you need the insight on it all the time and near real time without affecting your existing processes serviced by your data warehouse.
Of course, we also encounter the traditional ‘big’ data problems where companies expect to collect and analyse exabytes of data in the future. The latter is often related to sensor data, which with the Internet of Things will bring the next data explosion. We already have billions of phones and cars collecting data. Cars, for example, can collect many thousand distinct signals. Recent research has demonstrated that self-powering, low cost sensors/chips will soon become reality. The future may belong to trillions of meshed low cost sensor devices and the data produced by them will be staggering.
3. Is Hadoop a must for everybody? What other alternatives are out there?
Hadoop is useful since it is a multi-purpose cluster resource management platform with linear scalability and shared, resilient storage. However, it is not the only option. Some specific use cases can be solved with other technologies. Elasticsearch with Logstash and Kibana (ELK) is one good example where collecting event data (logs), transforming it, making it searchable and visualising it can be done without Hadoop. Yet often enough customers will find it valuable to collect the data and run complex processing against it, e.g. for predictions and pattern detection.
4. Which are some of the most common problems your clients are bringing to the table?
ETL offloading is a common one. Clients often only store a subset of their data in their data warehouse, the cleaning and transformation is expensive and what they actually want to do - store and analyse all the data quickly and flexibly - is not achievable in this environment. Another one is enriching existing data with third party data, e.g. social media or mined data, which usually is less structured.
5. Everything seems to be in-memory today, how do you see this trend evolving?
Near real time (though this means different things to different people) is a regular challenge and being able to store relevant data in-memory and process it in streams or micro batches will become the norm. In fact, tools like Spark will probably merge what we today understand as Kappa and Lambda architectures. A lot of the complexity of doing things both offline and online for large scale and near real time processing, e.g. to build models and then make predictions with them, will go away.
6. Cloud and Hadoop: in your view, how do they get along?
They can work well together for bursting, variable loads, small and medium deployments. Hadoop works well with commodity hardware but truly performant clusters require decent hardware and network connectivity. Cloud offerings are improving in this regard and some innovative solutions like BigStep are certainly something I will keep an eye on.
7. How do you see Hadoop moving in the future, e.g. Spark versus Map Reduce?
MapReduce will not go away for a while since it has been the underpinning of everything Hadoop for the last years. The future belongs to more flexible solutions and Spark (on YARN) will play a big role in this. Particular the ability to run the same logic in offline and online fashion is extremely valuable as is the in-memory caching option though even on disk Spark is highly efficient. Hive on Spark is another exciting development, which is backed by Hortonworks.
8. What is the next big development in big data?
I think polyglot persistence will revolutionise the way we work with data. The idea to persist data in whatever store is most suitable and to join the data with tools like Apache Drill will remove a lot of the work we are doing now. Imagine, for example, that you can join data transparently from Redis, Cassandra, HBase, HDFS, and a RDBMS in an equivalently simple fashion like you write SQL queries against a single RDBMS today. This is an exciting vision and not too far off thanks to the support for the project from the likes of MapR.
Christian Prokopp, PhD, is a Principal Consultant at Big Data Partnership in London developing big data strategies, architectures and products for clients. Previously, Christian worked as a big data architect and senior data scientist at Rangespan, a wholesales E-commerce analytics platform, until a successful exit when Google acquired the company. Christian spends his spare time as a writer for his blog, ghostwriter and speaker.