MapR’s chief data engineer, Michael Hausenblas, looks at how organisations approach the task of generating insights and how emerging technologies can break the notion of one-hat-fits-all relational databases.
Do the following examples sound familiar? You set off in your car to the airport to catch a flight and run into road works on the motorway and get stuck in a huge traffic-jam; you’re now late so you’ll miss your flight. Or you head to the park in a t-shirt and shorts on a clear autumn day, and within hours, the rains starts coming down. Both situations are common and are the result of insufficient insight. A quick check of online traffic information would have prompted an alternate route to the airport, while a glance at an online weather report would have suggested a sweater for that trip to the park.
These mistakes are due to the same lack of underlying insight and hold true for data-driven business decisions. More often than not, I see businesses make bad decisions because they have no way of generating insight from the mass of data.
It would therefore be wise to employ a holistic approach to driving business decisions, based on the internal and external data you have at your disposal. Yet there are only a few multi-national companies I know of, including Google, who really ‘walk their talk’ and base their strategic and tactical decisions to a great extent on data. But how can housing providers do so, and what are the key factors to consider?
- Data provenance – For the entire data lifecycle, you need to know where your data came from, understand what processes the data has been through, identify who is responsible for the data, and finally track who has access to it.
- Data immutability – Do you keep the data in the rawest form possible, and do you never change the original data, but only derive views from it? And can you afford it?
- Automation – Assess how much of your data acquisition, cleansing, de-duplication, processing and transformation processes benefit from automation.
- Business-driven – You need to elicit clear and ‘SMART’ goals from the business so that you can prove the efficiency and effectiveness of your processes in enabling the business to make faster and better decisions.
To get through the checklist above, there are a number of hurdles that need to be overcome when it comes to managing and analysing data to generate insights. The analyst Martin Fowler coined the term ‘polyglot persistence’, which neatly gets to the heart of the problem. He suggests that any decent-sized enterprise will have a variety of data storage technologies for different types of data. There will still be large amounts of it managed in relational databases, but increasingly we’ll be first asking how we want to manipulate the data and only then working out which technology is best for analysing it.
Polyglot persistence
In a world of data, with most of it stored in databases, different technologies are designed to solve different problems. Using a single database engine for all requirements usually leads to non-performant solutions. Make no mistake, polyglot persistence as a ‘meme’ has a direct impact on how you design and implement solutions for large-scale data processing. Moreover, it will influence the way you think about the tools you deploy and how you generate insights. Rather than the one-size-fits-all approach we’ve been used to implementing via the likes of Oracle and others over the past ten or more years, we should now consider using a tool belt approach. And, as an architect, it’s up to you to select the right combination of tools for the tasks at hand. In many ways, the Hadoop ecosystem is designed to be that tool holder and offers many options.
What system you choose will depend on the types of data you’re dealing with and the type of insight you are trying to gather. For example, a customer’s shopping basket vs. a financial transaction means different data sets and varying workloads. Is it a quick, key-driven look-up? Do you need to scan and aggregate data over many records? Do you have ad-hoc queries? Or do you have timed, repeated one-offs that run in batch mode? Is low-latency your primary concern? And of course, all the tooling should not only be available at scale, in the petabytes and beyond, but must be reliable and provide high performance.
The notion of polyglot persistence and its direct impact on generating insight and decision-making is a broad topic, but an important conversion that should form the basis of technology adoption. Distributions of Hadoop are already starting that process by providing a common framework to allow the best tool to work on the most relevant part of a problem while providing a common framework for analysis. One day, the majority of the platforms will hopefully be compatible with polyglot persistence, but the days of the big monolithic database as the only starting point for generating insights are definitely at an end.
Michael Hausenblas is the chief data engineer for MapR EMEA.