The Pragmatic Data Guy

In my previous post, I outlined three high level BI use cases and their key attributes;

Corporate/enterprise BI
Data scientist
Multi-tenant SaaS use case.

In this post, I dive a little deeper into the Corporate/Enterprise BI use case and discuss the appropriate technologies.

Corporate/Enterprise BI

Let's first consider the case of corporate or enterprise BI . Corporations have extremely complex structured data and derived measures. Unique data elements are produced by product development, manufacturing, finance, demand planning, inventory management, sales operations and sales planning, order management and fulfillment, HR, etc. There are also data that are shared among all of these departments. The classic problem of multiple versions of the truth has not gone away, and vast sums of money are wasted in operational inefficiencies as armies of BI/IT/IS people (whatever you happen to call them) and analysts wrestle with reconciling these data as they try to present a coherent picture to management. This problem has not been adequately solved, though a great deal of writing on the topic of BI and data architecture and data warehousing suggests that it has. It's almost as if the industry thinks that this problem magically went away after it got distracted by the mountains of unstructured data piling up from clickstreams, Splunk, social media, etc. These problems certainly haven't gone away for the name brand companies you know. Tons of operational dollars and insights remain on the table as these problems go unmet. We know how to solve them, but the discipline to solve them is lacking and companies are getting distracted. The derived measures in particular give BI departments headaches.

Can these problems be solved with Hadoop? MapReduce? Are the answers to this problem buried somewhere in mountains of unstructured data waiting to be discovered by advanced analytics? Clearly not. Do the analysts in your company know how to write MapReduce? Do they know how to create dashboards and scorecards that retrieve their data from MongoDB? Again, of course, they most certainly do not.

The solution here lies in the traditional data warehouse space. The power of the Teradata platform has been democratized through competition from Netezza, Greenplum, Vertica, Exadata and the like. These technologies enable data warehouse specialists to manage and store structured data with clear lineage and auditable data definitions and calculations. They allow analysts and BI specialists familiar with SQL to build powerful queries, using either BI tools such as Cognos, Business Objects, SAS, MicroStrategy, or just plain SQL, to ask novel questions without having to know how to write MapReduce jobs in Javascript or Python.

New 'Big Data' technologies have a place in the corporate and enterprise world, but they don't yet replace the old school data warehouse.

First Installment: The Use Cases

Introduction

The world of software and all things computer related is disappointingly prone to wildly exaggerated trends with poorly understood and overloaded terms. We've got a long and rich history of these terms. Big Data and NoSQL are two examples that have exploded recently. How does someone who's been toiling away in corporate IT, going from one project to another, fighting fires, and keeping your corporate masters happy make sense of all of this? Are you a traditional BI professional trying to figure out what all of this means to you? These are observations and conclusions that I've pulled together over the last couple of years in my quest to make sense of it all.

The first thing I learned is that it's extremely important to understand the unique characteristics of three high level use cases and how they are different. With this awareness and understanding, it becomes much easier to evaluate whether or not 'Big Data' or NoSQL fits in your world.

Use Case 1: Corporate BI

If you've been involved in data warehousing and BI for the past 25 years, then you are familiar with this use case. I've boiled this use case down to the following key parameters:

Very complex structured data model.
10 to 100 Terabytes of active, interesting data .
Thousands to tens of thousands of standard views of data across business units and levels within the business.
Tens of thousands of individual unique data elements.
Hundreds of unique data subject areas.
Data integrated across many business functions, many applications.
High rate of new/novel report or information requests.
Large number analysts with BI tool skills (like Cognos, Business Objects, etc.) but they are NOT software engineers (Java developers, etc.).
Total number of distinct users limited by employees of the corporation (< 10,000 users, only a handful of which are actively submitting on-demand reports or interacting with data directly).
Relatively less technologically sophisticated (usually) business users are the BI consumers. The folks managing the data and producing the BI solutions are NOT the consumers of the solutions.

Use Case 2: Data Scientist

If you are like me, you've been bombarded recently by the term 'data scientist'. Rather than profer a definition, let me instead list the key parameters that describe them and their use case below:

They are technical. Not software engineers exactly, but they can write Java script, Python, Map Reduce.
They know their way around multiple NoSQL technologies, including Hadoop, HBase, and Hive in addition to one of perhaps MongoDB, CouchDB, RIAK
Work with data 10 TB to Petabytes.
Low complexity of data: Typically, and perhaps controversially based on the hyperbolic language used to describe data science, the complexity of the structured data is much less than in the corporate BI world. True there is some unstructured, in the typical sense data, but the complexity of and necessity for conformation of data across hundreds of subject areas as in the corporate BI case is simply not there (most of the time).
For a given set of data, the number of concurrent data scientists analyzing the data are relatively small, perhaps less than 10 compared with a large department of centralized and distributed analysts producing solutions for the corporate BI case.
Data scientists typically work with 'big data' technologies such as Hadoop, and other NoSQL technologies. MapReduce is not an ad-hoc query language, nor can it discover hidden patterns in data. They have to tell it what to go out and 'map' and then 'reduce'. In order to do that, they have to have a specific thing in mind they are looking for, which implies that there is a specific question they are investigating. Once the data have been returned by the MapReduce task, the data scientist typically does something else, such as data mining/predictive analytics, traditional statistics, and/or data visualization. All of this requires a lot of thought, planning, and effort.
Ultimate consumers of the results may be business decision makers or the data scientists themselves.
The expected turnaround time can be longer than the corporate BI scenario.
Data scientists can exist anywhere in the corporate world and be looking at almost any kind of data in any sort of domain. This may seem to conflict with number 4 above. I maintain that, most of the time, the data that the data scientist is working with at any given moment is less complex, and because he/she is not building a solution that must meet the needs of demanding, production oriented business users as in the case of corporate BI, the less care can be taken in data definition and data modeling.

Use Case 3: SaaS Solution

Software as a service solutions often have a BI/reporting/dashboard component. There is a wide range of SaaS application. For my purposes, I'm going to take the case where the application is reasonably complex, with many types of reference data and transactions. The characteristics of the SaaS are:

Large data volumes (10 TB plus). Especially multi-tenant systems.
Medium complexity data models. Not as complex as a corporation and all of its processes, but not as simple as counting the number of clicks on an add banner and grouping by different criteria.
Hundreds of thousands to millions of users. Much, much higher than corporate users or data scientists.
Relatively stable reporting/BI views of data revolving around well understood business questions.
Customer facing analytic capabilities like OLAP are considered premium offerings and fairly rare.

The Pragmatic Data Guy

Friday, January 11, 2013

Big Data, NoSQL, Now What?