The Pragmatic Data Guy: December 2012

First Installment: The Use Cases

Introduction

The world of software and all things computer related is disappointingly prone to wildly exaggerated trends with poorly understood and overloaded terms. We've got a long and rich history of these terms. Big Data and NoSQL are two examples that have exploded recently. How does someone who's been toiling away in corporate IT, going from one project to another, fighting fires, and keeping your corporate masters happy make sense of all of this? Are you a traditional BI professional trying to figure out what all of this means to you? These are observations and conclusions that I've pulled together over the last couple of years in my quest to make sense of it all.

The first thing I learned is that it's extremely important to understand the unique characteristics of three high level use cases and how they are different. With this awareness and understanding, it becomes much easier to evaluate whether or not 'Big Data' or NoSQL fits in your world.

Use Case 1: Corporate BI

If you've been involved in data warehousing and BI for the past 25 years, then you are familiar with this use case. I've boiled this use case down to the following key parameters:

Very complex structured data model.
10 to 100 Terabytes of active, interesting data .
Thousands to tens of thousands of standard views of data across business units and levels within the business.
Tens of thousands of individual unique data elements.
Hundreds of unique data subject areas.
Data integrated across many business functions, many applications.
High rate of new/novel report or information requests.
Large number analysts with BI tool skills (like Cognos, Business Objects, etc.) but they are NOT software engineers (Java developers, etc.).
Total number of distinct users limited by employees of the corporation (< 10,000 users, only a handful of which are actively submitting on-demand reports or interacting with data directly).
Relatively less technologically sophisticated (usually) business users are the BI consumers. The folks managing the data and producing the BI solutions are NOT the consumers of the solutions.

Use Case 2: Data Scientist

If you are like me, you've been bombarded recently by the term 'data scientist'. Rather than profer a definition, let me instead list the key parameters that describe them and their use case below:

They are technical. Not software engineers exactly, but they can write Java script, Python, Map Reduce.
They know their way around multiple NoSQL technologies, including Hadoop, HBase, and Hive in addition to one of perhaps MongoDB, CouchDB, RIAK
Work with data 10 TB to Petabytes.
Low complexity of data: Typically, and perhaps controversially based on the hyperbolic language used to describe data science, the complexity of the structured data is much less than in the corporate BI world. True there is some unstructured, in the typical sense data, but the complexity of and necessity for conformation of data across hundreds of subject areas as in the corporate BI case is simply not there (most of the time).
For a given set of data, the number of concurrent data scientists analyzing the data are relatively small, perhaps less than 10 compared with a large department of centralized and distributed analysts producing solutions for the corporate BI case.
Data scientists typically work with 'big data' technologies such as Hadoop, and other NoSQL technologies. MapReduce is not an ad-hoc query language, nor can it discover hidden patterns in data. They have to tell it what to go out and 'map' and then 'reduce'. In order to do that, they have to have a specific thing in mind they are looking for, which implies that there is a specific question they are investigating. Once the data have been returned by the MapReduce task, the data scientist typically does something else, such as data mining/predictive analytics, traditional statistics, and/or data visualization. All of this requires a lot of thought, planning, and effort.
Ultimate consumers of the results may be business decision makers or the data scientists themselves.
The expected turnaround time can be longer than the corporate BI scenario.
Data scientists can exist anywhere in the corporate world and be looking at almost any kind of data in any sort of domain. This may seem to conflict with number 4 above. I maintain that, most of the time, the data that the data scientist is working with at any given moment is less complex, and because he/she is not building a solution that must meet the needs of demanding, production oriented business users as in the case of corporate BI, the less care can be taken in data definition and data modeling.

Use Case 3: SaaS Solution

Software as a service solutions often have a BI/reporting/dashboard component. There is a wide range of SaaS application. For my purposes, I'm going to take the case where the application is reasonably complex, with many types of reference data and transactions. The characteristics of the SaaS are:

Large data volumes (10 TB plus). Especially multi-tenant systems.
Medium complexity data models. Not as complex as a corporation and all of its processes, but not as simple as counting the number of clicks on an add banner and grouping by different criteria.
Hundreds of thousands to millions of users. Much, much higher than corporate users or data scientists.
Relatively stable reporting/BI views of data revolving around well understood business questions.
Customer facing analytic capabilities like OLAP are considered premium offerings and fairly rare.

The Pragmatic Data Guy

Saturday, December 15, 2012

Untangling the Big Data and NoSQL Hype

First Installment: The Use Cases

Introduction

Use Case 1: Corporate BI

Use Case 2: Data Scientist

Use Case 3: SaaS Solution