Big Fast Data – Perishable insight at scale
Data is perishable if there is only a limited amount of time to act upon that event. Market data, clickstream, mobile devices, sensors, and transactions may contain valuable, but perishable, insights. Hadoop is the de facto tool to enable analytics at scale by providing the ability to batch process petabytes of data in minutes or hours. But what about the computational explosion we’re seeing with these new data sources? The new challenge involves processing millions of events per second or even less.
Big Fast Data poses a subtle set challenges for business people and developers because questions are asked in a fundamentally different manner. In a traditional query model, you store data and then run queries on the data as needed. This is a query-driven or request/response model. Any application that has a Submit button follows this model. In a streaming data model, you store queries and then continuously run data through the queries. This is an event-driven or continuous query model. If you analyze time series data using temporal patterns, like Bollinger Bands, you are more familiar with this model. Let’s see how a bank could use Big Fast Data to improve customer satisfaction through a Customer 360 perspective.
In the banking world, disintermediation or cutting out the middleman, is a serious and constant threat. Google’s definition actually includes the example “investing directly in the securities market rather than through a bank”. Sometime in the near future, Google could include Google Wallet in its definition of disintermediation. DBS Bank, based out of Singapore, has been fairly well covered as an innovator in using Big Fast Data to manage its customer experience. DBS uses its client information to categorize customers, statistically predict behavior and send targeted promotional offers to their client’s cell phones when they are engaged with a business partner. While the technical details of its implementation are proprietary, some basic architectural assumptions can be made. A location-aware promotion application requires continuous, event-driven queries that continuously monitoring , analyzing and responding to the changing status of subscribers and promotions. These queries are multidimensional; matching location, context, preferences, socioeconomic factors, and spending habits. These dimensions are in flux.
If you are sending people offers to their phone while they are shopping, you need to be very timely. Assuming people are moving once per minute, one million people would generate almost seventeen thousand events per second. DBS has four million customers. Bank Of America has 57 million, Chase has 50 million, Citibank has 200 million and Wells Fargo has 70 million customers. This isn’t to say that this experience cannot be replicated by a bank with a larger customer base than DBS; it merely serves as a reminder that we need to always engineer for scale beyond what we can envision today.
So how do you engineer for scale for a real-time, location-aware, Customer 360 view? One way is to take all of your customer data and put it into a central log for real-time subscription. I’ll provide a brief implementation roadmap using a Kafka -> Storm -> MongoDB -> Pentaho pipeline since this is the fastest path to implementation that I’ve done.
Data will be understood in this context as an event and the fundamental unit of storage of an event is a log. A log is an append-only, totally-ordered sequence of event ordered by time. When you separate the data from the context in which it was created and just store it as a timestamped, schema-last log, you have an immutable record of what happened when. This enables parallelism and fault tolerance, which is the core of a distributed system. Functional computation, based on immutable inputs, is idempotent: applying a computation multiple times will not change the value beyond the initial application. Massively parallel computations are based on immutable inputs and functional calculations. This is a prerequisite for an event-driven system at scale.
Apache Kafka provides a highly reliable and scalable publish-subscribe enterprise messaging system. Kafka has producers: applications that send data to be stored. In the system we are building, it will be a requirement of the producer to bundle each unit of data as a log and send it to a specific, predefined topic in Kafka. If you consider that most new Internet of Things devices broadcast their data in json, storing that data as a payload in an ordered flow is certainly trivial. In this context, the log as acting as a kind of messaging system with durability guarantees and strong ordering semantics. A subscriber for that topic is always able to identify the freshest data and the underlying mechanism of the producer is hidden. Decoupling data from the engine that produced it is critical in building a large, distributed system. It makes for too much overhead if you need to alert subscribers that a producer has moved from being a relational database to a key-value store.
Apache Storm provides massively scalable event collection. Storm provides a distributed computation framework utilizing a directed acyclic graph (topology) consisting of streams of data (tuples) moving from data sources (spouts) through functions (bolts). Tuples are dynamically typed key-value pairs. It is not difficult to imaging a log as a key-value pair with a common key and a value whose schema is derived on read. There is a built-in spout for Kafka that allows the topic to be configured within a particular storm topology.Within a bolt, you can apply functions, transformations, calculations and aggregations, map reduce and pretty much anything else you can think of doing. Bolts can be wired to execute serially and in parallel. You’ll find the idempotency of the log structure lends itself to this type of functional programming: immutability manages complexity in distributed systems.The final bolt, the terminator bolt, can be any number of systems: Hadoop, HBase, Cassandra, Hive or, in this case, MongoDB.
We’re processing location-based data from mobile and IoT devices. That data is more often than not JSON-formatted. Storm can read JSON data out of the box and MongoDB uses BSON, or Binary JSON, as its internal storage engine. Building massive, distributed, real time systems is not easy. This architecture is designed to avoid pain points. One could argue that ETL is incompatible with a real-time system and, while perhaps a little strong, is not incorrect. Besides, if you can take time-stamped JSON data, retrieve it in time-order, and store it without unnecessary ETL, why wouldn’t you?
Now that we have streamed data into MongoDB, we are going to need to analyze that data. Pentaho has a MongoDB driver that can be used across their platform so you can perform visual analytics, data integration and predictive analytics in one product without using code. Missing any one of those elements I have found to be a barrier to moving from POC to production. There is going to be customer data that will not be stored in Kafka or MongoDB but will be needed to provide the context of the customer that’s needed to shape your promotional offerings. It’s best to report on the data in-place using the current authentication and auditing and governance protocols that have already been established rather than trying to break down silos. The data integration tool provides this. The visual analytics tools really show their power when you are building real time dashboards with geospatial information from your new MongoDB real time system and your existing data warehouse. In my opinion, though, its the machine learning algorithms that power the predictive analytics in the data science pack that unlock the true value in the system that we’ve just built. A scalable, real-time, location-aware Customer 360 system cannot be limited to standard reporting techniques in my opinion. The data is too rich and dense and broad to risk not using classification, regression, clustering and association algorithms. I also have a personal bias towards open standards and Pentaho Predictive Analytics is compliant with the Predictive Modeling Markup Language (PMML). This means that your data scientists can use their existing models they built in R against smaller data sets and run them against the much larger MongoDB data, for example.
We live in interesting times. While many organizations are still struggling to build out their Hadoop data lakes to handle their data explosion, the mobile device revolution has brought us a computational explosion. The best was to handle this new problem is to face it head-on with the best tools for the job. Architect for scale, code to immutable logs, don’t do ETL and remember that data science is an integral part of business intelligence when the data is too much to wrap your head around.