Modern Data Governance
Data Management Realities in a Modern Data Architecture
We tend to put our data into different categories. These categories may be based on the data source. Clickstream, web and social, geo-location, IoT, server logs, etc are considered modern (think schema-on-read). ERP, CRM, SCM and LOB-specific OLTP are considered traditional (think schema-on-write). Mainframe is considered legacy (think mission-critical). These categories may be based on the processing type. In a traditional query or batch model, you store data and then run queries on the data as needed (think query-driven model). In a streaming data model, you store queries and then continuously run data through the queries (think event-driven model). With all of the different new technologies and the hype and the vendor puffing, it’s easy to concentrate on the category and forget the context. Modern Data refers to data; not to technologies and it it the responsibility of those of us who architect, develop and implement data technologies to appreciate this difference. There have been many hard-won lessons learned in enterprise data management and the criticality of Data Governance may well top this list.
According to the Data Governance Institute:
Data governance is a system of decision rights and accountabilities for information-related processes, executed according to agreed upon models, which describe who can take what actions with what information, and when, under what circumstances, using what methods.
Data Governance has been a weak spot in Hadoop implementations. As a matter of fact, data governance along with security have been the two areas most often attributed to turning Big Data POCs into science projects at the enterprise level. Recognizing this, both Cloudera and Hortonworks have put forth significant effort and made huge strides in this area. Cloudera Navigator is “a fully integrated data management and security tool for the Hadoop platform. Data management and security capabilities are critical for enterprise customers that are in highly regulated industries and have stringent compliance requirements”. This is a closed-source, production-grade application that Cloudera is using as a product differentiator, and I would strongly recommend that any Cloudera customer in a regulated environment look into implementation. HortonWorks took an open-source, community driven approach and established a Data Governance Initiative along with Aetna, Merck, Target, Chase and other business partners. The Apache Atlas project, which has been accepted by Apache as an incubator project and is available in Hortonworks Data Platform 2.3, is tasked with providing an extensible framework to enable enterprises to meet their data governance requirements. Or, as they put it:
As enterprises across all major industries deploy Hadoop into corporate data and processing environments, a common approach to working with metadata and data governance becomes a necessity. Apache Atlas was created by a consortium of enterprises to meet this need. Atlas enhances governance capabilities in Hadoop for both prescriptive and forensic models enriched by taxonomical metadata. Atlas, at its core, is designed to exchange metadata with other tools and processes within and outside of the Hadoop stack. Atlas enables platform-agnostic governance controls that effectively address enterprise compliance requirements.
I was actually able to manually build a generic Hadoop installation in Docker and then deploy Atlas, so I can confirm that it works outside of the Hortonworks 2.3 sandbox. However, I was also able to confirm that it still in its early stages and needs significant work, particularly on the front-end, before it can be considered production-grade. At the very least, you should wait for the next release that includes Apache Ranger since policy without security is suggestion; not governance. However, after looking into the code base and evaluating the architecture, I think this could be a game-changer for the Master Data Management component of a data governance strategy. When I first looked into the project, I had expected it to be a fairly loose coupling of Apache Falcon and Apache Ranger with maybe a high level API for abstraction. Which, by the way, would have been very helpful but I was wondering if it would be sufficiently robust to keep enterprises from rolling their own. A quick check on the master POM showed be that they were dealing with a toolset that should give pause to those looking to do an in-house implementation. The combination of an open source community approach (which I believe results in better software in the long term) and some of the underlying technology choices are compelling enough to make me think this may be more than just a big data governance tool.
I wasn’t surprised to see that they had followed current best practices for building a scalable architecture. For concurrency and distribution, Scala uses data-parallel operations on collections, use actors for concurrency and distribution, or futures for asynchronous programming and is my go-to language for scalable applications in the JVM. For ease of development, Akka is a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM. Spray is an open-source toolkit for building REST/HTTP- based integration layers on top of Scala and Akka. Being asynchronous, actor-based, fast, lightweight, modular and testable it’s a great way to connect your Scala applications to the world. I thought adding fastutil was a nice touch. fastutil makes it possible to handle very large collections: in particular, collections whose size exceeds 2^31. A RESTful API exposes a JSON-centric interface and the extensible plug-in modular architecture already has HIPAA, SOX, Dodd-Frank and other compliance policies available.
But it was the inclusion of graph databases that made me sit back and fork the project on GitHub. Apache Tinkerpop provides graph computing capabilities for both graph databases (OLTP) and graph analytic systems (OLAP). Apache Titan is a scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. Titan is a transactional database that can support thousands of concurrent users executing complex graph traversals in real time. Neo4J first made me think about master data management as a graph rather than a relational system when they discussed master data management as a use case. This pretty much changed my thinking on MDM algoritmically. Then when I refectlect on Atlas’s stated goal of providing a centralized location for all metadata inside the Hadoop cluster as well as a single Interface point for metadata exchange with platforms outside of Hadoop I realized that I could use Atlas for more than just managing governance in Hadoop: I could use Hadoop to manage governance. That’s the game-changing part.
As an architect, you WILL need to accommodate modern data into the enterprise. All hype aside, because of all the data being generated outside of the enterprise that has legitimate business value, the standard OLTP and OLAP solutions will no longer be a key differentiator for most industries. Before the business pushes you into a timeframe that leads to shortcuts that we have learned will come back to haunt us, start working a data governance initiative into your modern data stack now. If smaller departmental teams take the lead before you do, their inexperience in dealing with enterprise data integrity issues will become your problem. By taking Atlas as a platform a step further, you may be able to develop a single application that can enforce all of your enterprise data governance policies and procedures in one place, and do it better than you did before.