The Underfunded Data Democracy
The growth of enterprise data and the impacts on IT is well documented. As the business’ requests for insights have outpaced IT’s ability to deliver well-governed data solutions, the market has responded by providing a variety of self-service business intelligence, data preparation, and data science tools and platforms. These tools along with a paradigm shift (and acquiescence) by IT, have given analysts direct access to extract and analyze both governed and ungoverned data sources. Analysts now have the access to quickly define and publish BI logical models on top of databases and big data datasets, a capability that used to be reserved for the BI architect. As the analyst toolbox has evolved, we’ve also seen organizational changes in the form of disparate analyst groups thriving under the umbrella of various lines of business.
Decentralized Analysis But Not Control
Despite this democratization of data, the days of a central BI Competency Center (evolved to the Analytics Center of Excellence) are far from over. In fact these cross functional committees are needed now more than ever, as the responsibility to provide sufficient data governance, documentation and support to these disparate groups is critical to a success of a self-service analytics strategy. The centralized BI report factory is dying, and centralized governance and development cannot keep up with the increased demand. However, despite their best efforts to provide self-service documentation, data dictionaries and metadata repositories tend to get out of date, and the volume of data lineage and ‘how do I query this’ related emails/chats inundate the data engineering team.
One Problem, Multiple Vendors
Some ETL and BI vendors have integrated metadata catalogs and search capabilities within their platforms with varying degrees of success. Ultimately, it still remains difficult for a single vendor to be everything to everyone. Bringing to market and maintaining intuitive applications that help users of different technical skill levels manage data pipelines, conduct self service analytics (including data blending at granular levels), and cross functional data documentation/collaboration is a tall order.
Out of the big data movement, and with the continued trend of ‘roll your own’ collection of cloud apps, we now see SaaS metadata collaboration tools in the landscape trying to help with these these problems.
Best Features of Smart Cataloging Tools
Some of the most useful features in the latest smart cataloging tools include:
Crowdsourcing — Allow crowdsourcing of metadata definitions from various domain data experts
Machine Learning — Use of Machine Learning and heuristics to generate content, provide recommendations on anomaly detection, data value mappings, recommended join conditions, and data quality scores
Collaboration — Allow for cross functional team discussions about data availability, breadth, and quality. In some cases, vendors may choose to integrate with existing chat tools like Slack and HipChat.
Ease of use — Integrate into the analyst’s analysis workflow by allowing search, querying/extracting data from directly within the application and sending to a file, database, or BI workbook
It’ll be interesting to see how adoption of these commercial tools proceeds and to what extent they meet the needs of distributed data governance. Moreover, the evolution of open source projects such as Apache Atlas provide promise of further innovation and disruption in this space.