Analytics 2020: The Data Management Journey
Cloud and Data Management – These two topics have been dominating most of the conversations with both our new and existing customers for the past two years. The challenges we have been hearing are strikingly similar no matter the industry in which customers are operating. Data continues to grow and with it the cost of management. Many businesses had implemented pure-play Hadoop based Data Lakes to address the challenge, but experienced low traction on data usage by the business. The common story is that the business users are still spending 80% time to massage data and 20% to use it. Now, many are looking at cloud for additional savings, but that alone will not address the challenge of low usage. In this post, we discuss the Data Management journey and share what we have seen repeatedly deliver the best results for our customers.
The Beginning: Data Marts and Data Warehouses
Data Management initiatives started with the backdrop of the proliferation of Enterprise Resource Planning (ERP) systems. It is now well known that online transaction processing (OLTP) systems such as ERPs are ill-equipped to handle historical and time-series data analyses. OLTP data models and databases are neither designed nor tuned for complex queries. Furthermore, the data quality problems caused while entering transactions often led to incomplete results. As the number of OLTP systems grew, so did the problems as data now resided in silos often in disparate formats. Data Management initiatives were born to build data repositories called Data Warehouses (DW) and Data Marts (DM) which leverage star schemas or Operational Data Stores (ODS) which leverage a mildly normalized format. However, data needed to be transformed to consolidate, standardize and deduplicate data to improve quality and usability before any of these solutions could live up to their potential. Analytical tools were built to capitalize on these models to provide curated metadata, versatile analytics, dashboard and alerting capabilities. Data governance comprised of security, metadata management and quality was delivered only through a combination of databases, analytics tools, data profiling and quality tools. Data integration tools were developed to centralize these functions and provide security and administration layer, but these tools introduced a lag before new data was available for analysis. As a system integrator delivering Analytics and BI for over 15 years, BIAS has actively traversed this journey and had built numerous highly successful data warehouses and data marts that continue to provide business value to the customers to this date.
And then came the data onslaught brought about by the “World Wide Web” and the smartphones…the “Big Data” boom!
Medieval Times: Data Lakes
In 2015-2018, the Analytics market shifted direction towards “data discovery” due to a strong thrust on Big Data –unstructured data, of a wider variety, higher volume and mostly, not well-understood. Data discovery focused more on a directional understanding of information than accuracy, which led to faster growth fueled by Data Visualization and Self-Service features. While governed analytics was the mainstay of enterprise-level decision-makers due to their need for reliable information, department-level decision-makers invested in simpler tools to meet immediate business demands. This phase, therefore, saw the birth of “shadow IT” where Data Management layers were relegated to “parked raw data” in Hadoop systems while visualization tools were often directly pointed to OLTP systems. Hadoop based “data lakes” lowered the unit cost of managing data but took the focus away from the high-powered analytics of the data assets – which paradoxically was the main reason businesses wanted to collect large amounts of data. Instead, business users grappled with tools that promised better analytics with unstructured Hadoop, but often found ROI distant and feeble. The honeymoon period with pure Hadoop-based data lakes was short-lived and is nearly over, but the concept of “data discovery” is here to stay – after all, the web and mobile revolution are generating more data at a faster pace than ever.
On the Analytics side, most pure-play data visualization tools in the market lacked enterprise robustness related to security, high-availability and scalability. This limited their deployments to departmental solutions used to query OLTP silos. As a result, most enterprises had to manage and support multiple analytical tools – one enterprise-class analytics tool providing governed, reliable accurate analytics and one or more departmental shadow IT frameworks that leveraged visualization tools. While IT costs went up due to higher training, administration and maintenance costs associated with multiple tools, the worst impact was that enterprise data remained in silos within the different tools. Many good enterprise-class tools now bridge the gap on visualizations and provide IT with an opportunity to eliminate shadow IT, consolidate the multiple Analytics frameworks and reduce their “analytics” costs. These tools can only live up to their potential if they can access a well-designed and efficient Data Management layer.
Modern Times: Governed Data Lakes
Accordingly, 2019 saw a concerted focus back towards “governed data management” as organizations recognized that the right balance of “speed vs. quality” is necessary for high-value analyses. Businesses prefer reliable information in a timely manner rather than raw data delivered in haste. A Data Lake may be a good central location to collect data from myriad enterprise sources, but it is important to compartmentalize data so that the valuable time of business users is spent looking at the compartment housing the correct data set. The approach also allows powerful data scientists to work with either raw or mildly processed data according to their needs in their compartment while keeping the transformed data for use by business users in another compartment. This compartmentalization allows for curated dashboards to be available to the executives in a timely manner. The security framework in this approach is applied uniformly and simultaneously to all compartments, thus making it easier to manage and administer.
Essentially, what we are proposing is a design that provides a convergence of Data Lakes and Data Warehouses into a single extensible architecture that BIAS refers to as a “Governed Data Lake.” Our approach provides a central location for data, one that enables multiple levels of processed data to be seamlessly leveraged by stakeholders according to their individual needs while maintaining security controls.
Gartner seems to agree. According to Gartner 2020 Magic Quadrant for Analytics and Business Intelligence Platforms report, “By 2023, 90% the world’s top 500 companies will have converged analytics governance into broader data and analytics governance initiatives.” In this report, Garner has substantially modified their evaluation criteria of Analytics and Business Intelligence (ABI) Platforms and has identified “Security” and “Manageability” – the two cornerstones of “Governance” – as the top two critical functionalities.
So, What about Cloud?
The Data Management train that started with the purely curated Data Warehouses and Data Marts nearly a quarter-century ago had taken the not-so-scenic route through the Data Lake/Data Swamp regions and is now full steam ahead towards the greener pastures of the “Governed Data Lake.” The journey hadn’t always had a well-defined track, so while data had been the driver, the tools and technologies related to databases, data integration and analytics have often found themselves anticipating the curve on the route to stay the course, with varying degrees of success. Through it all, the goal has always been to offer the best information to enterprise business users in a timely manner to enable better decision making.
A “Governed Data Lake” can be built either on-premise or in the cloud. However, the cloud approach uniquely speeds up the build process due to several innovations in the dev-ops space. Construction, configuration, administration and maintenance of Data Management, including infrastructure and technologies such as databases, integration and analytics tools, is substantially easier in the cloud. This leads to both greater agility in deployment as well as a reduction in effort…and, therefore, cost.
So, where are we heading in the Analytics space in the coming decade?
I will cover that in the next post. Might I add, I’m very excited about the 2020 Data Management trends as we continue serving our customers through their Data Management journey. Please contact BIAS for more information.
Author – Ashish Bokil
The following thoughts, intentions, strategies and/or solutions are those of the blog authors and do not represent the position of anyone other than the authors.