Data lakes are cool, but you don’t have to jump in head-first. It’s easy to start by dipping a toe: Integrating a legacy data warehouse into a data lake leverages the structured systems that have been built over the years while taking advantage of the ever-increasing volume of unstructured data across the organization.
A Familiar Structure
Data warehouses are familiar. Generally, IT teams know how to work with them. Over the past two decades, companies have invested in building data warehouses, which is no small investment, as it requires assets beyond servers and people to support the system: There are physical facilities needs, energy consumption, front-end systems, integration and so on.
When initiating a data warehouse implementation, the desired outcome must be articulated. If key performance indicators (KPIs) aren’t defined in advance, for example, or if business goals or market conditions change, you’re back to square one: It’s time-consuming and difficult to write new queries.
Data warehouses provide a snapshot in time through basic reporting of structured data — clearly defined data types that are easily searchable. It’s a simple solution for reporting, or for the calculation of a huge amount of data; however, the data is often outdated by the time it’s put into the analysis. It’s impossible to glean real-time insights from a data warehouse; it can often take days to generate reports.
Data Lakes Are More Fluid
Unconstrained by the structured data of a relational database, a data lake collates and aggregates all of an organization’s available data sources — not just structured data, but also unstructured data such as documents, emails and social media engagements. It’s quick and easy to access data and report with this diverse information, and it facilitates the discovery of trends.
A data lake allows insight into not just what happened yesterday and what’s going on now; it also employs predictive analytics to forecast what will happen tomorrow. It leverages machine learning not just to make informed decisions, but takes it to the next level with prescriptive analytics that discern the optimal decision.
The data lake itself may reduce costs while allowing more agility and scalability. Living in the cloud, it requires no on-site hardware and only limited resources and support. Because the data lake is still a new technology that is retaining every piece of data that might ever need to be accessed, there is a perception that it lacks some of the security features that are expected from a traditional data warehouse. These concerns are being addressed; security features are integral to the value of data lake solutions.
Proposing A Solution For Diverse Data Requirements
The flexibility and analysis potential of a data lake is hard to argue against, but what about the hundreds of thousands of dollars and hours that have been invested in a data warehouse? If KPIs and analytics are standard, a data warehouse is well suited for day-to-day reporting and indicators that draw on predefined information for historical analysis. Increasingly, those companies will leverage the technology of the data lake for real-time analysis of the full spectrum of their structured and unstructured data.
For established organizations, a combination solution is, in my opinion, probably the best way forward.
The hybrid solution isn’t just as simple as bolting on a data lake, though. While the data lake requires fewer resources to manage the hardware and software, because it does not require data analysis to be defined in advance like data warehouses do, there may be a need to bring a business intelligence analyst on staff to help the organization even know how to find the data hiding in the depths of the data lake.
A data lake solution is generally faster and less costly to initiate. With data warehouses, the company must buy the servers and allocate personnel for two to three months of setup — and it can easily be four months until the first report is generated in my experience. It’s possible that stakeholders may determine that the data warehouse isn’t able to give them what they wanted, particularly in today’s agile business environment. Based in the cloud, data lakes can begin delivering almost immediately. Companies can start small and scale easily; if the data lake isn’t giving them what they need, they simply need to stop the service and move on to another solution.
Considerations When Deciding On A Data Warehouse Or Data Lake Model
- Do you have a legacy data warehouse? If your organization has invested in an on-premise data warehouse, the business case for decommissioning it will be tough. If your business relies on both structured and unstructured data, you may want to consider a hybrid. If your KPIs and reporting only require simple, historical, standard analytics, the data warehouse should suffice.
- Are you building a new data repository and analytics solution? It’s probably the most cost-effective and fastest to implement a data lake that can scale with your requirements. Keep in mind, though, that you may need to bring in a data analyst to get the most out of your unstructured data.
- Are your business needs reliant on historical data, or would your organization benefit from prescriptive analytics? If historical data meets your business needs, the data warehouse will fulfill your requirements. If you want to leverage machine learning technology that can suggest which decision will produce the optimal outcome, a hybrid solution or data lake is your best bet.
Learning To Swim In The Data Lake
This new technology requires a different attitude — a different way of thinking and working. The data lake is an abstract concept, and it can be hard to grasp; sometimes, people just don’t get it, or they may be skeptical of the benefit to unorganized information. As companies move more of their systems into the cloud and integrate more tightly with their suppliers’ systems, however, the data lake model is establishing itself as the prevailing technology, even if many organizations continue to support their traditional data warehouses.