What I Learned About Data Lakes
What is a data lake? It is a way for an organization to collect, store, and analyze all of its data in one place. Data in its native format is stored to facilitate user-friendly exploration of the data. The other data lakes feature is automating routine data management and supporting a broad range of analytical use cases.
The data lake's complexity and multiple data streams cause common problems: low data quality and difficulty managing. Without adequate data quality and governance, the lake becomes a swamp. The end-user needs cataloged metadata to know where the data came from and how it was derived. For instance, there is a calculated field, and the end-user needs to know what it represents. Without good data governance, it becomes increasingly difficult to manage the many streams of data. Data governance is tracking where the data is coming from, who touched it, and its relation to other data. The data lake's utility decreases as the data becomes unreliable, and users revert to siloed data sources they trust. To keep a lake from turning into a swamp, it needs to have a single repository without silos, be deployable to near unlimited users and maintain data governance.
You may be familiar with data warehouses, which have been around approximately 20 years longer than data lakes. Data warehouses store highly curated data in a schema that defines the rows and columns. The data attributes need to be known upfront to create the table schema. Lastly, the business must transform unstructured data to align with the table structure. Common unstructured data includes logs, JSON, XML, images, videos, and social media data. Data lakes have no schemas and can organize unstructured data without transformation.
Interestingly end-users can access most data lakes through a data warehouse where a schema is required. Instead of creating the schema upfront, the data lake does so at the time of analysis using the data warehouse as a presentation layer. The data scientist defines the schema for their machine learning, predictive analytic, and data discovery projects using familiar SQL tools.
Data lakes came about to improve data exploration, interactive data analysis, and event-driven analytics. It is challenging to know in advance where correlations in data will come a. By accessing data streams instead of siloed warehouses, it becomes possible to identify new patterns by exploring vast amounts of data at once. Often, the pursuit of business questions leads to more problems. In a siloed data warehouse, additional data means the data scientist needs to extract, transform, and load (ETL) more data for analysis. With a data lake, this interactive data analysis is possible without the additional ETL steps. To monitor business processes requires new data that may not be available in a data warehouse that is updated each morning. To refresh dashboards and reports, a data lake that ingests data streams regularly is required.
Some data lake solutions include Hadoop (multiple variations), Snowflake, Amazon Simple Storage (S3), Microsoft Azure Blob, and Google Cloud Storage. Depending on the provider, they can be cloud-based services that reduce upfront costs and allow flexibility to eliminate time-consuming capacity planning.
Data-driven businesses rely on enterprise resource planning, customer relationship management, and the point of sale for making business decisions. Data scientists can use this business generated information and third party data streams such as weather forecasts, IoT sensors, financial markets, and social media to extract new insights.