Data lakes have arisen to solve a growing problem: the need for a scalable, low-cost data repository that allows organizations to easily store all data types from a diverse set of sources, and then analyze that data to make evidence-based decisions.
Data lakes are an ideal way to gather, store and analyze enormous amounts of data in one location. The modern cloud data lake leverages the power, flexibility, and near-infinite scalability of the cloud.
A Deep Dive Into Cloud Data Lakes
The term data lake was coined to describe a new type of data repository for storing massive amounts of raw data in its native form, in a single location (coined by James Dixon, 2010).
Getting the Data Flowing
Prior to the data lake, there was the data warehouse. Data warehouses were built primarily for analytics. They used relational databases and schemas to define tables of structured data in orderly columns and rows.
In contrast, the data lake’s goal was to enable organizations to explore, refine, and analyze huge amounts of information (about a petabyte’s worth!) without a predetermined notion of structure. It’s important to understand that data lakes enable a comprehensive way to explore and analyze petabytes of data constantly arriving from multiple sources.
The original data lake:
Some Problems With Older Data Lakes
The core data lake technology was based on the Apache Hadoop ecosystem:
Many on-premises data lake projects failed to fulfill the promise of data lake computing due to a number of reasons:
- Burdensome complexity
- Slow time to value
- Heavy system management efforts
A complex distributed architecture, and its need for custom coding for data transformation and integration, made it difficult to derive useful analytics and contributed to Hadoop’s demise.
The original promise of the data lake remains: a way for organizations to collect, store, and analyze all their data in one place.
And as cloud computing maintains its popularity, this new paradigm reveals its potential: modern cloud technologies allow you to create innovative, cost-effective, and versatile data lakes. They even allow you to extend existing Hadoop data lakes, cloud object stores (computer data storage architecture that manages data as objects), and other technologies.
Data Lake Requirements
To be truly useful, a data lake must:
- Be able to easily store data in native formats
- Facilitate user-friendly exploration of that data
- Automate routine data management activities, and
- Support a broad range of analytics use cases
Most of today’s data lakes can’t effectively organize all data and must be filled from a number of data streams, each of which delivers data at a different frequency. Without adequate data quality and data governance, even well-constructed data lakes can quickly become data swamps — unorganized pools of data that are difficult to use, understand, and share with business users. The greater the quantity and variety of data, the more significant this problem becomes.
Other common problems include:
- Lacking performance — slow queries, unnecessary data reads, delays through data teams
- Difficulty managing and scaling environments — multiple systems, silos, and copies
- Increasing licensing costs for hardware and software
Cue the Cloud Data Lake
The early data lakes of 2010 became data swamps and left many organizations struggling to produce their needed insights.
Some organizations leveraged interim data storage environments to create their own data lakes from scratch. These solutions allow customers to store unlimited amounts of data in their native formats, and enable them to conduct analytics.
Although customers no longer have to manage the hardware, as was needed with Hadoop, they still have to create, integrate, and manage the software:
- Setting up procedure to transform the data
- Establishing policies and procedures for identity management, security, data governance, and other essential activities, and
- Figuring out how to obtain high-performance analytics
Rise of the Modern Data Lake
In recent years, a far better data lake paradigm has arisen: a blend between popular object stores and a flexible, high-performance cloud-built data warehouse.
These solutions have become the foundation for the modern data lake: a place where structured and semi-structured data can be staged in its raw form — either in the data warehouse itself or in an associated object storage service.
Modern data lakes provide the environment to easily store, load, integrate, and analyze data in order to derive deep insights to inform data-drive decision-making.
These modern data lakes provide near-unlimited capacity and scalability for the storage and computing power as needed.
Why the Modern Cloud Data Lake
- Being able to store unlimited amounts of diverse data makes the cloud well-suited for data lakes
- This environment can be operated with familiar SQL tools
- Because all storage objects and necessary compute resources are internal to the modern data lake platform, data can be accessed and analytics can be executed quickly and efficiently
Who the Modern Data Lake Is For
E-commerce retailers use modern data lakes to collect clickstream data for monitoring web-shopping activities:
- They analyze browser data in conjunction with customer buying histories to predict outcomes
- With these insights, retailers provide timely, relevant, and consistent interactions for acquiring, serving, and retaining customers
Oil and gas companies use data lakes to improve geologic exploration and make their extraction operations more efficient and productive:
- Data from hundreds or thousands of sensors helps them discover trends, predict equipment failures, streamline maintenance cycles, and understand operations
Banks and financial service companies use data lakes to analyze market risks and determine which products and services to offer.
All customer-focused organizations can use data lakes to collect and analyze data from social media sites, customer relationship management (CRM) systems, and other sources. They can use data to gauge customer sentiment, adjust go-to-market strategies, mitigate customer-support problems, and extend targeted offers to customers and prospects.
It’s important to remember that traditional data lakes fail because of their inherent complexity, poor performance, and lack of governance.
Modern cloud data lakes overcome these challenges thanks to foundational tenets such as:
- No silos — easily ingest petabytes of structured, semi-structured, and unstructured data into a single repository
- Instant elasticity — supply any amount of compute resources to any workload, dynamically change the size of a compute cluster without affecting running queries, and scale services to easily include additional compute clusters to complete intense workloads faster
- Concurrent operations — deploy to a near-unlimited number of users and workloads to access a single copy of your data, without affecting performance
- Embedded governance — present fresh and accurate data to users, with a focus on collaboration, data quality, access control, and metadata (data about the data) management
- Transactional consistency — confidently combine data to enable multi-statement transactions and cross-database joins (data unions across databases)
- Fully managed — with a software-as-a-service (SaaS) solution, the data platform itself largely manages and handles provisioning, data protection, security, backups, and performance tuning, allowing you to focus on analytical endeavors rather than on managing hardware and software
This is part 1 of a 5 part story.
The information contained in this article was inspired by Cloud Data Lakes for Dummies provided by Snowflake resources. I am NOT an employee of Snowflake nor am I affiliated with its employees or infrastructure.
I’m a Data Engineer always looking to broaden my knowledge and I’m passionate about Big Data. I also enjoy spreading the knowledge I gain in topics that I find interesting. This article is meant to do just that. I play video games, watch TV, and love to travel (when it’s safe to do so). I’m a happy father of 2 fur babies and when I’m not giving them attention, I’m blogging about data and big data infrastructures!