Sign in

Democratizing Your Data With a Modern Cloud Data Lake

Data lakes have arisen to solve a growing problem: the need for a scalable, low-cost data repository that allows organizations to easily store all data types from a diverse set of sources, and then analyze that data to make evidence-based decisions.

Data lakes are an ideal way to gather, store and analyze enormous amounts of data in one location. The modern cloud data lake leverages the power, flexibility, and near-infinite scalability of the cloud.

A Deep Dive Into Cloud Data Lakes

The term data lake was coined to describe a new type of data repository for storing massive amounts of raw data in its native form, in a single location (coined by James Dixon, 2010).

Prior to the data lake, there was the data warehouse. Data warehouses were built primarily for analytics. They used relational databases and schemas to define tables of structured data in orderly columns and rows.

In contrast, the data lake’s goal was to enable organizations to explore, refine, and analyze huge amounts of information (about a petabyte’s worth!) without a predetermined notion of structure. It’s important to understand that data lakes enable a comprehensive way to explore and analyze petabytes of data constantly arriving from multiple sources.

The original data lake:

The original data lake failed to deliver the desired rapid insights

The core data lake technology was based on the Apache Hadoop ecosystem:

HDFS (Hadoop Distributed File System) enables customers to store data in its native form

Many on-premises data lake projects failed to fulfill the promise of data lake computing due to a number of reasons:

  • Burdensome complexity
  • Slow time to value
  • Heavy system management efforts

A complex distributed architecture, and its need for custom coding for data transformation and integration, made it difficult to derive useful analytics and contributed to Hadoop’s demise.

The original promise of the data lake remains: a way for organizations to collect, store, and analyze all their data in one place.

And as cloud computing maintains its popularity, this new paradigm reveals its potential: modern cloud technologies allow you to create innovative, cost-effective, and versatile data lakes. They even allow you to extend existing Hadoop data lakes, cloud object stores (computer data storage architecture that manages data as objects), and other technologies.

To be truly useful, a data lake must:

  • Be able to easily store data in native formats
  • Facilitate user-friendly exploration of that data
  • Automate routine data management activities, and
  • Support a broad range of analytics use cases

Most of today’s data lakes can’t effectively organize all data and must be filled from a number of data streams, each of which delivers data at a different frequency. Without adequate data quality and data governance, even well-constructed data lakes can quickly become data swamps — unorganized pools of data that are difficult to use, understand, and share with business users. The greater the quantity and variety of data, the more significant this problem becomes.

Other common problems include:

  • Lacking performance — slow queries, unnecessary data reads, delays through data teams
  • Difficulty managing and scaling environments — multiple systems, silos, and copies
  • Increasing licensing costs for hardware and software

The early data lakes of 2010 became data swamps and left many organizations struggling to produce their needed insights.

As cloud computing matured, object stores from Amazon, Microsoft, and other vendors introduced interim data lake solutions

Some organizations leveraged interim data storage environments to create their own data lakes from scratch. These solutions allow customers to store unlimited amounts of data in their native formats, and enable them to conduct analytics.

Although customers no longer have to manage the hardware, as was needed with Hadoop, they still have to create, integrate, and manage the software:

  • Setting up procedure to transform the data
  • Establishing policies and procedures for identity management, security, data governance, and other essential activities, and
  • Figuring out how to obtain high-performance analytics

In recent years, a far better data lake paradigm has arisen: a blend between popular object stores and a flexible, high-performance cloud-built data warehouse.

These solutions have become the foundation for the modern data lake: a place where structured and semi-structured data can be staged in its raw form — either in the data warehouse itself or in an associated object storage service.

Modern data lakes provide the environment to easily store, load, integrate, and analyze data in order to derive deep insights to inform data-drive decision-making.

A modern data lake dramatically simplifies the effort to derive insights and value from all stored data and ultimate produces faster business results

These modern data lakes provide near-unlimited capacity and scalability for the storage and computing power as needed.

  • Being able to store unlimited amounts of diverse data makes the cloud well-suited for data lakes
  • This environment can be operated with familiar SQL tools
  • Because all storage objects and necessary compute resources are internal to the modern data lake platform, data can be accessed and analytics can be executed quickly and efficiently

E-commerce retailers use modern data lakes to collect clickstream data for monitoring web-shopping activities:

  • They analyze browser data in conjunction with customer buying histories to predict outcomes
  • With these insights, retailers provide timely, relevant, and consistent interactions for acquiring, serving, and retaining customers

Oil and gas companies use data lakes to improve geologic exploration and make their extraction operations more efficient and productive:

  • Data from hundreds or thousands of sensors helps them discover trends, predict equipment failures, streamline maintenance cycles, and understand operations

Banks and financial service companies use data lakes to analyze market risks and determine which products and services to offer.

All customer-focused organizations can use data lakes to collect and analyze data from social media sites, customer relationship management (CRM) systems, and other sources. They can use data to gauge customer sentiment, adjust go-to-market strategies, mitigate customer-support problems, and extend targeted offers to customers and prospects.

It’s important to remember that traditional data lakes fail because of their inherent complexity, poor performance, and lack of governance.

Modern cloud data lakes overcome these challenges thanks to foundational tenets such as:

  • No silos — easily ingest petabytes of structured, semi-structured, and unstructured data into a single repository
  • Instant elasticity — supply any amount of compute resources to any workload, dynamically change the size of a compute cluster without affecting running queries, and scale services to easily include additional compute clusters to complete intense workloads faster
  • Concurrent operations — deploy to a near-unlimited number of users and workloads to access a single copy of your data, without affecting performance
  • Embedded governance — present fresh and accurate data to users, with a focus on collaboration, data quality, access control, and metadata (data about the data) management
  • Transactional consistency — confidently combine data to enable multi-statement transactions and cross-database joins (data unions across databases)
  • Fully managed — with a software-as-a-service (SaaS) solution, the data platform itself largely manages and handles provisioning, data protection, security, backups, and performance tuning, allowing you to focus on analytical endeavors rather than on managing hardware and software

This is part 1 of a 5 part story.

The information contained in this article was inspired by Cloud Data Lakes for Dummies provided by Snowflake resources. I am NOT an employee of Snowflake nor am I affiliated with its employees or infrastructure.

I’m a Data Engineer always looking to broaden my knowledge and I’m passionate about Big Data. I also enjoy spreading the knowledge I gain in topics that I find interesting. This article is meant to do just that. I play video games, watch TV, and love to travel (when it’s safe to do so). I’m a happy father of 2 fur babies and when I’m not giving them attention, I’m blogging about data and big data infrastructures!

I'm a data engineer looking to broaden my knowledge and am passionate about Big Data. I also enjoy blogging about data and big data infrastructures!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store