This is part 4 of a 5 part story about how to democratize your data using a modern data lake built for the cloud.
Here we discuss the benefits of modern, cloud-built data lake and how it can help to reduce infrastructure costs and leverage multiple data forms to gain further insights for business decisions.
The Benefits of a Modern Cloud Data Lake
There and Back Again
The first two generations of data lakes were constructed either by using open source technologies (like Apache Hadoop) or by customizing an object store from a cloud storage provider (Amazon Web Services, Microsoft Azure, or Google Cloud Platform).
These earlier approaches created multiple issues:
- Getting raw data into the data lake was straightforward, but getting insights from that data was nearly impossible
- IT departments had to over-plan and therefore overpay for enough compute to accommodate infrequent spikes in activity — eventually, activity surpassed physical compute limits, causing queries and other workloads to stall
- Technical professionals who knew how to administer and manage these installations were in short supply
- Security and governance were an afterthought, leaving data at risk and organizations behind in compliance
These limitations inspired modern, cloud-built solutions to collect large amounts of varying data types, in their raw forms, and store it all in a cohesive data lake that can draw from a variety of versatile data storage repositories.
Increasing Scalability Options
The expensive hardware of on-premises data lakes limits independent scalability. Basic data-access architecture requires you to store data on the compute nodes in the cluster, forcing you to size these clusters to accommodate peak processing loads. But much of this capacity goes unused most of the time and these massive clusters create significant processing overhead, which constrains performance.
In contrast, a data lake that leverages a cloud-built data warehouse has the scalability, resiliency, and throughput that are much better than on-premises data centers, at a fraction of the cost. Storage and compute resources are fully independent yet logically integrated, and designed to scale automatically and independently from each other.
Here are some guidelines to consider to ensure scalability:
- Choose a platform that gives you elastic scaling of compute and storage
- Build your data lake to enable multiple, independently scalable compute clusters that share a single copy of the data but eliminate conflict between workloads
- Take advantage of auto-scaling when concurrency surges
Traditional data lakes were notorious for long timelines and runaway costs.
Data lakes built on an object store were less expensive, but required customer integration and tedious administration.
A cloud data lake will save you the significant expense of buying, maintaining, and securing an on-premises system.
Gaining Insights From All Forms of Data
With diverse data sources and metadata located and integrated in a single system, data-driven insights can more easily be obtained, without asking technical staff for help.
In order to accommodate all possible business needs, your data lake should be versatile enough to ingest and immediately query information of varying types. This includes unstructured data such as audio and video files, and semi-structured data such as JSON, Avro, and XML. It also includes open source data types such as Apache Parquet and ORC, as well as traditional CSV and relational formats.
Your data lake should also enable native data loading and analytics on these mixed data formats with complete integrity, and store these diverse data types in their native form, without creating new data silos.
Here are some guidelines for smooth data management:
- Establish a complete metadata layer to guide user analytics
- Standardize an architecture that supports JSON, Avro, Parquet, and XML data
- Use pipeline tools that allow for native data loading with transactional integrity
Boosting Productivity for Business and IT
First-generation data lakes (based on Hadoop) required administrators to constantly attend to planning, resource allocation, performance optimization, and other complex tasks.
Although cloud object stores eliminate the security and hardware management overhead, they still require lots of manual tuning for analytic performance.
With a modern-built cloud data lake, security, tuning, and performance optimizations are built into the managed service as a package.
Simplify the Environment
Today’s data comes from a mix of sources: relational and NoSQL databases, IoT devices, and data generated by SaaS and enterprise applications. These data sources have different formats, models, and structures, which often are stored in different types of platforms.
A modern data lake can consolidate these multiple types of data.
Examining the benefits of an object storage
You can store virtually any kind of data in object storage mechanisms, without bothering with extensive planning, software customization, programming, and server provisioning required in Hadoop environments.
You can even use SQL and other familiar tools to explore, visualize, and analyze the data.
You don’t have to integrate lots of open source software packages to obtain capabilities with a cloud data lake pre-integrated with relatively inexpensive object storage mechanisms.
This is part 4 of a 5 part story.
The information contained in this article was inspired by Cloud Data Lakes for Dummies provided by Snowflake resources. I am NOT an employee of Snowflake nor am I affiliated with its employees or infrastructure.
I’m a Data Engineer always looking to broaden my knowledge and I’m passionate about Big Data. I also enjoy spreading the knowledge I gain in topics that I find interesting. This article is meant to do just that. I play video games, watch TV, and love to travel (when it’s safe to do so). I’m a happy father of 2 fur babies and when I’m not giving them attention, I’m blogging about data and big data infrastructures!