This is part 3 of a 5 part story about how to democratize your data using a modern data lake built for the cloud.
Here we introduce the modernization of a data lake and what it means to have a modern architecture for this new data lake implementation.
Modernizing a Data Lake
Some of today’s most valuable data doesn’t come with a predefined structure, and that’s where data lakes shine.
The Right Architecture
Modern data lakes are more versatile and often take the form of a cloud-based analytics layer that optimizes query performance against data stored in a data warehouse or an external object store for deeper and more efficient analytics.
Collecting a Range of Data Types
A complete data lake strategy should accommodate all types of data: JSON, tables, CSV files, and Optimized Row Columnar (ORC) and Parquet data stores.
If you’re also storing and analyzing unstructured data, you may need a separate repository like an external object store — a data storage architecture that manages data as objects rather than as file hierarchies or data blocks. A separate repository may also be necessary if you have specialized needs for data sovereignty and data retention, or if you must comply with certain industry regulations that govern where your data is stored.
Whether your data is stored in one location or multiple locations, having an integrated cloud analytics layer reduces risk and makes life simpler. You don’t have to move data among multiple data marts (smaller, static versions of your data lake), and you won’t have multiple potential points of failure. The entire environment can be queried as a single source of truth via SQL and other familiar tools.
Continuously Loading Data
Data-driven organizations need real-time analytic systems that can continuously ingest data into cloud storage environments:
- Analysts need up-to-date data to observe trends and identify opportunities
- Data scientists require current data to develop machine learning models
- Executives need up-to-the-minute data to guide their organizations
Data pipeline tools can migrate on-premises application data into a cloud data lake:
- Bulk-load processes work best for initial transfers, even with terabytes of data
- After that, you’ll most likely want to capture incremental changes
- Real-time data feeds and streaming applications with low latency are becoming the industry norm in many data streaming architectures
Make sure your data pipeline can move data continuously as well as in batch mode. It must also handle the complex transformations required to rationalize different data types without reducing the performance of production workloads. A continuous data pipeline automatically and synchronously detects new data as it arrives in your cloud storage environment and then continuously loads it into the data lake.
Enabling Secure Data Sharing
Many businesses enhance their operations by tapping into third-party data repositories, services, and streams for even deeper insights.
Traditional data sharing methods, such as FTP, APIs, and email, require you to copy the shared data and send it to your data consumers. Unfortunately, these methods are cumbersome, costly, and risky. They also produce static data that quickly becomes dated and must be refreshed with up-to-date versions. This means constant data movement and management.
Modern data sharing technologies enable organizations to easily share slices of their data, and receive shared data, in a secure and governed way. These robust data sharing methods allow you to share live data without moving it from place to place. You can also set up data-sharing services that turn your data lake into a profit center.
A multi-tenant cloud-built data lake enables organizations to share live data and receive shared data from diverse sources without having to move that data. And there’s no contention or competition for resources.
Customizing Workloads for Optimal Performance
A modern cloud data lake can deliver all the resources you need, with the instant elasticity to scale with demand. You can meet user demands for data volume, velocity, and variety.
Enable a high-performance SQL layer
This is necessary for a work group that suddenly starts querying datasets that include a quarter’s worth of data or more.
For example, a supply chain analyst who normally evaluates day-to-day performance might want to suddenly access a rolling set of data for an entire month or quarter.
Maintain workload isolation
Many users will likely be accessing your data lake at the same time, which can consume huge amounts of resources. This means your data lake must isolate workloads and allocate resources to the jobs that truly matter.
If you have a regular event that requires a burst of compute resources, ensure your data lake architecture enables workload isolation.
Cloud vendors should allow you to configure these resources to “go to sleep” automatically after a predetermined period of inactivity. This ensures you always have resources but you don’t have to pay for usage when the resources aren’t needed.
Interacting with data in an object store
Your data lake should allow you to query data housed in Amazon S3, Microsoft Azure Blob Storage, or Google Cloud Storage where it resides.
You can maintain the data lake as a single source of truth, even for multi-cloud environments. Having a single source of truth eliminates the time-consuming task of keeping multiple data repositories in sync.
The SQL abstraction layer must have a simple way to connect to the data in its raw format, so that when you want to interact with it you don’t have to move it from place to place. This ideally involves materialized views since they can significantly improve the query performance on external files. Materialized views precalculate metadata and stats about data in external files, thus speeding up the queries that are run on them. In essence, they materialize the parts of your data lake you query most frequently. These views are automatically refreshed in the background, eliminating the need to build a data extract, transform, and load (ETL) layer or orchestration pipeline.
Here are the things you can achieve with external tables:
- Query data directly from Amazon S3, Microsoft Azure Blob Storage, or Google Cloud Platform and ingest it natively into your data warehouse
- Maintain your data lake as a single source of truth, eliminating the need to copy and transfer data
- Achieve fast analytics on data from an external source, wherever it resides
External tables store file-level metadata about the data files such as the file path, a version identifier, and partitioning information. This enables querying data stored in files in a data lake as if it were inside a single database.
Resizing compute clusters
A flexible, cloud-built data lake can automatically scale concurrency by transparently creating a new cluster and then automatically balancing the load. When the load subsides and the queries catch up, the second cluster automatically spins down.
Look for a cloud provider that allows you to dynamically expand independent resources to handle sudden concurrency issues. You should be able to specify the number of clusters you would like or let it happen automatically.
For example, if you have a 4-node cluster and you’re expecting a temporary surge in data, you can easily add more compute power. You might need to resize the cluster to 16 or 32 nodes and then, once the team is done with its analytics, you can scale the cluster back to its original configuration. To ensure uninterrupted service, make sure your cloud vendor allows you to scale the service while the cluster is running.
Using Metadata to Create a User-Friendly Environment
Ultimately, a data lake should democratize access to the data.
People of all skill levels (managers, financial analysts, executives, data scientists, and data engineers) should have easy and ready access to the system to perform the analytics they need.
Of course, this doesn’t mean users have free reign:
- Security, access control, and governance are essential
- Role-based access mechanisms define precisely which data people are allowed to see
- Personally identifiable information (PII) can be housed within protected databases, and storage resources can be physically or virtually isolated to distinguish workloads from each other
To uphold necessary compliance requirements, you should continually monitor and audit user access. The cloud-built data lake should have an intuitive, graphical user interface for IT managers and data engineers that provides access to the data, metadata, data processing, and auditing functions.
5 Characteristics of a Data Lake Built for the Cloud
- A multi-cluster, shared-data architecture
- Independent scaling of compute and storage resources
- The ability to add users without affecting performance
- Tools to load and query data simultaneously, without degrading performance
- A robust metadata service that is fundamental to the object storage environment
This is part 3 of a 5 part story.
The information contained in this article was inspired by Cloud Data Lakes for Dummies provided by Snowflake resources. I am NOT an employee of Snowflake nor am I affiliated with its employees or infrastructure.
I’m a Data Engineer always looking to broaden my knowledge and I’m passionate about Big Data. I also enjoy spreading the knowledge I gain in topics that I find interesting. This article is meant to do just that. I play video games, watch TV, and love to travel (when it’s safe to do so). I’m a happy father of 2 fur babies and when I’m not giving them attention, I’m blogging about data and big data infrastructures!