This is the fifth and final part of a 5 part story about how to democratize your data using a modern data lake built for the cloud.
With this final part, we discuss the steps needed to ensure your built-for-the-cloud data lake is up to par for your business needs.
6 Steps for Planning Your Cloud Data Lake
Building a modern data lake requires technology to easily store data in raw form, provide immediate exploration of that data, refine it in a consistent and managed way, and make it easy to support a broad range of operational analytics.
These steps should be followed when getting started:
1 — Identify the data
Identify the exact data sources, types, and locations of the data you plan to load into your data lake. Then consider how extensively that data will be used.
Do you plan to share data within your ecosystem to enrich analytics? If so, how are you sharing that data now? Identify archaic data sharing methods such as FTP and email, and consider how you can replace them with a modern data-sharing architecture.
2 — Consider the repository
A data lake uses a single repository to efficiently store all your data.
The repository can support a range of use cases:
- Data archival
- Data integration across a variety of data sources
- ETL offloading from legacy data warehouses, and
- Complex data processing across batch, streaming, and machine-learning workloads
Will you stage data from an existing data warehouse or data store? Will all your data land in a cloud storage bucket such as Amazon S3, Microsoft Azure Blob, or Google Cloud Platform? If so, will you have to integrate the data lake with the storage bucket? If not, the cloud data lake can serve as your central data repository.
3 — Define the pipeline
Consider initial data loads as well as incremental updates. Do you have historical datasets you would like to migrate? If so, you will likely want to set up a one-time transfer of this historical information to the data lake.
Will you continually refresh that data as new transactions occur? If so, you’ll want to establish a continuous stream of data moving through the pipeline.
4 — Check pertinent use cases
Do you want to replace or augment an existing Hadoop data lake?
Do you want to create a new data lake from scratch, using object storage from a general-purpose cloud provider, or add to an existing object store?
Do you want to work with a provider to configure a data lake using pre-integrated technologies?
If you have a large investment in a traditional data lake or warehouse, then you’ll likely want to complement and extend that investment with new technologies.
5 — Apply governance and security
You need to decide who is responsible for governing and securing your data, both for initial data loads and on a continuous basis as new data is ingested.
6 — Keep it simple
Once your implementation is complete, you shouldn’t have to install, configure, update, or maintain hardware or software. Backups, performance tuning, security updates, and other management requirements should be part of the basic service.
This is part 5 of a 5 part story.
The information contained in this article was inspired by Cloud Data Lakes for Dummies provided by Snowflake resources. I am NOT an employee of Snowflake nor am I affiliated with its employees or infrastructure.
I’m a Data Engineer always looking to broaden my knowledge and I’m passionate about Big Data. I also enjoy spreading the knowledge I gain in topics that I find interesting. This article is meant to do just that. I play video games, watch TV, and love to travel (when it’s safe to do so). I’m a happy father of 2 fur babies and when I’m not giving them attention, I’m blogging about data and big data infrastructures!