This is part 2 of a 5 part story about how to democratize your data using a modern data lake built for the cloud.
Here we discuss why the modern data lake emerged and what business needs it arose to address.
Why the Modern Data Lake Emerged
The Difference Between a Data Warehouse and a Data Lake
- Emerged as a method for organizations to store and organize their data for analytics (i.e. to ask questions about the data to reveal answers, trends, and other insights)
- Orchestrate data marts (databases that meet demands for specific groups of users)
- Handle thousands or even millions of queries a day (vitals queries such as order trends, customer demographics, and business forecasting)
- Store data that have their own source schemas (row-column definitions that dictate how the data is organized) — data attributes must be known up front
- Store semi-structured data types in their native formats, without requiring a schema to be defined upfront
- Can contain new, semi-structured data types from sources like clickstreams, mobile apps, or social media networks
- At a minimum, are capable of storing mixed data types
With a modern, cloud-built data lake, you get the power of a data warehouse and the flexibility of the data lake, and you leave limitations of both systems behind. You also get the unlimited resources of the cloud, automatically.
Staying Afloat in the Data Storm
Data is being gathered in increasing quantities, from more diverse sets of sources.
Harnessing data from many enterprise applications
Now, organizations rely on dozens of enterprise applications, including software-as-a-service (SaaS) solutions.
Business applications that generate huge amounts of data include:
- Credit card transactions — for fraud detection and prevention
- Social media data — for recruiting and talent management
- Supplier data — for manufacturing resource planning and supply chain management
- Order-fulfillment data — for inventory control, revenue forecasting, and budgeting
Unifying device-generated data
Petabytes (millions of gigabytes) of data come from people using various types of digital and mechanical devices.
The volume and complexity of these semi-structured data sources can quickly overwhelm a conventional data warehouse. Bottlenecks in data processing cause analytics jobs to hang, or even crash the system.
The heart of the modern, cloud-built data lake is to streamline these diverse data management activities and easily and efficiently store all this data and make it useful.
Keeping Your Data in the Cloud
With the majority of data now in the cloud, the natural place to integrate this data is also in the cloud. In the cloud, you can focus on scaling cost-effectively, on the order of magnitude necessary to handle massive volumes of varying data.
Not call cloud systems are created equal, however. It’s not just a matter of taking the same technologies and architectures from inside an on-premises data center and moving them elsewhere. The best data lake solutions are designed first and foremost for the cloud.
To take full advantage of what the cloud offers, a solution must be built for the cloud, from the ground up.
Democratizing Your Analytics
To manifest the full potential of your data, your organization needs to make analytic activities accessible to the other 90% of business users.
Here are 4 benefits of cloud-built analytics:
- Data Exploration — the cloud offers on-demand, elastic scalability for discovering trends and patterns that are difficult to know in advance precisely what amount of computing resources are needed to analyze huge datasets
- Interactive Data Analysis — the dynamic elasticity of the cloud gives you the flexibility and adaptability to perform additional queries without slowing down other workloads
- Batch Processing — scheduling and sending a large set of queries to the data lake or data warehouse for execution can be a drain on performance
- Event-driven Analytics — ingesting and processing streaming data requires an elastic data lake to handle variations and spikes in data flow
A growing trend is to build analytics into cloud business applications, which serve many types of users and the queries (workloads) users run to analyze that data.
Reduce Risk, Protect Data
Your organization’s data is incredibly valuable. But if your data is valuable to you, it’s also valuable to your stakeholders, as well as malevolent actors. Sensitive information can get into the wrong hands or be improperly erased, and you may lose valuable customers.
Implementing Compliance and Governance
Organizations can’t ignore the increasingly rigorous privacy regulations.
Data governance ensures data is properly classified, accessed, protected, and used:
- It involves establishing strategies and policies to ensure the data lake processing environment complies with regulatory requirements
- These policies verify data quality and standardization to ensure the data is properly prepared to meet organizational needs
- The types of information that fall under data governance guidelines include credit card information, social security numbers, names, data of birth, etc.
Implementing effective data governance policies early in the data lake processes helps to avoid potential pitfalls:
- Poor access control
- Metadata management
- Unacceptable data quality
- Insufficient data security
Remember, data governance isn’t a technology. It’s an organizational commitment that involves people, processes, and tools.
There are 5 basic steps to formulating a strong data governance practice:
- Establish a core team of stakeholders to create a data governance framework — identify issues with current data management policies and areas needing improvement
- Define the problems you’re hoping to solve (e.g. better regulatory compliance, increased data security, and improved data quality), then determine what you need to change (e.g. fine-tuning access rights, protecting sensitive data, or consolidating data silos)
- Assess what tools and skills are needed to execute the data governance program (e.g. skills in data modeling, data cataloging, data quality, and reporting)
- Inventory your data to see what you have, how it’s classified, where it resides, who can access it, and how it’s used
- Identify capabilities and gaps, then figure out how to fill those gaps by hiring in-house specialists or by using partner tools and services
A data lake achieves effective governance by following proven data management principles:
- Add context to metadata to make it easier to track where data is coming from, who touched that data, and how various data sets relate to one another
- Ensure quality data is delivered across business processes, and
- Provide a means to catalog enterprise data
Ensuring Data Quality
Data security hinges on traceability, you must know where your data comes from, where it is, who has access to it, how it’s used, and how to delete it when required.
Data governance also involves oversight to ensure the quality of the data your organization shares with its constituents. Bad data can lead to missed or poor business decisions, loss of revenue, and increased costs.
Data stewards — people charged with overseeing data quality — can identify when data is corrupt or inaccurate, when it’s not being refreshed often enough to relevant, or when it’s being analyzed out of context.
Ideally, data quality tasks are assigned to business users who own and manage the data, since they’re in the best position to note inaccuracies and inconsistencies. These data stewards work with IT professionals and data scientists to establish data quality rules and processes.
Data Protection, Availability, and Retention
Cloud infrastructures can fail, accidental data deletions can occur, other human errors can happen, and bad actors can attack — resulting in data loss, data inconsistencies, and data corruption. This is why cloud data lakes must incorporate redundant processes and procedures to keep your data available and protected. Regulatory compliance and certification requirements may also dictate that data is retained for a certain minimal length of time, which can be years.
All cloud data lakes should protect data and ensure business continuity by performing periodic backups. If a particular storage device fails, the analytic operations and applications that need that data can automatically switch to a redundant copy of that data on another device. Data retention requirements call for maintaining copies of all your data.
Complete data protection should go beyond just duplicating data within the same physical region or zone of a cloud compute and storage provider. It’s important to replicate that data among multiple geographically dispersed locations to offer the best possible data protection.
It’s important to pay attention to performance as well:
- Without the right technology, data backups and replication can consume valuable compute resources and interfere with analytic workloads
- A modern cloud data lake should manage replication programmatically in the background, without interfering with whatever workloads are executing at the time
- Good data backup, protection, and replication procedures minimize, if not prevent, performance degradation and data availability interruptions
End-to-End Data Security
All aspects of a data lake — architecture, implementation, and operation — must center on protecting its data, in transit and at rest.
A data protection strategy should address external interfaces, access control, data storage, and physical infrastructure, as well as monitoring, alerts, and cyber security practices.
Data encryption is a fundamental aspect of security.
Data should be encrypted when it’s stored, when it’s moved for staging into a data lake, when it’s placed into a database object in the lake itself, and when it’s cached within a virtual data lake.
Query results must be encrypted.
End-to-end encryption should be the default, with methods that keep the customer in control, such as customer-managed keys.
Once data is encrypted, it can be decrypted with an encryption key. In order to fully protect the data, you also have to protect the key that decodes that data.
The best data lakes employ AES 256-bit encryption with a hierarchical key model rooted in a dedicated hardware security module. This method encrypts the encryption keys and instigates key-rotation processes that limit the time during which any single key can be used. Encryption and key management should be transparent to the user but not interfere with performance.
Automating updates and logging
Security updates should be automatic and be applied to all relevant software components of your modern cloud data lake solution, as soon as those updates are available.
If using a cloud provider, that vendor should perform periodic security testing (also known as penetration testing) to proactively check for security flaws.
As added protection, file integrity monitoring (FIM) tools can ensure that critical system files aren’t tampered with. All security events should be automatically logged in a tamper-resistant security information and event management (SIEM) system. The ventor must administer these measures consistently and automatically, and they must not affect query performance.
For authentication, make sure connections to the cloud provider leverage standard security technologies such as Transport Layer Security (TSL) 1.2 and IP whitelisting (a whitelist is a list of permitted email addresses or domain names which a blocking program will allow messages to be received).
A cloud data lake should also support the SAML 2.0 standard so you can leverage existing password security requirements as well as existing user roles. Multi-factor identification (MFA) should be required to prevent users from being able to log in with stolen credentials. With MFA, users are challenged with a secondary verification request, such as a one-time security code sent as an email or text message.
Once a user it authenticated, it’s important to enforce authorization to specific data based on each user’s “need to know”.
A modern data lake must support multilevel, role-based access control (RBAC) functionality so each user requesting access to the data lake is authorized to access only data that he or she is explicitly permitted to see. Discretionary and role-based access control should be applied to all database objects (e.g. tables, schemas, any virtual extensions to the data lake). As an added restriction, secure views can be used to further restrict access.
Data breaches can cost millions of dollars to remedy, and permanently damage relationships with customers.
Industry-standard attestation reports verify that cloud vendors use appropriate security controls and features. For example, cloud vendors need to demonstrate they adequately monitor and respond to threats and security incidents, and that they have sufficient incident response procedures in place.
In addition to industry-standard technology certifications, verify your cloud provider also complies with all applicable government and industry regulations.
Ask your providers to supply attestation reports to verify they adequately monitor and respond to threats and security incidents and have sufficient incident response procedures in place. Make sure they provide a copy of the entire report for each pertinent standard, and not just the cover letters.
Isolating your data
You may want to isolate your data lake from all other data lakes, if it runs in a multi-tenant cloud environment.
Isolation should extend to the virtual machine layer. Your cloud vendor should isolate each customer’s data storage environment from every other customer’s storage environment, with independent directories encrypted using customer-specific keys.
Facts About Data Security
Effective security is complex and costly.
Cloud-built data lakes shift the responsibility for data center security to the SaaS cloud vendor. A properly architected and secured cloud data lake can be more secure than an on-premise data center.
You should be aware: security capabilities vary widely among vendors. The most basic cloud data lakes provide only rudimentary security capabilities, leaving things such as encryption, access control, and security monitoring to the customer.
This is part 2 of a 5 part story.
The information contained in this article was inspired by Cloud Data Lakes for Dummies provided by Snowflake resources. I am NOT an employee of Snowflake nor am I affiliated with its employees or infrastructure.
I’m a Data Engineer always looking to broaden my knowledge and I’m passionate about Big Data. I also enjoy spreading the knowledge I gain in topics that I find interesting. This article is meant to do just that. I play video games, watch TV, and love to travel (when it’s safe to do so). I’m a happy father of 2 fur babies and when I’m not giving them attention, I’m blogging about data and big data infrastructures!