The Data Lake is a data-centered architecture featuring a repository capable of storing vast quantities of data in various formats. Data from enterprise systems, data bases, web server logs, social media, and third-party data is ingested into the Data Lake in a secure and governed manner. Data is cleansed, conformed, integrated and modeled into “refined and for purpose zones” for exploratory and analytical consumption. Metadata consisting of business and technical metadata is captured including lineage in the data catalog for search and discovery. Security policies, including entitlements, are also applied.
Data can flow into the Data Lake by either batch processing or real-time processing of streaming data. Additionally, data itself is no longer restrained by initial schema decisions, and can be exploited more freely by the enterprise. Rising above this repository is a set of capabilities that allow IT to provide Data and Analytics as a Service (DaaS), in a supply-demand model. IT takes the role of the data provider (supplier), while business users (data scientists, business analysts) are consumers.
AWS provide an extensive set of tools and services to implementing serverless data lake architectures.
In this session, Akshay Goel and Alberto Artasanchez from Knowledgent will walk us through stories from the trenches with the challenges that they have had implementing petabyte scale data lakes and how they have overcome those challenges. In this session, we will learn about S3, DynamoDB, Kinesis, AWS Glue, Athena, Lambda, EMR, QuickSight, SageMaker and other AWS services applicable to Data Lakes.