Apache Iceberg: The Definitive Guide

Book description

Traditional data architecture patterns are severely limited. To use these patterns, you have to ETL data into each tool—a cost-prohibitive process for making warehouse features available to all of your data. The lack of flexibility with these patterns requires you to lock into a set of priority tools and formats, which creates data silos and data drift. This practical book shows you a better way.

Apache Iceberg provides the capabilities, performance, scalability, and savings that fulfill the promise of an open data lakehouse. By following the lessons in this book, you'll be able to achieve interactive, batch, machine learning, and streaming analytics with this high-performance open source format. Authors Tomer Shiran, Jason Hughes, and Alex Merced from Dremio show you how to get started with Iceberg.

With this book, you'll learn:

  • The architecture of Apache Iceberg tables
  • What happens under the hood when you perform operations on Iceberg tables
  • How to further optimize Iceberg tables for maximum performance
  • How to use Iceberg with popular data engines such as Apache Spark, Apache Flink, and Dremio

Discover why Apache Iceberg is a foundational technology for implementing an open data lakehouse.

Publisher resources

View/Submit Errata

Table of contents

  1. Foreword by Gerrit Kazmaier
  2. Foreword by Raghu Ramakrishnan
  3. Foreword by Rick Sears
  4. Preface
    1. About This Book
    2. Why We Wrote This Book
    3. What You Will Find Inside
    4. How to Use This Book
    5. Feedback and Questions
    6. Conventions Used in This Book
    7. Using Code Examples
    8. O’Reilly Online Learning
    9. How to Contact Us
    10. Acknowledgments
  5. I. Fundamentals of Apache Iceberg
  6. 1. Introduction to Apache Iceberg
    1. How Did We Get Here? A Brief History
      1. Foundational Components of a System Designed for OLAP Workloads
      2. Bringing It All Together
    2. The Data Warehouse
      1. A Brief History
      2. Pros and Cons of a Data Warehouse
    3. The Data Lake
      1. A Brief History
      2. Pros and Cons of a Data Lake
    4. Should I Run Analytics on a Data Lake or a Data Warehouse?
    5. The Data Lakehouse
    6. What Is a Table Format?
    7. Hive: The Original Table Format
    8. Modern Data Lake Table Formats
    9. What Is Apache Iceberg?
      1. How Apache Iceberg Came to Be
      2. The Apache Iceberg Architecture
      3. Key Features of Apache Iceberg
    10. Conclusion
  7. 2. The Architecture of Apache Iceberg
    1. The Data Layer
      1. Datafiles
      2. Delete Files
    2. The Metadata Layer
      1. Manifest Files
      2. Manifest Lists
      3. Metadata Files
      4. Puffin Files
    3. The Catalog
    4. Conclusion
  8. 3. Lifecycle of Write and Read Queries
    1. Writing Queries in Apache Iceberg
      1. Create the Table
      2. Insert the Query
      3. Merge Query
    2. Reading Queries in Apache Iceberg
      1. The SELECT Query
      2. The Time-Travel Query
    3. Conclusion
  9. 4. Optimizing the Performance of Iceberg Tables
    1. Compaction
    2. Hands-on with Compaction
      1. Compaction Strategies
      2. Automating Compaction
    3. Sorting
    4. Z-order
    5. Partitioning
      1. Hidden Partitioning
      2. Partition Evolution
      3. Other Partitioning Considerations
    6. Copy-on-Write Versus Merge-on-Read
      1. Copy-on-Write
      2. Merge-on-Read
      3. Configuring COW and MOR
    7. Other Considerations
      1. Metrics Collection
      2. Rewriting Manifests
      3. Optimizing Storage
      4. Write Distribution Mode
      5. Object Storage Considerations
      6. Datafile Bloom Filters
    8. Conclusion
  10. 5. Iceberg Catalogs
    1. Requirements of an Iceberg Catalog
    2. Catalog Comparison
      1. The Hadoop Catalog
      2. The Hive Catalog
      3. The AWS Glue Catalog
      4. The Nessie Catalog
      5. The REST Catalog
      6. The JDBC Catalog
      7. Other Catalogs
    3. Catalog Migration
      1. Using the Apache Iceberg Catalog Migration CLI
      2. Using an Engine
    4. Conclusion
  11. II. Hands-on with Apache Iceberg
  12. 6. Apache Spark
    1. Configuration
      1. Configuring Apache Iceberg and Spark
      2. Configuring the Catalogs
      3. Starting Spark with All the Configurations (AWS Glue Example)
    2. Data Definition Language Operations
      1. CREATE TABLE
      2. ALTER TABLE
      3. Alter a Table with Iceberg’s Spark SQL Extensions
      4. DROP TABLE
    3. Reading Data
      1. The Select All Query
      2. The Filter Rows Query
      3. Aggregation Queries
      4. Using Window Functions
    4. Writing Data
      1. INSERT INTO
      2. MERGE INTO
      3. INSERT OVERWRITE
      4. DELETE FROM
      5. UPDATE
    5. Iceberg Table Maintenance Procedures
      1. Expire Snapshots
      2. Rewrite Datafiles
      3. Rewrite Manifests
      4. Remove Orphan Files
    6. Conclusion
  13. 7. Dremio’s SQL Query Engine
    1. Configuration
    2. Data Definition Language Operations
      1. CREATE TABLE
      2. ALTER TABLE
      3. DROP TABLE
    3. Reading Data
      1. Using the SELECT Query
      2. Filtering Rows
      3. Using Aggregated Queries
      4. Using Window Functions
    4. Writing Data
      1. INSERT INTO
      2. COPY INTO
      3. MERGE INTO
      4. DELETE
      5. UPDATE
    5. Iceberg Table Maintenance
      1. Expire Snapshots
      2. Rewrite Datafiles
      3. Rewrite Manifests
    6. Conclusion
  14. 8. AWS Glue
    1. Configuration
      1. Creating a Glue Database
      2. Configuring the Glue ETL Job
    2. Create a Table Using the Glue Data Catalog
      1. Read the Table
      2. Insert the Data
    3. Conclusion
  15. 9. Apache Flink
    1. Configuration
      1. Prerequisites
      2. Start the Flink Cluster and Flink SQL Client
    2. Data Definition Language Operations
      1. CREATE CATALOG
      2. CREATE DATABASE
      3. CREATE TABLE
      4. ALTER TABLE
      5. DROP TABLE
    3. Reading Data
      1. Flink SQL Batch Read
      2. Flink SQL Streaming Read
      3. Metadata Table
    4. Writing Data
      1. INSERT INTO
      2. INSERT OVERWRITE
      3. UPSERT
    5. Flink DataFrame and Table API with Apache Iceberg Tables
      1. Prerequisites
      2. Configuring the Flink Job
      3. Starting the Cluster and Building the Package
      4. Running the Job
    6. Conclusion
  16. III. Apache Iceberg in Practice
  17. 10. Apache Iceberg in Production
    1. Apache Iceberg Metadata Tables
      1. The history Metadata Table
      2. The metadata_log_entries Metadata Table
      3. The snapshots Metadata Table
      4. The files Metadata Table
      5. The manifests Metadata Table
      6. The partitions Metadata Table
      7. The all_data_files Metadata Table
      8. The all_manifests Metadata Table
      9. The refs Metadata Table
      10. The entries Metadata Table
      11. Using the Metadata Tables in Conjunction
    2. Isolation of Changes with Branches
      1. Table Branching and Tagging
      2. Catalog Branching and Tagging
    3. Multitable Transactions
    4. Rolling Back Changes
      1. Rolling Back at the Table Level
      2. Rolling Back at the Catalog Level
    5. Conclusion
  18. 11. Streaming with Apache Iceberg
    1. Streaming with Spark
      1. Streaming into Iceberg with Spark
      2. Streaming from Iceberg with Spark
    2. Streaming with Flink
      1. Streaming into Iceberg with Flink
      2. Example of Streaming into Iceberg with Flink
    3. Streaming with Kafka Connect
      1. The Iceberg Kafka Sink
    4. Streaming with AWS
    5. Conclusion
  19. 12. Governance and Security
    1. Securing Datafiles
      1. Securing Files: Best Practices
      2. Hadoop Distributed File System
      3. Amazon Simple Storage Service
      4. Azure Data Lake Storage
      5. Google Cloud Storage
    2. Securing and Governing at the Semantic Layer
      1. Semantic Layer Best Practices
      2. Dremio
      3. Trino
    3. Securing and Governing at the Catalog Level
      1. Nessie
      2. Tabular
      3. AWS Glue and Lake Formation
    4. Additional Security and Governance Considerations
    5. Conclusion
  20. 13. Migrating to Apache Iceberg
    1. Migration Considerations
      1. Three-Step In-Place Migration Plan
      2. Four-Phase Shadow Migration Plan
    2. Migrating Hive Tables to Apache Iceberg
      1. The Snapshot Procedure
      2. The Migrate Procedure
    3. Migrating Delta Lake to Apache Iceberg
    4. Migrating Apache Hudi to Apache Iceberg
    5. Migrating Individual Files to Apache Iceberg
      1. Using the add_files Procedure
      2. Migrating from Delta Lake or Apache Hudi Without Preserving History
    6. Migrating from Anywhere by Rewriting Data
      1. Migrating Data to a New Iceberg Table
      2. Migrating Data into an Existing Iceberg Table
    7. Conclusion
  21. 14. Real-World Use Cases of Apache Iceberg
    1. Ensuring High-Quality Data with Write-Audit-Publish in Apache Iceberg
      1. WAP Using Iceberg’s Branching Feature
    2. Running BI Workloads on the Data Lake
      1. Land the Raw Data into the Data Lake
      2. Curate Virtual Data Marts/Data Products
      3. Create a Reflection to Accelerate Our Dashboard
      4. Connect Our View to Our BI Tool
      5. Benefits of Running BI Workloads on the Data Lake
    3. Implementing Change Data Capture with Apache Iceberg
      1. Create Apache Iceberg Tables
      2. Apply Updates from Operational Systems
      3. Create the Change Log View to Capture Changes
      4. Merge Changed Data in the Aggregated Table
    4. Conclusion
  22. Index
  23. About the Authors

Product information

  • Title: Apache Iceberg: The Definitive Guide
  • Author(s): Tomer Shiran, Jason Hughes, Alex Merced
  • Release date: May 2024
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098148621