Data Science on AWS

Book description

With this practical book, AI and machine learning practitioners will learn how to successfully build and deploy data science projects on Amazon Web Services. The Amazon AI and machine learning stack unifies data science, data engineering, and application development to help level up your skills. This guide shows you how to build and run pipelines in the cloud, then integrate the results into applications in minutes instead of days. Throughout the book, authors Chris Fregly and Antje Barth demonstrate how to reduce cost and improve performance.

  • Apply the Amazon AI and ML stack to real-world use cases for natural language processing, computer vision, fraud detection, conversational devices, and more
  • Use automated machine learning to implement a specific subset of use cases with SageMaker Autopilot
  • Dive deep into the complete model development lifecycle for a BERT-based NLP use case including data ingestion, analysis, model training, and deployment
  • Tie everything together into a repeatable machine learning operations pipeline
  • Explore real-time ML, anomaly detection, and streaming analytics on data streams with Amazon Kinesis and Managed Streaming for Apache Kafka
  • Learn security best practices for data science projects and workflows including identity and access management, authentication, authorization, and more

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Overview of the Chapters
    2. Who Should Read This Book
    3. Other Resources
    4. Conventions Used in This Book
    5. Using Code Examples
    6. O’Reilly Online Learning
    7. How to Contact Us
    8. Acknowledgments
  2. 1. Introduction to Data Science on AWS
    1. Benefits of Cloud Computing
      1. Agility
      2. Cost Savings
      3. Elasticity
      4. Innovate Faster
      5. Deploy Globally in Minutes
      6. Smooth Transition from Prototype to Production
    2. Data Science Pipelines and Workflows
      1. Amazon SageMaker Pipelines
      2. AWS Step Functions Data Science SDK
      3. Kubeflow Pipelines
      4. Managed Workflows for Apache Airflow on AWS
      5. MLflow
      6. TensorFlow Extended
      7. Human-in-the-Loop Workflows
    3. MLOps Best Practices
      1. Operational Excellence
      2. Security
      3. Reliability
      4. Performance Efficiency
      5. Cost Optimization
    4. Amazon AI Services and AutoML with Amazon SageMaker
      1. Amazon AI Services
      2. AutoML with SageMaker Autopilot
    5. Data Ingestion, Exploration, and Preparation in AWS
      1. Data Ingestion and Data Lakes with Amazon S3 and AWS Lake Formation
      2. Data Analysis with Amazon Athena, Amazon Redshift, and Amazon QuickSight
      3. Evaluate Data Quality with AWS Deequ and SageMaker Processing Jobs
      4. Label Training Data with SageMaker Ground Truth
      5. Data Transformation with AWS Glue DataBrew, SageMaker Data Wrangler, and SageMaker Processing Jobs
    6. Model Training and Tuning with Amazon SageMaker
      1. Train Models with SageMaker Training and Experiments
      2. Built-in Algorithms
      3. Bring Your Own Script (Script Mode)
      4. Bring Your Own Container
      5. Pre-Built Solutions and Pre-Trained Models with SageMaker JumpStart
      6. Tune and Validate Models with SageMaker Hyper-Parameter Tuning
    7. Model Deployment with Amazon SageMaker and AWS Lambda Functions
      1. SageMaker Endpoints
      2. SageMaker Batch Transform
      3. Serverless Model Deployment with AWS Lambda
    8. Streaming Analytics and Machine Learning on AWS
      1. Amazon Kinesis Streaming
      2. Amazon Managed Streaming for Apache Kafka
      3. Streaming Predictions and Anomaly Detection
    9. AWS Infrastructure and Custom-Built Hardware
      1. SageMaker Compute Instance Types
      2. GPUs and Amazon Custom-Built Compute Hardware
      3. GPU-Optimized Networking and Custom-Built Hardware
      4. Storage Options Optimized for Large-Scale Model Training
    10. Reduce Cost with Tags, Budgets, and Alerts
    11. Summary
  3. 2. Data Science Use Cases
    1. Innovation Across Every Industry
    2. Personalized Product Recommendations
      1. Recommend Products with Amazon Personalize
      2. Generate Recommendations with Amazon SageMaker and TensorFlow
      3. Generate Recommendations with Amazon SageMaker and Apache Spark
    3. Detect Inappropriate Videos with Amazon Rekognition
    4. Demand Forecasting
      1. Predict Energy Consumption with Amazon Forecast
      2. Predict Demand for Amazon EC2 Instances with Amazon Forecast
    5. Identify Fake Accounts with Amazon Fraud Detector
    6. Enable Privacy-Leak Detection with Amazon Macie
    7. Conversational Devices and Voice Assistants
      1. Speech Recognition with Amazon Lex
      2. Text-to-Speech Conversion with Amazon Polly
      3. Speech-to-Text Conversion with Amazon Transcribe
    8. Text Analysis and Natural Language Processing
      1. Translate Languages with Amazon Translate
      2. Classify Customer-Support Messages with Amazon Comprehend
      3. Extract Resume Details with Amazon Textract and Comprehend
    9. Cognitive Search and Natural Language Understanding
    10. Intelligent Customer Support Centers
    11. Industrial AI Services and Predictive Maintenance
    12. Home Automation with AWS IoT and Amazon SageMaker
    13. Extract Medical Information from Healthcare Documents
    14. Self-Optimizing and Intelligent Cloud Infrastructure
      1. Predictive Auto Scaling for Amazon EC2
      2. Anomaly Detection on Streams of Data
    15. Cognitive and Predictive Business Intelligence
      1. Ask Natural-Language Questions with Amazon QuickSight
      2. Train and Invoke SageMaker Models with Amazon Redshift
      3. Invoke Amazon Comprehend and SageMaker Models from Amazon Aurora SQL Database
      4. Invoke SageMaker Model from Amazon Athena
      5. Run Predictions on Graph Data Using Amazon Neptune
    16. Educating the Next Generation of AI and ML Developers
      1. Build Computer Vision Models with AWS DeepLens
      2. Learn Reinforcement Learning with AWS DeepRacer
      3. Understand GANs with AWS DeepComposer
    17. Program Nature’s Operating System with Quantum Computing
      1. Quantum Bits Versus Digital Bits
      2. Quantum Supremacy and the Quantum Computing Eras
      3. Cracking Cryptography
      4. Molecular Simulations and Drug Discovery
      5. Logistics and Financial Optimizations
      6. Quantum Machine Learning and AI
      7. Programming a Quantum Computer with Amazon Braket
      8. AWS Center for Quantum Computing
    18. Increase Performance and Reduce Cost
      1. Automatic Code Reviews with CodeGuru Reviewer
      2. Improve Application Performance with CodeGuru Profiler
      3. Improve Application Availability with DevOps Guru
    19. Summary
  4. 3. Automated Machine Learning
    1. Automated Machine Learning with SageMaker Autopilot
    2. Track Experiments with SageMaker Autopilot
    3. Train and Deploy a Text Classifier with SageMaker Autopilot
      1. Train and Deploy with SageMaker Autopilot UI
      2. Train and Deploy a Model with the SageMaker Autopilot Python SDK
      3. Predict with Amazon Athena and SageMaker Autopilot
      4. Train and Predict with Amazon Redshift ML and SageMaker Autopilot
    4. Automated Machine Learning with Amazon Comprehend
      1. Predict with Amazon Comprehend’s Built-in Model
      2. Train and Deploy a Custom Model with the Amazon Comprehend UI
      3. Train and Deploy a Custom Model with the Amazon Comprehend Python SDK
    5. Summary
  5. 4. Ingest Data into the Cloud
    1. Data Lakes
      1. Import Data into the S3 Data Lake
      2. Describe the Dataset
    2. Query the Amazon S3 Data Lake with Amazon Athena
      1. Access Athena from the AWS Console
      2. Register S3 Data as an Athena Table
      3. Update Athena Tables as New Data Arrives with AWS Glue Crawler
      4. Create a Parquet-Based Table in Athena
    3. Continuously Ingest New Data with AWS Glue Crawler
    4. Build a Lake House with Amazon Redshift Spectrum
      1. Export Amazon Redshift Data to S3 Data Lake as Parquet
      2. Share Data Between Amazon Redshift Clusters
    5. Choose Between Amazon Athena and Amazon Redshift
    6. Reduce Cost and Increase Performance
      1. S3 Intelligent-Tiering
      2. Parquet Partitions and Compression
      3. Amazon Redshift Table Design and Compression
      4. Use Bloom Filters to Improve Query Performance
      5. Materialized Views in Amazon Redshift Spectrum
    7. Summary
  6. 5. Explore the Dataset
    1. Tools for Exploring Data in AWS
    2. Visualize Our Data Lake with SageMaker Studio
      1. Prepare SageMaker Studio to Visualize Our Dataset
      2. Run a Sample Athena Query in SageMaker Studio
      3. Dive Deep into the Dataset with Athena and SageMaker
    3. Query Our Data Warehouse
      1. Run a Sample Amazon Redshift Query from SageMaker Studio
      2. Dive Deep into the Dataset with Amazon Redshift and SageMaker
    4. Create Dashboards with Amazon QuickSight
    5. Detect Data-Quality Issues with Amazon SageMaker and Apache Spark
      1. SageMaker Processing Jobs
      2. Analyze Our Dataset with Deequ and Apache Spark
    6. Detect Bias in Our Dataset
      1. Generate and Visualize Bias Reports with SageMaker Data Wrangler
      2. Detect Bias with a SageMaker Clarify Processing Job
      3. Integrate Bias Detection into Custom Scripts with SageMaker Clarify Open Source
      4. Mitigate Data Bias by Balancing the Data
    7. Detect Different Types of Drift with SageMaker Clarify
    8. Analyze Our Data with AWS Glue DataBrew
    9. Reduce Cost and Increase Performance
      1. Use a Shared S3 Bucket for Nonsensitive Athena Query Results
      2. Approximate Counts with HyperLogLog
      3. Dynamically Scale a Data Warehouse with AQUA for Amazon Redshift
      4. Improve Dashboard Performance with QuickSight SPICE
    10. Summary
  7. 6. Prepare the Dataset for Model Training
    1. Perform Feature Selection and Engineering
      1. Select Training Features Based on Feature Importance
      2. Balance the Dataset to Improve Model Accuracy
      3. Split the Dataset into Train, Validation, and Test Sets
      4. Transform Raw Text into BERT Embeddings
      5. Convert Features and Labels to Optimized TensorFlow File Format
    2. Scale Feature Engineering with SageMaker Processing Jobs
      1. Transform with scikit-learn and TensorFlow
      2. Transform with Apache Spark and TensorFlow
    3. Share Features Through SageMaker Feature Store
      1. Ingest Features into SageMaker Feature Store
      2. Retrieve Features from SageMaker Feature Store
    4. Ingest and Transform Data with SageMaker Data Wrangler
    5. Track Artifact and Experiment Lineage with Amazon SageMaker
      1. Understand Lineage-Tracking Concepts
      2. Show Lineage of a Feature Engineering Job
      3. Understand the SageMaker Experiments API
    6. Ingest and Transform Data with AWS Glue DataBrew
    7. Summary
  8. 7. Train Your First Model
    1. Understand the SageMaker Infrastructure
      1. Introduction to SageMaker Containers
      2. Increase Availability with Compute and Network Isolation
    2. Deploy a Pre-Trained BERT Model with SageMaker JumpStart
    3. Develop a SageMaker Model
      1. Built-in Algorithms
      2. Bring Your Own Script
      3. Bring Your Own Container
    4. A Brief History of Natural Language Processing
    5. BERT Transformer Architecture
    6. Training BERT from Scratch
      1. Masked Language Model
      2. Next Sentence Prediction
    7. Fine Tune a Pre-Trained BERT Model
    8. Create the Training Script
      1. Setup the Train, Validation, and Test Dataset Splits
      2. Set Up the Custom Classifier Model
      3. Train and Validate the Model
      4. Save the Model
    9. Launch the Training Script from a SageMaker Notebook
      1. Define the Metrics to Capture and Monitor
      2. Configure the Hyper-Parameters for Our Algorithm
      3. Select Instance Type and Instance Count
      4. Putting It All Together in the Notebook
      5. Download and Inspect Our Trained Model from S3
      6. Show Experiment Lineage for Our SageMaker Training Job
      7. Show Artifact Lineage for Our SageMaker Training Job
    10. Evaluate Models
      1. Run Some Ad Hoc Predictions from the Notebook
      2. Analyze Our Classifier with a Confusion Matrix
      3. Visualize Our Neural Network with TensorBoard
      4. Monitor Metrics with SageMaker Studio
      5. Monitor Metrics with CloudWatch Metrics
    11. Debug and Profile Model Training with SageMaker Debugger
      1. Detect and Resolve Issues with SageMaker Debugger Rules and Actions
      2. Profile Training Jobs
    12. Interpret and Explain Model Predictions
    13. Detect Model Bias and Explain Predictions
      1. Detect Bias with a SageMaker Clarify Processing Job
      2. Feature Attribution and Importance with SageMaker Clarify and SHAP
    14. More Training Options for BERT
      1. Convert TensorFlow BERT Model to PyTorch
      2. Train PyTorch BERT Models with SageMaker
      3. Train Apache MXNet BERT Models with SageMaker
      4. Train BERT Models with PyTorch and AWS Deep Java Library
    15. Reduce Cost and Increase Performance
      1. Use Small Notebook Instances
      2. Test Model-Training Scripts Locally in the Notebook
      3. Profile Training Jobs with SageMaker Debugger
      4. Start with a Pre-Trained Model
      5. Use 16-Bit Half Precision and bfloat16
      6. Mixed 32-Bit Full and 16-Bit Half Precision
      7. Quantization
      8. Use Training-Optimized Hardware
      9. Spot Instances and Checkpoints
      10. Early Stopping Rule in SageMaker Debugger
    16. Summary
  9. 8. Train and Optimize Models at Scale
    1. Automatically Find the Best Model Hyper-Parameters
      1. Set Up the Hyper-Parameter Ranges
      2. Run the Hyper-Parameter Tuning Job
      3. Analyze the Best Hyper-Parameters from the Tuning Job
      4. Show Experiment Lineage for Our SageMaker Tuning Job
    2. Use Warm Start for Additional SageMaker Hyper-Parameter Tuning Jobs
      1. Run HPT Job Using Warm Start
      2. Analyze the Best Hyper-Parameters from the Warm-Start Tuning Job
    3. Scale Out with SageMaker Distributed Training
      1. Choose a Distributed-Communication Strategy
      2. Choose a Parallelism Strategy
      3. Choose a Distributed File System
      4. Launch the Distributed Training Job
    4. Reduce Cost and Increase Performance
      1. Start with Reasonable Hyper-Parameter Ranges
      2. Shard the Data with ShardedByS3Key
      3. Stream Data on the Fly with Pipe Mode
      4. Enable Enhanced Networking
    5. Summary
  10. 9. Deploy Models to Production
    1. Choose Real-Time or Batch Predictions
    2. Real-Time Predictions with SageMaker Endpoints
      1. Deploy Model Using SageMaker Python SDK
      2. Track Model Deployment in Our Experiment
      3. Analyze the Experiment Lineage of a Deployed Model
      4. Invoke Predictions Using the SageMaker Python SDK
      5. Invoke Predictions Using HTTP POST
      6. Create Inference Pipelines
      7. Invoke SageMaker Models from SQL and Graph-Based Queries
    3. Auto-Scale SageMaker Endpoints Using Amazon CloudWatch
      1. Define a Scaling Policy with AWS-Provided Metrics
      2. Define a Scaling Policy with a Custom Metric
      3. Tuning Responsiveness Using a Cooldown Period
      4. Auto-Scale Policies
    4. Strategies to Deploy New and Updated Models
      1. Split Traffic for Canary Rollouts
      2. Shift Traffic for Blue/Green Deployments
    5. Testing and Comparing New Models
      1. Perform A/B Tests to Compare Model Variants
      2. Reinforcement Learning with Multiarmed Bandit Testing
    6. Monitor Model Performance and Detect Drift
      1. Enable Data Capture
      2. Understand Baselines and Drift
    7. Monitor Data Quality of Deployed SageMaker Endpoints
      1. Create a Baseline to Measure Data Quality
      2. Schedule Data-Quality Monitoring Jobs
      3. Inspect Data-Quality Results
    8. Monitor Model Quality of Deployed SageMaker Endpoints
      1. Create a Baseline to Measure Model Quality
      2. Schedule Model-Quality Monitoring Jobs
      3. Inspect Model-Quality Monitoring Results
    9. Monitor Bias Drift of Deployed SageMaker Endpoints
      1. Create a Baseline to Detect Bias
      2. Schedule Bias-Drift Monitoring Jobs
      3. Inspect Bias-Drift Monitoring Results
    10. Monitor Feature Attribution Drift of Deployed SageMaker Endpoints
      1. Create a Baseline to Monitor Feature Attribution
      2. Schedule Feature Attribution Drift Monitoring Jobs
      3. Inspect Feature Attribution Drift Monitoring Results
    11. Perform Batch Predictions with SageMaker Batch Transform
      1. Select an Instance Type
      2. Set Up the Input Data
      3. Tune the SageMaker Batch Transform Configuration
      4. Prepare the SageMaker Batch Transform Job
      5. Run the SageMaker Batch Transform Job
      6. Review the Batch Predictions
    12. AWS Lambda Functions and Amazon API Gateway
    13. Optimize and Manage Models at the Edge
    14. Deploy a PyTorch Model with TorchServe
    15. TensorFlow-BERT Inference with AWS Deep Java Library
    16. Reduce Cost and Increase Performance
      1. Delete Unused Endpoints and Scale In Underutilized Clusters
      2. Deploy Multiple Models in One Container
      3. Attach a GPU-Based Elastic Inference Accelerator
      4. Optimize a Trained Model with SageMaker Neo and TensorFlow Lite
      5. Use Inference-Optimized Hardware
    17. Summary
  11. 10. Pipelines and MLOps
    1. Machine Learning Operations
    2. Software Pipelines
    3. Machine Learning Pipelines
      1. Components of Effective Machine Learning Pipelines
      2. Steps of an Effective Machine Learning Pipeline
    4. Pipeline Orchestration with SageMaker Pipelines
      1. Create an Experiment to Track Our Pipeline Lineage
      2. Define Our Pipeline Steps
      3. Configure the Pipeline Parameters
      4. Create the Pipeline
      5. Start the Pipeline with the Python SDK
      6. Start the Pipeline with the SageMaker Studio UI
      7. Approve the Model for Staging and Production
      8. Review the Pipeline Artifact Lineage
      9. Review the Pipeline Experiment Lineage
    5. Automation with SageMaker Pipelines
      1. GitOps Trigger When Committing Code
      2. S3 Trigger When New Data Arrives
      3. Time-Based Schedule Trigger
      4. Statistical Drift Trigger
    6. More Pipeline Options
      1. AWS Step Functions and the Data Science SDK
      2. Kubeflow Pipelines
      3. Apache Airflow
      4. MLflow
      5. TensorFlow Extended
    7. Human-in-the-Loop Workflows
      1. Improving Model Accuracy with Amazon A2I
      2. Active-Learning Feedback Loops with SageMaker Ground Truth
    8. Reduce Cost and Improve Performance
      1. Cache Pipeline Steps
      2. Use Less-Expensive Spot Instances
    9. Summary
  12. 11. Streaming Analytics and Machine Learning
    1. Online Learning Versus Offline Learning
    2. Streaming Applications
    3. Windowed Queries on Streaming Data
      1. Stagger Windows
      2. Tumbling Windows
      3. Sliding Windows
    4. Streaming Analytics and Machine Learning on AWS
    5. Classify Real-Time Product Reviews with Amazon Kinesis, AWS Lambda, and Amazon SageMaker
    6. Implement Streaming Data Ingest Using Amazon Kinesis Data Firehose
      1. Create Lambda Function to Invoke SageMaker Endpoint
      2. Create the Kinesis Data Firehose Delivery Stream
      3. Put Messages on the Stream
    7. Summarize Real-Time Product Reviews with Streaming Analytics
    8. Setting Up Amazon Kinesis Data Analytics
      1. Create a Kinesis Data Stream to Deliver Data to a Custom Application
      2. Create AWS Lambda Function to Send Notifications via Amazon SNS
      3. Create AWS Lambda Function to Publish Metrics to Amazon CloudWatch
      4. Transform Streaming Data in Kinesis Data Analytics
      5. Understand In-Application Streams and Pumps
    9. Amazon Kinesis Data Analytics Applications
      1. Calculate Average Star Rating
      2. Detect Anomalies in Streaming Data
      3. Calculate Approximate Counts of Streaming Data
      4. Create Kinesis Data Analytics Application
      5. Start the Kinesis Data Analytics Application
      6. Put Messages on the Stream
    10. Classify Product Reviews with Apache Kafka, AWS Lambda, and Amazon SageMaker
    11. Reduce Cost and Improve Performance
      1. Aggregate Messages
      2. Consider Kinesis Firehose Versus Kinesis Data Streams
      3. Enable Enhanced Fan-Out for Kinesis Data Streams
    12. Summary
  13. 12. Secure Data Science on AWS
    1. Shared Responsibility Model Between AWS and Customers
    2. Applying AWS Identity and Access Management
      1. IAM Users
      2. IAM Policies
      3. IAM User Roles
      4. IAM Service Roles
      5. Specifying Condition Keys for IAM Roles
      6. Enable Multifactor Authentication
      7. Least Privilege Access with IAM Roles and Policies
      8. Resource-Based IAM Policies
      9. Identity-Based IAM Policies
    3. Isolating Compute and Network Environments
      1. Virtual Private Cloud
      2. VPC Endpoints and PrivateLink
      3. Limiting Athena APIs with a VPC Endpoint Policy
    4. Securing Amazon S3 Data Access
      1. Require a VPC Endpoint with an S3 Bucket Policy
      2. Limit S3 APIs for an S3 Bucket with a VPC Endpoint Policy
      3. Restrict S3 Bucket Access to a Specific VPC with an S3 Bucket Policy
      4. Limit S3 APIs with an S3 Bucket Policy
      5. Restrict S3 Data Access Using IAM Role Policies
      6. Restrict S3 Bucket Access to a Specific VPC with an IAM Role Policy
      7. Restrict S3 Data Access Using S3 Access Points
    5. Encryption at Rest
      1. Create an AWS KMS Key
      2. Encrypt the Amazon EBS Volumes During Training
      3. Encrypt the Uploaded Model in S3 After Training
      4. Store Encryption Keys with AWS KMS
      5. Enforce S3 Encryption for Uploaded S3 Objects
      6. Enforce Encryption at Rest for SageMaker Jobs
      7. Enforce Encryption at Rest for SageMaker Notebooks
      8. Enforce Encryption at Rest for SageMaker Studio
    6. Encryption in Transit
      1. Post-Quantum TLS Encryption in Transit with KMS
      2. Encrypt Traffic Between Training-Cluster Containers
      3. Enforce Inter-Container Encryption for SageMaker Jobs
    7. Securing SageMaker Notebook Instances
      1. Deny Root Access Inside SageMaker Notebooks
      2. Disable Internet Access for SageMaker Notebooks
    8. Securing SageMaker Studio
      1. Require a VPC for SageMaker Studio
      2. SageMaker Studio Authentication
    9. Securing SageMaker Jobs and Models
      1. Require a VPC for SageMaker Jobs
      2. Require Network Isolation for SageMaker Jobs
    10. Securing AWS Lake Formation
    11. Securing Database Credentials with AWS Secrets Manager
    12. Governance
      1. Secure Multiaccount AWS Environments with AWS Control Tower
      2. Manage Accounts with AWS Organizations
      3. Enforce Account-Level Permissions with SCPs
      4. Implement Multiaccount Model Deployments
    13. Auditability
      1. Tag Resources
      2. Log Activities and Collect Events
      3. Track User Activity and API Calls
    14. Reduce Cost and Improve Performance
      1. Limit Instance Types to Control Cost
      2. Quarantine or Delete Untagged Resources
      3. Use S3 Bucket KMS Keys to Reduce Cost and Increase Performance
    15. Summary
  14. Index

Product information

  • Title: Data Science on AWS
  • Author(s): Chris Fregly, Antje Barth
  • Release date: April 2021
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781492079392