Data Engineering for Machine Learning Pipelines
Год издания: 2024
Автор: Narayanan P.K.
Издательство: Apress
ISBN: 979-8-8688-0602-5
Язык: Английский
Формат: PDF
Качество: Издательский макет или текст (eBook)
Интерактивное оглавление: Да
Количество страниц: 631
Описание: This book covers modern data engineering functions and important Python libraries, to help you develop state-of-the-art ML pipelines and integration code.
The book begins by explaining data analytics and transformation, delving into the Pandas library, its capabilities, and nuances. It then explores emerging libraries such as Polars and CuDF, providing insights into GPU-based computing and cutting-edge data manipulation techniques. The text discusses the importance of data validation in engineering processes, introducing tools such as Great Expectations and Pandera to ensure data quality and reliability. The book delves into API design and development, with a specific focus on leveraging the power of FastAPI. It covers authentication, authorization, and real-world applications, enabling you to construct efficient and secure APIs using FastAPI. Also explored is concurrency in data engineering, examining Dask's capabilities from basic setup to crafting advanced machine learning pipelines. The book includes development and delivery of data engineering pipelines using leading cloud platforms such as AWS, Google Cloud, and Microsoft Azure. The concluding chapters concentrate on real-time and streaming data engineering pipelines, emphasizing Apache Kafka and workflow orchestration in data engineering. Workflow tools such as Airflow and Prefect are introduced to seamlessly manage and automate complex data workflows.
What sets this book apart is its blend of theoretical knowledge and practical application, a structured path from basic to advanced concepts, and insights into using state-of-the-art tools. With this book, you gain access to cutting-edge techniques and insights that are reshaping the industry. This book is not just an educational tool. It is a career catalyst, and an investment in your future as a data engineering expert, poised to meet the challenges of today's data-driven world.
What You Will Learn
- Elevate your data wrangling jobs by utilizing the power of both CPU and GPU computing, and learn to process data using Pandas 2.0, Polars, and CuDF at unprecedented speeds
- Design data validation pipelines, construct efficient data service APIs, develop real-time streaming pipelines and master the art of workflow orchestration to streamline your engineering projects
- Leverage concurrent programming to develop machine learning pipelines and get hands-on experience in development and deployment of machine learning pipelines across AWS, GCP, and Azure
Примеры страниц (скриншоты)
Оглавление
About the Author xix
About the Technical Reviewer xxi
Introduction xxiii
Chapter 1: Core Technologies in Data Engineering 1
Chapter 2: Data Wrangling using Pandas 41
Chapter 3: Data Wrangling using Rust’s Polars 93
Chapter 4: GPU Driven Data Wrangling Using CuDF 133
Chapter 5: Getting Started with Data Validation using Pydantic and Pandera . 163
Chapter 6: Data Validation using Great Expectations 197
Chapter 7: Introduction to Concurrency Programming and Dask 225
Chapter 8: Engineering Machine Learning Pipelines using DaskML 253
Chapter 9: Engineering Real-time Data Pipelines using Apache Kafka 277
Chapter 10: Engineering Machine Learning and Data REST APIs using FastAPI . 323
Chapter 11: Getting Started with Workflow Management and Orchestration 361
Chapter 12: Orchestrating Data Engineering Pipelines using Apache Airflow . 383
Chapter 13: Orchestrating Data Engineering Pipelines using Prefect 415
Chapter 14: Getting Started with Big Data and Cloud Computing 451
Chapter 15: Engineering Data Pipelines Using Amazon Web Services 473
Chapter 16: Engineering Data Pipelines Using Google Cloud Platform 531
Chapter 17: Engineering Data Pipelines Using Microsoft Azure 571
Index 617