Name: Modern Data Engineering with Apache Spark
ISBN: 978-1-4842-7452-1

Overview

Authors:

Scott Haines ⁰

Scott Haines
1. San Jose, USA
View author publications

You can also search for this author in PubMed Google Scholar

Provides a practical approach to data engineering through the lens of Apache Spark
Includes lessons from the author’s experience in managing massive data pipelines
Gives you a toolbox of solutions to draw on when solving future problems

25k Accesses
5 Citations
13 Altmetric

This is a preview of subscription content, log in via an institution to check access.

Access this book

eBook USD 49.99

Price excludes VAT (USA)

Softcover Book USD 64.99

Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Other ways to access

Licence this eBook for your library

Institutional subscriptions

Table of contents (15 chapters)

Front Matter

Pages i-xxv

Download chapter PDF
The Fundamentals of Data Engineering with Spark
1. Front Matter
  
  Pages 1-1
  
  Download chapter PDF
2. Introduction to Modern Data Engineering
  
  Scott Haines
  
  Pages 3-29
3. Getting Started with Apache Spark
  
  Scott Haines
  
  Pages 31-57
4. Working with Data
  
  Scott Haines
  
  Pages 59-91
5. Transforming Data with Spark SQL and the DataFrame API
  
  Scott Haines
  
  Pages 93-115
6. Bridging Spark SQL with JDBC
  
  Scott Haines
  
  Pages 117-151
7. Data Discovery and the Spark SQL Catalog
  
  Scott Haines
  
  Pages 153-202
8. Data Pipelines and Structured Spark Applications
  
  Scott Haines
  
  Pages 203-252
The Streaming Pipeline Ecosystem
1. Front Matter
  
  Pages 253-253
  
  Download chapter PDF
2. Workflow Orchestration with Apache Airflow
  
  Scott Haines
  
  Pages 255-295
3. A Gentle Introduction to Stream Processing
  
  Scott Haines
  
  Pages 297-322
4. Patterns for Writing Structured Streaming Applications
  
  Scott Haines
  
  Pages 323-363
5. Apache Kafka and Spark Structured Streaming
  
  Scott Haines
  
  Pages 365-404
6. Analytical Processing and Insights
  
  Scott Haines
  
  Pages 405-450
Advanced Techniques
1. Front Matter
  
  Pages 451-451
  
  Download chapter PDF
2. Advanced Analytics with Spark Stateful Structured Streaming
  
  Scott Haines
  
  Pages 453-488
3. Deploying Mission-Critical Spark Applications on Spark Standalone
  
  Scott Haines
  
  Pages 489-521
4. Deploying Mission-Critical Spark Applications on Kubernetes
  
  Scott Haines
  
  Pages 523-571
Back Matter

Pages 573-585

Download chapter PDF

Keywords

About this book

Leverage Apache Spark within a modern data engineering ecosystem. This hands-on guide will teach you how to write fully functional applications, follow industry best practices, and learn the rationale behind these decisions. With Apache Spark as the foundation, you will follow a step-by-step journey beginning with the basics of data ingestion, processing, and transformation, and ending up with an entire local data platform running Apache Spark, Apache Zeppelin, Apache Kafka, Redis, MySQL, Minio (S3), and Apache Airflow.

Apache Spark applications solve a wide range of data problems from traditional data loading and processing to rich SQL-based analysis as well as complex machine learning workloads and even near real-time processing of streaming data. Spark fits well as a central foundation for any data engineering workload. This book will teach you to write interactive Spark applications using Apache Zeppelin notebooks, write and compilereusable applications and modules, and fully test both batch and streaming. You will also learn to containerize your applications using Docker and run and deploy your Spark applications using a variety of tools such as Apache Airflow, Docker and Kubernetes.

Reading this book will empower you to take advantage of Apache Spark to optimize your data pipelines and teach you to craft modular and testable Spark applications. You will create and deploy mission-critical streaming spark applications in a low-stress environment that paves the way for your own path to production.

What You Will Learn

Simplify data transformation with Spark Pipelines and Spark SQL
Bridge data engineering with machine learning
Architect modular data pipeline applications
Build reusable application components and libraries
Containerize your Spark applications for consistency and reliability
Use Docker and Kubernetes to deploy your Spark applications
Speed up application experimentation using Apache Zeppelin and Docker
Understand serializable structured data and data contracts
Harness effective strategies for optimizing data in your data lakes
Build end-to-end Spark structured streaming applications using Redis and Apache Kafka
Embrace testing for your batch and streaming applications
Deploy and monitor your Spark applications

Who This Book Is For

Professional software engineers who want to take their current skills and apply them to new and exciting opportunities within the data ecosystem, practicing data engineers who are looking for a guiding light while traversing the many challenges of moving from batch to streaming modes, data architects who wish to provide clear and concise direction for how best to harness anduse Apache Spark within their organization, and those interested in the ins and outs of becoming a modern data engineer in today's fast-paced and data-hungry world

Authors and Affiliations

San Jose, USA

Scott Haines

About the author

Scott Haines is a full stack engineer with a current focus on real-time, highly available, trustworthy analytics systems. He works at Twilio as a Principal Software Engineer on the Voice Insights team, where he helps drive Spark adoption, creates streaming pipeline architectures, and helps to architect and build out a massive stream and batch processing platform.
Prior to Twilio, Scott worked writing the backend Java APIs for Yahoo Games as well as the real-time game ranking and ratings engine (built on Storm) to provide personalized recommendations and page views for 10 million customers. He finished his tenure at Yahoo working for Flurry Analytics where he wrote the alerts and notifications system for mobile devices.

Bibliographic Information

Book Title: Modern Data Engineering with Apache Spark
Book Subtitle: A Hands-On Guide for Building Mission-Critical Streaming Applications
Authors: Scott Haines
DOI: https://doi.org/10.1007/978-1-4842-7452-1
Publisher: Apress Berkeley, CA
eBook Packages: Professional and Applied Computing, Apress Access Books, Professional and Applied Computing (R0)
Softcover ISBN: 978-1-4842-7451-4Published: 23 March 2022
eBook ISBN: 978-1-4842-7452-1Published: 22 March 2022
Edition Number: 1
Number of Pages: XXV, 585
Number of Illustrations: 59 b/w illustrations
Topics: Java, Statistics, general, Database Management, Data Mining and Knowledge Discovery

Publish with us

Policies and ethics

Modern Data Engineering with Apache Spark

Overview

Access this book

Other ways to access

Table of contents (15 chapters)

Front Matter

The Fundamentals of Data Engineering with Spark

Front Matter

The Streaming Pipeline Ecosystem

Front Matter

Advanced Techniques

Front Matter

Back Matter

Keywords

About this book

Authors and Affiliations

San Jose, USA

About the author

Bibliographic Information

Publish with us

Search

Navigation