Data Pipelines Pocket Reference

@tags:: #lit✍/📚book/highlights
@links::
@ref:: Data Pipelines Pocket Reference
@author:: James Densmore

=this.file.name

Book cover of "Data Pipelines Pocket Reference"

Reference

Notes

Chapter 1. Introduction to Data Pipelines

What Are Data Pipelines?

Quote

Data pipelines are sets of processes that move and transform data from various sources to a destination where new value can be derived.
- Location 90
-

Quote

In practice, however, pipelines typically consist of multiple steps including data extraction, data preprocessing, data validation, and at times training or running a machine learning model before delivering data to its final destination.
- Location 94
-

Quote

Figure 1-1. A simple pipeline that loads server log data into an S3 Bucket, does some basic processing and structuring, and loads the results into an Amazon Redshift database.
- Location 98
-

Who Builds Data Pipelines?

Quote

Data engineers take pride in ensuring the validity and timeliness of the data they deliver. That means testing, alerting, and creating contingency plans for when something goes wrong.
- Location 107
-

SQL and Data Warehousing Fundamentals
Quote

Experienced data engineers know how to write high-performance SQL and understand the fundamentals of data warehousing and data modeling.
- Location 113
-

Python and/or Java
Distributed Computing
Quote

One popular example of distributed computing in analytics is the Hadoop ecosystem, which includes distributed file storage via Hadoop Distributed File System (HDFS), processing via MapReduce, data analysis via Pig, and more. Apache Spark is another popular distributed processing framework, which is quickly surpassing Hadoop in popularity.
- Location 125
-

Basic System Administration
Quote

A data engineer is expected to be proficient on the Linux command line and be able to perform tasks such as analyze application logs, schedule cron jobs, and troubleshoot firewall and other security settings.
- Location 130
-

A Goal-Oriented Mentality

Why Build Data Pipelines?

Quote

What most people outside of analytics often fail to appreciate is that to generate what is seen, there’s a complex machinery that is unseen. For every dashboard and insight that a data analyst generates and for each predictive model developed by a data scientist, there are data pipelines working behind the scenes.
- Location 143
-

How Are Pipelines Built?

Quote

pipelines are not just built — they are monitored, maintained, and extended. Data engineers are tasked with not just delivering data once, but building pipelines and supporting infrastructure that deliver and process it reliably, securely, and on time.
- Location 159
-

Chapter 2. A Modern Data Infrastructure

Diversity of Data Sources

Quote

Figure 2-1. The key components of a modern data infrastructure.
- Location 172
-

Source System Ownership
Quote

Figure 2-2. A simple pipeline with data from multiple sources loaded into an S3 bucket and then a Redshift database.
- Location 181
-

Quote

The term data ingestion refers to extracting data from one source and loading it into another.
- Location 183
-

Ingestion Interface and Data Structure
Quote

First, what is the interface to the data? Some of the most common include the following: A database behind an application, such as a Postgres or MySQL database A layer of abstraction on top of a system such as a REST API A stream processing platform such as Apache Kafka A shared network file system or cloud storage bucket containing logs, comma-separated value (CSV) files, and other flat files A data warehouse or data lake Data in HDFS or HBase database
- Location 197
-

Quote

In addition to the interface, the structure of the data will vary. Here are some common examples: JSON from a REST API Well-structured data from a MySQL database JSON within columns of a MySQL database table Semistructured log data CSV, fixed-width format (FWF), and other flat file formats JSON in flat files Stream output from Kafka
- Location 200
-

Quote

Semistructured data such as JSON is increasingly common and has the advantage of the structure of attribute-value pairs and nesting of objects. However, unlike a relational database, there is no guarantee that each object in the same dataset will have the same structure.
- Location 206
-

Data Volume
Quote

Though the design decisions at each step in a pipeline must take data volume into consideration, high volume does not mean high value.
- Location 216
-

Quote

Data Cleanliness and Validity
- Location 223
-

Quote

There are many common characteristics of “messy data,” including, but not limited to, the following: Duplicate or ambiguous records Orphaned records Incomplete or missing records Text encoding errors Inconsistent formats (for example, phone numbers with or without dashes) Mislabeled or unlabeled data
- Location 227
-

Quote

There’s no magic bullet for ensuring data cleanliness and validity, but in a modern data ecosystem, there are key characteristics and approaches that we’ll see throughout this book: Assume the worst, expect the best Pristine datasets only exist in academic literature. Assume your input datasets will contain numerous validity and consistency issues, but build pipelines that identify and cleanse data in the interest of clean output. Clean and validate data in the system best suited to do so There are times when it’s better to wait to clean data until later in a pipeline. For example, modern pipelines tend to follow an extract-load-transform (ELT) rather than extract-transform-load (ETL) approach for data warehousing (more in Chapter 3). It’s sometimes optimal to load data into a data lake in a fairly raw form and to worry about structuring and cleaning later in the pipeline. In other words, use the right tool for the right job rather than rushing the cleaning and validation processes. Validate often Even if you don’t clean up data early in a pipeline, don’t wait until the end of the pipeline to validate it. You’ll have a much harder time determining where things went wrong. Conversely, don’t validate once early in a pipeline and assume all will go well in subsequent steps. Chapter 8 digs deeper into validation.
- Location 231
-

Latency and Bandwidth of the Source System
Quote

Data extraction steps in pipelines must contend with API rate limits, connection time-outs, slow downloads, and source system owners who are unhappy due to strain placed on their systems.
- Location 247
-

Cloud Data Warehouses and Data Lakes

Quote

A data warehouse is a database where data from different systems is stored and modeled to support analysis and other activities related to answering questions with it. Data in a data warehouse is structured and optimized for reporting and analysis queries.
- Location 263
-

Quote

A data lake is where data is stored, but without the structure or query optimization of a data warehouse.
- Location 265
-

Data Ingestion Tools

Data Transformation and Modeling Tools

Quote

Data transformation Transforming data is a broad term that is signified by the T in an ETL or ELT process. A transformation can be something as simple as converting a timestamp stored in a table from one time zone to another. It can also be a more complex operation that creates a new metric from multiple source columns that are aggregated and filtered through some business logic.
- Location 293
-

Quote

Data modeling Data modeling is a more specific type of data transformation. A data model structures and defines data in a format that is understood and optimized for data analysis. A data model is usually represented as one or more tables in a data warehouse.
- Location 298
-

Quote

For example, for the sake of protecting personally identifiable information (PII) it may be desirable to turn an email address into a hashed value that is stored in the final destination. Such a transformation is usually performed during the ingestion process.
- Location 304
-

Workflow Orchestration Platforms

Quote

it’s important to introduce a workflow orchestration platform to your data infrastructure. These platforms manage the scheduling and flow of tasks in a pipeline. Imagine a pipeline with a dozen tasks ranging from data ingestions written in Python to data transformations written in SQL that must run in a particular sequence throughout the day. It’s not a simple challenge to schedule and manage dependencies between each task.
- Location 318
-

Quote

Workflow orchestration platforms are also referred to as workflow management systems (WMSs), orchestration platforms, or orchestration frameworks.
- Location 323
-

Directed Acyclic Graphs

Customizing Your Data Infrastructure

Quote

What’s important is understanding your constraints (dollars, engineering resources, security, and legal risk tolerance) and the resulting trade-offs.
- Location 357
-

Chapter 3. Common Data Pipeline Patterns

ETL and ELT

Quote

The extract step gathers data from various sources in preparation for loading and transforming.
- Location 377
-

Quote

The load step brings either the raw data (in the case of ELT) or the fully transformed data (in the case of ETL) into the final destination. Either way, the end result is loading data into the data warehouse, data lake, or other destination.
- Location 379
-

Quote

The transform step is where the raw data from each source system is combined and formatted in a such a way that it’s useful to analysts, visualization tools, or whatever use case your pipeline is serving.
- Location 382
-

Quote

The combination of the extraction and loading steps is often referred to as data ingestion. Especially in ELT and the EtLT subpattern (note the lowercase t), which is defined later in this chapter, extraction and loading capabilities are often tightly coupled and packaged together in software frameworks. When designing pipelines, however, it is still best to consider the two steps as separate due to the complexity of coordinating extracts and loads across different systems and infrastructure.
- Location 386
-

The Emergence of ELT over ETL

Quote

The majority of today’s data warehouses are built on highly scalable, columnar databases that can both store and run bulk transforms on large datasets in a cost-effective manner.
- Location 406
-

Quote

Each row of the database is stored together on disk, in one or more blocks depending on the size of each record. If a record is smaller than a single block or not cleanly divisible by the block size, it leaves some disk space unused.
- Location 412
-

Quote

Figure 3-1. A table stored in a row-based storage database. Each block contains a record (row) from the table.
- Location 414
-


dg-publish: true
created: 2024-07-01
modified: 2024-07-01
title: Data Pipelines Pocket Reference
source: kindle

@tags:: #lit✍/📚book/highlights
@links::
@ref:: Data Pipelines Pocket Reference
@author:: James Densmore

=this.file.name

Book cover of "Data Pipelines Pocket Reference"

Reference

Notes

Chapter 1. Introduction to Data Pipelines

What Are Data Pipelines?

Quote

Data pipelines are sets of processes that move and transform data from various sources to a destination where new value can be derived.
- Location 90
-

Quote

In practice, however, pipelines typically consist of multiple steps including data extraction, data preprocessing, data validation, and at times training or running a machine learning model before delivering data to its final destination.
- Location 94
-

Quote

Figure 1-1. A simple pipeline that loads server log data into an S3 Bucket, does some basic processing and structuring, and loads the results into an Amazon Redshift database.
- Location 98
-

Who Builds Data Pipelines?

Quote

Data engineers take pride in ensuring the validity and timeliness of the data they deliver. That means testing, alerting, and creating contingency plans for when something goes wrong.
- Location 107
-

SQL and Data Warehousing Fundamentals
Quote

Experienced data engineers know how to write high-performance SQL and understand the fundamentals of data warehousing and data modeling.
- Location 113
-

Python and/or Java
Distributed Computing
Quote

One popular example of distributed computing in analytics is the Hadoop ecosystem, which includes distributed file storage via Hadoop Distributed File System (HDFS), processing via MapReduce, data analysis via Pig, and more. Apache Spark is another popular distributed processing framework, which is quickly surpassing Hadoop in popularity.
- Location 125
-

Basic System Administration
Quote

A data engineer is expected to be proficient on the Linux command line and be able to perform tasks such as analyze application logs, schedule cron jobs, and troubleshoot firewall and other security settings.
- Location 130
-

A Goal-Oriented Mentality

Why Build Data Pipelines?

Quote

What most people outside of analytics often fail to appreciate is that to generate what is seen, there’s a complex machinery that is unseen. For every dashboard and insight that a data analyst generates and for each predictive model developed by a data scientist, there are data pipelines working behind the scenes.
- Location 143
-

How Are Pipelines Built?

Quote

pipelines are not just built — they are monitored, maintained, and extended. Data engineers are tasked with not just delivering data once, but building pipelines and supporting infrastructure that deliver and process it reliably, securely, and on time.
- Location 159
-

Chapter 2. A Modern Data Infrastructure

Diversity of Data Sources

Quote

Figure 2-1. The key components of a modern data infrastructure.
- Location 172
-

Source System Ownership
Quote

Figure 2-2. A simple pipeline with data from multiple sources loaded into an S3 bucket and then a Redshift database.
- Location 181
-

Quote

The term data ingestion refers to extracting data from one source and loading it into another.
- Location 183
-

Ingestion Interface and Data Structure
Quote

First, what is the interface to the data? Some of the most common include the following: A database behind an application, such as a Postgres or MySQL database A layer of abstraction on top of a system such as a REST API A stream processing platform such as Apache Kafka A shared network file system or cloud storage bucket containing logs, comma-separated value (CSV) files, and other flat files A data warehouse or data lake Data in HDFS or HBase database
- Location 197
-

Quote

In addition to the interface, the structure of the data will vary. Here are some common examples: JSON from a REST API Well-structured data from a MySQL database JSON within columns of a MySQL database table Semistructured log data CSV, fixed-width format (FWF), and other flat file formats JSON in flat files Stream output from Kafka
- Location 200
-

Quote

Semistructured data such as JSON is increasingly common and has the advantage of the structure of attribute-value pairs and nesting of objects. However, unlike a relational database, there is no guarantee that each object in the same dataset will have the same structure.
- Location 206
-

Data Volume
Quote

Though the design decisions at each step in a pipeline must take data volume into consideration, high volume does not mean high value.
- Location 216
-

Quote

Data Cleanliness and Validity
- Location 223
-

Quote

There are many common characteristics of “messy data,” including, but not limited to, the following: Duplicate or ambiguous records Orphaned records Incomplete or missing records Text encoding errors Inconsistent formats (for example, phone numbers with or without dashes) Mislabeled or unlabeled data
- Location 227
-

Quote

There’s no magic bullet for ensuring data cleanliness and validity, but in a modern data ecosystem, there are key characteristics and approaches that we’ll see throughout this book: Assume the worst, expect the best Pristine datasets only exist in academic literature. Assume your input datasets will contain numerous validity and consistency issues, but build pipelines that identify and cleanse data in the interest of clean output. Clean and validate data in the system best suited to do so There are times when it’s better to wait to clean data until later in a pipeline. For example, modern pipelines tend to follow an extract-load-transform (ELT) rather than extract-transform-load (ETL) approach for data warehousing (more in Chapter 3). It’s sometimes optimal to load data into a data lake in a fairly raw form and to worry about structuring and cleaning later in the pipeline. In other words, use the right tool for the right job rather than rushing the cleaning and validation processes. Validate often Even if you don’t clean up data early in a pipeline, don’t wait until the end of the pipeline to validate it. You’ll have a much harder time determining where things went wrong. Conversely, don’t validate once early in a pipeline and assume all will go well in subsequent steps. Chapter 8 digs deeper into validation.
- Location 231
-

Latency and Bandwidth of the Source System
Quote

Data extraction steps in pipelines must contend with API rate limits, connection time-outs, slow downloads, and source system owners who are unhappy due to strain placed on their systems.
- Location 247
-

Cloud Data Warehouses and Data Lakes

Quote

A data warehouse is a database where data from different systems is stored and modeled to support analysis and other activities related to answering questions with it. Data in a data warehouse is structured and optimized for reporting and analysis queries.
- Location 263
-

Quote

A data lake is where data is stored, but without the structure or query optimization of a data warehouse.
- Location 265
-

Data Ingestion Tools

Data Transformation and Modeling Tools

Quote

Data transformation Transforming data is a broad term that is signified by the T in an ETL or ELT process. A transformation can be something as simple as converting a timestamp stored in a table from one time zone to another. It can also be a more complex operation that creates a new metric from multiple source columns that are aggregated and filtered through some business logic.
- Location 293
-

Quote

Data modeling Data modeling is a more specific type of data transformation. A data model structures and defines data in a format that is understood and optimized for data analysis. A data model is usually represented as one or more tables in a data warehouse.
- Location 298
-

Quote

For example, for the sake of protecting personally identifiable information (PII) it may be desirable to turn an email address into a hashed value that is stored in the final destination. Such a transformation is usually performed during the ingestion process.
- Location 304
-

Workflow Orchestration Platforms

Quote

it’s important to introduce a workflow orchestration platform to your data infrastructure. These platforms manage the scheduling and flow of tasks in a pipeline. Imagine a pipeline with a dozen tasks ranging from data ingestions written in Python to data transformations written in SQL that must run in a particular sequence throughout the day. It’s not a simple challenge to schedule and manage dependencies between each task.
- Location 318
-

Quote

Workflow orchestration platforms are also referred to as workflow management systems (WMSs), orchestration platforms, or orchestration frameworks.
- Location 323
-

Directed Acyclic Graphs

Customizing Your Data Infrastructure

Quote

What’s important is understanding your constraints (dollars, engineering resources, security, and legal risk tolerance) and the resulting trade-offs.
- Location 357
-

Chapter 3. Common Data Pipeline Patterns

ETL and ELT

Quote

The extract step gathers data from various sources in preparation for loading and transforming.
- Location 377
-

Quote

The load step brings either the raw data (in the case of ELT) or the fully transformed data (in the case of ETL) into the final destination. Either way, the end result is loading data into the data warehouse, data lake, or other destination.
- Location 379
-

Quote

The transform step is where the raw data from each source system is combined and formatted in a such a way that it’s useful to analysts, visualization tools, or whatever use case your pipeline is serving.
- Location 382
-

Quote

The combination of the extraction and loading steps is often referred to as data ingestion. Especially in ELT and the EtLT subpattern (note the lowercase t), which is defined later in this chapter, extraction and loading capabilities are often tightly coupled and packaged together in software frameworks. When designing pipelines, however, it is still best to consider the two steps as separate due to the complexity of coordinating extracts and loads across different systems and infrastructure.
- Location 386
-

The Emergence of ELT over ETL

Quote

The majority of today’s data warehouses are built on highly scalable, columnar databases that can both store and run bulk transforms on large datasets in a cost-effective manner.
- Location 406
-

Quote

Each row of the database is stored together on disk, in one or more blocks depending on the size of each record. If a record is smaller than a single block or not cleanly divisible by the block size, it leaves some disk space unused.
- Location 412
-

Quote

Figure 3-1. A table stored in a row-based storage database. Each block contains a record (row) from the table.
- Location 414
-