Data Pipelines Pocket Reference
@tags:: #litâ/đbook/highlights
@links::
@ref:: Data Pipelines Pocket Reference
@author:: James Densmore
=this.file.name
Reference
=this.ref
Notes
Chapter 1. Introduction to Data Pipelines
What Are Data Pipelines?
Data pipelines are sets of processes that move and transform data from various sources to a destination where new value can be derived.
- Location 90
-
In practice, however, pipelines typically consist of multiple steps including data extraction, data preprocessing, data validation, and at times training or running a machine learning model before delivering data to its final destination.
- Location 94
-
Figure 1-1. A simple pipeline that loads server log data into an S3 Bucket, does some basic processing and structuring, and loads the results into an Amazon Redshift database.
- Location 98
-
Who Builds Data Pipelines?
Data engineers take pride in ensuring the validity and timeliness of the data they deliver. That means testing, alerting, and creating contingency plans for when something goes wrong.
- Location 107
-
SQL and Data Warehousing Fundamentals
Experienced data engineers know how to write high-performance SQL and understand the fundamentals of data warehousing and data modeling.
- Location 113
-
Python and/or Java
Distributed Computing
One popular example of distributed computing in analytics is the Hadoop ecosystem, which includes distributed file storage via Hadoop Distributed File System (HDFS), processing via MapReduce, data analysis via Pig, and more. Apache Spark is another popular distributed processing framework, which is quickly surpassing Hadoop in popularity.
- Location 125
-
Basic System Administration
A data engineer is expected to be proficient on the Linux command line and be able to perform tasks such as analyze application logs, schedule cron jobs, and troubleshoot firewall and other security settings.
- Location 130
-
A Goal-Oriented Mentality
Why Build Data Pipelines?
What most people outside of analytics often fail to appreciate is that to generate what is seen, thereâs a complex machinery that is unseen. For every dashboard and insight that a data analyst generates and for each predictive model developed by a data scientist, there are data pipelines working behind the scenes.
- Location 143
-
How Are Pipelines Built?
pipelines are not just built â they are monitored, maintained, and extended. Data engineers are tasked with not just delivering data once, but building pipelines and supporting infrastructure that deliver and process it reliably, securely, and on time.
- Location 159
-
Chapter 2. A Modern Data Infrastructure
Diversity of Data Sources
Figure 2-1. The key components of a modern data infrastructure.
- Location 172
-
Source System Ownership
Figure 2-2. A simple pipeline with data from multiple sources loaded into an S3 bucket and then a Redshift database.
- Location 181
-
The term data ingestion refers to extracting data from one source and loading it into another.
- Location 183
-
Ingestion Interface and Data Structure
First, what is the interface to the data? Some of the most common include the following: A database behind an application, such as a Postgres or MySQL database A layer of abstraction on top of a system such as a REST API A stream processing platform such as Apache Kafka A shared network file system or cloud storage bucket containing logs, comma-separated value (CSV) files, and other flat files A data warehouse or data lake Data in HDFS or HBase database
- Location 197
-
In addition to the interface, the structure of the data will vary. Here are some common examples: JSON from a REST API Well-structured data from a MySQL database JSON within columns of a MySQL database table Semistructured log data CSV, fixed-width format (FWF), and other flat file formats JSON in flat files Stream output from Kafka
- Location 200
-
Semistructured data such as JSON is increasingly common and has the advantage of the structure of attribute-value pairs and nesting of objects. However, unlike a relational database, there is no guarantee that each object in the same dataset will have the same structure.
- Location 206
-
Data Volume
Though the design decisions at each step in a pipeline must take data volume into consideration, high volume does not mean high value.
- Location 216
-
Data Cleanliness and Validity
- Location 223
-
There are many common characteristics of âmessy data,â including, but not limited to, the following: Duplicate or ambiguous records Orphaned records Incomplete or missing records Text encoding errors Inconsistent formats (for example, phone numbers with or without dashes) Mislabeled or unlabeled data
- Location 227
-
Thereâs no magic bullet for ensuring data cleanliness and validity, but in a modern data ecosystem, there are key characteristics and approaches that weâll see throughout this book: Assume the worst, expect the best Pristine datasets only exist in academic literature. Assume your input datasets will contain numerous validity and consistency issues, but build pipelines that identify and cleanse data in the interest of clean output. Clean and validate data in the system best suited to do so There are times when itâs better to wait to clean data until later in a pipeline. For example, modern pipelines tend to follow an extract-load-transform (ELT) rather than extract-transform-load (ETL) approach for data warehousing (more in Chapter 3). Itâs sometimes optimal to load data into a data lake in a fairly raw form and to worry about structuring and cleaning later in the pipeline. In other words, use the right tool for the right job rather than rushing the cleaning and validation processes. Validate often Even if you donât clean up data early in a pipeline, donât wait until the end of the pipeline to validate it. Youâll have a much harder time determining where things went wrong. Conversely, donât validate once early in a pipeline and assume all will go well in subsequent steps. Chapter 8 digs deeper into validation.
- Location 231
-
Latency and Bandwidth of the Source System
Data extraction steps in pipelines must contend with API rate limits, connection time-outs, slow downloads, and source system owners who are unhappy due to strain placed on their systems.
- Location 247
-
Cloud Data Warehouses and Data Lakes
A data warehouse is a database where data from different systems is stored and modeled to support analysis and other activities related to answering questions with it. Data in a data warehouse is structured and optimized for reporting and analysis queries.
- Location 263
-
A data lake is where data is stored, but without the structure or query optimization of a data warehouse.
- Location 265
-
Data Ingestion Tools
Data Transformation and Modeling Tools
Data transformation Transforming data is a broad term that is signified by the T in an ETL or ELT process. A transformation can be something as simple as converting a timestamp stored in a table from one time zone to another. It can also be a more complex operation that creates a new metric from multiple source columns that are aggregated and filtered through some business logic.
- Location 293
-
Data modeling Data modeling is a more specific type of data transformation. A data model structures and defines data in a format that is understood and optimized for data analysis. A data model is usually represented as one or more tables in a data warehouse.
- Location 298
-
For example, for the sake of protecting personally identifiable information (PII) it may be desirable to turn an email address into a hashed value that is stored in the final destination. Such a transformation is usually performed during the ingestion process.
- Location 304
-
Workflow Orchestration Platforms
itâs important to introduce a workflow orchestration platform to your data infrastructure. These platforms manage the scheduling and flow of tasks in a pipeline. Imagine a pipeline with a dozen tasks ranging from data ingestions written in Python to data transformations written in SQL that must run in a particular sequence throughout the day. Itâs not a simple challenge to schedule and manage dependencies between each task.
- Location 318
-
Workflow orchestration platforms are also referred to as workflow management systems (WMSs), orchestration platforms, or orchestration frameworks.
- Location 323
-
Directed Acyclic Graphs
Customizing Your Data Infrastructure
Whatâs important is understanding your constraints (dollars, engineering resources, security, and legal risk tolerance) and the resulting trade-offs.
- Location 357
-
Chapter 3. Common Data Pipeline Patterns
ETL and ELT
The extract step gathers data from various sources in preparation for loading and transforming.
- Location 377
-
The load step brings either the raw data (in the case of ELT) or the fully transformed data (in the case of ETL) into the final destination. Either way, the end result is loading data into the data warehouse, data lake, or other destination.
- Location 379
-
The transform step is where the raw data from each source system is combined and formatted in a such a way that itâs useful to analysts, visualization tools, or whatever use case your pipeline is serving.
- Location 382
-
The combination of the extraction and loading steps is often referred to as data ingestion. Especially in ELT and the EtLT subpattern (note the lowercase t), which is defined later in this chapter, extraction and loading capabilities are often tightly coupled and packaged together in software frameworks. When designing pipelines, however, it is still best to consider the two steps as separate due to the complexity of coordinating extracts and loads across different systems and infrastructure.
- Location 386
-
The Emergence of ELT over ETL
The majority of todayâs data warehouses are built on highly scalable, columnar databases that can both store and run bulk transforms on large datasets in a cost-effective manner.
- Location 406
-
Each row of the database is stored together on disk, in one or more blocks depending on the size of each record. If a record is smaller than a single block or not cleanly divisible by the block size, it leaves some disk space unused.
- Location 412
-
Figure 3-1. A table stored in a row-based storage database. Each block contains a record (row) from the table.
- Location 414
-
dg-publish: true
created: 2024-07-01
modified: 2024-07-01
title: Data Pipelines Pocket Reference
source: kindle
@tags:: #litâ/đbook/highlights
@links::
@ref:: Data Pipelines Pocket Reference
@author:: James Densmore
=this.file.name
Reference
=this.ref
Notes
Chapter 1. Introduction to Data Pipelines
What Are Data Pipelines?
Data pipelines are sets of processes that move and transform data from various sources to a destination where new value can be derived.
- Location 90
-
In practice, however, pipelines typically consist of multiple steps including data extraction, data preprocessing, data validation, and at times training or running a machine learning model before delivering data to its final destination.
- Location 94
-
Figure 1-1. A simple pipeline that loads server log data into an S3 Bucket, does some basic processing and structuring, and loads the results into an Amazon Redshift database.
- Location 98
-
Who Builds Data Pipelines?
Data engineers take pride in ensuring the validity and timeliness of the data they deliver. That means testing, alerting, and creating contingency plans for when something goes wrong.
- Location 107
-
SQL and Data Warehousing Fundamentals
Experienced data engineers know how to write high-performance SQL and understand the fundamentals of data warehousing and data modeling.
- Location 113
-
Python and/or Java
Distributed Computing
One popular example of distributed computing in analytics is the Hadoop ecosystem, which includes distributed file storage via Hadoop Distributed File System (HDFS), processing via MapReduce, data analysis via Pig, and more. Apache Spark is another popular distributed processing framework, which is quickly surpassing Hadoop in popularity.
- Location 125
-
Basic System Administration
A data engineer is expected to be proficient on the Linux command line and be able to perform tasks such as analyze application logs, schedule cron jobs, and troubleshoot firewall and other security settings.
- Location 130
-
A Goal-Oriented Mentality
Why Build Data Pipelines?
What most people outside of analytics often fail to appreciate is that to generate what is seen, thereâs a complex machinery that is unseen. For every dashboard and insight that a data analyst generates and for each predictive model developed by a data scientist, there are data pipelines working behind the scenes.
- Location 143
-
How Are Pipelines Built?
pipelines are not just built â they are monitored, maintained, and extended. Data engineers are tasked with not just delivering data once, but building pipelines and supporting infrastructure that deliver and process it reliably, securely, and on time.
- Location 159
-
Chapter 2. A Modern Data Infrastructure
Diversity of Data Sources
Figure 2-1. The key components of a modern data infrastructure.
- Location 172
-
Source System Ownership
Figure 2-2. A simple pipeline with data from multiple sources loaded into an S3 bucket and then a Redshift database.
- Location 181
-
The term data ingestion refers to extracting data from one source and loading it into another.
- Location 183
-
Ingestion Interface and Data Structure
First, what is the interface to the data? Some of the most common include the following: A database behind an application, such as a Postgres or MySQL database A layer of abstraction on top of a system such as a REST API A stream processing platform such as Apache Kafka A shared network file system or cloud storage bucket containing logs, comma-separated value (CSV) files, and other flat files A data warehouse or data lake Data in HDFS or HBase database
- Location 197
-
In addition to the interface, the structure of the data will vary. Here are some common examples: JSON from a REST API Well-structured data from a MySQL database JSON within columns of a MySQL database table Semistructured log data CSV, fixed-width format (FWF), and other flat file formats JSON in flat files Stream output from Kafka
- Location 200
-
Semistructured data such as JSON is increasingly common and has the advantage of the structure of attribute-value pairs and nesting of objects. However, unlike a relational database, there is no guarantee that each object in the same dataset will have the same structure.
- Location 206
-
Data Volume
Though the design decisions at each step in a pipeline must take data volume into consideration, high volume does not mean high value.
- Location 216
-
Data Cleanliness and Validity
- Location 223
-
There are many common characteristics of âmessy data,â including, but not limited to, the following: Duplicate or ambiguous records Orphaned records Incomplete or missing records Text encoding errors Inconsistent formats (for example, phone numbers with or without dashes) Mislabeled or unlabeled data
- Location 227
-
Thereâs no magic bullet for ensuring data cleanliness and validity, but in a modern data ecosystem, there are key characteristics and approaches that weâll see throughout this book: Assume the worst, expect the best Pristine datasets only exist in academic literature. Assume your input datasets will contain numerous validity and consistency issues, but build pipelines that identify and cleanse data in the interest of clean output. Clean and validate data in the system best suited to do so There are times when itâs better to wait to clean data until later in a pipeline. For example, modern pipelines tend to follow an extract-load-transform (ELT) rather than extract-transform-load (ETL) approach for data warehousing (more in Chapter 3). Itâs sometimes optimal to load data into a data lake in a fairly raw form and to worry about structuring and cleaning later in the pipeline. In other words, use the right tool for the right job rather than rushing the cleaning and validation processes. Validate often Even if you donât clean up data early in a pipeline, donât wait until the end of the pipeline to validate it. Youâll have a much harder time determining where things went wrong. Conversely, donât validate once early in a pipeline and assume all will go well in subsequent steps. Chapter 8 digs deeper into validation.
- Location 231
-
Latency and Bandwidth of the Source System
Data extraction steps in pipelines must contend with API rate limits, connection time-outs, slow downloads, and source system owners who are unhappy due to strain placed on their systems.
- Location 247
-
Cloud Data Warehouses and Data Lakes
A data warehouse is a database where data from different systems is stored and modeled to support analysis and other activities related to answering questions with it. Data in a data warehouse is structured and optimized for reporting and analysis queries.
- Location 263
-
A data lake is where data is stored, but without the structure or query optimization of a data warehouse.
- Location 265
-
Data Ingestion Tools
Data Transformation and Modeling Tools
Data transformation Transforming data is a broad term that is signified by the T in an ETL or ELT process. A transformation can be something as simple as converting a timestamp stored in a table from one time zone to another. It can also be a more complex operation that creates a new metric from multiple source columns that are aggregated and filtered through some business logic.
- Location 293
-
Data modeling Data modeling is a more specific type of data transformation. A data model structures and defines data in a format that is understood and optimized for data analysis. A data model is usually represented as one or more tables in a data warehouse.
- Location 298
-
For example, for the sake of protecting personally identifiable information (PII) it may be desirable to turn an email address into a hashed value that is stored in the final destination. Such a transformation is usually performed during the ingestion process.
- Location 304
-
Workflow Orchestration Platforms
itâs important to introduce a workflow orchestration platform to your data infrastructure. These platforms manage the scheduling and flow of tasks in a pipeline. Imagine a pipeline with a dozen tasks ranging from data ingestions written in Python to data transformations written in SQL that must run in a particular sequence throughout the day. Itâs not a simple challenge to schedule and manage dependencies between each task.
- Location 318
-
Workflow orchestration platforms are also referred to as workflow management systems (WMSs), orchestration platforms, or orchestration frameworks.
- Location 323
-
Directed Acyclic Graphs
Customizing Your Data Infrastructure
Whatâs important is understanding your constraints (dollars, engineering resources, security, and legal risk tolerance) and the resulting trade-offs.
- Location 357
-
Chapter 3. Common Data Pipeline Patterns
ETL and ELT
The extract step gathers data from various sources in preparation for loading and transforming.
- Location 377
-
The load step brings either the raw data (in the case of ELT) or the fully transformed data (in the case of ETL) into the final destination. Either way, the end result is loading data into the data warehouse, data lake, or other destination.
- Location 379
-
The transform step is where the raw data from each source system is combined and formatted in a such a way that itâs useful to analysts, visualization tools, or whatever use case your pipeline is serving.
- Location 382
-
The combination of the extraction and loading steps is often referred to as data ingestion. Especially in ELT and the EtLT subpattern (note the lowercase t), which is defined later in this chapter, extraction and loading capabilities are often tightly coupled and packaged together in software frameworks. When designing pipelines, however, it is still best to consider the two steps as separate due to the complexity of coordinating extracts and loads across different systems and infrastructure.
- Location 386
-
The Emergence of ELT over ETL
The majority of todayâs data warehouses are built on highly scalable, columnar databases that can both store and run bulk transforms on large datasets in a cost-effective manner.
- Location 406
-
Each row of the database is stored together on disk, in one or more blocks depending on the size of each record. If a record is smaller than a single block or not cleanly divisible by the block size, it leaves some disk space unused.
- Location 412
-
Figure 3-1. A table stored in a row-based storage database. Each block contains a record (row) from the table.
- Location 414
-