Data Pipelines Pocket Reference
@tags:: #litâ/đbook/highlights
@links::
@ref:: Data Pipelines Pocket Reference
@author:: James Densmore
=this.file.name
Reference
=this.ref
Notes
Chapter 1. Introduction to Data Pipelines
What Are Data Pipelines?
Data pipelines are sets of processes that move and transform data from various sources to a destination where new value can be derived.
- View Highlight
-
In practice, however, pipelines typically consist of multiple steps including data extraction, data preprocessing, data validation, and at times training or running a machine learning model before delivering data to its final destination.
- View Highlight
-
Figure 1-1. A simple pipeline that loads server log data into an S3 Bucket, does some basic processing and structuring, and loads the results into an Amazon Redshift database.
- View Highlight
-
Who Builds Data Pipelines?
Data engineers take pride in ensuring the validity and timeliness of the data they deliver. That means testing, alerting, and creating contingency plans for when something goes wrong. And yes, something will eventually go wrong!
- View Highlight
-
SQL and Data Warehousing Fundamentals
Experienced data engineers know how to write high-performance SQL and understand the fundamentals of data warehousing and data modeling.
- View Highlight
-
Python and/or Java
Distributed Computing
One popular example of distributed computing in analytics is the Hadoop ecosystem, which includes distributed file storage via Hadoop Distributed File System (HDFS), processing via MapReduce, data analysis via Pig, and more. Apache Spark is another popular distributed processing framework, which is quickly surpassing Hadoop in popularity.
- View Highlight
-
Basic System Administration
A data engineer is expected to be proficient on the Linux command line and be able to perform tasks such as analyze application logs, schedule cron jobs, and troubleshoot firewall and other security settings.
- View Highlight
-
A Goal-Oriented Mentality
Why Build Data Pipelines?
What most people outside of analytics often fail to appreciate is that to generate what is seen, thereâs a complex machinery that is unseen. For every dashboard and insight that a data analyst generates and for each predictive model developed by a data scientist, there are data pipelines working behind the scenes.
- View Highlight
-
How Are Pipelines Built?
In addition, pipelines are not just built â they are monitored, maintained, and extended. Data engineers are tasked with not just delivering data once, but building pipelines and supporting infrastructure that deliver and process it reliably, securely, and on time.
- View Highlight
-
Chapter 2. A Modern Data Infrastructure
Diversity of Data Sources
Figure 2-1. The key components of a modern data infrastructure.
- View Highlight
-
Source System Ownership
The term data ingestion refers to extracting data from one source and loading it into another.
- View Highlight
-
Ingestion Interface and Data Structure
(highlight:: First, what is the interface to the data? Some of the most common include the following:
⢠A database behind an application, such as a Postgres or MySQL database
⢠A layer of abstraction on top of a system such as a REST API
⢠A stream processing platform such as Apache Kafka
⢠A shared network file system or cloud storage bucket containing logs, comma-separated value (CSV) files, and other flat files
⢠A data warehouse or data lake
⢠Data in HDFS or HBase database
In addition to the interface, the structure of the data will vary. Here are some common examples:
⢠JSON from a REST API
⢠Well-structured data from a MySQL database
⢠JSON within columns of a MySQL database table
⢠Semistructured log data
⢠CSV, fixed-width format (FWF), and other flat file formats
⢠JSON in flat files
⢠Stream output from Kafka)
- View Highlight
-
Semistructured data such as JSON is increasingly common and has the advantage of the structure of attribute-value pairs and nesting of objects. However, unlike a relational database, there is no guarantee that each object in the same dataset will have the same structure.
- View Highlight
-
Data Volume
Though the design decisions at each step in a pipeline must take data volume into consideration, high volume does not mean high value.
- View Highlight
-
Data Cleanliness and Validity
(highlight:: There are many common characteristics of âmessy data,â including, but not limited to, the following:
⢠Duplicate or ambiguous records
⢠Orphaned records
⢠Incomplete or missing records
⢠Text encoding errors
⢠Inconsistent formats (for example, phone numbers with or without dashes)
⢠Mislabeled or unlabeled data)
- View Highlight
-
(highlight:: Thereâs no magic bullet for ensuring data cleanliness and validity, but in a modern data ecosystem, there are key characteristics and approaches that weâll see throughout this book:
Assume the worst, expect the best
Pristine datasets only exist in academic literature. Assume your input datasets will contain numerous validity and consistency issues, but build pipelines that identify and cleanse data in the interest of clean output.
Clean and validate data in the system best suited to do so
There are times when itâs better to wait to clean data until later in a pipeline. For example, modern pipelines tend to follow an extract-load-transform (ELT) rather than extract-transform-load (ETL) approach for data warehousing (more in Chapter 3). Itâs sometimes optimal to load data into a data lake in a fairly raw form and to worry about structuring and cleaning later in the pipeline. In other words, use the right tool for the right job rather than rushing the cleaning and validation processes.
Validate often
Even if you donât clean up data early in a pipeline, donât wait until the end of the pipeline to validate it. Youâll have a much harder time determining where things went wrong. Conversely, donât validate once early in a pipeline and assume all will go well in subsequent steps. Chapter 8 digs deeper into validation.)
- View Highlight
-
Latency and Bandwidth of the Source System
Data extraction steps in pipelines must contend with API rate limits, connection time-outs, slow downloads, and source system owners who are unhappy due to strain placed on their systems.
- View Highlight
-
Cloud Data Warehouses and Data Lakes
A data warehouse is a database where data from different systems is stored and modeled to support analysis and other activities related to answering questions with it. Data in a data warehouse is structured and optimized for reporting and analysis queries.
- View Highlight
-
A data lake is where data is stored, but without the structure or query optimization of a data warehouse. It will likely contain a high volume of data as well as a variety of data types. For example, a single data lake might contain a collection of blog posts stored as text files, flat file extracts from a relational database, and JSON objects containing events generated by sensors in an industrial system. It can even store structured data like a standard database, though itâs not optimized for querying such data in the interest of reporting and analysis.
- View Highlight
-
Data Ingestion Tools
Data Transformation and Modeling Tools
(highlight:: Data transformation
Transforming data is a broad term that is signified by the T in an ETL or ELT process. A transformation can be something as simple as converting a timestamp stored in a table from one time zone to another. It can also be a more complex operation that creates a new metric from multiple source columns that are aggregated and filtered through some business logic.)
- View Highlight
-
(highlight:: Data modeling
Data modeling is a more specific type of data transformation. A data model structures and defines data in a format that is understood and optimized for data analysis. A data model is usually represented as one or more tables in a data warehouse. The process of creating data models is discussed in more detail in Chapter 6.)
- View Highlight
-
For example, for the sake of protecting personally identifiable information (PII) it may be desirable to turn an email address into a hashed value that is stored in the final destination. Such a transformation is usually performed during the ingestion process.
- View Highlight
-
Workflow Orchestration Platforms
itâs important to introduce a workflow orchestration platform to your data infrastructure. These platforms manage the scheduling and flow of tasks in a pipeline. Imagine a pipeline with a dozen tasks ranging from data ingestions written in Python to data transformations written in SQL that must run in a particular sequence throughout the day. Itâs not a simple challenge to schedule and manage dependencies between each task.
- View Highlight
-
Workflow orchestration platforms are also referred to as workflow management systems (WMSs), orchestration platforms, or orchestration frameworks.
- View Highlight
-
Directed Acyclic Graphs
The inefficient use of disk space due to records leaving empty space in blocks is a reasonable trade-off in this case, as the speed to reading and writing single records frequently is whatâs most important. However, in analytics the situation is reversed. Instead of the need to read and write small amounts of data frequently, we often read and write a large amount of data infrequently. In addition, itâs less likely that an analytical query requires many, or all, of the columns in a table but rather a single column of a table with many columns.
- View Highlight
-
Customizing Your Data Infrastructure
Figure 3-2. A table stored in a column-based storage database. Each disk block contains data from the same column. The two columns involved in our example query are highlighted. Only these blocks must be accessed to run the query. Each block contains data of the same type, making compression optimal
- View Highlight
-
- [note::Columnar databases are more efficient than row-based databases because it accesses data by column. Since columns have the same data type and analysts frequently request large amounts of data from only a subset of the available columns, storing data as columns facilitates more efficient queries.]
Whatâs important is understanding your constraints (dollars, engineering resources, security, and legal risk tolerance) and the resulting trade-offs.
- View Highlight
-
EtLT Subpattern
Chapter 3. Common Data Pipeline Patterns
(highlight:: Some examples of the type of transformation that fits into the EtLT subpattern include the following:
⢠Deduplicate records in a table
⢠Parse URL parameters into individual components
⢠Mask or otherwise obfuscate sensitive data
These types of transforms are either fully disconnected from business logic or, in the case of something like masking sensitive data, at times required as early in a pipeline as possible for legal or security reasons.)
- View Highlight
-
ETL and ELT
ELT for Data Analysis
The extract step gathers data from various sources in preparation for loading and transforming.
- View Highlight
-
The load step brings either the raw data (in the case of ELT) or the fully transformed data (in the case of ETL) into the final destination. Either way, the end result is loading data into the data warehouse, data lake, or other destination.
- View Highlight
-
The transform step is where the raw data from each source system is combined and formatted in a such a way that itâs useful to analysts, visualization tools, or whatever use case your pipeline is serving.
- View Highlight
-
The combination of the extraction and loading steps is often referred to as data ingestion. Especially in ELT and the EtLT subpattern (note the lowercase t), which is defined later in this chapter, extraction and loading capabilities are often tightly coupled and packaged together in software frameworks. When designing pipelines, however, it is still best to consider the two steps as separate due to the complexity of coordinating extracts and loads across different systems and infrastructure.
- View Highlight
-
Figure 3-3. The ELT pattern allows for a clean split of responsibilities between data engineers and data analysts (or data scientists). Each role can work autonomously with the tools and languages they are comfortable in.
- View Highlight
-
The Emergence of ELT over ETL
ELT for Data Science
Though data science is a broad field, in general, data scientists will need access to more granular â and at times raw â data than data analysts do. While data analysts build data models that produce metrics and power dashboards, data scientists spend their days exploring data and building predictive models.
- View Highlight
-
The majority of todayâs data warehouses are built on highly scalable, columnar databases that can both store and run bulk transforms on large datasets in a cost-effective manner.
- View Highlight
-
ELT for Data Products and Machine Learning
(highlight:: Data is used for more than analysis, reporting, and predictive models. Itâs also used for powering data products. Some common examples of data products include the following:
⢠A content recommendation engine that powers a video streaming home screen
⢠A personalized search engine on an e-commerce website
⢠An application that performs sentiment analysis on user-generated restaurant reviews)
- View Highlight
-
Each row of the database is stored together on disk, in one or more blocks depending on the size of each record. If a record is smaller than a single block or not cleanly divisible by the block size, it leaves some disk space unused.
- View Highlight
-
Figure 3-1. A table stored in a row-based storage database. Each block contains a record (row) from the table.
- View Highlight
-
Steps in a Machine Learning Pipeline
Though the data you ingest may differ, the logic remains primarily the same for pipelines built for analytics as well as ML, but with one additional consideration for ML pipelines. That is, ensuring that the data you ingest is versioned in a way that ML models can later refer to as a specific dataset for training or validation.
- View Highlight
-
(highlight:: Data preprocessing
The data thatâs ingested is unlikely to be ready to use in ML development. Preprocessing is where data is cleaned and otherwise prepared for models. For example, this is the step in a pipeline where text is tokenized, features are converted to numerical values, and input values are normalized.)
- View Highlight
-
(highlight:: Model training
After new data is ingested and preprocessed, ML models need to be retrained.)
- View Highlight
-
(highlight:: Model deployment
Deploying models to production can be the most challenging part of going from research-oriented machine learning to a true data product. Here, not only is versioning of datasets necessary, but versioning of trained models is also needed. Often, a REST API is used to allow for querying of a deployed model, and API endpoints for various versions of a model will be used. Itâs a lot to keep track of and takes coordination between data scientists, machine learning engineers, and data engineers to get to a production state. A well-designed pipeline is key to gluing it together.)
- View Highlight
-
Validating Ingested Data
- View Highlight
-
In pipelines built for data analysts, validation often happens after data ingestion (extract-load) as well as after data modeling (transform). In ML pipelines, validation of the data thatâs ingested is also important. Donât confuse this critical step with validation of the ML model itself, which is of course a standard part of ML development.
- View Highlight
-
Incorporate Feedback in the Pipeline
Further Reading on ML Pipelines
Building Machine Learning Pipelines by Hannes Hapke and Catherine Nelson (OâReilly, 2020)
- View Highlight
-
Chapter 4. Data Ingestion: Extracting Data
Setting Up Your Python Environment
Donât Add Your Config Files to a Git Repo!
The safest way to ensure exclusion in the repo is to give your config files an extension like .conf and add a line to your .gitignore file with *.conf.
- View Highlight
-
Setting Up Cloud File Storage
Every major public cloud provider has a service similar to S3. Equivalents on other public clouds are Azure Storage in Microsoft Azure and Google Cloud Storage (GCS) in GCP.
- View Highlight
-
Extracting Data from a MySQL Database
(highlight:: Extracting data from a MySQL database can be done in two ways:
⢠Full or incremental extraction using SQL
⢠Binary Log (binlog) replication)
- View Highlight
-
Binary Log replication, though more complex to implement, is better suited to cases where the data volume of changes in source tables is high, or there is a need for more frequent data ingestions from the MySQL source.
- View Highlight
-
Full or Incremental MySQL Table Extraction
- View Highlight
-
In a full extraction, every record in the table is extracted on each run of the extraction job. This is the least complex approach, but for high-volume tables it can take a long time to run.
- View Highlight
-
In an incremental extraction, only records from the source table that have changed or been added since the last run of the job are extracted. The timestamp of the last extraction can either be stored in an extraction job log table in the data warehouse or retrieved by querying the maximum timestamp in a LastUpdated
column in the destination table in the warehouse.
- View Highlight
-
Caching Last Updated Dates
Be sure to store the MAX(LastUpdated)
value from the destination table in the data warehouse and not the time the extraction job started or finished. Even a small lag in the time logged for job execution could mean missed or duplicated records from the source table in the next run.
- View Highlight
-
- [note::I should fix this in my Readwise Anki script!]
(highlight:: Though incremental extraction is ideal for optimal performance, there are some downsides and reasons why it may not be possible for a given table.
First, with this method deleted, rows are not captured. If a row is deleted from the source MySQL table, you wonât know, and it will remain in the destination table as if nothing changed.
Second, the source table must have a reliable timestamp for when it was last updated (the LastUpdated
column in the previous example). Itâs not uncommon for source system tables to be missing such a column or have one that is not updated reliably. Thereâs nothing stopping developers from updating records in the source table and forgetting to update the LastUpdated
timestamp.)
- View Highlight
-
(highlight:: When it comes time for the load step, full extracts are usually loaded by first truncating the destination table and loading in the newly extracted data. In that case, youâre left with only the latest version of the row in the data warehouse.
When loading data from an incremental extraction, the resulting data is appended to the data in the destination table. In that case, you have both the original record as well as the updated version. Having both can be valuable when it comes time to transform and analyze data)
- View Highlight
-
Never assume a LastUpdated
column in a source system is reliably updated. Check with the owner of the source system and confirm before relying on it for an incremental extraction.
- View Highlight
-
Beware of large extraction jobs â whether full or incremental â putting strain on the source MySQL database, and even blocking production queries from executing. Consult with the owner of the database and consider setting up a replica to extract from, rather than extracting from the primary source database.
- View Highlight
-
Binary Log Replication of MySQL Data
Binlog replication is a form of change data capture (CDC). Many source data stores have some form of CDC that you can use.
- View Highlight
-
The MySQL binlog is a log that keeps a record of every operation performed in the database. For example, depending on how itâs configured, it will log the specifics of every table creation or modification, as well as every INSERT
, UPDATE
, and DELETE
operation.
- View Highlight
-
Consider Using a Prebuilt Framework
Consult with Source System Owners
Extracting Data from a PostgreSQL Database
Full or Incremental Postgres Table Extraction
Replicating Data Using the Write-Ahead Log
Extracting Data from MongoDB
Extracting Data from a REST API
Extracting
- View Highlight
-
Streaming Data Ingestions with Kafka and Debezium
Chapter 5. Data Ingestion: Loading Data
Storing large amounts of data in a cloud storageâbased data lake is less expensive than storing it in a warehouse (this is not true for Snowflake data lakes that use the same storage as Snowflake data warehouses). In addition, because itâs unstructured or semistructured data (no predefined schema), making changes to the types or properties of data stored is far easier than modifying a warehouse schema. JSON documents are an example of the type of semistructured data that you might encounter in a data lake. If a data structure is frequently changing, you may consider storing it in a data lake, at least for the time being.
- View Highlight
-
Using Your File Storage as a Data Lake
Unlike a data warehouse, a data lake stores data in many formats in a raw and sometimes unstructured form. Itâs cheaper to store, but is not optimized for querying in the same way that structured data in a warehouse is.
- View Highlight
-
Chapter 6. Transforming Data
Noncontextual Transformations
Deduplicating Records in a Table
(highlight:: it is possible for duplicate records to exist in a table of data that has been ingested into a data warehouse. There are a number of reasons it happens:
⢠An incremental data ingestion mistakenly overlaps a previous ingestion time window and picks up some records that were already ingested in a previous run.
⢠Duplicate records were inadvertently created in a source system.
⢠Data that was backfilled overlapped with subsequent data loaded into the table during ingestion.)
- View Highlight
-
dg-publish: true
created: 2024-07-01
modified: 2024-07-01
title: Data Pipelines Pocket Reference
source: reader
@tags:: #litâ/đbook/highlights
@links::
@ref:: Data Pipelines Pocket Reference
@author:: James Densmore
=this.file.name
Reference
=this.ref
Notes
Chapter 1. Introduction to Data Pipelines
What Are Data Pipelines?
Data pipelines are sets of processes that move and transform data from various sources to a destination where new value can be derived.
- View Highlight
-
In practice, however, pipelines typically consist of multiple steps including data extraction, data preprocessing, data validation, and at times training or running a machine learning model before delivering data to its final destination.
- View Highlight
-
Figure 1-1. A simple pipeline that loads server log data into an S3 Bucket, does some basic processing and structuring, and loads the results into an Amazon Redshift database.
- View Highlight
-
Who Builds Data Pipelines?
Data engineers take pride in ensuring the validity and timeliness of the data they deliver. That means testing, alerting, and creating contingency plans for when something goes wrong. And yes, something will eventually go wrong!
- View Highlight
-
SQL and Data Warehousing Fundamentals
Experienced data engineers know how to write high-performance SQL and understand the fundamentals of data warehousing and data modeling.
- View Highlight
-
Python and/or Java
Distributed Computing
One popular example of distributed computing in analytics is the Hadoop ecosystem, which includes distributed file storage via Hadoop Distributed File System (HDFS), processing via MapReduce, data analysis via Pig, and more. Apache Spark is another popular distributed processing framework, which is quickly surpassing Hadoop in popularity.
- View Highlight
-
Basic System Administration
A data engineer is expected to be proficient on the Linux command line and be able to perform tasks such as analyze application logs, schedule cron jobs, and troubleshoot firewall and other security settings.
- View Highlight
-
A Goal-Oriented Mentality
Why Build Data Pipelines?
What most people outside of analytics often fail to appreciate is that to generate what is seen, thereâs a complex machinery that is unseen. For every dashboard and insight that a data analyst generates and for each predictive model developed by a data scientist, there are data pipelines working behind the scenes.
- View Highlight
-
How Are Pipelines Built?
In addition, pipelines are not just built â they are monitored, maintained, and extended. Data engineers are tasked with not just delivering data once, but building pipelines and supporting infrastructure that deliver and process it reliably, securely, and on time.
- View Highlight
-
Chapter 2. A Modern Data Infrastructure
Diversity of Data Sources
Figure 2-1. The key components of a modern data infrastructure.
- View Highlight
-
Source System Ownership
The term data ingestion refers to extracting data from one source and loading it into another.
- View Highlight
-
Ingestion Interface and Data Structure
(highlight:: First, what is the interface to the data? Some of the most common include the following:
⢠A database behind an application, such as a Postgres or MySQL database
⢠A layer of abstraction on top of a system such as a REST API
⢠A stream processing platform such as Apache Kafka
⢠A shared network file system or cloud storage bucket containing logs, comma-separated value (CSV) files, and other flat files
⢠A data warehouse or data lake
⢠Data in HDFS or HBase database
In addition to the interface, the structure of the data will vary. Here are some common examples:
⢠JSON from a REST API
⢠Well-structured data from a MySQL database
⢠JSON within columns of a MySQL database table
⢠Semistructured log data
⢠CSV, fixed-width format (FWF), and other flat file formats
⢠JSON in flat files
⢠Stream output from Kafka)
- View Highlight
-
Semistructured data such as JSON is increasingly common and has the advantage of the structure of attribute-value pairs and nesting of objects. However, unlike a relational database, there is no guarantee that each object in the same dataset will have the same structure.
- View Highlight
-
Data Volume
Though the design decisions at each step in a pipeline must take data volume into consideration, high volume does not mean high value.
- View Highlight
-
Data Cleanliness and Validity
(highlight:: There are many common characteristics of âmessy data,â including, but not limited to, the following:
⢠Duplicate or ambiguous records
⢠Orphaned records
⢠Incomplete or missing records
⢠Text encoding errors
⢠Inconsistent formats (for example, phone numbers with or without dashes)
⢠Mislabeled or unlabeled data)
- View Highlight
-
(highlight:: Thereâs no magic bullet for ensuring data cleanliness and validity, but in a modern data ecosystem, there are key characteristics and approaches that weâll see throughout this book:
Assume the worst, expect the best
Pristine datasets only exist in academic literature. Assume your input datasets will contain numerous validity and consistency issues, but build pipelines that identify and cleanse data in the interest of clean output.
Clean and validate data in the system best suited to do so
There are times when itâs better to wait to clean data until later in a pipeline. For example, modern pipelines tend to follow an extract-load-transform (ELT) rather than extract-transform-load (ETL) approach for data warehousing (more in Chapter 3). Itâs sometimes optimal to load data into a data lake in a fairly raw form and to worry about structuring and cleaning later in the pipeline. In other words, use the right tool for the right job rather than rushing the cleaning and validation processes.
Validate often
Even if you donât clean up data early in a pipeline, donât wait until the end of the pipeline to validate it. Youâll have a much harder time determining where things went wrong. Conversely, donât validate once early in a pipeline and assume all will go well in subsequent steps. Chapter 8 digs deeper into validation.)
- View Highlight
-
Latency and Bandwidth of the Source System
Data extraction steps in pipelines must contend with API rate limits, connection time-outs, slow downloads, and source system owners who are unhappy due to strain placed on their systems.
- View Highlight
-
Cloud Data Warehouses and Data Lakes
A data warehouse is a database where data from different systems is stored and modeled to support analysis and other activities related to answering questions with it. Data in a data warehouse is structured and optimized for reporting and analysis queries.
- View Highlight
-
A data lake is where data is stored, but without the structure or query optimization of a data warehouse. It will likely contain a high volume of data as well as a variety of data types. For example, a single data lake might contain a collection of blog posts stored as text files, flat file extracts from a relational database, and JSON objects containing events generated by sensors in an industrial system. It can even store structured data like a standard database, though itâs not optimized for querying such data in the interest of reporting and analysis.
- View Highlight
-
Data Ingestion Tools
Data Transformation and Modeling Tools
(highlight:: Data transformation
Transforming data is a broad term that is signified by the T in an ETL or ELT process. A transformation can be something as simple as converting a timestamp stored in a table from one time zone to another. It can also be a more complex operation that creates a new metric from multiple source columns that are aggregated and filtered through some business logic.)
- View Highlight
-
(highlight:: Data modeling
Data modeling is a more specific type of data transformation. A data model structures and defines data in a format that is understood and optimized for data analysis. A data model is usually represented as one or more tables in a data warehouse. The process of creating data models is discussed in more detail in Chapter 6.)
- View Highlight
-
For example, for the sake of protecting personally identifiable information (PII) it may be desirable to turn an email address into a hashed value that is stored in the final destination. Such a transformation is usually performed during the ingestion process.
- View Highlight
-
Workflow Orchestration Platforms
itâs important to introduce a workflow orchestration platform to your data infrastructure. These platforms manage the scheduling and flow of tasks in a pipeline. Imagine a pipeline with a dozen tasks ranging from data ingestions written in Python to data transformations written in SQL that must run in a particular sequence throughout the day. Itâs not a simple challenge to schedule and manage dependencies between each task.
- View Highlight
-
Workflow orchestration platforms are also referred to as workflow management systems (WMSs), orchestration platforms, or orchestration frameworks.
- View Highlight
-
Directed Acyclic Graphs
The inefficient use of disk space due to records leaving empty space in blocks is a reasonable trade-off in this case, as the speed to reading and writing single records frequently is whatâs most important. However, in analytics the situation is reversed. Instead of the need to read and write small amounts of data frequently, we often read and write a large amount of data infrequently. In addition, itâs less likely that an analytical query requires many, or all, of the columns in a table but rather a single column of a table with many columns.
- View Highlight
-
Customizing Your Data Infrastructure
Figure 3-2. A table stored in a column-based storage database. Each disk block contains data from the same column. The two columns involved in our example query are highlighted. Only these blocks must be accessed to run the query. Each block contains data of the same type, making compression optimal
- View Highlight
-
- [note::Columnar databases are more efficient than row-based databases because it accesses data by column. Since columns have the same data type and analysts frequently request large amounts of data from only a subset of the available columns, storing data as columns facilitates more efficient queries.]
Whatâs important is understanding your constraints (dollars, engineering resources, security, and legal risk tolerance) and the resulting trade-offs.
- View Highlight
-
EtLT Subpattern
Chapter 3. Common Data Pipeline Patterns
(highlight:: Some examples of the type of transformation that fits into the EtLT subpattern include the following:
⢠Deduplicate records in a table
⢠Parse URL parameters into individual components
⢠Mask or otherwise obfuscate sensitive data
These types of transforms are either fully disconnected from business logic or, in the case of something like masking sensitive data, at times required as early in a pipeline as possible for legal or security reasons.)
- View Highlight
-
ETL and ELT
ELT for Data Analysis
The extract step gathers data from various sources in preparation for loading and transforming.
- View Highlight
-
The load step brings either the raw data (in the case of ELT) or the fully transformed data (in the case of ETL) into the final destination. Either way, the end result is loading data into the data warehouse, data lake, or other destination.
- View Highlight
-
The transform step is where the raw data from each source system is combined and formatted in a such a way that itâs useful to analysts, visualization tools, or whatever use case your pipeline is serving.
- View Highlight
-
The combination of the extraction and loading steps is often referred to as data ingestion. Especially in ELT and the EtLT subpattern (note the lowercase t), which is defined later in this chapter, extraction and loading capabilities are often tightly coupled and packaged together in software frameworks. When designing pipelines, however, it is still best to consider the two steps as separate due to the complexity of coordinating extracts and loads across different systems and infrastructure.
- View Highlight
-
Figure 3-3. The ELT pattern allows for a clean split of responsibilities between data engineers and data analysts (or data scientists). Each role can work autonomously with the tools and languages they are comfortable in.
- View Highlight
-
The Emergence of ELT over ETL
ELT for Data Science
Though data science is a broad field, in general, data scientists will need access to more granular â and at times raw â data than data analysts do. While data analysts build data models that produce metrics and power dashboards, data scientists spend their days exploring data and building predictive models.
- View Highlight
-
The majority of todayâs data warehouses are built on highly scalable, columnar databases that can both store and run bulk transforms on large datasets in a cost-effective manner.
- View Highlight
-
ELT for Data Products and Machine Learning
(highlight:: Data is used for more than analysis, reporting, and predictive models. Itâs also used for powering data products. Some common examples of data products include the following:
⢠A content recommendation engine that powers a video streaming home screen
⢠A personalized search engine on an e-commerce website
⢠An application that performs sentiment analysis on user-generated restaurant reviews)
- View Highlight
-
Each row of the database is stored together on disk, in one or more blocks depending on the size of each record. If a record is smaller than a single block or not cleanly divisible by the block size, it leaves some disk space unused.
- View Highlight
-
Figure 3-1. A table stored in a row-based storage database. Each block contains a record (row) from the table.
- View Highlight
-
Steps in a Machine Learning Pipeline
Though the data you ingest may differ, the logic remains primarily the same for pipelines built for analytics as well as ML, but with one additional consideration for ML pipelines. That is, ensuring that the data you ingest is versioned in a way that ML models can later refer to as a specific dataset for training or validation.
- View Highlight
-
(highlight:: Data preprocessing
The data thatâs ingested is unlikely to be ready to use in ML development. Preprocessing is where data is cleaned and otherwise prepared for models. For example, this is the step in a pipeline where text is tokenized, features are converted to numerical values, and input values are normalized.)
- View Highlight
-
(highlight:: Model training
After new data is ingested and preprocessed, ML models need to be retrained.)
- View Highlight
-
(highlight:: Model deployment
Deploying models to production can be the most challenging part of going from research-oriented machine learning to a true data product. Here, not only is versioning of datasets necessary, but versioning of trained models is also needed. Often, a REST API is used to allow for querying of a deployed model, and API endpoints for various versions of a model will be used. Itâs a lot to keep track of and takes coordination between data scientists, machine learning engineers, and data engineers to get to a production state. A well-designed pipeline is key to gluing it together.)
- View Highlight
-
Validating Ingested Data
- View Highlight
-
In pipelines built for data analysts, validation often happens after data ingestion (extract-load) as well as after data modeling (transform). In ML pipelines, validation of the data thatâs ingested is also important. Donât confuse this critical step with validation of the ML model itself, which is of course a standard part of ML development.
- View Highlight
-
Incorporate Feedback in the Pipeline
Further Reading on ML Pipelines
Building Machine Learning Pipelines by Hannes Hapke and Catherine Nelson (OâReilly, 2020)
- View Highlight
-
Chapter 4. Data Ingestion: Extracting Data
Setting Up Your Python Environment
Donât Add Your Config Files to a Git Repo!
The safest way to ensure exclusion in the repo is to give your config files an extension like .conf and add a line to your .gitignore file with *.conf.
- View Highlight
-
Setting Up Cloud File Storage
Every major public cloud provider has a service similar to S3. Equivalents on other public clouds are Azure Storage in Microsoft Azure and Google Cloud Storage (GCS) in GCP.
- View Highlight
-
Extracting Data from a MySQL Database
(highlight:: Extracting data from a MySQL database can be done in two ways:
⢠Full or incremental extraction using SQL
⢠Binary Log (binlog) replication)
- View Highlight
-
Binary Log replication, though more complex to implement, is better suited to cases where the data volume of changes in source tables is high, or there is a need for more frequent data ingestions from the MySQL source.
- View Highlight
-
Full or Incremental MySQL Table Extraction
- View Highlight
-
In a full extraction, every record in the table is extracted on each run of the extraction job. This is the least complex approach, but for high-volume tables it can take a long time to run.
- View Highlight
-
In an incremental extraction, only records from the source table that have changed or been added since the last run of the job are extracted. The timestamp of the last extraction can either be stored in an extraction job log table in the data warehouse or retrieved by querying the maximum timestamp in a LastUpdated
column in the destination table in the warehouse.
- View Highlight
-
Caching Last Updated Dates
Be sure to store the MAX(LastUpdated)
value from the destination table in the data warehouse and not the time the extraction job started or finished. Even a small lag in the time logged for job execution could mean missed or duplicated records from the source table in the next run.
- View Highlight
-
- [note::I should fix this in my Readwise Anki script!]
(highlight:: Though incremental extraction is ideal for optimal performance, there are some downsides and reasons why it may not be possible for a given table.
First, with this method deleted, rows are not captured. If a row is deleted from the source MySQL table, you wonât know, and it will remain in the destination table as if nothing changed.
Second, the source table must have a reliable timestamp for when it was last updated (the LastUpdated
column in the previous example). Itâs not uncommon for source system tables to be missing such a column or have one that is not updated reliably. Thereâs nothing stopping developers from updating records in the source table and forgetting to update the LastUpdated
timestamp.)
- View Highlight
-
(highlight:: When it comes time for the load step, full extracts are usually loaded by first truncating the destination table and loading in the newly extracted data. In that case, youâre left with only the latest version of the row in the data warehouse.
When loading data from an incremental extraction, the resulting data is appended to the data in the destination table. In that case, you have both the original record as well as the updated version. Having both can be valuable when it comes time to transform and analyze data)
- View Highlight
-
Never assume a LastUpdated
column in a source system is reliably updated. Check with the owner of the source system and confirm before relying on it for an incremental extraction.
- View Highlight
-
Beware of large extraction jobs â whether full or incremental â putting strain on the source MySQL database, and even blocking production queries from executing. Consult with the owner of the database and consider setting up a replica to extract from, rather than extracting from the primary source database.
- View Highlight
-
Binary Log Replication of MySQL Data
Binlog replication is a form of change data capture (CDC). Many source data stores have some form of CDC that you can use.
- View Highlight
-
The MySQL binlog is a log that keeps a record of every operation performed in the database. For example, depending on how itâs configured, it will log the specifics of every table creation or modification, as well as every INSERT
, UPDATE
, and DELETE
operation.
- View Highlight
-
Consider Using a Prebuilt Framework
Consult with Source System Owners
Extracting Data from a PostgreSQL Database
Full or Incremental Postgres Table Extraction
Replicating Data Using the Write-Ahead Log
Extracting Data from MongoDB
Extracting Data from a REST API
Extracting
- View Highlight
-
Streaming Data Ingestions with Kafka and Debezium
Chapter 5. Data Ingestion: Loading Data
Storing large amounts of data in a cloud storageâbased data lake is less expensive than storing it in a warehouse (this is not true for Snowflake data lakes that use the same storage as Snowflake data warehouses). In addition, because itâs unstructured or semistructured data (no predefined schema), making changes to the types or properties of data stored is far easier than modifying a warehouse schema. JSON documents are an example of the type of semistructured data that you might encounter in a data lake. If a data structure is frequently changing, you may consider storing it in a data lake, at least for the time being.
- View Highlight
-
Using Your File Storage as a Data Lake
Unlike a data warehouse, a data lake stores data in many formats in a raw and sometimes unstructured form. Itâs cheaper to store, but is not optimized for querying in the same way that structured data in a warehouse is.
- View Highlight
-
Chapter 6. Transforming Data
Noncontextual Transformations
Deduplicating Records in a Table
(highlight:: it is possible for duplicate records to exist in a table of data that has been ingested into a data warehouse. There are a number of reasons it happens:
⢠An incremental data ingestion mistakenly overlaps a previous ingestion time window and picks up some records that were already ingested in a previous run.
⢠Duplicate records were inadvertently created in a source system.
⢠Data that was backfilled overlapped with subsequent data loaded into the table during ingestion.)
- View Highlight
-