Data Pipelines Pocket Reference

Quote

Data pipelines are sets of processes that move and transform data from various sources to a destination where new value can be derived.
- View Highlight
-

Quote

In practice, however, pipelines typically consist of multiple steps including data extraction, data preprocessing, data validation, and at times training or running a machine learning model before delivering data to its final destination.
- View Highlight
-

Quote

Figure 1-1. A simple pipeline that loads server log data into an S3 Bucket, does some basic processing and structuring, and loads the results into an Amazon Redshift database.
- View Highlight
-

Quote

Data engineers take pride in ensuring the validity and timeliness of the data they deliver. That means testing, alerting, and creating contingency plans for when something goes wrong. And yes, something will eventually go wrong!
- View Highlight
-

Quote

Experienced data engineers know how to write high-performance SQL and understand the fundamentals of data warehousing and data modeling.
- View Highlight
-

Quote

One popular example of distributed computing in analytics is the Hadoop ecosystem, which includes distributed file storage via Hadoop Distributed File System (HDFS), processing via MapReduce, data analysis via Pig, and more. Apache Spark is another popular distributed processing framework, which is quickly surpassing Hadoop in popularity.
- View Highlight
-

Quote

A data engineer is expected to be proficient on the Linux command line and be able to perform tasks such as analyze application logs, schedule cron jobs, and troubleshoot firewall and other security settings.
- View Highlight
-

Quote

What most people outside of analytics often fail to appreciate is that to generate what is seen, there’s a complex machinery that is unseen. For every dashboard and insight that a data analyst generates and for each predictive model developed by a data scientist, there are data pipelines working behind the scenes.
- View Highlight
-

Quote

In addition, pipelines are not just built — they are monitored, maintained, and extended. Data engineers are tasked with not just delivering data once, but building pipelines and supporting infrastructure that deliver and process it reliably, securely, and on time.
- View Highlight
-

Quote

Figure 2-1. The key components of a modern data infrastructure.
- View Highlight
-

Quote

The term data ingestion refers to extracting data from one source and loading it into another.
- View Highlight
-

Quote

(highlight:: First, what is the interface to the data? Some of the most common include the following:
• A database behind an application, such as a Postgres or MySQL database
• A layer of abstraction on top of a system such as a REST API
• A stream processing platform such as Apache Kafka
• A shared network file system or cloud storage bucket containing logs, comma-separated value (CSV) files, and other flat files
• A data warehouse or data lake
• Data in HDFS or HBase database
In addition to the interface, the structure of the data will vary. Here are some common examples:
• JSON from a REST API
• Well-structured data from a MySQL database
• JSON within columns of a MySQL database table
• Semistructured log data
• CSV, fixed-width format (FWF), and other flat file formats
• JSON in flat files
• Stream output from Kafka)
- View Highlight
-

Quote

Semistructured data such as JSON is increasingly common and has the advantage of the structure of attribute-value pairs and nesting of objects. However, unlike a relational database, there is no guarantee that each object in the same dataset will have the same structure.
- View Highlight
-

Quote

Though the design decisions at each step in a pipeline must take data volume into consideration, high volume does not mean high value.
- View Highlight
-

Quote

(highlight:: There are many common characteristics of “messy data,” including, but not limited to, the following:
• Duplicate or ambiguous records
• Orphaned records
• Incomplete or missing records
• Text encoding errors
• Inconsistent formats (for example, phone numbers with or without dashes)
• Mislabeled or unlabeled data)
- View Highlight
-

Quote

(highlight:: There’s no magic bullet for ensuring data cleanliness and validity, but in a modern data ecosystem, there are key characteristics and approaches that we’ll see throughout this book:
Assume the worst, expect the best

Pristine datasets only exist in academic literature. Assume your input datasets will contain numerous validity and consistency issues, but build pipelines that identify and cleanse data in the interest of clean output.
Clean and validate data in the system best suited to do so
There are times when it’s better to wait to clean data until later in a pipeline. For example, modern pipelines tend to follow an extract-load-transform (ELT) rather than extract-transform-load (ETL) approach for data warehousing (more in Chapter 3). It’s sometimes optimal to load data into a data lake in a fairly raw form and to worry about structuring and cleaning later in the pipeline. In other words, use the right tool for the right job rather than rushing the cleaning and validation processes.
Validate often
Even if you don’t clean up data early in a pipeline, don’t wait until the end of the pipeline to validate it. You’ll have a much harder time determining where things went wrong. Conversely, don’t validate once early in a pipeline and assume all will go well in subsequent steps. Chapter 8 digs deeper into validation.)
- View Highlight
-

Quote

Data extraction steps in pipelines must contend with API rate limits, connection time-outs, slow downloads, and source system owners who are unhappy due to strain placed on their systems.
- View Highlight
-

Quote

A data warehouse is a database where data from different systems is stored and modeled to support analysis and other activities related to answering questions with it. Data in a data warehouse is structured and optimized for reporting and analysis queries.
- View Highlight
-

Quote

A data lake is where data is stored, but without the structure or query optimization of a data warehouse. It will likely contain a high volume of data as well as a variety of data types. For example, a single data lake might contain a collection of blog posts stored as text files, flat file extracts from a relational database, and JSON objects containing events generated by sensors in an industrial system. It can even store structured data like a standard database, though it’s not optimized for querying such data in the interest of reporting and analysis.
- View Highlight
-

Quote

(highlight:: Data transformation

Transforming data is a broad term that is signified by the T in an ETL or ELT process. A transformation can be something as simple as converting a timestamp stored in a table from one time zone to another. It can also be a more complex operation that creates a new metric from multiple source columns that are aggregated and filtered through some business logic.)
- View Highlight
-

Quote

(highlight:: Data modeling

Data modeling is a more specific type of data transformation. A data model structures and defines data in a format that is understood and optimized for data analysis. A data model is usually represented as one or more tables in a data warehouse. The process of creating data models is discussed in more detail in Chapter 6.)
- View Highlight
-

Quote

For example, for the sake of protecting personally identifiable information (PII) it may be desirable to turn an email address into a hashed value that is stored in the final destination. Such a transformation is usually performed during the ingestion process.
- View Highlight
-

Quote

it’s important to introduce a workflow orchestration platform to your data infrastructure. These platforms manage the scheduling and flow of tasks in a pipeline. Imagine a pipeline with a dozen tasks ranging from data ingestions written in Python to data transformations written in SQL that must run in a particular sequence throughout the day. It’s not a simple challenge to schedule and manage dependencies between each task.
- View Highlight
-

Quote

Workflow orchestration platforms are also referred to as workflow management systems (WMSs), orchestration platforms, or orchestration frameworks.
- View Highlight
-

Quote

The inefficient use of disk space due to records leaving empty space in blocks is a reasonable trade-off in this case, as the speed to reading and writing single records frequently is what’s most important. However, in analytics the situation is reversed. Instead of the need to read and write small amounts of data frequently, we often read and write a large amount of data infrequently. In addition, it’s less likely that an analytical query requires many, or all, of the columns in a table but rather a single column of a table with many columns.
- View Highlight
-

Quote

Figure 3-2. A table stored in a column-based storage database. Each disk block contains data from the same column. The two columns involved in our example query are highlighted. Only these blocks must be accessed to run the query. Each block contains data of the same type, making compression optimal
- View Highlight
-
- [note::Columnar databases are more efficient than row-based databases because it accesses data by column. Since columns have the same data type and analysts frequently request large amounts of data from only a subset of the available columns, storing data as columns facilitates more efficient queries.]

Quote

What’s important is understanding your constraints (dollars, engineering resources, security, and legal risk tolerance) and the resulting trade-offs.
- View Highlight
-

Quote

(highlight:: Some examples of the type of transformation that fits into the EtLT subpattern include the following:
• Deduplicate records in a table
• Parse URL parameters into individual components
• Mask or otherwise obfuscate sensitive data
These types of transforms are either fully disconnected from business logic or, in the case of something like masking sensitive data, at times required as early in a pipeline as possible for legal or security reasons.)
- View Highlight
-

Quote

The extract step gathers data from various sources in preparation for loading and transforming.
- View Highlight
-

Quote

The load step brings either the raw data (in the case of ELT) or the fully transformed data (in the case of ETL) into the final destination. Either way, the end result is loading data into the data warehouse, data lake, or other destination.
- View Highlight
-

Quote

The transform step is where the raw data from each source system is combined and formatted in a such a way that it’s useful to analysts, visualization tools, or whatever use case your pipeline is serving.
- View Highlight
-

Quote

The combination of the extraction and loading steps is often referred to as data ingestion. Especially in ELT and the EtLT subpattern (note the lowercase t), which is defined later in this chapter, extraction and loading capabilities are often tightly coupled and packaged together in software frameworks. When designing pipelines, however, it is still best to consider the two steps as separate due to the complexity of coordinating extracts and loads across different systems and infrastructure.
- View Highlight
-

Quote

Figure 3-3. The ELT pattern allows for a clean split of responsibilities between data engineers and data analysts (or data scientists). Each role can work autonomously with the tools and languages they are comfortable in.
- View Highlight
-

Quote

Though data science is a broad field, in general, data scientists will need access to more granular — and at times raw — data than data analysts do. While data analysts build data models that produce metrics and power dashboards, data scientists spend their days exploring data and building predictive models.
- View Highlight
-

Quote

The majority of today’s data warehouses are built on highly scalable, columnar databases that can both store and run bulk transforms on large datasets in a cost-effective manner.
- View Highlight
-

Quote

(highlight:: Data is used for more than analysis, reporting, and predictive models. It’s also used for powering data products. Some common examples of data products include the following:
• A content recommendation engine that powers a video streaming home screen
• A personalized search engine on an e-commerce website
• An application that performs sentiment analysis on user-generated restaurant reviews)
- View Highlight
-

Quote

Each row of the database is stored together on disk, in one or more blocks depending on the size of each record. If a record is smaller than a single block or not cleanly divisible by the block size, it leaves some disk space unused.
- View Highlight
-

Quote

Figure 3-1. A table stored in a row-based storage database. Each block contains a record (row) from the table.
- View Highlight
-

Quote

Though the data you ingest may differ, the logic remains primarily the same for pipelines built for analytics as well as ML, but with one additional consideration for ML pipelines. That is, ensuring that the data you ingest is versioned in a way that ML models can later refer to as a specific dataset for training or validation.
- View Highlight
-

Quote

(highlight:: Data preprocessing

The data that’s ingested is unlikely to be ready to use in ML development. Preprocessing is where data is cleaned and otherwise prepared for models. For example, this is the step in a pipeline where text is tokenized, features are converted to numerical values, and input values are normalized.)
- View Highlight
-

Quote

(highlight:: Model training

After new data is ingested and preprocessed, ML models need to be retrained.)
- View Highlight
-

Quote

(highlight:: Model deployment

Deploying models to production can be the most challenging part of going from research-oriented machine learning to a true data product. Here, not only is versioning of datasets necessary, but versioning of trained models is also needed. Often, a REST API is used to allow for querying of a deployed model, and API endpoints for various versions of a model will be used. It’s a lot to keep track of and takes coordination between data scientists, machine learning engineers, and data engineers to get to a production state. A well-designed pipeline is key to gluing it together.)
- View Highlight
-

Quote

Validating Ingested Data
- View Highlight
-

Quote

In pipelines built for data analysts, validation often happens after data ingestion (extract-load) as well as after data modeling (transform). In ML pipelines, validation of the data that’s ingested is also important. Don’t confuse this critical step with validation of the ML model itself, which is of course a standard part of ML development.
- View Highlight
-

Quote

Building Machine Learning Pipelines by Hannes Hapke and Catherine Nelson (O’Reilly, 2020)
- View Highlight
-

Quote

The safest way to ensure exclusion in the repo is to give your config files an extension like .conf and add a line to your .gitignore file with *.conf.
- View Highlight
-

Quote

Every major public cloud provider has a service similar to S3. Equivalents on other public clouds are Azure Storage in Microsoft Azure and Google Cloud Storage (GCS) in GCP.
- View Highlight
-

Quote

(highlight:: Extracting data from a MySQL database can be done in two ways:
• Full or incremental extraction using SQL
• Binary Log (binlog) replication)
- View Highlight
-

Quote

Binary Log replication, though more complex to implement, is better suited to cases where the data volume of changes in source tables is high, or there is a need for more frequent data ingestions from the MySQL source.
- View Highlight
-

Quote

Full or Incremental MySQL Table Extraction
- View Highlight
-

Quote

In a full extraction, every record in the table is extracted on each run of the extraction job. This is the least complex approach, but for high-volume tables it can take a long time to run.
- View Highlight
-

Quote

In an incremental extraction, only records from the source table that have changed or been added since the last run of the job are extracted. The timestamp of the last extraction can either be stored in an extraction job log table in the data warehouse or retrieved by querying the maximum timestamp in a LastUpdated column in the destination table in the warehouse.
- View Highlight
-

Quote

Be sure to store the MAX(LastUpdated) value from the destination table in the data warehouse and not the time the extraction job started or finished. Even a small lag in the time logged for job execution could mean missed or duplicated records from the source table in the next run.
- View Highlight
-
- [note::I should fix this in my Readwise Anki script!]

Quote

(highlight:: Though incremental extraction is ideal for optimal performance, there are some downsides and reasons why it may not be possible for a given table.
First, with this method deleted, rows are not captured. If a row is deleted from the source MySQL table, you won’t know, and it will remain in the destination table as if nothing changed.
Second, the source table must have a reliable timestamp for when it was last updated (the LastUpdated column in the previous example). It’s not uncommon for source system tables to be missing such a column or have one that is not updated reliably. There’s nothing stopping developers from updating records in the source table and forgetting to update the LastUpdated timestamp.)
- View Highlight
-

Quote

(highlight:: When it comes time for the load step, full extracts are usually loaded by first truncating the destination table and loading in the newly extracted data. In that case, you’re left with only the latest version of the row in the data warehouse.
When loading data from an incremental extraction, the resulting data is appended to the data in the destination table. In that case, you have both the original record as well as the updated version. Having both can be valuable when it comes time to transform and analyze data)
- View Highlight
-

Quote

Never assume a LastUpdated column in a source system is reliably updated. Check with the owner of the source system and confirm before relying on it for an incremental extraction.
- View Highlight
-

Quote

Beware of large extraction jobs — whether full or incremental — putting strain on the source MySQL database, and even blocking production queries from executing. Consult with the owner of the database and consider setting up a replica to extract from, rather than extracting from the primary source database.
- View Highlight
-

Quote

Binlog replication is a form of change data capture (CDC). Many source data stores have some form of CDC that you can use.
- View Highlight
-

Quote

The MySQL binlog is a log that keeps a record of every operation performed in the database. For example, depending on how it’s configured, it will log the specifics of every table creation or modification, as well as every INSERT, UPDATE, and DELETE operation.
- View Highlight
-

Quote

Extracting
- View Highlight
-

Quote

Storing large amounts of data in a cloud storage–based data lake is less expensive than storing it in a warehouse (this is not true for Snowflake data lakes that use the same storage as Snowflake data warehouses). In addition, because it’s unstructured or semistructured data (no predefined schema), making changes to the types or properties of data stored is far easier than modifying a warehouse schema. JSON documents are an example of the type of semistructured data that you might encounter in a data lake. If a data structure is frequently changing, you may consider storing it in a data lake, at least for the time being.
- View Highlight
-

Quote

Unlike a data warehouse, a data lake stores data in many formats in a raw and sometimes unstructured form. It’s cheaper to store, but is not optimized for querying in the same way that structured data in a warehouse is.
- View Highlight
-

Quote

(highlight:: it is possible for duplicate records to exist in a table of data that has been ingested into a data warehouse. There are a number of reasons it happens:
• An incremental data ingestion mistakenly overlaps a previous ingestion time window and picks up some records that were already ingested in a previous run.
• Duplicate records were inadvertently created in a source system.
• Data that was backfilled overlapped with subsequent data loaded into the table during ingestion.)
- View Highlight
-

Quote

Data pipelines are sets of processes that move and transform data from various sources to a destination where new value can be derived.
- View Highlight
-

Quote

In practice, however, pipelines typically consist of multiple steps including data extraction, data preprocessing, data validation, and at times training or running a machine learning model before delivering data to its final destination.
- View Highlight
-

Quote

Figure 1-1. A simple pipeline that loads server log data into an S3 Bucket, does some basic processing and structuring, and loads the results into an Amazon Redshift database.
- View Highlight
-

Quote

Data engineers take pride in ensuring the validity and timeliness of the data they deliver. That means testing, alerting, and creating contingency plans for when something goes wrong. And yes, something will eventually go wrong!
- View Highlight
-

Quote

Experienced data engineers know how to write high-performance SQL and understand the fundamentals of data warehousing and data modeling.
- View Highlight
-

Quote

One popular example of distributed computing in analytics is the Hadoop ecosystem, which includes distributed file storage via Hadoop Distributed File System (HDFS), processing via MapReduce, data analysis via Pig, and more. Apache Spark is another popular distributed processing framework, which is quickly surpassing Hadoop in popularity.
- View Highlight
-

Quote

A data engineer is expected to be proficient on the Linux command line and be able to perform tasks such as analyze application logs, schedule cron jobs, and troubleshoot firewall and other security settings.
- View Highlight
-

Quote

What most people outside of analytics often fail to appreciate is that to generate what is seen, there’s a complex machinery that is unseen. For every dashboard and insight that a data analyst generates and for each predictive model developed by a data scientist, there are data pipelines working behind the scenes.
- View Highlight
-

Quote

In addition, pipelines are not just built — they are monitored, maintained, and extended. Data engineers are tasked with not just delivering data once, but building pipelines and supporting infrastructure that deliver and process it reliably, securely, and on time.
- View Highlight
-

Quote

Figure 2-1. The key components of a modern data infrastructure.
- View Highlight
-

Quote

The term data ingestion refers to extracting data from one source and loading it into another.
- View Highlight
-

Quote

(highlight:: First, what is the interface to the data? Some of the most common include the following:
• A database behind an application, such as a Postgres or MySQL database
• A layer of abstraction on top of a system such as a REST API
• A stream processing platform such as Apache Kafka
• A shared network file system or cloud storage bucket containing logs, comma-separated value (CSV) files, and other flat files
• A data warehouse or data lake
• Data in HDFS or HBase database
In addition to the interface, the structure of the data will vary. Here are some common examples:
• JSON from a REST API
• Well-structured data from a MySQL database
• JSON within columns of a MySQL database table
• Semistructured log data
• CSV, fixed-width format (FWF), and other flat file formats
• JSON in flat files
• Stream output from Kafka)
- View Highlight
-

Quote

Semistructured data such as JSON is increasingly common and has the advantage of the structure of attribute-value pairs and nesting of objects. However, unlike a relational database, there is no guarantee that each object in the same dataset will have the same structure.
- View Highlight
-

Quote

Though the design decisions at each step in a pipeline must take data volume into consideration, high volume does not mean high value.
- View Highlight
-

Quote

(highlight:: There are many common characteristics of “messy data,” including, but not limited to, the following:
• Duplicate or ambiguous records
• Orphaned records
• Incomplete or missing records
• Text encoding errors
• Inconsistent formats (for example, phone numbers with or without dashes)
• Mislabeled or unlabeled data)
- View Highlight
-

Quote

(highlight:: There’s no magic bullet for ensuring data cleanliness and validity, but in a modern data ecosystem, there are key characteristics and approaches that we’ll see throughout this book:
Assume the worst, expect the best

Pristine datasets only exist in academic literature. Assume your input datasets will contain numerous validity and consistency issues, but build pipelines that identify and cleanse data in the interest of clean output.
Clean and validate data in the system best suited to do so
There are times when it’s better to wait to clean data until later in a pipeline. For example, modern pipelines tend to follow an extract-load-transform (ELT) rather than extract-transform-load (ETL) approach for data warehousing (more in Chapter 3). It’s sometimes optimal to load data into a data lake in a fairly raw form and to worry about structuring and cleaning later in the pipeline. In other words, use the right tool for the right job rather than rushing the cleaning and validation processes.
Validate often
Even if you don’t clean up data early in a pipeline, don’t wait until the end of the pipeline to validate it. You’ll have a much harder time determining where things went wrong. Conversely, don’t validate once early in a pipeline and assume all will go well in subsequent steps. Chapter 8 digs deeper into validation.)
- View Highlight
-

Quote

Data extraction steps in pipelines must contend with API rate limits, connection time-outs, slow downloads, and source system owners who are unhappy due to strain placed on their systems.
- View Highlight
-

Quote

A data warehouse is a database where data from different systems is stored and modeled to support analysis and other activities related to answering questions with it. Data in a data warehouse is structured and optimized for reporting and analysis queries.
- View Highlight
-

Quote

A data lake is where data is stored, but without the structure or query optimization of a data warehouse. It will likely contain a high volume of data as well as a variety of data types. For example, a single data lake might contain a collection of blog posts stored as text files, flat file extracts from a relational database, and JSON objects containing events generated by sensors in an industrial system. It can even store structured data like a standard database, though it’s not optimized for querying such data in the interest of reporting and analysis.
- View Highlight
-

Quote

(highlight:: Data transformation

Transforming data is a broad term that is signified by the T in an ETL or ELT process. A transformation can be something as simple as converting a timestamp stored in a table from one time zone to another. It can also be a more complex operation that creates a new metric from multiple source columns that are aggregated and filtered through some business logic.)
- View Highlight
-

Quote

(highlight:: Data modeling

Data modeling is a more specific type of data transformation. A data model structures and defines data in a format that is understood and optimized for data analysis. A data model is usually represented as one or more tables in a data warehouse. The process of creating data models is discussed in more detail in Chapter 6.)
- View Highlight
-

Quote

For example, for the sake of protecting personally identifiable information (PII) it may be desirable to turn an email address into a hashed value that is stored in the final destination. Such a transformation is usually performed during the ingestion process.
- View Highlight
-

Quote

it’s important to introduce a workflow orchestration platform to your data infrastructure. These platforms manage the scheduling and flow of tasks in a pipeline. Imagine a pipeline with a dozen tasks ranging from data ingestions written in Python to data transformations written in SQL that must run in a particular sequence throughout the day. It’s not a simple challenge to schedule and manage dependencies between each task.
- View Highlight
-

Quote

Workflow orchestration platforms are also referred to as workflow management systems (WMSs), orchestration platforms, or orchestration frameworks.
- View Highlight
-

Quote

The inefficient use of disk space due to records leaving empty space in blocks is a reasonable trade-off in this case, as the speed to reading and writing single records frequently is what’s most important. However, in analytics the situation is reversed. Instead of the need to read and write small amounts of data frequently, we often read and write a large amount of data infrequently. In addition, it’s less likely that an analytical query requires many, or all, of the columns in a table but rather a single column of a table with many columns.
- View Highlight
-

Quote

Figure 3-2. A table stored in a column-based storage database. Each disk block contains data from the same column. The two columns involved in our example query are highlighted. Only these blocks must be accessed to run the query. Each block contains data of the same type, making compression optimal
- View Highlight
-
- [note::Columnar databases are more efficient than row-based databases because it accesses data by column. Since columns have the same data type and analysts frequently request large amounts of data from only a subset of the available columns, storing data as columns facilitates more efficient queries.]

Quote

What’s important is understanding your constraints (dollars, engineering resources, security, and legal risk tolerance) and the resulting trade-offs.
- View Highlight
-

Quote

(highlight:: Some examples of the type of transformation that fits into the EtLT subpattern include the following:
• Deduplicate records in a table
• Parse URL parameters into individual components
• Mask or otherwise obfuscate sensitive data
These types of transforms are either fully disconnected from business logic or, in the case of something like masking sensitive data, at times required as early in a pipeline as possible for legal or security reasons.)
- View Highlight
-

Quote

The extract step gathers data from various sources in preparation for loading and transforming.
- View Highlight
-

Quote

The load step brings either the raw data (in the case of ELT) or the fully transformed data (in the case of ETL) into the final destination. Either way, the end result is loading data into the data warehouse, data lake, or other destination.
- View Highlight
-

Quote

The transform step is where the raw data from each source system is combined and formatted in a such a way that it’s useful to analysts, visualization tools, or whatever use case your pipeline is serving.
- View Highlight
-

Quote

The combination of the extraction and loading steps is often referred to as data ingestion. Especially in ELT and the EtLT subpattern (note the lowercase t), which is defined later in this chapter, extraction and loading capabilities are often tightly coupled and packaged together in software frameworks. When designing pipelines, however, it is still best to consider the two steps as separate due to the complexity of coordinating extracts and loads across different systems and infrastructure.
- View Highlight
-

Quote

Figure 3-3. The ELT pattern allows for a clean split of responsibilities between data engineers and data analysts (or data scientists). Each role can work autonomously with the tools and languages they are comfortable in.
- View Highlight
-

Quote

Though data science is a broad field, in general, data scientists will need access to more granular — and at times raw — data than data analysts do. While data analysts build data models that produce metrics and power dashboards, data scientists spend their days exploring data and building predictive models.
- View Highlight
-

Quote

The majority of today’s data warehouses are built on highly scalable, columnar databases that can both store and run bulk transforms on large datasets in a cost-effective manner.
- View Highlight
-

Quote

(highlight:: Data is used for more than analysis, reporting, and predictive models. It’s also used for powering data products. Some common examples of data products include the following:
• A content recommendation engine that powers a video streaming home screen
• A personalized search engine on an e-commerce website
• An application that performs sentiment analysis on user-generated restaurant reviews)
- View Highlight
-

Quote

Each row of the database is stored together on disk, in one or more blocks depending on the size of each record. If a record is smaller than a single block or not cleanly divisible by the block size, it leaves some disk space unused.
- View Highlight
-

Quote

Figure 3-1. A table stored in a row-based storage database. Each block contains a record (row) from the table.
- View Highlight
-

Quote

Though the data you ingest may differ, the logic remains primarily the same for pipelines built for analytics as well as ML, but with one additional consideration for ML pipelines. That is, ensuring that the data you ingest is versioned in a way that ML models can later refer to as a specific dataset for training or validation.
- View Highlight
-

Quote

(highlight:: Data preprocessing

The data that’s ingested is unlikely to be ready to use in ML development. Preprocessing is where data is cleaned and otherwise prepared for models. For example, this is the step in a pipeline where text is tokenized, features are converted to numerical values, and input values are normalized.)
- View Highlight
-

Quote

(highlight:: Model training

After new data is ingested and preprocessed, ML models need to be retrained.)
- View Highlight
-

Quote

(highlight:: Model deployment

Deploying models to production can be the most challenging part of going from research-oriented machine learning to a true data product. Here, not only is versioning of datasets necessary, but versioning of trained models is also needed. Often, a REST API is used to allow for querying of a deployed model, and API endpoints for various versions of a model will be used. It’s a lot to keep track of and takes coordination between data scientists, machine learning engineers, and data engineers to get to a production state. A well-designed pipeline is key to gluing it together.)
- View Highlight
-

Quote

Validating Ingested Data
- View Highlight
-

Quote

In pipelines built for data analysts, validation often happens after data ingestion (extract-load) as well as after data modeling (transform). In ML pipelines, validation of the data that’s ingested is also important. Don’t confuse this critical step with validation of the ML model itself, which is of course a standard part of ML development.
- View Highlight
-

Quote

Building Machine Learning Pipelines by Hannes Hapke and Catherine Nelson (O’Reilly, 2020)
- View Highlight
-

Quote

The safest way to ensure exclusion in the repo is to give your config files an extension like .conf and add a line to your .gitignore file with *.conf.
- View Highlight
-

Quote

Every major public cloud provider has a service similar to S3. Equivalents on other public clouds are Azure Storage in Microsoft Azure and Google Cloud Storage (GCS) in GCP.
- View Highlight
-

Quote

(highlight:: Extracting data from a MySQL database can be done in two ways:
• Full or incremental extraction using SQL
• Binary Log (binlog) replication)
- View Highlight
-

Quote

Binary Log replication, though more complex to implement, is better suited to cases where the data volume of changes in source tables is high, or there is a need for more frequent data ingestions from the MySQL source.
- View Highlight
-

Quote

Full or Incremental MySQL Table Extraction
- View Highlight
-

Quote

In a full extraction, every record in the table is extracted on each run of the extraction job. This is the least complex approach, but for high-volume tables it can take a long time to run.
- View Highlight
-

Quote

In an incremental extraction, only records from the source table that have changed or been added since the last run of the job are extracted. The timestamp of the last extraction can either be stored in an extraction job log table in the data warehouse or retrieved by querying the maximum timestamp in a LastUpdated column in the destination table in the warehouse.
- View Highlight
-

Quote

Be sure to store the MAX(LastUpdated) value from the destination table in the data warehouse and not the time the extraction job started or finished. Even a small lag in the time logged for job execution could mean missed or duplicated records from the source table in the next run.
- View Highlight
-
- [note::I should fix this in my Readwise Anki script!]

Quote

(highlight:: Though incremental extraction is ideal for optimal performance, there are some downsides and reasons why it may not be possible for a given table.
First, with this method deleted, rows are not captured. If a row is deleted from the source MySQL table, you won’t know, and it will remain in the destination table as if nothing changed.
Second, the source table must have a reliable timestamp for when it was last updated (the LastUpdated column in the previous example). It’s not uncommon for source system tables to be missing such a column or have one that is not updated reliably. There’s nothing stopping developers from updating records in the source table and forgetting to update the LastUpdated timestamp.)
- View Highlight
-

Quote

(highlight:: When it comes time for the load step, full extracts are usually loaded by first truncating the destination table and loading in the newly extracted data. In that case, you’re left with only the latest version of the row in the data warehouse.
When loading data from an incremental extraction, the resulting data is appended to the data in the destination table. In that case, you have both the original record as well as the updated version. Having both can be valuable when it comes time to transform and analyze data)
- View Highlight
-

Quote

Never assume a LastUpdated column in a source system is reliably updated. Check with the owner of the source system and confirm before relying on it for an incremental extraction.
- View Highlight
-

Quote

Beware of large extraction jobs — whether full or incremental — putting strain on the source MySQL database, and even blocking production queries from executing. Consult with the owner of the database and consider setting up a replica to extract from, rather than extracting from the primary source database.
- View Highlight
-

Quote

Binlog replication is a form of change data capture (CDC). Many source data stores have some form of CDC that you can use.
- View Highlight
-

Quote

The MySQL binlog is a log that keeps a record of every operation performed in the database. For example, depending on how it’s configured, it will log the specifics of every table creation or modification, as well as every INSERT, UPDATE, and DELETE operation.
- View Highlight
-

Quote

Extracting
- View Highlight
-

Quote

Storing large amounts of data in a cloud storage–based data lake is less expensive than storing it in a warehouse (this is not true for Snowflake data lakes that use the same storage as Snowflake data warehouses). In addition, because it’s unstructured or semistructured data (no predefined schema), making changes to the types or properties of data stored is far easier than modifying a warehouse schema. JSON documents are an example of the type of semistructured data that you might encounter in a data lake. If a data structure is frequently changing, you may consider storing it in a data lake, at least for the time being.
- View Highlight
-

Quote

Unlike a data warehouse, a data lake stores data in many formats in a raw and sometimes unstructured form. It’s cheaper to store, but is not optimized for querying in the same way that structured data in a warehouse is.
- View Highlight
-

Quote

(highlight:: it is possible for duplicate records to exist in a table of data that has been ingested into a data warehouse. There are a number of reasons it happens:
• An incremental data ingestion mistakenly overlaps a previous ingestion time window and picks up some records that were already ingested in a previous run.
• Duplicate records were inadvertently created in a source system.
• Data that was backfilled overlapped with subsequent data loaded into the table during ingestion.)
- View Highlight
-

=this.file.name

Reference

Notes

Chapter 1. Introduction to Data Pipelines

What Are Data Pipelines?

Who Builds Data Pipelines?

SQL and Data Warehousing Fundamentals

Python and/or Java

Distributed Computing

Basic System Administration

A Goal-Oriented Mentality

Why Build Data Pipelines?

How Are Pipelines Built?

Chapter 2. A Modern Data Infrastructure

Diversity of Data Sources

Source System Ownership

Ingestion Interface and Data Structure

Data Volume

Data Cleanliness and Validity

Latency and Bandwidth of the Source System

Cloud Data Warehouses and Data Lakes

Data Ingestion Tools

Data Transformation and Modeling Tools

Workflow Orchestration Platforms

Directed Acyclic Graphs

Customizing Your Data Infrastructure

EtLT Subpattern

Chapter 3. Common Data Pipeline Patterns

ETL and ELT

ELT for Data Analysis

The Emergence of ELT over ETL

ELT for Data Science

ELT for Data Products and Machine Learning

Steps in a Machine Learning Pipeline

Incorporate Feedback in the Pipeline

Further Reading on ML Pipelines

Chapter 4. Data Ingestion: Extracting Data

Setting Up Your Python Environment

Don’t Add Your Config Files to a Git Repo!

Setting Up Cloud File Storage

Extracting Data from a MySQL Database

Caching Last Updated Dates

Binary Log Replication of MySQL Data

Consider Using a Prebuilt Framework

Consult with Source System Owners

Extracting Data from a PostgreSQL Database

Full or Incremental Postgres Table Extraction

Replicating Data Using the Write-Ahead Log

Extracting Data from MongoDB

Extracting Data from a REST API

Streaming Data Ingestions with Kafka and Debezium

Chapter 5. Data Ingestion: Loading Data

Using Your File Storage as a Data Lake

Chapter 6. Transforming Data

Noncontextual Transformations

Deduplicating Records in a Table

dg-publish: true created: 2024-07-01 modified: 2024-07-01 title: Data Pipelines Pocket Reference source: reader

=this.file.name

Reference

Notes

Chapter 1. Introduction to Data Pipelines

What Are Data Pipelines?

Who Builds Data Pipelines?

SQL and Data Warehousing Fundamentals

Python and/or Java

Distributed Computing

Basic System Administration

A Goal-Oriented Mentality

Why Build Data Pipelines?

How Are Pipelines Built?

Chapter 2. A Modern Data Infrastructure

Diversity of Data Sources

Source System Ownership

Ingestion Interface and Data Structure

Data Volume

Data Cleanliness and Validity

Latency and Bandwidth of the Source System

Cloud Data Warehouses and Data Lakes

Data Ingestion Tools

Data Transformation and Modeling Tools

`=this.file.name`

dg-publish: true
created: 2024-07-01
modified: 2024-07-01
title: Data Pipelines Pocket Reference
source: reader

`=this.file.name`