Fundamentals of Data Engineering

@tags:: #litāœ/šŸ“šbook/highlights
@links:: data engineering, data governance, data management, data orchestration,
@ref:: Fundamentals of Data Engineering
@author:: Joe Reis and Matt Housley

=this.file.name

Book cover of "Fundamentals of Data Engineering"

Reference

Notes

Quote

What This Book Isnā€™t
- LocationĀ 104
-

Quote

What This Book Is About
- LocationĀ 109
-

Quote

Who Should Read This Book
- LocationĀ 128
-

Quote

Prerequisites
- LocationĀ 138
-

Quote

What Youā€™ll Learn and How It Will Improve Your Abilities
- LocationĀ 154
-

Quote

Navigating This Book
- LocationĀ 166
-

Quote

Conventions Used in This Book
- LocationĀ 199
-

Quote

Foundation and Building Blocks
- LocationĀ 250
-

Quote

Data Engineering Described
- LocationĀ 252
-

Quote

What Is Data Engineering?
- LocationĀ 256
-

Quote

Data Engineering Defined
- LocationĀ 287
-

Quote

a data engineer gets data, stores it, and prepares it for consumption by data scientists, analysts, and others.
- LocationĀ 288
-

Quote

Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning. Data engineering is the intersection of security, data management, DataOps, data architecture, orchestration, and software engineering. A data engineer manages the data engineering lifecycle, beginning with getting data from source systems and ending with serving data for use cases, such as analysis or machine learning.
- LocationĀ 291
-
- [note::I'd be fun to make a Venn diagram of data engineering]

Quote

The Data Engineering Lifecycle
- LocationĀ 297
-

Quote

Figure 1-1. The data engineering lifecycle
- LocationĀ 301
-

Quote

Evolution of the Data Engineer
- LocationĀ 313
-

Quote

The term big data is essentially a relic to describe a particular time and approach to handling large amounts of data.
- LocationĀ 398
-

Quote

big data processing has become so accessible that it no longer merits a separate term; every company aims to solve its data problems, regardless of actual data size. Big data engineers are now simply data engineers.
- LocationĀ 400
-

Quote

Data engineering is increasingly a discipline of interoperation, and connecting various technologies like LEGO bricks, to serve ultimate business goals.
- LocationĀ 411
-

Quote

Figure 1-3. Matt Turckā€™s Data Landscape in 2012 versus 2021
- LocationĀ 413
-

Quote

As tools and workflows simplify, weā€™ve seen a noticeable shift in the attitudes of data engineers. Instead of focusing on who has the ā€œbiggest data,ā€ open source projects and services are increasingly concerned with managing and governing data, making it easier to use and discover, and improving its quality.
- LocationĀ 421
-

Quote

Data Engineering and Data Science
- LocationĀ 434
-

Quote

Figure 1-4. Data engineering sits upstream from data science
- LocationĀ 440
-

Quote

Figure 1-5. The Data Science Hierarchy of Needs
- LocationĀ 446
-

Quote

Rogati argues that companies need to build a solid data foundation (the bottom three levels of the hierarchy) before tackling areas such as AI and ML.
- LocationĀ 450
- artificial intelligence (ai), data hierarchy of needs, machine learning (ml),
- [note::Hierarchy = Data Science Hierarchy of Needs]

Quote

In an ideal world, data scientists should spend more than 90% of their time focused on the top layers of the pyramid: analytics, experimentation, and ML. When data engineers focus on these bottom parts of the hierarchy, they build a solid foundation for data scientists to succeed.
- LocationĀ 452
-

Quote

Data Engineering Skills and Activities
- LocationĀ 461
-

Quote

a data engineer juggles a lot of complex moving parts and must constantly optimize along the axes of cost, agility, scalability, simplicity, reuse, and interoperability
- LocationĀ 467
-

Quote

data engineers are now focused on balancing the simplest and most cost-effective, best-of-breed services that deliver value to the business.
- LocationĀ 476
-

Quote

A data engineer typically does not directly build ML models, create reports or dashboards, perform data analysis, build key performance indicators (KPIs), or develop software applications. A data engineer should have a good functioning understanding of these areas to serve stakeholders best.
- LocationĀ 478
-

Quote

Data Maturity and the Data Engineer
- LocationĀ 481
-

Quote

Data maturity is the progression toward higher data utilization, capabilities, and integration across the organization,
- LocationĀ 484
-

Quote

data maturity does not simply depend on the age or revenue of a company. An early-stage startup can have greater data maturity than a 100-year-old company with annual revenues in the billions. What matters is the way data is leveraged as a competitive advantage.
- LocationĀ 488
-

Quote

Stage 1: Starting with data
- LocationĀ 496
-

Quote

A data engineer should focus on the following in organizations getting started with data: Get buy-in from key stakeholders, including executive management. Ideally, the data engineer should have a sponsor for critical initiatives to design and build a data architecture to support the companyā€™s goals. Define the right data architecture (usually solo, since a data architect likely isnā€™t available). This means determining business goals and the competitive advantage youā€™re aiming to achieve with your data initiative. Work toward a data architecture that supports these goals. See ChapterĀ 3 for our advice on ā€œgoodā€ data architecture. Identify and audit data that will support key initiatives and operate within the data architecture you designed. Build a solid data foundation for future data analysts and data scientists to generate reports and models that provide competitive value. In the meantime, you may also have to generate these reports and models until this team is hired.
- LocationĀ 508
-

Quote

This is a delicate stage with lots of pitfalls. Here are some tips for this stage: Organizational willpower may wane if a lot of visible successes donā€™t occur with data. Getting quick wins will establish the importance of data within the organization. Just keep in mind that quick wins will likely create technical debt. Have a plan to reduce this debt, as it will otherwise add friction for future delivery. Get out and talk to people, and avoid working in silos. We often see the data team working in a bubble, not communicating with people outside their departments and getting perspectives and feedback from business stakeholders. The danger is youā€™ll spend a lot of time working on things of little use to people. Avoid undifferentiated heavy lifting. Donā€™t box yourself in with unnecessary technical complexity. Use off-the-shelf, turnkey solutions wherever possible. Build custom solutions and code only where this creates a competitive advantage.
- LocationĀ 516
- data culture,
- [note::This stage = "Getting started with data"]

Quote

Stage 2: Scalingā€¦
- LocationĀ 525
-

Quote

In organizations that are in stage 2 of data maturity, a data engineerā€™s goals are to do the following: Establish formal data practices Create scalable and robust data architectures Adopt DevOps and DataOps practices Build systems that support ML Continue to avoidā€¦
- LocationĀ 528
-

Quote

Issues to watch out for include the following: As we grow more sophisticated with data, thereā€™s a temptation to adopt bleeding-edge technologies based on social proof from Silicon Valley companies. This is rarely a good use of your time and energy. Any technology decisions should be driven by the value theyā€™ll deliver to your customers. The main bottleneck for scaling is not cluster nodes, storage, or technology but the data engineering team. Focus on solutions that are simple to deploy and manage to expand your teamā€™s throughput. Youā€™ll be tempted to frame yourself as a technologist, a data genius who can deliver magical products. Shift your focus instead to pragmatic leadershipā€¦
- LocationĀ 533
-

Quote

Stage 3: Leadingā€¦
- LocationĀ 541
-

Quote

In organizations in stage 3 of data maturity, a data engineer will continue building on prior stages, plus they will do the following: Create automation for the seamless introduction and usage of new data Focus on building custom tools and systems that leverage data as a competitive advantage Focus on the ā€œenterpriseyā€ aspects of data, such as data management (including data governance and quality) and DataOps Deploy tools that expose and disseminate data throughout the organization, including data catalogs, data lineage tools, and metadata management systems Collaborate efficiently with softwareā€¦
- LocationĀ 545
-

Quote

Issues to watch out for include the following: At this stage, complacency is a significant danger. Once organizations reach stage 3, they must constantly focus on maintenance and improvement or risk falling back to a lower stage. Technology distractions are a more significant danger here than in the other stages. Thereā€™s a temptation to pursue expensive hobby projects that donā€™t deliverā€¦
- LocationĀ 552
-
- [note::"Utilize custom-built technology only where it provides a competitive advantage" is a reoccurring theme in each data maturity stage (1-3)]

Quote

The Background and Skills of aā€¦
- LocationĀ 560
-

Quote

If youā€™re pivoting your career into data engineering, weā€™ve found that the transition is easiest when moving from an adjacent field, such as software engineering, ETL development, database administration, data science, or data analysis.
- LocationĀ 568
- career pivot, career, data engineering,

Quote

Zooming out, a data engineer must also understand the requirements of data consumers (data analysts and data scientists) and the broader implications of data across the organization. Data engineering is a holistic practice; the best data engineers view their responsibilities through business and technical lenses.
- LocationĀ 575
- data engineering, ankified,

Quote

Business Responsibilities
- LocationĀ 577
-

Quote

Know how to communicate with nontechnical and technical people.
- LocationĀ 581
-

Quote

We suggest paying close attention to organizational hierarchies, who reports to whom, how people interact, and which silos exist. These observations will be invaluable to your success.
- LocationĀ 582
- stakeholder mapping,
- [note::For any problem you're trying to solve, it helps to map the stakeholders who can influence or are influenced by your progress towards a solution.]

Quote

Understand how to scope and gather business and product requirements.
- LocationĀ 584
-

Quote

Data engineering is a holistic practice; the best data engineers view their responsibilities through business and technical lenses.
- LocationĀ 585
- data engineering,
- [note::Emphasis on and]

Quote

Many technologists mistakenly believe these practices are solved through technology. We feel this is dangerously wrong. Agile, DevOps, and DataOps are fundamentally cultural, requiring buy-in across the organization.
- LocationĀ 586
- work culture, devops, agile, dataops,

Quote

Know how to optimize for time to value, the total cost of ownership, and opportunity cost. Learn to monitor costs to avoid surprises.
- LocationĀ 589
-

Quote

success or failure is rarely a technology issue.
- LocationĀ 595
-
- [note::Stakeholder communication is paramount]

Quote

Technical Responsibilities
- LocationĀ 598
-

Quote

data engineers now focus on high-level abstractions or writing pipelines as code within an orchestration framework.
- LocationĀ 614
-

Quote

a data engineer who canā€™t write production-grade code will be severely hindered, and we donā€™t see this changing anytime soon.
- LocationĀ 617
-

Quote

At the time of this writing, the primary languages of data engineering are SQL, Python, a Java Virtual Machine (JVM) language (usually Java or Scala), and bash:
- LocationĀ 620
-

Quote

Understanding Java or Scala will be beneficial if youā€™re using a popular open source data framework.
- LocationĀ 633
-

Quote

Even today, data engineers frequently use command-line tools like awk or sed to process files in a data pipeline or call bash commands from orchestration frameworks. If youā€™re using Windows, feel free to substitute PowerShell for bash.
- LocationĀ 636
-

Quote

Data engineers also do well to develop expertise in composing SQL with other operations, either within frameworks such as Spark and Flink or by using orchestration to combine multiple tools. Data engineers should also learn modern SQL semantics for dealing with JavaScript Object Notation (JSON) parsing and nested data and consider leveraging a SQL management framework such as dbt (Data Build Tool).
- LocationĀ 645
- career capital, skills, sql,

Quote

Data engineers may also need to develop proficiency in secondary programming languages, including R, JavaScript, Go, Rust, C/C++, C#, and Julia. Developing in these languages is often necessary when popular across the company or used with domain-specific data tools. For instance, JavaScript has proven popular as a language for user-defined functions in cloud data warehouses. At the same time, C# and Powā erShell are essential in companies that leverage Azure and the Microsoft ecosystem.
- LocationĀ 652
- programming languages, data engineering,

Quote

The Continuum of Data Engineering Roles, from A to B
- LocationĀ 665
-

Quote

Data Engineers Inside an Organization
- LocationĀ 689
-

Quote

Internal-Facing Versus External-Facing Data Engineers
- LocationĀ 693
-

Quote

Data Engineers and Other Technical Roles
- LocationĀ 718
-

Quote

Figure 1-12. Key technical stakeholders of data engineering
- LocationĀ 722
-

Quote

Upstream stakeholders
- LocationĀ 729
-

Quote

Data architects design the blueprint for organizational data management, mapping out processes and overall data architecture and systems.11 They also serve as a bridge between an organizationā€™s technical and nontechnical sides.
- LocationĀ 735
-

Quote

Data architects implement policies for managing data across silos and business units, steer global strategies such as data management and data governance, and guide significant initiatives. Data architects often play a central role in cloud migrations and greenfield cloud design.
- LocationĀ 739
-

Quote

Nevertheless, data architects will remain influential visionaries in enterprises, working hand in hand with data engineers to determine the big picture of architecture practices and data strategies.
- LocationĀ 744
-

Quote

In well-run technical organizations, software engineers and data engineers coordinate from the inception of a new project to design application data for consumption by analytics and ML applications.
- LocationĀ 758
-

Quote

A data engineer should work together with software engineers to understand the applications that generate data, the volume, frequency, and format of the generated data, and anything else that will impact the data engineering lifecycle, such as data security and regulatory compliance. For example, this might mean setting upstream expectations on what the data software engineers need to do their jobs.
- LocationĀ 759
- data engineering,

Quote

Downstream stakeholders
- LocationĀ 769
-

Quote

If data engineers do their job and collaborate successfully, data scientists shouldnā€™t spend their time collecting, cleaning, and preparing data after initial exploratory work. Data engineers should automate this work as much as possible.
- LocationĀ 784
-

Quote

Whereas data scientists are forward-looking, a data analyst typically focuses on the past or present.
- LocationĀ 790
-

Quote

Data Engineers and Business Leadership
- LocationĀ 822
-

Quote

Data engineers often support data architects by acting as the glue between the business and data science/analytics.
- LocationĀ 826
-

Quote

Data in the C-suite
- LocationĀ 827
-

Quote

CIOs will work with engineers and architects to map out major initiatives and make strategic decisions on adopting major architectural elements, such as enterprise resource planning (ERP) and customer relationship management (CRM) systems, cloud migrations, data systems, and internal-facing IT.
- LocationĀ 846
-

Quote

A CTO owns the key technological strategy and architectures for external-facing applications, such as mobile, web apps, and IoTā€”all critical data sources for data engineers.
- LocationĀ 851
- business roles,

Quote

Data engineers and project managers
- LocationĀ 876
-

Quote

These large initiatives often benefit from project management (in contrast to product management, discussed next). Whereas data engineers function in an infrastructure and service delivery capacity, project managers direct traffic and serve as gatekeepers.
- LocationĀ 881
-

Quote

Data engineers and product managers
- LocationĀ 889
-

Quote

Data engineers and other management roles
- LocationĀ 896
-

Quote

For more information on data teams and how to structure them, we recommend John Thompsonā€™s Building Analytics Teams (Packt) and Jesse Andersonā€™s Data Teams (Apress). Both books provide strong frameworks and perspectives on the roles of executives with data, who to hire, and how to construct the most effective data team for your company.
- LocationĀ 900
-

Quote

Conclusion
- LocationĀ 907
-

Quote

Additional Resources
- LocationĀ 917
-

Quote

ā€œBig Data Will Be Dead in Five Yearsā€ by Lewis Gavin
- LocationĀ 920
-

Quote

ā€œData as a Product vs. Data as a Serviceā€ by Justin Gage
- LocationĀ 923
-

Quote

ā€œThe Downfall of the Data Engineerā€ by Maxime Beauchemin
- LocationĀ 928
-

Quote

ā€œHow Creating a Data-Driven Culture Can Drive Successā€ by Frederik Bussler
- LocationĀ 931
-

Quote

The Information Management Body of Knowledge website
- LocationĀ 933
-

Quote

ā€œInformation Management Body of Knowledgeā€ Wikipedia page
- LocationĀ 934
-

Quote

ā€œInformation Managementā€ Wikipedia page
- LocationĀ 935
-

Quote

ā€œOn Complexity in Big Dataā€ by Jesse Anderson (Oā€™Reilly)
- LocationĀ 936
-

Quote

ā€œWhat Is a Data Architect? ITā€™s Data Framework Visionaryā€ by Thor Olavsrud
- LocationĀ 943
-

Quote

The Data Engineering Lifecycle
- LocationĀ 980
-

Quote

data engineering lifecycle comprises stages that turn raw data ingredients into a useful end product, ready for consumption by analysts, data scientists, ML engineers, and others.
- LocationĀ 989
- data engineering,

Quote

What Is the Data Engineering Lifecycle?
- LocationĀ 989
-

Quote

Figure 2-1. Components and undercurrents of the data engineering lifecycle
- LocationĀ 996
-

Quote

The Data Lifecycle Versus the Data Engineering Lifecycle
- LocationĀ 1009
-

Quote

source system is the origin of the data used in the data engineering lifecycle. For example, a source system could be an IoT device, an application message queue, or a transactional database.
- LocationĀ 1018
-

Quote

Generation: Source Systems
- LocationĀ 1018
-

Quote

Engineers also need to keep an open line of communication with source system owners on changes that could break pipelines and analytics.
- LocationĀ 1025
-

Quote

The following is a starting set of evaluation questions of source systems that data engineers must consider: What are the essential characteristics of the data source? Is it an application? A swarm of IoT devices? How is data persisted in the source system? Is data persisted long term, or is it temporary and quickly deleted? At what rate is data generated? How many events per second? How many gigabytes per hour? What level of consistency can data engineers expect from the output data? If youā€™re running data-quality checks against the output data, how often do data inconsistencies occurā€”nulls where they arenā€™t expected, lousy formatting, etc.? How often do errors occur? Will the data contain duplicates? Will some data values arrive late, possibly much later than other messages produced simultaneously? What is the schema of the ingested data? Will data engineers need to join across several tables or even several systems to get a complete picture of the data? If schema changes (say, a new column is added), how is this dealt with and communicated to downstream stakeholders? How frequently should data be pulled from the source system? For stateful systems (e.g., a database tracking customer account information), is data provided as periodic snapshots or update events from change data capture (CDC)? Whatā€™s the logic for how changes are performed, and how are these tracked in the source database? Who/what is the data provider that will transmit the data for downstream consumption? Will reading from a data source impact its performance? Does the source system have upstream data dependencies? What are the characteristics of these upstream systems? Are data-quality checks in place to check for late or missing data?
- LocationĀ 1049
- considerations, data source systems,

Quote

A data engineer should know how the source generates data, including relevant quirks or nuances. Data engineers also need to understand the limits of the source systems they interact with.
- LocationĀ 1069
-

Quote

Storage
- LocationĀ 1089
-

Quote

few data storage solutions function purely as storage, with many supporting complex transformation queries; even object storage solutions may support powerful query capabilitiesā€”e.g., Amazon S3 Select.
- LocationĀ 1093
-

Quote

Here are a few key engineering questions to ask when choosing a storage system for a data warehouse, data lakehouse, database, or object storage: Is this storage solution compatible with the architectureā€™s required write and read speeds? Will storage create a bottleneck for downstream processes? Do you understand how this storage technology works? Are you utilizing the storage system optimally or committing unnatural acts? For instance, are you applying a high rate of random access updates in an object storage system? (This is an antipattern with significant performance overhead.) Will this storage system handle anticipated future scale? You should consider all capacity limits on the storage system: total available storage, read operation rate, write volume, etc. Will downstream users and processes be able to retrieve data in the required service-level agreement (SLA)? Are you capturing metadata about schema evolution, data flows, data lineage, and so forth? Metadata has a significant impact on the utility of data. Metadata represents an investment in the future, dramatically enhancing discoverability and institutional knowledge to streamline future projects and architecture changes. Is this a pure storage solution (object storage), or does it support complex query patterns (i.e., a cloud data warehouse)? Is the storage system schema-agnostic (object storage)? Flexible schema (Cassandra)? Enforced schema (a cloud data warehouse)? How are you tracking master data, golden records data quality, and data lineage for data governance? (We have more to say on these in ā€œData Managementā€.) How are you handling regulatory compliance and data sovereignty? For example, can you store your data in certain geographical locations but not others?
- LocationĀ 1102
-

Quote

Understanding data access frequency
- LocationĀ 1120
-

Quote

Data access frequency will determine the temperature of your data. Data that is most frequently accessed is called hot data. Hot data is commonly retrieved many times per day, perhaps even several times per secondā€”for example, in systems that serve user requests. This data should be stored for fast retrieval, where ā€œfastā€ is relative to the use case. Lukewarm data might be accessed every so oftenā€”say, every week or month. Cold data is seldom queried and is appropriate for storing in an archival system. Cold data is often retained for compliance purposes or in case of a catastrophic failure in another system.
- LocationĀ 1123
-
- [note::Data Temperature = Frequency of Data Retrieval]

Quote

Selecting a storage system
- LocationĀ 1136
-

Quote

Ingestion
- LocationĀ 1144
-

Quote

source systems and ingestion represent the most significant bottlenecks of the data engineering lifecycle.
- LocationĀ 1148
-

Quote

Key engineering considerations for the ingestion phase
- LocationĀ 1153
-

Quote

When preparing to architect or build a system, here are some primary questions about the ingestion stage: What are the use cases for the data Iā€™m ingesting? Can I reuse this data rather than create multiple versions of the same dataset? Are the systems generating and ingesting this data reliably, and is the data available when I need it? What is the data destination after ingestion? How frequently will I need to access the data? In what volume will the data typically arrive? What format is the data in? Can my downstream storage and transformation systems handle this format? Is the source data in good shape for immediate downstream use? If so, for how long, and what may cause it to be unusable? If the data is from a streaming source, does it need to be transformed before reaching its destination? Would an in-flight transformation be appropriate, where the data is transformed within the stream itself?
- LocationĀ 1154
-

Quote

Batch versus streaming
- LocationĀ 1167
-

Quote

Virtually all data we deal with is inherently streaming. Data is nearly always produced and updated continually at its source.
- LocationĀ 1168
-

Quote

Batch ingestion is simply a specialized and convenient way of processing this stream in large chunksā€”for example, handling a full dayā€™s worth of data in a single batch.
- LocationĀ 1170
-

Quote

Key considerations for batch versus stream ingestion
- LocationĀ 1184
-

Quote

The following are some questions to ask yourself when determining whether streaming ingestion is an appropriate choice over batch ingestion: If I ingest the data in real time, can downstream storage systems handle the rate of data flow? Do I need millisecond real-time data ingestion? Or would a micro-batch approach work, accumulating and ingesting data, say, every minute? What are my use cases for streaming ingestion? What specific benefits do I realize by implementing streaming? If I get data in real time, what actions can I take on that data that would be an improvement upon batch? Will my streaming-first approach cost more in terms of time, money, maintenance, downtime, and opportunity cost than simply doing batch? Are my streaming pipeline and system reliable and redundant if infrastructure fails? What tools are most appropriate for the use case? Should I use a managed service (Amazon Kinesis, Google Cloud Pub/Sub, Google Cloud Dataflow) or stand up my own instances of Kafka, Flink, Spark, Pulsar, etc.? If I do the latter, who will manage it? What are the costs and trade-offs? If Iā€™m deploying an ML model, what benefits do I have with online predictions and possibly continuous training? Am I getting data from a live production instance? If so, whatā€™s the impact of my ingestion process on this source system?
- LocationĀ 1186
-

Quote

In the push model of data ingestion, a source system writes data out to a target, whether a database, object store, or filesystem. In the pull model, data is retrieved from the source system.
- LocationĀ 1202
- data ingestion,

Quote

Push versus pull
- LocationĀ 1202
-

Quote

continuous CDC,
- LocationĀ 1213
-
- [note::What does CDC stand for? -> "Change Data Capture"]

Quote

Transformation
- LocationĀ 1225
-

Quote

Immediately after ingestion, basic transformations map data into correct types (changing ingested string data into numeric and date types, for example), putting records into standard formats, and removing bad ones. Later stages of transformation may transform the data schema and apply normalization. Downstream, we can apply large-scale aggregation for reporting or featurize data for ML processes.
- LocationĀ 1231
-

Quote

Key considerations for the transformation phase
- LocationĀ 1234
-

Quote

When considering data transformations within the data engineering lifecycle, it helps to consider the following: Whatā€™s the cost and return on investment (ROI) of the transformation? What is the associated business value? Is the transformation as simple and self-isolated as possible? What business rules do the transformations support?
- LocationĀ 1235
- data transformation,

Quote

Serving Data
- LocationĀ 1263
-

Quote

Analytics
- LocationĀ 1276
-

Quote

Figure 2-5. Types of analytics
- LocationĀ 1282
-

Quote

Multitenancy
- LocationĀ 1314
-

Quote

Machine learning
- LocationĀ 1321
-

Quote

The following are some considerations for the serving data phase specific to ML: Is the data of sufficient quality to perform reliable feature engineering? Quality requirements and assessments are developed in close collaboration with teams consuming the data. Is the data discoverable? Can data scientists and ML engineers easily find valuable data? Where are the technical and organizational boundaries between data engineering and ML engineering? This organizational question has significant architectural implications. Does the dataset properly represent ground truth? Is it unfairly biased?
- LocationĀ 1340
-

Quote

Reverse ETL
- LocationĀ 1349
-

Quote

Major Undercurrents Across the Data Engineering Lifecycle
- LocationĀ 1370
-

Quote

Figure 2-7. The major undercurrents of data engineering
- LocationĀ 1379
-

Quote

Security
- LocationĀ 1381
-

Quote

The principle of least privilege means giving a user or system access to only the essential data and resources to perform an intended function. A common antipattern we see with data engineers with little security experience is to give admin access to all users. This is a catastrophe waiting to happen!
- LocationĀ 1386
- data security,

Quote

Data Management
- LocationĀ 1403
-

Quote

The Data Management Association International (DAMA) Data Management Body of Knowledge (DMBOK), which we consider to be the definitive book for enterprise data management, offers this definition: Data management is the development, execution, and supervision of plans, policies, programs, and practices that deliver, control, protect, and enhance the value of data and information assets throughout their lifecycle.
- LocationĀ 1410
-

Quote

data governance engages people, processes, and technologies to maximize data value across an organization while protecting data with appropriate security controls.
- LocationĀ 1438
-

Quote

The core categories of data governance are discoverability, security, and accountability.2 Within these core categories are subcategories, such as data quality, metadata, and privacy.
- LocationĀ 1450
-

Quote

In a data-driven company, data must be available and discoverable. End users should have quick and reliable access to the data they need to do their jobs. They should know where the data comes from, how it relates to other data, and what the data means.
- LocationĀ 1453
-

Quote

We divide metadata into two major categories: autogenerated and human generated. Modern data engineering revolves around automation, but metadata collection is often manual and error prone.
- LocationĀ 1462
- data quality, metadata,

Quote

Metadata tools are only as good as their connectors to data systems and their ability to share metadata.
- LocationĀ 1470
-

Quote

Documentation and internal wiki tools provide a key foundation for metadata management, but these tools should also integrate with automated data cataloging. For example, data-scanning tools can generate wiki pages with links to relevant data objects.
- LocationĀ 1477
-

Quote

DMBOK identifies four main categories of metadata that are useful to data engineers: Business metadata Technical metadata Operational metadata Reference metadata
- LocationĀ 1481
-

Quote

Business metadata relates to the way data is used in the business, including business and data definitions, data rules and logic, how and where data is used, and the data owner(s). A data engineer uses business metadata to answer nontechnical questions about who, what, where, and how. For example, a data engineer may be tasked with creating a data pipeline for customer sales analysis. But what is a customer? Is it someone whoā€™s purchased in the last 90 days? Or someone whoā€™s purchased at any time the business has been open? A dataā€¦
- LocationĀ 1484
- metadata, business metadata,

Quote

Technical metadata describes the data created and used by systems across the data engineering lifecycle. It includes the data model and schema, data lineage, field mappings, and pipeline workflows. A data engineer uses technical metadata to create, connect, and monitor various systems across the data engineering lifecycle. Here are some common types of technical metadata that a dataā€¦
- LocationĀ 1491
- metadata,

Quote

Pipeline metadata captured in orchestration systems provides details of the workflow schedule, system and data dependencies, configurationsā€¦
- LocationĀ 1497
- data orchestration, data pipelines,

Quote

Data-lineage metadata tracks the origin and changes to data, and its dependencies, over time. As data flows through the data engineering lifecycle, it evolves through transformations and combinations with other data. Data lineage provides an audit trail ofā€¦
- LocationĀ 1500
-

Quote

Schema metadata describes the structure of data stored in a system such as a database, a data warehouse, aā€¦
- LocationĀ 1503
-

Quote

Operational metadata describes the operational results of various systems and includes statistics about processes, job IDs, application runtime logs, data used in a process, and error logs. A data engineer uses operational metadata to determine whether aā€¦
- LocationĀ 1508
-

Quote

Reference metadata is data used to classify other data. This is also referred to as lookup data. Standard examples of reference data are internal codes, geographic codes, units of measurement, and internal calendar standards. Note that much of reference data is fully managed internally, but items such as geographic codes might come from standard external references. Reference data isā€¦
- LocationĀ 1514
-

Quote

Data accountability means assigning an individual to govern a portion of data. The responsible person then coordinates the governanceā€¦
- LocationĀ 1522
-

Quote

Data quality is the optimization of data toward the desired state and orbits the question, ā€œWhat do you get compared with what you expect?ā€ Data should conform to the expectations in the business metadata. Does the data match the definition agreed upon by the business?
- LocationĀ 1535
-

Quote

According to Data Governance: The Definitive Guide, data quality is defined by three main characteristics:4 Accuracy Is the collected data factually correct? Are there duplicate values? Are the numeric values accurate? Completeness Are the records complete? Do all required fields contain valid values? Timeliness Are records available in a timely fashion?
- LocationĀ 1542
- data quality,
- [note::Data quality is characterized by:

  1. Accuracy
  2. Completeness
  3. Timeliness]

Quote

Master data is data about business entities such as employees, customers, products, and locations.
- LocationĀ 1560
-
- [note::Do people in the field have qualms about the use of "master" in this term?]

Quote

Master data management (MDM) is the practice of building consistent entity definitions known as golden records. Golden records harmonize entity data across an organization and with its partners.
- LocationĀ 1565
-

Quote

Data modeling and design
- LocationĀ 1575
-

Quote

Whereas we traditionally think of data modeling as a problem for database administrators (DBAs) and ETL developers, data modeling can happen almost anywhere in an organization.
- LocationĀ 1580
-

Quote

Data engineers need to understand modeling best practices as well as develop the flexibility to apply the appropriate level and type of modeling to the data source and use case.
- LocationĀ 1591
-

Quote

Data lineage
- LocationĀ 1593
-

Quote

Data lineage describes the recording of an audit trail of data through its lifecycle, tracking both the systems that process the data and the upstream data it depends on.
- LocationĀ 1597
- blue,

Quote

We also note that Andy Petrellaā€™s concept of Data Observability Driven Development (DODD) is closely related to data lineage. DODD observes data all along its lineage. This process is applied during development, testing, and finally production to deliver quality and conformity to expectations.
- LocationĀ 1603
-

Quote

Data integration and interoperability is the process of integrating data across tools and processes. As we move away from a single-stack
- LocationĀ 1608
- blue,

Quote

Data integration and interoperability
- LocationĀ 1608
-

Quote

Increasingly, integration happens through general-purpose APIs rather than custom database connections. For example, a data pipeline might pull data from the Salesforce API, store it to Amazon S3, call the Snowflake API to load it into a table, call the API again to run a query, and then export the results to S3 where Spark can consume them.
- LocationĀ 1614
-

Quote

Data lifecycle management
- LocationĀ 1622
-

Quote

This means we have pay-as-you-go storage costs instead of large up-front capital expenditures for an on-premises data lake. When every byte shows up on a monthly AWS statement, CFOs see opportunities for savings.
- LocationĀ 1627
-

Quote

Ethics and privacy
- LocationĀ 1638
-

Quote

Data used to live in the Wild West, freely collected and traded like baseball cards. Those days are long gone. Whereas dataā€™s ethical and privacy implications were once considered nice to have, like security, theyā€™re now central to the general data lifecycle. Data engineers need to do the right thing when no one else is watching, because everyone will be watching someday.
- LocationĀ 1643
-

Quote

Ensure that your data assets are compliant with a growing number of data regulations, such as GDPR and CCPA. Please take this seriously. We offer tips throughout the book to ensure that youā€™re baking ethics and privacy into the data engineering lifecycle.
- LocationĀ 1649
-

Quote

DataOps maps the best practices of Agile methodology, DevOps, and statistical process control (SPC) to data. Whereas DevOps aims to improve the release and quality of software products, DataOps does the same thing for data products.
- LocationĀ 1653
-

Quote

DataOps
- LocationĀ 1653
-

Quote

Like DevOps, DataOps borrows much from lean manufacturing and supply chain management, mixing people, processes, and technology to reduce time to value. As Data Kitchen (experts in DataOps) describes it:7 DataOps is a collection of technical practices, workflows, cultural norms, and architectural patterns that enable: Rapid innovation and experimentation delivering new insights to customers with increasing velocity Extremely high data quality and very low error rates Collaboration across complex arrays of people, technology, and environments Clear measurement, monitoring, and transparency of results
- LocationĀ 1660
- devops, dataops,

Quote

Observability and monitoring
- LocationĀ 1712
-

Quote

As we tell our clients, ā€œData is a silent killer.ā€ Weā€™ve seen countless examples of bad data lingering in reports for months or years. Executives may make key decisions from this bad data, discovering the error only much later. The outcomes are usually bad and sometimes catastrophic for the business. Initiatives are undermined and destroyed, years of work wasted. In some of the worst cases, bad data may lead companies to financial ruin.
- LocationĀ 1713
-

Quote

Observability, monitoring, logging, alerting, and tracing are all critical to getting ahead of any problems along the data engineering lifecycle. We recommend you incorporate SPC to understand whether events being monitored are out of line and which incidents are worth responding to.
- LocationĀ 1723
-

Quote

The purpose of DODD is to give everyone involved in the data chain visibility into the data and data applications so that everyone involved in the data value chain has the ability to identify changes to the data or data applications at every stepā€”from ingestion to transformation to analysisā€”to help troubleshoot or prevent data issues. DODD focuses on making data observability a first-class consideration in the data engineering lifecycle.
- LocationĀ 1731
-

Quote

Incident response
- LocationĀ 1736
-

Quote

Trust takes a long time to build and can be lost in minutes. Incident response is as much about retroactively responding to incidents as proactively addressing them before they happen.
- LocationĀ 1748
-

Quote

DataOps summary
- LocationĀ 1750
-

Quote

Data Architecture
- LocationĀ 1761
-

Quote

Orchestration
- LocationĀ 1774
-

Quote

Orchestration is the process of coordinating many jobs to run as quickly and efficiently as possible on a scheduled cadence.
- LocationĀ 1782
- blue,

Quote

Software Engineering
- LocationĀ 1812
-

Quote

Core data processing code
- LocationĀ 1821
-

Quote

Whether in ingestion, transformation, or data serving, data engineers need to be highly proficient and productive in frameworks and languages such as Spark, SQL, or Beam;
- LocationĀ 1823
-

Quote

Itā€™s also imperative that a data engineer understand proper code-testing methodologies, such as unit, regression, integration, end-to-end, and smoke.
- LocationĀ 1825
-

Quote

Development of open source frameworks
- LocationĀ 1827
-

Quote

Infrastructure as code
- LocationĀ 1851
-

Quote

Pipelines as code
- LocationĀ 1862
-

Quote

General-purpose problem solving
- LocationĀ 1870
-

Quote

In practice, regardless of which high-level tools they adopt, data engineers will run into corner cases throughout the data engineering lifecycle that require them to solve problems outside the boundaries of their chosen tools and to write custom code. When using frameworks like Fivetran, Airbyte, or Matillion, data engineers will encounter data sources without existing connectors and need to write something custom. They should be proficient in software engineering to understand APIs, pull and transform data, handle exceptions, and so forth.
- LocationĀ 1871
-

Quote

Conclusion
- LocationĀ 1876
-

Quote

A data engineer has several top-level goals across the data lifecycle: produce optimum ROI and reduce costs (financial and opportunity), reduce risk (security, data quality), and maximize data value and utility.
- LocationĀ 1886
- data lifecycle, data engineering,

Quote

Additional Resources
- LocationĀ 1891
-

Quote

ā€œDemocratizing Data at Airbnbā€ by Chris Williams et al.
- LocationĀ 1898
-

Quote

ā€œFive Steps to Begin Collecting the Value of Your Dataā€ Lean-Data web page
- LocationĀ 1899
-

Quote

ā€œGetting Started with DevOps Automationā€ by Jared Murrell
- LocationĀ 1900
-

Quote

ā€œStaying Ahead of Debtā€ by Etai Mizrahi
- LocationĀ 1907
-

Quote

ā€œWhat Is Metadataā€ by Michelle Knight
- LocationĀ 1908
-

Quote

3 Chris Williams et al., ā€œDemocratizing Data at Airbnb,ā€ The Airbnb Tech Blog, May 12, 2017, https://oreil.ly/dM332.
- LocationĀ 1913
-

Quote

ethical behavior is doing the right thing when no one is watching,
- LocationĀ 1921
-

Quote

ā€œWhat Is DataOps,ā€ DataKitchen FAQ page, accessed May 5, 2022, https://oreil.ly/Ns06w.
- LocationĀ 1923
-

Quote

Designing Good Data Architecture
- LocationĀ 1931
-

Quote

What Is Data Architecture?
- LocationĀ 1936
-

Quote

Enterprise Architecture Defined
- LocationĀ 1947
-

Quote

Figure 3-1. Data architecture is a subset of enterprise architecture
- LocationĀ 1952
-

Quote

TOGAF is The Open Group Architecture Framework, a standard of The Open Group. Itā€™s touted as the most widely used architecture framework today.
- LocationĀ 1962
-

Quote

Gartner is a global research and advisory company that produces research articles and reports on trends related to enterprises. Among other things, it is responsible for the (in)famous Gartner Hype Cycle.
- LocationĀ 1971
-

Quote

Data Architecture Defined
- LocationĀ 2031
-

Quote

ā€œGoodā€ Data Architecture
- LocationĀ 2068
-

Quote

Principles of Good Data Architecture
- LocationĀ 2090
-

Quote

Principle 1: Choose Common Components Wisely
- LocationĀ 2115
-

Quote

Principle 2: Plan for Failure
- LocationĀ 2138
-

Quote

Principle 3: Architect for Scalability
- LocationĀ 2160
-

Quote

Principle 4: Architecture Is Leadership
- LocationĀ 2175
-

Quote

Principle 5: Always Be Architecting
- LocationĀ 2195
-

Quote

Principle 6: Build Loosely Coupled Systems
- LocationĀ 2214
-

Quote

Principle 7: Make Reversible Decisions
- LocationĀ 2255
-

Quote

Principle 8: Prioritize Security
- LocationĀ 2272
-

Quote

Principle 9: Embrace FinOps
- LocationĀ 2322
-

Quote

Major Architecture Concepts
- LocationĀ 2370
-

Quote

Domains and Services
- LocationĀ 2376
-

Quote

Distributed Systems, Scalability, and Designing for Failure
- LocationĀ 2400
-

Quote

Tight Versus Loose Coupling: Tiers, Monoliths, and Microservices
- LocationĀ 2440
-

Quote

User Access: Single Versus Multitenant
- LocationĀ 2553
-

Quote

Event-Driven Architecture
- LocationĀ 2570
-

Quote

Brownfield Versus Greenfield Projects
- LocationĀ 2591
-

Quote

Examples and Types of Data Architecture
- LocationĀ 2630
-

Quote

Data Warehouse
- LocationĀ 2635
-

Quote

Data Lake
- LocationĀ 2713
-

Quote

Convergence, Next-Generation Data Lakes, and the Data Platform
- LocationĀ 2745
-

Quote

Modern Data Stack
- LocationĀ 2767
-

Quote

Lambda Architecture
- LocationĀ 2788
-

Quote

Kappa Architecture
- LocationĀ 2807
-

Quote

The Dataflow Model and Unified Batch and Streaming
- LocationĀ 2821
-

Quote

Architecture for IoT
- LocationĀ 2842
-

Quote

Data Mesh
- LocationĀ 2905
-

Quote

Other Data Architecture Examples
- LocationĀ 2927
-

Quote

Whoā€™s Involved with Designing a Data Architecture?
- LocationĀ 2940
-

Quote

Additional Resources
- LocationĀ 2959
-

Quote

Choosing Technologies Across the Data Engineering Lifecycle
- LocationĀ 3111
-

Quote

Team Size and Capabilities
- LocationĀ 3146
-

Quote

Speed to Market
- LocationĀ 3163
-

Quote

Cost Optimization and Business Value
- LocationĀ 3190
-

Quote

Total Cost of Ownership
- LocationĀ 3199
-

Quote

Total Opportunity Cost of Ownership
- LocationĀ 3229
-

Quote

Today Versus the Future: Immutable Versus Transitory Technologies
- LocationĀ 3263
-

Quote

Our Advice
- LocationĀ 3300
-

Quote

On Premises
- LocationĀ 3317
-

Quote

Hybrid Cloud
- LocationĀ 3425
-

Quote

Decentralized: Blockchain and the Edge
- LocationĀ 3465
-

Quote

Our Advice
- LocationĀ 3473
-

Quote

Cloud Repatriation Arguments
- LocationĀ 3499
-

Quote

Build Versus Buy
- LocationĀ 3555
-

Quote

Open Source Software
- LocationĀ 3578
-

Quote

Proprietary Walled Gardens
- LocationĀ 3649
-

Quote

Our Advice
- LocationĀ 3686
-

Quote

Monolith Versus Modular
- LocationĀ 3699
-

Quote

The Distributed Monolith Pattern
- LocationĀ 3757
-

Quote

Our Advice
- LocationĀ 3774
-

Quote

Serverless Versus Servers
- LocationĀ 3785
-

Quote

How to Evaluate Server Versus Serverless
- LocationĀ 3835
-

Quote

Our Advice
- LocationĀ 3855
-

Quote

Optimization, Performance, and the Benchmark Wars
- LocationĀ 3873
-

Quote

Big Data...for the 1990s
- LocationĀ 3894
-

Quote

Nonsensical Cost Comparisons
- LocationĀ 3901
-

Quote

Asymmetric Optimization
- LocationĀ 3906
-

Quote

Caveat Emptor
- LocationĀ 3912
-

Quote

Undercurrents and Their Impacts on Choosing Technologies
- LocationĀ 3914
-

Quote

Data Management
- LocationĀ 3920
-

Quote

Data Architecture
- LocationĀ 3940
-

Quote

Orchestration Example: Airflow
- LocationĀ 3946
-

Quote

Software Engineering
- LocationĀ 3966
-

Quote

Additional Resources
- LocationĀ 3981
-

Quote

The Data Engineering Lifecycle in Depth
- LocationĀ 4028
-

Quote

Data Generation in Source Systems
- LocationĀ 4029
-

Quote

Sources of Data: How Is Data Created?
- LocationĀ 4043
-

Quote

Source Systems: Main Ideas
- LocationĀ 4061
-

Quote

Files and Unstructured Data
- LocationĀ 4064
-

Quote

Application Databases (OLTP Systems)
- LocationĀ 4085
-

Quote

Online Analytical Processing System
- LocationĀ 4137
-

Quote

Change Data Capture
- LocationĀ 4152
-

Quote

Database Logs
- LocationĀ 4191
-

Quote

Insert-Only
- LocationĀ 4212
-

Quote

Messages and Streams
- LocationĀ 4231
-

Quote

Types of Time
- LocationĀ 4264
-

Quote

Source System Practical Details
- LocationĀ 4287
-

Quote

Data Sharing
- LocationĀ 4588
-

Quote

Third-Party Data Sources
- LocationĀ 4605
-

Quote

Message Queues and Event-Streaming Platforms
- LocationĀ 4620
-

Quote

Whom Youā€™ll Work With
- LocationĀ 4726
-

Quote

Undercurrents and Their Impact on Source Systems
- LocationĀ 4760
-

Quote

Data Management
- LocationĀ 4778
-

Quote

Data Architecture
- LocationĀ 4813
-

Quote

Software Engineering
- LocationĀ 4838
-

Quote

Additional Resources
- LocationĀ 4865
-

Quote

Raw Ingredients of Data Storage
- LocationĀ 4924
-

Quote

Magnetic Disk Drive
- LocationĀ 4934
-

Quote

Solid-State Drive
- LocationĀ 4975
-

Quote

Random Access Memory
- LocationĀ 4990
-

Quote

Networking and CPU
- LocationĀ 5025
-

Quote

Data Storage Systems
- LocationĀ 5098
-

Quote

Single Machine Versus Distributed Storage
- LocationĀ 5103
-

Quote

Eventual Versus Strong Consistency
- LocationĀ 5117
-

Quote

File Storage
- LocationĀ 5146
-

Quote

Block Storage
- LocationĀ 5205
-

Quote

Object Storage
- LocationĀ 5281
-

Quote

Cache and Memory-Based Storage Systems
- LocationĀ 5428
-

Quote

The Hadoop Distributed File System
- LocationĀ 5448
-

Quote

Streaming Storage
- LocationĀ 5474
-

Quote

Indexes, Partitioning, and Clustering
- LocationĀ 5488
-

Quote

Data Engineering Storage Abstractions
- LocationĀ 5542
-

Quote

The Data Warehouse
- LocationĀ 5560
-

Quote

The Data Lake
- LocationĀ 5573
-

Quote

The Data Lakehouse
- LocationĀ 5581
-

Quote

Data Platforms
- LocationĀ 5598
-

Quote

Stream-to-Batch Storage Architecture
- LocationĀ 5606
-

Quote

Big Ideas and Trends in Storage
- LocationĀ 5616
-

Quote

Data Catalog
- LocationĀ 5622
-

Quote

Data Sharing
- LocationĀ 5644
-

Quote

Separation of Compute from Storage
- LocationĀ 5672
-

Quote

Data Storage Lifecycle and Data Retention
- LocationĀ 5769
-

Quote

Single-Tenant Versus Multitenant Storage
- LocationĀ 5862
-

Quote

Whom Youā€™ll Work With
- LocationĀ 5888
-

Quote

Undercurrents
- LocationĀ 5898
-

Quote

Security
- LocationĀ 5902
-

Quote

Conclusion
- LocationĀ 5956
-

Quote

Ingestion
- LocationĀ 5987
-

Quote

What Is Data Ingestion?
- LocationĀ 5994
-

Quote

Key Engineering Considerations for the Ingestion Phase
- LocationĀ 6029
-

Quote

Bounded Versus Unbounded Data
- LocationĀ 6048
-

Quote

Frequency
- LocationĀ 6065
-

Quote

Synchronous Versus Asynchronous Ingestion
- LocationĀ 6093
-

Quote

Serialization and Deserialization
- LocationĀ 6121
-

Quote

Throughput and Scalability
- LocationĀ 6130
-

Quote

Reliability and Durability
- LocationĀ 6146
-

Quote

Payload
- LocationĀ 6163
-

Quote

Push Versus Pull Versus Poll Patterns
- LocationĀ 6237
-

Quote

Batch Ingestion Considerations
- LocationĀ 6253
-

Quote

Snapshot or Differential Extraction
- LocationĀ 6272
-

Quote

File-Based Export and Ingestion
- LocationĀ 6281
-

Quote

ETL Versus ELT
- LocationĀ 6290
-

Quote

Inserts, Updates, and Batch Size
- LocationĀ 6304
-

Quote

Data Migration
- LocationĀ 6315
-

Quote

Message and Stream Ingestion Considerations
- LocationĀ 6330
-

Quote

Schema Evolution
- LocationĀ 6336
-

Quote

Late-Arriving Data
- LocationĀ 6346
-

Quote

Ordering and Multiple Delivery
- LocationĀ 6356
-

Quote

Replay
- LocationĀ 6360
-

Quote

Time to Live
- LocationĀ 6366
-

Quote

Message Size
- LocationĀ 6377
-

Quote

Error Handling and Dead-Letter Queues
- LocationĀ 6383
-

Quote

Consumer Pull and Push
- LocationĀ 6395
-

Quote

Location
- LocationĀ 6403
-

Quote

Ways to Ingest Data
- LocationĀ 6411
-

Quote

Direct Database Connection
- LocationĀ 6415
-

Quote

Change Data Capture
- LocationĀ 6447
-

Quote

Message Queues and Event-Streaming Platforms
- LocationĀ 6522
-

Quote

Managed Data Connectors
- LocationĀ 6548
-

Quote

Moving Data with Object Storage
- LocationĀ 6561
-

Quote

Databases and File Export
- LocationĀ 6579
-

Quote

Practical Issues with Common File Formats
- LocationĀ 6589
-

Quote

Shell
- LocationĀ 6607
-

Quote

Quote

SFTP and SCP
- LocationĀ 6625
-

Quote

Web Interface
- LocationĀ 6653
-

Quote

Web Scraping
- LocationĀ 6659
-

Quote

Transfer Appliances for Data Migration
- LocationĀ 6677
-

Quote

Data Sharing
- LocationĀ 6694
-

Quote

Whom Youā€™ll Work With
- LocationĀ 6705
-

Quote

Upstream Stakeholders
- LocationĀ 6710
-

Quote

Downstream Stakeholders
- LocationĀ 6723
-

Quote

Undercurrents
- LocationĀ 6737
-

Quote

Security
- LocationĀ 6740
-

Quote

Data Management
- LocationĀ 6746
-

Quote

DataOps
- LocationĀ 6791
-

Quote

Orchestration
- LocationĀ 6834
-

Quote

Software Engineering
- LocationĀ 6844
-

Quote

Conclusion
- LocationĀ 6855
-

Quote

Additional Resources
- LocationĀ 6861
-

Quote

Queries, Modeling, and Transformation
- LocationĀ 6877
-

Quote

Queries
- LocationĀ 6903
-

Quote

What Is a Query?
- LocationĀ 6908
-

Quote

The Life of a Query
- LocationĀ 6957
-

Quote

The Query Optimizer
- LocationĀ 6970
-

Quote

Improving Query Performance
- LocationĀ 6978
-

Quote

Queries on Streaming Data
- LocationĀ 7123
-

Quote

Data Modeling
- LocationĀ 7258
-

Quote

What Is a Data Model?
- LocationĀ 7275
-

Quote

Conceptual, Logical, and Physical Data Models
- LocationĀ 7292
-

Quote

Normalization
- LocationĀ 7328
-

Quote

Techniques for Modeling Batch Analytical Data
- LocationĀ 7503
-

Quote

Modeling Streaming Data
- LocationĀ 7881
-

Quote

Transformations
- LocationĀ 7907
-

Quote

Batch Transformations
- LocationĀ 7931
-

Quote

Materialized Views, Federation, and Query Virtualization
- LocationĀ 8270
-

Quote

Streaming Transformations and Processing
- LocationĀ 8338
-

Quote

Whom Youā€™ll Work With
- LocationĀ 8395
-

Quote

Upstream Stakeholders
- LocationĀ 8405
-

Quote

Downstream Stakeholders
- LocationĀ 8417
-

Quote

Undercurrents
- LocationĀ 8424
-

Quote

Security
- LocationĀ 8427
-

Quote

Data Management
- LocationĀ 8436
-

Quote

DataOps
- LocationĀ 8458
-

Quote

Data Architecture
- LocationĀ 8479
-

Quote

Orchestration
- LocationĀ 8490
-

Quote

Software Engineering
- LocationĀ 8495
-

Quote

Conclusion
- LocationĀ 8515
-

Quote

Additional Resources
- LocationĀ 8528
-

Quote

Serving Data for Analytics, Machine Learning, and Reverse ETL
- LocationĀ 8614
-

Quote

General Considerations for Serving Data
- LocationĀ 8627
-

Quote

Trust
- LocationĀ 8633
-

Quote

Whatā€™s the Use Case, and Whoā€™s the User?
- LocationĀ 8668
-

Quote

Data Products
- LocationĀ 8687
-

Quote

Self-Service or Not?
- LocationĀ 8709
-

Quote

Data Definitions and Logic
- LocationĀ 8732
-

Quote

Data Mesh
- LocationĀ 8757
-

Quote

Analytics
- LocationĀ 8768
-

Quote

Business Analytics
- LocationĀ 8779
-

Quote

Operational Analytics
- LocationĀ 8820
-

Quote

Embedded Analytics
- LocationĀ 8856
-

Quote

Machine Learning
- LocationĀ 8884
-

Quote

What a Data Engineer Should Know About ML
- LocationĀ 8899
-

Quote

Ways to Serve Data for Analytics and ML
- LocationĀ 8933
-

Quote

File Exchange
- LocationĀ 8938
-

Quote

Databases
- LocationĀ 8959
-

Quote

Streaming Systems
- LocationĀ 8989
-

Quote

Query Federation
- LocationĀ 8999
-

Quote

Data Sharing
- LocationĀ 9018
-

Quote

Semantic and Metrics Layers
- LocationĀ 9026
-

Quote

Serving Data in Notebooks
- LocationĀ 9053
-

Quote

Reverse ETL
- LocationĀ 9094
-

Quote

Whom Youā€™ll Work With
- LocationĀ 9125
-

Quote

Undercurrents
- LocationĀ 9141
-

Quote

Security
- LocationĀ 9147
-

Quote

Data Management
- LocationĀ 9171
-

Quote

DataOps
- LocationĀ 9185
-

Quote

Data Architecture
- LocationĀ 9197
-

Quote

Orchestration
- LocationĀ 9204
-

Quote

Software Engineering
- LocationĀ 9219
-

Quote

Conclusion
- LocationĀ 9236
-

Quote

Additional Resources
- LocationĀ 9245
-

Quote

Security, Privacy, and the Future of Data Engineering
- LocationĀ 9285
-

Quote

Security and Privacy
- LocationĀ 9287
-

Quote

The Power of Negative Thinking
- LocationĀ 9320
-

Quote

Always Be Paranoid
- LocationĀ 9329
-

Quote

Processes
- LocationĀ 9337
-

Quote

Security Theater Versus Security Habit
- LocationĀ 9341
-

Quote

Active Security
- LocationĀ 9351
-

Quote

Shared Responsibility in the Cloud
- LocationĀ 9371
-

Quote

Always Back Up Your Data
- LocationĀ 9378
-

Quote

An Example Security Policy
- LocationĀ 9387
-

Quote

Technology
- LocationĀ 9416
-

Quote

Patch and Update Systems
- LocationĀ 9420
-

Quote

Encryption
- LocationĀ 9426
-

Quote

Logging, Monitoring, and Alerting
- LocationĀ 9448
-

Quote

Network Access
- LocationĀ 9473
-

Quote

Security for Low-Level Data Engineering
- LocationĀ 9491
-

Quote

Conclusion
- LocationĀ 9509
-

Quote

Additional Resources
- LocationĀ 9512
-

Quote

The Future of Data Engineering
- LocationĀ 9518
-

Quote

The Data Engineering Lifecycle Isnā€™t Going Away
- LocationĀ 9534
-

Quote

The Decline of Complexity and the Rise of Easy-to-Use Data Tools
- LocationĀ 9543
-

Quote

The Cloud-Scale Data OS and Improved Interoperability
- LocationĀ 9572
-

Quote

ā€œEnterpriseyā€ Data Engineering
- LocationĀ 9617
-

Quote

Titles and Responsibilities Will Morph...
- LocationĀ 9630
-

Quote

Moving Beyond the Modern Data Stack, Toward the Live Data Stack
- LocationĀ 9652
-

Quote

The Live Data Stack
- LocationĀ 9674
-

Quote

Streaming Pipelines and Real-Time Analytical Databases
- LocationĀ 9687
-

Quote

The Fusion of Data with Applications
- LocationĀ 9722
-

Quote

The Tight Feedback Between Applications and ML
- LocationĀ 9731
-

Quote

Dark Matter Data and the Rise of...Spreadsheets?!
- LocationĀ 9742
-

Quote

Conclusion
- LocationĀ 9774
-

Quote

Serialization and Compression Technical Details
- LocationĀ 9805
-

Quote

Serialization Formats
- LocationĀ 9809
-

Quote

Row-Based Serialization
- LocationĀ 9817
-

Quote

Columnar Serialization
- LocationĀ 9849
-

Quote

Hybrid Serialization
- LocationĀ 9903
-

Quote

Database Storage Engines
- LocationĀ 9922
-

Quote

Compression: gzip, bzip2, Snappy, Etc.
- LocationĀ 9934
-

Quote

Cloud Networking
- LocationĀ 9956
-

Quote

Cloud Network Topology
- LocationĀ 9959
-

Quote

Data Egress Charges
- LocationĀ 9965
-

Quote

Availability Zones
- LocationĀ 9973
-

Quote

GCP-Specific Networking and Multiregional Redundancy
- LocationĀ 10000
-

Quote

Direct Network Connections to the Clouds
- LocationĀ 10016
-

Quote

Quote

The Future of Data Egress Fees
- LocationĀ 10029
-

Quote

About the Authors
- LocationĀ 11858
-

Quote

Joe Reis is a business-minded data nerd whoā€™s worked in the data industry for 20 years, with responsibilities ranging from statistical modeling, forecasting, machine learning, data engineering, data architecture, and almost everything else in between. Joe is the CEO and cofounder of Ternary Data, a data engineering and architecture consulting firm based in Salt Lake City, Utah. In addition, he volunteers with several technology groups and teaches at the University of Utah.
- LocationĀ 11859
-

Quote

Matt Housley is a data engineering consultant and cloud specialist. After some early programming experience with Logo, Basic, and 6502 assembly, he completed a PhD in mathematics at the University of Utah. Matt then began working in data science, eventually specializing in cloud-based data engineering. He cofounded Ternary Data with Joe Reis, where he leverages his teaching experience to train future data engineers and advise teams on robust data architecture. Matt and Joe also pontificate on all things data on The Monday Morning Data Chat.
- LocationĀ 11863
-


dg-publish: true
created: 2024-07-01
modified: 2024-07-01
title: Fundamentals of Data Engineering
source: kindle

@tags:: #litāœ/šŸ“šbook/highlights
@links:: data engineering, data governance, data management, data orchestration,
@ref:: Fundamentals of Data Engineering
@author:: Joe Reis and Matt Housley

=this.file.name

Book cover of "Fundamentals of Data Engineering"

Reference

Notes

Quote

What This Book Isnā€™t
- LocationĀ 104
-

Quote

What This Book Is About
- LocationĀ 109
-

Quote

Who Should Read This Book
- LocationĀ 128
-

Quote

Prerequisites
- LocationĀ 138
-

Quote

What Youā€™ll Learn and How It Will Improve Your Abilities
- LocationĀ 154
-

Quote

Navigating This Book
- LocationĀ 166
-

Quote

Conventions Used in This Book
- LocationĀ 199
-

Quote

Foundation and Building Blocks
- LocationĀ 250
-

Quote

Data Engineering Described
- LocationĀ 252
-

Quote

What Is Data Engineering?
- LocationĀ 256
-

Quote

Data Engineering Defined
- LocationĀ 287
-

Quote

a data engineer gets data, stores it, and prepares it for consumption by data scientists, analysts, and others.
- LocationĀ 288
-

Quote

Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning. Data engineering is the intersection of security, data management, DataOps, data architecture, orchestration, and software engineering. A data engineer manages the data engineering lifecycle, beginning with getting data from source systems and ending with serving data for use cases, such as analysis or machine learning.
- LocationĀ 291
-
- [note::I'd be fun to make a Venn diagram of data engineering]

Quote

The Data Engineering Lifecycle
- LocationĀ 297
-

Quote

Figure 1-1. The data engineering lifecycle
- LocationĀ 301
-

Quote

Evolution of the Data Engineer
- LocationĀ 313
-

Quote

The term big data is essentially a relic to describe a particular time and approach to handling large amounts of data.
- LocationĀ 398
-

Quote

big data processing has become so accessible that it no longer merits a separate term; every company aims to solve its data problems, regardless of actual data size. Big data engineers are now simply data engineers.
- LocationĀ 400
-

Quote

Data engineering is increasingly a discipline of interoperation, and connecting various technologies like LEGO bricks, to serve ultimate business goals.
- LocationĀ 411
-

Quote

Figure 1-3. Matt Turckā€™s Data Landscape in 2012 versus 2021
- LocationĀ 413
-

Quote

As tools and workflows simplify, weā€™ve seen a noticeable shift in the attitudes of data engineers. Instead of focusing on who has the ā€œbiggest data,ā€ open source projects and services are increasingly concerned with managing and governing data, making it easier to use and discover, and improving its quality.
- LocationĀ 421
-

Quote

Data Engineering and Data Science
- LocationĀ 434
-

Quote

Figure 1-4. Data engineering sits upstream from data science
- LocationĀ 440
-

Quote

Figure 1-5. The Data Science Hierarchy of Needs
- LocationĀ 446
-

Quote

Rogati argues that companies need to build a solid data foundation (the bottom three levels of the hierarchy) before tackling areas such as AI and ML.
- LocationĀ 450
- artificial intelligence (ai), data hierarchy of needs, machine learning (ml),
- [note::Hierarchy = Data Science Hierarchy of Needs]

Quote

In an ideal world, data scientists should spend more than 90% of their time focused on the top layers of the pyramid: analytics, experimentation, and ML. When data engineers focus on these bottom parts of the hierarchy, they build a solid foundation for data scientists to succeed.
- LocationĀ 452
-

Quote

Data Engineering Skills and Activities
- LocationĀ 461
-

Quote

a data engineer juggles a lot of complex moving parts and must constantly optimize along the axes of cost, agility, scalability, simplicity, reuse, and interoperability
- LocationĀ 467
-

Quote

data engineers are now focused on balancing the simplest and most cost-effective, best-of-breed services that deliver value to the business.
- LocationĀ 476
-

Quote

A data engineer typically does not directly build ML models, create reports or dashboards, perform data analysis, build key performance indicators (KPIs), or develop software applications. A data engineer should have a good functioning understanding of these areas to serve stakeholders best.
- LocationĀ 478
-

Quote

Data Maturity and the Data Engineer
- LocationĀ 481
-

Quote

Data maturity is the progression toward higher data utilization, capabilities, and integration across the organization,
- LocationĀ 484
-

Quote

data maturity does not simply depend on the age or revenue of a company. An early-stage startup can have greater data maturity than a 100-year-old company with annual revenues in the billions. What matters is the way data is leveraged as a competitive advantage.
- LocationĀ 488
-

Quote

Stage 1: Starting with data
- LocationĀ 496
-

Quote

A data engineer should focus on the following in organizations getting started with data: Get buy-in from key stakeholders, including executive management. Ideally, the data engineer should have a sponsor for critical initiatives to design and build a data architecture to support the companyā€™s goals. Define the right data architecture (usually solo, since a data architect likely isnā€™t available). This means determining business goals and the competitive advantage youā€™re aiming to achieve with your data initiative. Work toward a data architecture that supports these goals. See ChapterĀ 3 for our advice on ā€œgoodā€ data architecture. Identify and audit data that will support key initiatives and operate within the data architecture you designed. Build a solid data foundation for future data analysts and data scientists to generate reports and models that provide competitive value. In the meantime, you may also have to generate these reports and models until this team is hired.
- LocationĀ 508
-

Quote

This is a delicate stage with lots of pitfalls. Here are some tips for this stage: Organizational willpower may wane if a lot of visible successes donā€™t occur with data. Getting quick wins will establish the importance of data within the organization. Just keep in mind that quick wins will likely create technical debt. Have a plan to reduce this debt, as it will otherwise add friction for future delivery. Get out and talk to people, and avoid working in silos. We often see the data team working in a bubble, not communicating with people outside their departments and getting perspectives and feedback from business stakeholders. The danger is youā€™ll spend a lot of time working on things of little use to people. Avoid undifferentiated heavy lifting. Donā€™t box yourself in with unnecessary technical complexity. Use off-the-shelf, turnkey solutions wherever possible. Build custom solutions and code only where this creates a competitive advantage.
- LocationĀ 516
- data culture,
- [note::This stage = "Getting started with data"]

Quote

Stage 2: Scalingā€¦
- LocationĀ 525
-

Quote

In organizations that are in stage 2 of data maturity, a data engineerā€™s goals are to do the following: Establish formal data practices Create scalable and robust data architectures Adopt DevOps and DataOps practices Build systems that support ML Continue to avoidā€¦
- LocationĀ 528
-

Quote

Issues to watch out for include the following: As we grow more sophisticated with data, thereā€™s a temptation to adopt bleeding-edge technologies based on social proof from Silicon Valley companies. This is rarely a good use of your time and energy. Any technology decisions should be driven by the value theyā€™ll deliver to your customers. The main bottleneck for scaling is not cluster nodes, storage, or technology but the data engineering team. Focus on solutions that are simple to deploy and manage to expand your teamā€™s throughput. Youā€™ll be tempted to frame yourself as a technologist, a data genius who can deliver magical products. Shift your focus instead to pragmatic leadershipā€¦
- LocationĀ 533
-

Quote

Stage 3: Leadingā€¦
- LocationĀ 541
-

Quote

In organizations in stage 3 of data maturity, a data engineer will continue building on prior stages, plus they will do the following: Create automation for the seamless introduction and usage of new data Focus on building custom tools and systems that leverage data as a competitive advantage Focus on the ā€œenterpriseyā€ aspects of data, such as data management (including data governance and quality) and DataOps Deploy tools that expose and disseminate data throughout the organization, including data catalogs, data lineage tools, and metadata management systems Collaborate efficiently with softwareā€¦
- LocationĀ 545
-

Quote

Issues to watch out for include the following: At this stage, complacency is a significant danger. Once organizations reach stage 3, they must constantly focus on maintenance and improvement or risk falling back to a lower stage. Technology distractions are a more significant danger here than in the other stages. Thereā€™s a temptation to pursue expensive hobby projects that donā€™t deliverā€¦
- LocationĀ 552
-
- [note::"Utilize custom-built technology only where it provides a competitive advantage" is a reoccurring theme in each data maturity stage (1-3)]

Quote

The Background and Skills of aā€¦
- LocationĀ 560
-

Quote

If youā€™re pivoting your career into data engineering, weā€™ve found that the transition is easiest when moving from an adjacent field, such as software engineering, ETL development, database administration, data science, or data analysis.
- LocationĀ 568
- career pivot, career, data engineering,

Quote

Zooming out, a data engineer must also understand the requirements of data consumers (data analysts and data scientists) and the broader implications of data across the organization. Data engineering is a holistic practice; the best data engineers view their responsibilities through business and technical lenses.
- LocationĀ 575
- data engineering, ankified,

Quote

Business Responsibilities
- LocationĀ 577
-

Quote

Know how to communicate with nontechnical and technical people.
- LocationĀ 581
-

Quote

We suggest paying close attention to organizational hierarchies, who reports to whom, how people interact, and which silos exist. These observations will be invaluable to your success.
- LocationĀ 582
- stakeholder mapping,
- [note::For any problem you're trying to solve, it helps to map the stakeholders who can influence or are influenced by your progress towards a solution.]

Quote

Understand how to scope and gather business and product requirements.
- LocationĀ 584
-

Quote

Data engineering is a holistic practice; the best data engineers view their responsibilities through business and technical lenses.
- LocationĀ 585
- data engineering,
- [note::Emphasis on and]

Quote

Many technologists mistakenly believe these practices are solved through technology. We feel this is dangerously wrong. Agile, DevOps, and DataOps are fundamentally cultural, requiring buy-in across the organization.
- LocationĀ 586
- work culture, devops, agile, dataops,

Quote

Know how to optimize for time to value, the total cost of ownership, and opportunity cost. Learn to monitor costs to avoid surprises.
- LocationĀ 589
-

Quote

success or failure is rarely a technology issue.
- LocationĀ 595
-
- [note::Stakeholder communication is paramount]

Quote

Technical Responsibilities
- LocationĀ 598
-

Quote

data engineers now focus on high-level abstractions or writing pipelines as code within an orchestration framework.
- LocationĀ 614
-

Quote

a data engineer who canā€™t write production-grade code will be severely hindered, and we donā€™t see this changing anytime soon.
- LocationĀ 617
-

Quote

At the time of this writing, the primary languages of data engineering are SQL, Python, a Java Virtual Machine (JVM) language (usually Java or Scala), and bash:
- LocationĀ 620
-

Quote

Understanding Java or Scala will be beneficial if youā€™re using a popular open source data framework.
- LocationĀ 633
-

Quote

Even today, data engineers frequently use command-line tools like awk or sed to process files in a data pipeline or call bash commands from orchestration frameworks. If youā€™re using Windows, feel free to substitute PowerShell for bash.
- LocationĀ 636
-

Quote

Data engineers also do well to develop expertise in composing SQL with other operations, either within frameworks such as Spark and Flink or by using orchestration to combine multiple tools. Data engineers should also learn modern SQL semantics for dealing with JavaScript Object Notation (JSON) parsing and nested data and consider leveraging a SQL management framework such as dbt (Data Build Tool).
- LocationĀ 645
- career capital, skills, sql,

Quote

Data engineers may also need to develop proficiency in secondary programming languages, including R, JavaScript, Go, Rust, C/C++, C#, and Julia. Developing in these languages is often necessary when popular across the company or used with domain-specific data tools. For instance, JavaScript has proven popular as a language for user-defined functions in cloud data warehouses. At the same time, C# and Powā erShell are essential in companies that leverage Azure and the Microsoft ecosystem.
- LocationĀ 652
- programming languages, data engineering,

Quote

The Continuum of Data Engineering Roles, from A to B
- LocationĀ 665
-

Quote

Data Engineers Inside an Organization
- LocationĀ 689
-

Quote

Internal-Facing Versus External-Facing Data Engineers
- LocationĀ 693
-

Quote

Data Engineers and Other Technical Roles
- LocationĀ 718
-

Quote

Figure 1-12. Key technical stakeholders of data engineering
- LocationĀ 722
-

Quote

Upstream stakeholders
- LocationĀ 729
-

Quote

Data architects design the blueprint for organizational data management, mapping out processes and overall data architecture and systems.11 They also serve as a bridge between an organizationā€™s technical and nontechnical sides.
- LocationĀ 735
-

Quote

Data architects implement policies for managing data across silos and business units, steer global strategies such as data management and data governance, and guide significant initiatives. Data architects often play a central role in cloud migrations and greenfield cloud design.
- LocationĀ 739
-

Quote

Nevertheless, data architects will remain influential visionaries in enterprises, working hand in hand with data engineers to determine the big picture of architecture practices and data strategies.
- LocationĀ 744
-

Quote

In well-run technical organizations, software engineers and data engineers coordinate from the inception of a new project to design application data for consumption by analytics and ML applications.
- LocationĀ 758
-

Quote

A data engineer should work together with software engineers to understand the applications that generate data, the volume, frequency, and format of the generated data, and anything else that will impact the data engineering lifecycle, such as data security and regulatory compliance. For example, this might mean setting upstream expectations on what the data software engineers need to do their jobs.
- LocationĀ 759
- data engineering,

Quote

Downstream stakeholders
- LocationĀ 769
-

Quote

If data engineers do their job and collaborate successfully, data scientists shouldnā€™t spend their time collecting, cleaning, and preparing data after initial exploratory work. Data engineers should automate this work as much as possible.
- LocationĀ 784
-

Quote

Whereas data scientists are forward-looking, a data analyst typically focuses on the past or present.
- LocationĀ 790
-

Quote

Data Engineers and Business Leadership
- LocationĀ 822
-

Quote

Data engineers often support data architects by acting as the glue between the business and data science/analytics.
- LocationĀ 826
-

Quote

Data in the C-suite
- LocationĀ 827
-

Quote

CIOs will work with engineers and architects to map out major initiatives and make strategic decisions on adopting major architectural elements, such as enterprise resource planning (ERP) and customer relationship management (CRM) systems, cloud migrations, data systems, and internal-facing IT.
- LocationĀ 846
-

Quote

A CTO owns the key technological strategy and architectures for external-facing applications, such as mobile, web apps, and IoTā€”all critical data sources for data engineers.
- LocationĀ 851
- business roles,

Quote

Data engineers and project managers
- LocationĀ 876
-

Quote

These large initiatives often benefit from project management (in contrast to product management, discussed next). Whereas data engineers function in an infrastructure and service delivery capacity, project managers direct traffic and serve as gatekeepers.
- LocationĀ 881
-

Quote

Data engineers and product managers
- LocationĀ 889
-

Quote

Data engineers and other management roles
- LocationĀ 896
-

Quote

For more information on data teams and how to structure them, we recommend John Thompsonā€™s Building Analytics Teams (Packt) and Jesse Andersonā€™s Data Teams (Apress). Both books provide strong frameworks and perspectives on the roles of executives with data, who to hire, and how to construct the most effective data team for your company.
- LocationĀ 900
-

Quote

Conclusion
- LocationĀ 907
-

Quote

Additional Resources
- LocationĀ 917
-

Quote

ā€œBig Data Will Be Dead in Five Yearsā€ by Lewis Gavin
- LocationĀ 920
-

Quote

ā€œData as a Product vs. Data as a Serviceā€ by Justin Gage
- LocationĀ 923
-

Quote

ā€œThe Downfall of the Data Engineerā€ by Maxime Beauchemin
- LocationĀ 928
-

Quote

ā€œHow Creating a Data-Driven Culture Can Drive Successā€ by Frederik Bussler
- LocationĀ 931
-

Quote

The Information Management Body of Knowledge website
- LocationĀ 933
-

Quote

ā€œInformation Management Body of Knowledgeā€ Wikipedia page
- LocationĀ 934
-

Quote

ā€œInformation Managementā€ Wikipedia page
- LocationĀ 935
-

Quote

ā€œOn Complexity in Big Dataā€ by Jesse Anderson (Oā€™Reilly)
- LocationĀ 936
-

Quote

ā€œWhat Is a Data Architect? ITā€™s Data Framework Visionaryā€ by Thor Olavsrud
- LocationĀ 943
-

Quote

The Data Engineering Lifecycle
- LocationĀ 980
-

Quote

data engineering lifecycle comprises stages that turn raw data ingredients into a useful end product, ready for consumption by analysts, data scientists, ML engineers, and others.
- LocationĀ 989
- data engineering,

Quote

What Is the Data Engineering Lifecycle?
- LocationĀ 989
-

Quote

Figure 2-1. Components and undercurrents of the data engineering lifecycle
- LocationĀ 996
-

Quote

The Data Lifecycle Versus the Data Engineering Lifecycle
- LocationĀ 1009
-

Quote

source system is the origin of the data used in the data engineering lifecycle. For example, a source system could be an IoT device, an application message queue, or a transactional database.
- LocationĀ 1018
-

Quote

Generation: Source Systems
- LocationĀ 1018
-

Quote

Engineers also need to keep an open line of communication with source system owners on changes that could break pipelines and analytics.
- LocationĀ 1025
-

Quote

The following is a starting set of evaluation questions of source systems that data engineers must consider: What are the essential characteristics of the data source? Is it an application? A swarm of IoT devices? How is data persisted in the source system? Is data persisted long term, or is it temporary and quickly deleted? At what rate is data generated? How many events per second? How many gigabytes per hour? What level of consistency can data engineers expect from the output data? If youā€™re running data-quality checks against the output data, how often do data inconsistencies occurā€”nulls where they arenā€™t expected, lousy formatting, etc.? How often do errors occur? Will the data contain duplicates? Will some data values arrive late, possibly much later than other messages produced simultaneously? What is the schema of the ingested data? Will data engineers need to join across several tables or even several systems to get a complete picture of the data? If schema changes (say, a new column is added), how is this dealt with and communicated to downstream stakeholders? How frequently should data be pulled from the source system? For stateful systems (e.g., a database tracking customer account information), is data provided as periodic snapshots or update events from change data capture (CDC)? Whatā€™s the logic for how changes are performed, and how are these tracked in the source database? Who/what is the data provider that will transmit the data for downstream consumption? Will reading from a data source impact its performance? Does the source system have upstream data dependencies? What are the characteristics of these upstream systems? Are data-quality checks in place to check for late or missing data?
- LocationĀ 1049
- considerations, data source systems,

Quote

A data engineer should know how the source generates data, including relevant quirks or nuances. Data engineers also need to understand the limits of the source systems they interact with.
- LocationĀ 1069
-

Quote

Storage
- LocationĀ 1089
-

Quote

few data storage solutions function purely as storage, with many supporting complex transformation queries; even object storage solutions may support powerful query capabilitiesā€”e.g., Amazon S3 Select.
- LocationĀ 1093
-

Quote

Here are a few key engineering questions to ask when choosing a storage system for a data warehouse, data lakehouse, database, or object storage: Is this storage solution compatible with the architectureā€™s required write and read speeds? Will storage create a bottleneck for downstream processes? Do you understand how this storage technology works? Are you utilizing the storage system optimally or committing unnatural acts? For instance, are you applying a high rate of random access updates in an object storage system? (This is an antipattern with significant performance overhead.) Will this storage system handle anticipated future scale? You should consider all capacity limits on the storage system: total available storage, read operation rate, write volume, etc. Will downstream users and processes be able to retrieve data in the required service-level agreement (SLA)? Are you capturing metadata about schema evolution, data flows, data lineage, and so forth? Metadata has a significant impact on the utility of data. Metadata represents an investment in the future, dramatically enhancing discoverability and institutional knowledge to streamline future projects and architecture changes. Is this a pure storage solution (object storage), or does it support complex query patterns (i.e., a cloud data warehouse)? Is the storage system schema-agnostic (object storage)? Flexible schema (Cassandra)? Enforced schema (a cloud data warehouse)? How are you tracking master data, golden records data quality, and data lineage for data governance? (We have more to say on these in ā€œData Managementā€.) How are you handling regulatory compliance and data sovereignty? For example, can you store your data in certain geographical locations but not others?
- LocationĀ 1102
-

Quote

Understanding data access frequency
- LocationĀ 1120
-

Quote

Data access frequency will determine the temperature of your data. Data that is most frequently accessed is called hot data. Hot data is commonly retrieved many times per day, perhaps even several times per secondā€”for example, in systems that serve user requests. This data should be stored for fast retrieval, where ā€œfastā€ is relative to the use case. Lukewarm data might be accessed every so oftenā€”say, every week or month. Cold data is seldom queried and is appropriate for storing in an archival system. Cold data is often retained for compliance purposes or in case of a catastrophic failure in another system.
- LocationĀ 1123
-
- [note::Data Temperature = Frequency of Data Retrieval]

Quote

Selecting a storage system
- LocationĀ 1136
-

Quote

Ingestion
- LocationĀ 1144
-

Quote

source systems and ingestion represent the most significant bottlenecks of the data engineering lifecycle.
- LocationĀ 1148
-

Quote

Key engineering considerations for the ingestion phase
- LocationĀ 1153
-

Quote

When preparing to architect or build a system, here are some primary questions about the ingestion stage: What are the use cases for the data Iā€™m ingesting? Can I reuse this data rather than create multiple versions of the same dataset? Are the systems generating and ingesting this data reliably, and is the data available when I need it? What is the data destination after ingestion? How frequently will I need to access the data? In what volume will the data typically arrive? What format is the data in? Can my downstream storage and transformation systems handle this format? Is the source data in good shape for immediate downstream use? If so, for how long, and what may cause it to be unusable? If the data is from a streaming source, does it need to be transformed before reaching its destination? Would an in-flight transformation be appropriate, where the data is transformed within the stream itself?
- LocationĀ 1154
-

Quote

Batch versus streaming
- LocationĀ 1167
-

Quote

Virtually all data we deal with is inherently streaming. Data is nearly always produced and updated continually at its source.
- LocationĀ 1168
-

Quote

Batch ingestion is simply a specialized and convenient way of processing this stream in large chunksā€”for example, handling a full dayā€™s worth of data in a single batch.
- LocationĀ 1170
-

Quote

Key considerations for batch versus stream ingestion
- LocationĀ 1184
-

Quote

The following are some questions to ask yourself when determining whether streaming ingestion is an appropriate choice over batch ingestion: If I ingest the data in real time, can downstream storage systems handle the rate of data flow? Do I need millisecond real-time data ingestion? Or would a micro-batch approach work, accumulating and ingesting data, say, every minute? What are my use cases for streaming ingestion? What specific benefits do I realize by implementing streaming? If I get data in real time, what actions can I take on that data that would be an improvement upon batch? Will my streaming-first approach cost more in terms of time, money, maintenance, downtime, and opportunity cost than simply doing batch? Are my streaming pipeline and system reliable and redundant if infrastructure fails? What tools are most appropriate for the use case? Should I use a managed service (Amazon Kinesis, Google Cloud Pub/Sub, Google Cloud Dataflow) or stand up my own instances of Kafka, Flink, Spark, Pulsar, etc.? If I do the latter, who will manage it? What are the costs and trade-offs? If Iā€™m deploying an ML model, what benefits do I have with online predictions and possibly continuous training? Am I getting data from a live production instance? If so, whatā€™s the impact of my ingestion process on this source system?
- LocationĀ 1186
-

Quote

In the push model of data ingestion, a source system writes data out to a target, whether a database, object store, or filesystem. In the pull model, data is retrieved from the source system.
- LocationĀ 1202
- data ingestion,

Quote

Push versus pull
- LocationĀ 1202
-

Quote

continuous CDC,
- LocationĀ 1213
-
- [note::What does CDC stand for? -> "Change Data Capture"]

Quote

Transformation
- LocationĀ 1225
-

Quote

Immediately after ingestion, basic transformations map data into correct types (changing ingested string data into numeric and date types, for example), putting records into standard formats, and removing bad ones. Later stages of transformation may transform the data schema and apply normalization. Downstream, we can apply large-scale aggregation for reporting or featurize data for ML processes.
- LocationĀ 1231
-

Quote

Key considerations for the transformation phase
- LocationĀ 1234
-

Quote

When considering data transformations within the data engineering lifecycle, it helps to consider the following: Whatā€™s the cost and return on investment (ROI) of the transformation? What is the associated business value? Is the transformation as simple and self-isolated as possible? What business rules do the transformations support?
- LocationĀ 1235
- data transformation,

Quote

Serving Data
- LocationĀ 1263
-

Quote

Analytics
- LocationĀ 1276
-

Quote

Figure 2-5. Types of analytics
- LocationĀ 1282
-

Quote

Multitenancy
- LocationĀ 1314
-

Quote

Machine learning
- LocationĀ 1321
-

Quote

The following are some considerations for the serving data phase specific to ML: Is the data of sufficient quality to perform reliable feature engineering? Quality requirements and assessments are developed in close collaboration with teams consuming the data. Is the data discoverable? Can data scientists and ML engineers easily find valuable data? Where are the technical and organizational boundaries between data engineering and ML engineering? This organizational question has significant architectural implications. Does the dataset properly represent ground truth? Is it unfairly biased?
- LocationĀ 1340
-

Quote

Reverse ETL
- LocationĀ 1349
-

Quote

Major Undercurrents Across the Data Engineering Lifecycle
- LocationĀ 1370
-

Quote

Figure 2-7. The major undercurrents of data engineering
- LocationĀ 1379
-

Quote

Security
- LocationĀ 1381
-

Quote

The principle of least privilege means giving a user or system access to only the essential data and resources to perform an intended function. A common antipattern we see with data engineers with little security experience is to give admin access to all users. This is a catastrophe waiting to happen!
- LocationĀ 1386
- data security,

Quote

Data Management
- LocationĀ 1403
-

Quote

The Data Management Association International (DAMA) Data Management Body of Knowledge (DMBOK), which we consider to be the definitive book for enterprise data management, offers this definition: Data management is the development, execution, and supervision of plans, policies, programs, and practices that deliver, control, protect, and enhance the value of data and information assets throughout their lifecycle.
- LocationĀ 1410
-

Quote

data governance engages people, processes, and technologies to maximize data value across an organization while protecting data with appropriate security controls.
- LocationĀ 1438
-

Quote

The core categories of data governance are discoverability, security, and accountability.2 Within these core categories are subcategories, such as data quality, metadata, and privacy.
- LocationĀ 1450
-

Quote

In a data-driven company, data must be available and discoverable. End users should have quick and reliable access to the data they need to do their jobs. They should know where the data comes from, how it relates to other data, and what the data means.
- LocationĀ 1453
-

Quote

We divide metadata into two major categories: autogenerated and human generated. Modern data engineering revolves around automation, but metadata collection is often manual and error prone.
- LocationĀ 1462
- data quality, metadata,

Quote

Metadata tools are only as good as their connectors to data systems and their ability to share metadata.
- LocationĀ 1470
-

Quote

Documentation and internal wiki tools provide a key foundation for metadata management, but these tools should also integrate with automated data cataloging. For example, data-scanning tools can generate wiki pages with links to relevant data objects.
- LocationĀ 1477
-

Quote

DMBOK identifies four main categories of metadata that are useful to data engineers: Business metadata Technical metadata Operational metadata Reference metadata
- LocationĀ 1481
-

Quote

Business metadata relates to the way data is used in the business, including business and data definitions, data rules and logic, how and where data is used, and the data owner(s). A data engineer uses business metadata to answer nontechnical questions about who, what, where, and how. For example, a data engineer may be tasked with creating a data pipeline for customer sales analysis. But what is a customer? Is it someone whoā€™s purchased in the last 90 days? Or someone whoā€™s purchased at any time the business has been open? A dataā€¦
- LocationĀ 1484
- metadata, business metadata,

Quote

Technical metadata describes the data created and used by systems across the data engineering lifecycle. It includes the data model and schema, data lineage, field mappings, and pipeline workflows. A data engineer uses technical metadata to create, connect, and monitor various systems across the data engineering lifecycle. Here are some common types of technical metadata that a dataā€¦
- LocationĀ 1491
- metadata,

Quote

Pipeline metadata captured in orchestration systems provides details of the workflow schedule, system and data dependencies, configurationsā€¦
- LocationĀ 1497
- data orchestration, data pipelines,

Quote

Data-lineage metadata tracks the origin and changes to data, and its dependencies, over time. As data flows through the data engineering lifecycle, it evolves through transformations and combinations with other data. Data lineage provides an audit trail ofā€¦
- LocationĀ 1500
-

Quote

Schema metadata describes the structure of data stored in a system such as a database, a data warehouse, aā€¦
- LocationĀ 1503
-

Quote

Operational metadata describes the operational results of various systems and includes statistics about processes, job IDs, application runtime logs, data used in a process, and error logs. A data engineer uses operational metadata to determine whether aā€¦
- LocationĀ 1508
-

Quote

Reference metadata is data used to classify other data. This is also referred to as lookup data. Standard examples of reference data are internal codes, geographic codes, units of measurement, and internal calendar standards. Note that much of reference data is fully managed internally, but items such as geographic codes might come from standard external references. Reference data isā€¦
- LocationĀ 1514
-

Quote

Data accountability means assigning an individual to govern a portion of data. The responsible person then coordinates the governanceā€¦
- LocationĀ 1522
-

Quote

Data quality is the optimization of data toward the desired state and orbits the question, ā€œWhat do you get compared with what you expect?ā€ Data should conform to the expectations in the business metadata. Does the data match the definition agreed upon by the business?
- LocationĀ 1535
-

Quote

According to Data Governance: The Definitive Guide, data quality is defined by three main characteristics:4 Accuracy Is the collected data factually correct? Are there duplicate values? Are the numeric values accurate? Completeness Are the records complete? Do all required fields contain valid values? Timeliness Are records available in a timely fashion?
- LocationĀ 1542
- data quality,
- [note::Data quality is characterized by:

  1. Accuracy
  2. Completeness
  3. Timeliness]

Quote

Master data is data about business entities such as employees, customers, products, and locations.
- LocationĀ 1560
-
- [note::Do people in the field have qualms about the use of "master" in this term?]

Quote

Master data management (MDM) is the practice of building consistent entity definitions known as golden records. Golden records harmonize entity data across an organization and with its partners.
- LocationĀ 1565
-

Quote

Data modeling and design
- LocationĀ 1575
-

Quote

Whereas we traditionally think of data modeling as a problem for database administrators (DBAs) and ETL developers, data modeling can happen almost anywhere in an organization.
- LocationĀ 1580
-

Quote

Data engineers need to understand modeling best practices as well as develop the flexibility to apply the appropriate level and type of modeling to the data source and use case.
- LocationĀ 1591
-

Quote

Data lineage
- LocationĀ 1593
-

Quote

Data lineage describes the recording of an audit trail of data through its lifecycle, tracking both the systems that process the data and the upstream data it depends on.
- LocationĀ 1597
- blue,

Quote

We also note that Andy Petrellaā€™s concept of Data Observability Driven Development (DODD) is closely related to data lineage. DODD observes data all along its lineage. This process is applied during development, testing, and finally production to deliver quality and conformity to expectations.
- LocationĀ 1603
-

Quote

Data integration and interoperability is the process of integrating data across tools and processes. As we move away from a single-stack
- LocationĀ 1608
- blue,

Quote

Data integration and interoperability
- LocationĀ 1608
-

Quote

Increasingly, integration happens through general-purpose APIs rather than custom database connections. For example, a data pipeline might pull data from the Salesforce API, store it to Amazon S3, call the Snowflake API to load it into a table, call the API again to run a query, and then export the results to S3 where Spark can consume them.
- LocationĀ 1614
-

Quote

Data lifecycle management
- LocationĀ 1622
-

Quote

This means we have pay-as-you-go storage costs instead of large up-front capital expenditures for an on-premises data lake. When every byte shows up on a monthly AWS statement, CFOs see opportunities for savings.
- LocationĀ 1627
-

Quote

Ethics and privacy
- LocationĀ 1638
-

Quote

Data used to live in the Wild West, freely collected and traded like baseball cards. Those days are long gone. Whereas dataā€™s ethical and privacy implications were once considered nice to have, like security, theyā€™re now central to the general data lifecycle. Data engineers need to do the right thing when no one else is watching, because everyone will be watching someday.
- LocationĀ 1643
-

Quote

Ensure that your data assets are compliant with a growing number of data regulations, such as GDPR and CCPA. Please take this seriously. We offer tips throughout the book to ensure that youā€™re baking ethics and privacy into the data engineering lifecycle.
- LocationĀ 1649
-

Quote

DataOps maps the best practices of Agile methodology, DevOps, and statistical process control (SPC) to data. Whereas DevOps aims to improve the release and quality of software products, DataOps does the same thing for data products.
- LocationĀ 1653
-

Quote

DataOps
- LocationĀ 1653
-

Quote

Like DevOps, DataOps borrows much from lean manufacturing and supply chain management, mixing people, processes, and technology to reduce time to value. As Data Kitchen (experts in DataOps) describes it:7 DataOps is a collection of technical practices, workflows, cultural norms, and architectural patterns that enable: Rapid innovation and experimentation delivering new insights to customers with increasing velocity Extremely high data quality and very low error rates Collaboration across complex arrays of people, technology, and environments Clear measurement, monitoring, and transparency of results
- LocationĀ 1660
- devops, dataops,

Quote

Observability and monitoring
- LocationĀ 1712
-

Quote

As we tell our clients, ā€œData is a silent killer.ā€ Weā€™ve seen countless examples of bad data lingering in reports for months or years. Executives may make key decisions from this bad data, discovering the error only much later. The outcomes are usually bad and sometimes catastrophic for the business. Initiatives are undermined and destroyed, years of work wasted. In some of the worst cases, bad data may lead companies to financial ruin.
- LocationĀ 1713
-

Quote

Observability, monitoring, logging, alerting, and tracing are all critical to getting ahead of any problems along the data engineering lifecycle. We recommend you incorporate SPC to understand whether events being monitored are out of line and which incidents are worth responding to.
- LocationĀ 1723
-

Quote

The purpose of DODD is to give everyone involved in the data chain visibility into the data and data applications so that everyone involved in the data value chain has the ability to identify changes to the data or data applications at every stepā€”from ingestion to transformation to analysisā€”to help troubleshoot or prevent data issues. DODD focuses on making data observability a first-class consideration in the data engineering lifecycle.
- LocationĀ 1731
-

Quote

Incident response
- LocationĀ 1736
-

Quote

Trust takes a long time to build and can be lost in minutes. Incident response is as much about retroactively responding to incidents as proactively addressing them before they happen.
- LocationĀ 1748
-

Quote

DataOps summary
- LocationĀ 1750
-

Quote

Data Architecture
- LocationĀ 1761
-

Quote

Orchestration
- LocationĀ 1774
-

Quote

Orchestration is the process of coordinating many jobs to run as quickly and efficiently as possible on a scheduled cadence.
- LocationĀ 1782
- blue,

Quote

Software Engineering
- LocationĀ 1812
-

Quote

Core data processing code
- LocationĀ 1821
-

Quote

Whether in ingestion, transformation, or data serving, data engineers need to be highly proficient and productive in frameworks and languages such as Spark, SQL, or Beam;
- LocationĀ 1823
-

Quote

Itā€™s also imperative that a data engineer understand proper code-testing methodologies, such as unit, regression, integration, end-to-end, and smoke.
- LocationĀ 1825
-

Quote

Development of open source frameworks
- LocationĀ 1827
-

Quote

Infrastructure as code
- LocationĀ 1851
-

Quote

Pipelines as code
- LocationĀ 1862
-

Quote

General-purpose problem solving
- LocationĀ 1870
-

Quote

In practice, regardless of which high-level tools they adopt, data engineers will run into corner cases throughout the data engineering lifecycle that require them to solve problems outside the boundaries of their chosen tools and to write custom code. When using frameworks like Fivetran, Airbyte, or Matillion, data engineers will encounter data sources without existing connectors and need to write something custom. They should be proficient in software engineering to understand APIs, pull and transform data, handle exceptions, and so forth.
- LocationĀ 1871
-

Quote

Conclusion
- LocationĀ 1876
-

Quote

A data engineer has several top-level goals across the data lifecycle: produce optimum ROI and reduce costs (financial and opportunity), reduce risk (security, data quality), and maximize data value and utility.
- LocationĀ 1886
- data lifecycle, data engineering,

Quote

Additional Resources
- LocationĀ 1891
-

Quote

ā€œDemocratizing Data at Airbnbā€ by Chris Williams et al.
- LocationĀ 1898
-

Quote

ā€œFive Steps to Begin Collecting the Value of Your Dataā€ Lean-Data web page
- LocationĀ 1899
-

Quote

ā€œGetting Started with DevOps Automationā€ by Jared Murrell
- LocationĀ 1900
-

Quote

ā€œStaying Ahead of Debtā€ by Etai Mizrahi
- LocationĀ 1907
-

Quote

ā€œWhat Is Metadataā€ by Michelle Knight
- LocationĀ 1908
-

Quote

3 Chris Williams et al., ā€œDemocratizing Data at Airbnb,ā€ The Airbnb Tech Blog, May 12, 2017, https://oreil.ly/dM332.
- LocationĀ 1913
-

Quote

ethical behavior is doing the right thing when no one is watching,
- LocationĀ 1921
-

Quote

ā€œWhat Is DataOps,ā€ DataKitchen FAQ page, accessed May 5, 2022, https://oreil.ly/Ns06w.
- LocationĀ 1923
-

Quote

Designing Good Data Architecture
- LocationĀ 1931
-

Quote

What Is Data Architecture?
- LocationĀ 1936
-

Quote

Enterprise Architecture Defined
- LocationĀ 1947
-

Quote

Figure 3-1. Data architecture is a subset of enterprise architecture
- LocationĀ 1952
-

Quote

TOGAF is The Open Group Architecture Framework, a standard of The Open Group. Itā€™s touted as the most widely used architecture framework today.
- LocationĀ 1962
-

Quote

Gartner is a global research and advisory company that produces research articles and reports on trends related to enterprises. Among other things, it is responsible for the (in)famous Gartner Hype Cycle.
- LocationĀ 1971
-

Quote

Data Architecture Defined
- LocationĀ 2031
-

Quote

ā€œGoodā€ Data Architecture
- LocationĀ 2068
-

Quote

Principles of Good Data Architecture
- LocationĀ 2090
-

Quote

Principle 1: Choose Common Components Wisely
- LocationĀ 2115
-

Quote

Principle 2: Plan for Failure
- LocationĀ 2138
-

Quote

Principle 3: Architect for Scalability
- LocationĀ 2160
-

Quote

Principle 4: Architecture Is Leadership
- LocationĀ 2175
-

Quote

Principle 5: Always Be Architecting
- LocationĀ 2195
-

Quote

Principle 6: Build Loosely Coupled Systems
- LocationĀ 2214
-

Quote

Principle 7: Make Reversible Decisions
- LocationĀ 2255
-

Quote

Principle 8: Prioritize Security
- LocationĀ 2272
-

Quote

Principle 9: Embrace FinOps
- LocationĀ 2322
-

Quote

Major Architecture Concepts
- LocationĀ 2370
-

Quote

Domains and Services
- LocationĀ 2376
-

Quote

Distributed Systems, Scalability, and Designing for Failure
- LocationĀ 2400
-

Quote

Tight Versus Loose Coupling: Tiers, Monoliths, and Microservices
- LocationĀ 2440
-

Quote

User Access: Single Versus Multitenant
- LocationĀ 2553
-

Quote

Event-Driven Architecture
- LocationĀ 2570
-

Quote

Brownfield Versus Greenfield Projects
- LocationĀ 2591
-

Quote

Examples and Types of Data Architecture
- LocationĀ 2630
-

Quote

Data Warehouse
- LocationĀ 2635
-

Quote

Data Lake
- LocationĀ 2713
-

Quote

Convergence, Next-Generation Data Lakes, and the Data Platform
- LocationĀ 2745
-

Quote

Modern Data Stack
- LocationĀ 2767
-

Quote

Lambda Architecture
- LocationĀ 2788
-

Quote

Kappa Architecture
- LocationĀ 2807
-

Quote

The Dataflow Model and Unified Batch and Streaming
- LocationĀ 2821
-

Quote

Architecture for IoT
- LocationĀ 2842
-

Quote

Data Mesh
- LocationĀ 2905
-

Quote

Other Data Architecture Examples
- LocationĀ 2927
-

Quote

Whoā€™s Involved with Designing a Data Architecture?
- LocationĀ 2940
-

Quote

Additional Resources
- LocationĀ 2959
-

Quote

Choosing Technologies Across the Data Engineering Lifecycle
- LocationĀ 3111
-

Quote

Team Size and Capabilities
- LocationĀ 3146
-

Quote

Speed to Market
- LocationĀ 3163
-

Quote

Cost Optimization and Business Value
- LocationĀ 3190
-

Quote

Total Cost of Ownership
- LocationĀ 3199
-

Quote

Total Opportunity Cost of Ownership
- LocationĀ 3229
-

Quote

Today Versus the Future: Immutable Versus Transitory Technologies
- LocationĀ 3263
-

Quote

Our Advice
- LocationĀ 3300
-

Quote

On Premises
- LocationĀ 3317
-

Quote

Hybrid Cloud
- LocationĀ 3425
-

Quote

Decentralized: Blockchain and the Edge
- LocationĀ 3465
-

Quote

Our Advice
- LocationĀ 3473
-

Quote

Cloud Repatriation Arguments
- LocationĀ 3499
-

Quote

Build Versus Buy
- LocationĀ 3555
-

Quote

Open Source Software
- LocationĀ 3578
-

Quote

Proprietary Walled Gardens
- LocationĀ 3649
-

Quote

Our Advice
- LocationĀ 3686
-

Quote

Monolith Versus Modular
- LocationĀ 3699
-

Quote

The Distributed Monolith Pattern
- LocationĀ 3757
-

Quote

Our Advice
- LocationĀ 3774
-

Quote

Serverless Versus Servers
- LocationĀ 3785
-

Quote

How to Evaluate Server Versus Serverless
- LocationĀ 3835
-

Quote

Our Advice
- LocationĀ 3855
-

Quote

Optimization, Performance, and the Benchmark Wars
- LocationĀ 3873
-

Quote

Big Data...for the 1990s
- LocationĀ 3894
-

Quote

Nonsensical Cost Comparisons
- LocationĀ 3901
-

Quote

Asymmetric Optimization
- LocationĀ 3906
-

Quote

Caveat Emptor
- LocationĀ 3912
-

Quote

Undercurrents and Their Impacts on Choosing Technologies
- LocationĀ 3914
-

Quote

Data Management
- LocationĀ 3920
-

Quote

Data Architecture
- LocationĀ 3940
-

Quote

Orchestration Example: Airflow
- LocationĀ 3946
-

Quote

Software Engineering
- LocationĀ 3966
-

Quote

Additional Resources
- LocationĀ 3981
-

Quote

The Data Engineering Lifecycle in Depth
- LocationĀ 4028
-

Quote

Data Generation in Source Systems
- LocationĀ 4029
-

Quote

Sources of Data: How Is Data Created?
- LocationĀ 4043
-

Quote

Source Systems: Main Ideas
- LocationĀ 4061
-

Quote

Files and Unstructured Data
- LocationĀ 4064
-

Quote

Application Databases (OLTP Systems)
- LocationĀ 4085
-

Quote

Online Analytical Processing System
- LocationĀ 4137
-

Quote

Change Data Capture
- LocationĀ 4152
-

Quote

Database Logs
- LocationĀ 4191
-

Quote

Insert-Only
- LocationĀ 4212
-

Quote

Messages and Streams
- LocationĀ 4231
-

Quote

Types of Time
- LocationĀ 4264
-

Quote

Source System Practical Details
- LocationĀ 4287
-

Quote

Data Sharing
- LocationĀ 4588
-

Quote

Third-Party Data Sources
- LocationĀ 4605
-

Quote

Message Queues and Event-Streaming Platforms
- LocationĀ 4620
-

Quote

Whom Youā€™ll Work With
- LocationĀ 4726
-

Quote

Undercurrents and Their Impact on Source Systems
- LocationĀ 4760
-

Quote

Data Management
- LocationĀ 4778
-

Quote

Data Architecture
- LocationĀ 4813
-

Quote

Software Engineering
- LocationĀ 4838
-

Quote

Additional Resources
- LocationĀ 4865
-

Quote

Raw Ingredients of Data Storage
- LocationĀ 4924
-

Quote

Magnetic Disk Drive
- LocationĀ 4934
-

Quote

Solid-State Drive
- LocationĀ 4975
-

Quote

Random Access Memory
- LocationĀ 4990
-

Quote

Networking and CPU
- LocationĀ 5025
-

Quote

Data Storage Systems
- LocationĀ 5098
-

Quote

Single Machine Versus Distributed Storage
- LocationĀ 5103
-

Quote

Eventual Versus Strong Consistency
- LocationĀ 5117
-

Quote

File Storage
- LocationĀ 5146
-

Quote

Block Storage
- LocationĀ 5205
-

Quote

Object Storage
- LocationĀ 5281
-

Quote

Cache and Memory-Based Storage Systems
- LocationĀ 5428
-

Quote

The Hadoop Distributed File System
- LocationĀ 5448
-

Quote

Streaming Storage
- LocationĀ 5474
-

Quote

Indexes, Partitioning, and Clustering
- LocationĀ 5488
-

Quote

Data Engineering Storage Abstractions
- LocationĀ 5542
-

Quote

The Data Warehouse
- LocationĀ 5560
-

Quote

The Data Lake
- LocationĀ 5573
-

Quote

The Data Lakehouse
- LocationĀ 5581
-

Quote

Data Platforms
- LocationĀ 5598
-

Quote

Stream-to-Batch Storage Architecture
- LocationĀ 5606
-

Quote

Big Ideas and Trends in Storage
- LocationĀ 5616
-

Quote

Data Catalog
- LocationĀ 5622
-

Quote

Data Sharing
- LocationĀ 5644
-

Quote

Separation of Compute from Storage
- LocationĀ 5672
-

Quote

Data Storage Lifecycle and Data Retention
- LocationĀ 5769
-

Quote

Single-Tenant Versus Multitenant Storage
- LocationĀ 5862
-

Quote

Whom Youā€™ll Work With
- LocationĀ 5888
-

Quote

Undercurrents
- LocationĀ 5898
-

Quote

Security
- LocationĀ 5902
-

Quote

Conclusion
- LocationĀ 5956
-

Quote

Ingestion
- LocationĀ 5987
-

Quote

What Is Data Ingestion?
- LocationĀ 5994
-

Quote

Key Engineering Considerations for the Ingestion Phase
- LocationĀ 6029
-

Quote

Bounded Versus Unbounded Data
- LocationĀ 6048
-

Quote

Frequency
- LocationĀ 6065
-

Quote

Synchronous Versus Asynchronous Ingestion
- LocationĀ 6093
-

Quote

Serialization and Deserialization
- LocationĀ 6121
-

Quote

Throughput and Scalability
- LocationĀ 6130
-

Quote

Reliability and Durability
- LocationĀ 6146
-

Quote

Payload
- LocationĀ 6163
-

Quote

Push Versus Pull Versus Poll Patterns
- LocationĀ 6237
-

Quote

Batch Ingestion Considerations
- LocationĀ 6253
-

Quote

Snapshot or Differential Extraction
- LocationĀ 6272
-

Quote

File-Based Export and Ingestion
- LocationĀ 6281
-

Quote

ETL Versus ELT
- LocationĀ 6290
-

Quote

Inserts, Updates, and Batch Size
- LocationĀ 6304
-

Quote

Data Migration
- LocationĀ 6315
-

Quote

Message and Stream Ingestion Considerations
- LocationĀ 6330
-

Quote

Schema Evolution
- LocationĀ 6336
-

Quote

Late-Arriving Data
- LocationĀ 6346
-

Quote

Ordering and Multiple Delivery
- LocationĀ 6356
-

Quote

Replay
- LocationĀ 6360
-

Quote

Time to Live
- LocationĀ 6366
-

Quote

Message Size
- LocationĀ 6377
-

Quote

Error Handling and Dead-Letter Queues
- LocationĀ 6383
-

Quote

Consumer Pull and Push
- LocationĀ 6395
-

Quote

Location
- LocationĀ 6403
-

Quote

Ways to Ingest Data
- LocationĀ 6411
-

Quote

Direct Database Connection
- LocationĀ 6415
-

Quote

Change Data Capture
- LocationĀ 6447
-

Quote

Message Queues and Event-Streaming Platforms
- LocationĀ 6522
-

Quote

Managed Data Connectors
- LocationĀ 6548
-

Quote

Moving Data with Object Storage
- LocationĀ 6561
-

Quote

Databases and File Export
- LocationĀ 6579
-

Quote

Practical Issues with Common File Formats
- LocationĀ 6589
-

Quote

Shell
- LocationĀ 6607
-

Quote

Quote

SFTP and SCP
- LocationĀ 6625
-

Quote

Web Interface
- LocationĀ 6653
-

Quote

Web Scraping
- LocationĀ 6659
-

Quote

Transfer Appliances for Data Migration
- LocationĀ 6677
-

Quote

Data Sharing
- LocationĀ 6694
-

Quote

Whom Youā€™ll Work With
- LocationĀ 6705
-

Quote

Upstream Stakeholders
- LocationĀ 6710
-

Quote

Downstream Stakeholders
- LocationĀ 6723
-

Quote

Undercurrents
- LocationĀ 6737
-

Quote

Security
- LocationĀ 6740
-

Quote

Data Management
- LocationĀ 6746
-

Quote

DataOps
- LocationĀ 6791
-

Quote

Orchestration
- LocationĀ 6834
-

Quote

Software Engineering
- LocationĀ 6844
-

Quote

Conclusion
- LocationĀ 6855
-

Quote

Additional Resources
- LocationĀ 6861
-

Quote

Queries, Modeling, and Transformation
- LocationĀ 6877
-

Quote

Queries
- LocationĀ 6903
-

Quote

What Is a Query?
- LocationĀ 6908
-

Quote

The Life of a Query
- LocationĀ 6957
-

Quote

The Query Optimizer
- LocationĀ 6970
-

Quote

Improving Query Performance
- LocationĀ 6978
-

Quote

Queries on Streaming Data
- LocationĀ 7123
-

Quote

Data Modeling
- LocationĀ 7258
-

Quote

What Is a Data Model?
- LocationĀ 7275
-

Quote

Conceptual, Logical, and Physical Data Models
- LocationĀ 7292
-

Quote

Normalization
- LocationĀ 7328
-

Quote

Techniques for Modeling Batch Analytical Data
- LocationĀ 7503
-

Quote

Modeling Streaming Data
- LocationĀ 7881
-

Quote

Transformations
- LocationĀ 7907
-

Quote

Batch Transformations
- LocationĀ 7931
-

Quote

Materialized Views, Federation, and Query Virtualization
- LocationĀ 8270
-

Quote

Streaming Transformations and Processing
- LocationĀ 8338
-

Quote

Whom Youā€™ll Work With
- LocationĀ 8395
-

Quote

Upstream Stakeholders
- LocationĀ 8405
-

Quote

Downstream Stakeholders
- LocationĀ 8417
-

Quote

Undercurrents
- LocationĀ 8424
-

Quote

Security
- LocationĀ 8427
-

Quote

Data Management
- LocationĀ 8436
-

Quote

DataOps
- LocationĀ 8458
-

Quote

Data Architecture
- LocationĀ 8479
-

Quote

Orchestration
- LocationĀ 8490
-

Quote

Software Engineering
- LocationĀ 8495
-

Quote

Conclusion
- LocationĀ 8515
-

Quote

Additional Resources
- LocationĀ 8528
-

Quote

Serving Data for Analytics, Machine Learning, and Reverse ETL
- LocationĀ 8614
-

Quote

General Considerations for Serving Data
- LocationĀ 8627
-

Quote

Trust
- LocationĀ 8633
-

Quote

Whatā€™s the Use Case, and Whoā€™s the User?
- LocationĀ 8668
-

Quote

Data Products
- LocationĀ 8687
-

Quote

Self-Service or Not?
- LocationĀ 8709
-

Quote

Data Definitions and Logic
- LocationĀ 8732
-

Quote

Data Mesh
- LocationĀ 8757
-

Quote

Analytics
- LocationĀ 8768
-

Quote

Business Analytics
- LocationĀ 8779
-

Quote

Operational Analytics
- LocationĀ 8820
-

Quote

Embedded Analytics
- LocationĀ 8856
-

Quote

Machine Learning
- LocationĀ 8884
-

Quote

What a Data Engineer Should Know About ML
- LocationĀ 8899
-

Quote

Ways to Serve Data for Analytics and ML
- LocationĀ 8933
-

Quote

File Exchange
- LocationĀ 8938
-

Quote

Databases
- LocationĀ 8959
-

Quote

Streaming Systems
- LocationĀ 8989
-

Quote

Query Federation
- LocationĀ 8999
-

Quote

Data Sharing
- LocationĀ 9018
-

Quote

Semantic and Metrics Layers
- LocationĀ 9026
-

Quote

Serving Data in Notebooks
- LocationĀ 9053
-

Quote

Reverse ETL
- LocationĀ 9094
-

Quote

Whom Youā€™ll Work With
- LocationĀ 9125
-

Quote

Undercurrents
- LocationĀ 9141
-

Quote

Security
- LocationĀ 9147
-

Quote

Data Management
- LocationĀ 9171
-

Quote

DataOps
- LocationĀ 9185
-

Quote

Data Architecture
- LocationĀ 9197
-

Quote

Orchestration
- LocationĀ 9204
-

Quote

Software Engineering
- LocationĀ 9219
-

Quote

Conclusion
- LocationĀ 9236
-

Quote

Additional Resources
- LocationĀ 9245
-

Quote

Security, Privacy, and the Future of Data Engineering
- LocationĀ 9285
-

Quote

Security and Privacy
- LocationĀ 9287
-

Quote

The Power of Negative Thinking
- LocationĀ 9320
-

Quote

Always Be Paranoid
- LocationĀ 9329
-

Quote

Processes
- LocationĀ 9337
-

Quote

Security Theater Versus Security Habit
- LocationĀ 9341
-

Quote

Active Security
- LocationĀ 9351
-

Quote

Shared Responsibility in the Cloud
- LocationĀ 9371
-

Quote

Always Back Up Your Data
- LocationĀ 9378
-

Quote

An Example Security Policy
- LocationĀ 9387
-

Quote

Technology
- LocationĀ 9416
-

Quote

Patch and Update Systems
- LocationĀ 9420
-

Quote

Encryption
- LocationĀ 9426
-

Quote

Logging, Monitoring, and Alerting
- LocationĀ 9448
-

Quote

Network Access
- LocationĀ 9473
-

Quote

Security for Low-Level Data Engineering
- LocationĀ 9491
-

Quote

Conclusion
- LocationĀ 9509
-

Quote

Additional Resources
- LocationĀ 9512
-

Quote

The Future of Data Engineering
- LocationĀ 9518
-

Quote

The Data Engineering Lifecycle Isnā€™t Going Away
- LocationĀ 9534
-

Quote

The Decline of Complexity and the Rise of Easy-to-Use Data Tools
- LocationĀ 9543
-

Quote

The Cloud-Scale Data OS and Improved Interoperability
- LocationĀ 9572
-

Quote

ā€œEnterpriseyā€ Data Engineering
- LocationĀ 9617
-

Quote

Titles and Responsibilities Will Morph...
- LocationĀ 9630
-

Quote

Moving Beyond the Modern Data Stack, Toward the Live Data Stack
- LocationĀ 9652
-

Quote

The Live Data Stack
- LocationĀ 9674
-

Quote

Streaming Pipelines and Real-Time Analytical Databases
- LocationĀ 9687
-

Quote

The Fusion of Data with Applications
- LocationĀ 9722
-

Quote

The Tight Feedback Between Applications and ML
- LocationĀ 9731
-

Quote

Dark Matter Data and the Rise of...Spreadsheets?!
- LocationĀ 9742
-

Quote

Conclusion
- LocationĀ 9774
-

Quote

Serialization and Compression Technical Details
- LocationĀ 9805
-

Quote

Serialization Formats
- LocationĀ 9809
-

Quote

Row-Based Serialization
- LocationĀ 9817
-

Quote

Columnar Serialization
- LocationĀ 9849
-

Quote

Hybrid Serialization
- LocationĀ 9903
-

Quote

Database Storage Engines
- LocationĀ 9922
-

Quote

Compression: gzip, bzip2, Snappy, Etc.
- LocationĀ 9934
-

Quote

Cloud Networking
- LocationĀ 9956
-

Quote

Cloud Network Topology
- LocationĀ 9959
-

Quote

Data Egress Charges
- LocationĀ 9965
-

Quote

Availability Zones
- LocationĀ 9973
-

Quote

GCP-Specific Networking and Multiregional Redundancy
- LocationĀ 10000
-

Quote

Direct Network Connections to the Clouds
- LocationĀ 10016
-

Quote

Quote

The Future of Data Egress Fees
- LocationĀ 10029
-

Quote

About the Authors
- LocationĀ 11858
-

Quote

Joe Reis is a business-minded data nerd whoā€™s worked in the data industry for 20 years, with responsibilities ranging from statistical modeling, forecasting, machine learning, data engineering, data architecture, and almost everything else in between. Joe is the CEO and cofounder of Ternary Data, a data engineering and architecture consulting firm based in Salt Lake City, Utah. In addition, he volunteers with several technology groups and teaches at the University of Utah.
- LocationĀ 11859
-

Quote

Matt Housley is a data engineering consultant and cloud specialist. After some early programming experience with Logo, Basic, and 6502 assembly, he completed a PhD in mathematics at the University of Utah. Matt then began working in data science, eventually specializing in cloud-based data engineering. He cofounded Ternary Data with Joe Reis, where he leverages his teaching experience to train future data engineers and advise teams on robust data architecture. Matt and Joe also pontificate on all things data on The Monday Morning Data Chat.
- LocationĀ 11863
-