Fundamentals of Data Engineering
@created:: 2024-01-24
@tags:: #lit✍/📚book/highlights
@links:: data engineering, data governance, data management, data orchestration,
@ref:: Fundamentals of Data Engineering
@author:: Joe Reis and Matt Housley
2023-02-05 Joe Reis and Matt Housley - Fundamentals of Data Engineering
Reference
Notes
What This Book Isn’t
- Location 104
-
What This Book Is About
- Location 109
-
Who Should Read This Book
- Location 128
-
Prerequisites
- Location 138
-
What You’ll Learn and How It Will Improve Your Abilities
- Location 154
-
Navigating This Book
- Location 166
-
Conventions Used in This Book
- Location 199
-
Foundation and Building Blocks
- Location 250
-
Data Engineering Described
- Location 252
-
What Is Data Engineering?
- Location 256
-
Data Engineering Defined
- Location 287
-
a data engineer gets data, stores it, and prepares it for consumption by data scientists, analysts, and others.
- Location 288
-
Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning. Data engineering is the intersection of security, data management, DataOps, data architecture, orchestration, and software engineering. A data engineer manages the data engineering lifecycle, beginning with getting data from source systems and ending with serving data for use cases, such as analysis or machine learning.
- Location 291
-
- [note::I'd be fun to make a Venn diagram of data engineering]
The Data Engineering Lifecycle
- Location 297
-
Figure 1-1. The data engineering lifecycle
- Location 301
-
Evolution of the Data Engineer
- Location 313
-
The term big data is essentially a relic to describe a particular time and approach to handling large amounts of data.
- Location 398
-
big data processing has become so accessible that it no longer merits a separate term; every company aims to solve its data problems, regardless of actual data size. Big data engineers are now simply data engineers.
- Location 400
-
Data engineering is increasingly a discipline of interoperation, and connecting various technologies like LEGO bricks, to serve ultimate business goals.
- Location 411
-
Figure 1-3. Matt Turck’s Data Landscape in 2012 versus 2021
- Location 413
-
As tools and workflows simplify, we’ve seen a noticeable shift in the attitudes of data engineers. Instead of focusing on who has the “biggest data,” open source projects and services are increasingly concerned with managing and governing data, making it easier to use and discover, and improving its quality.
- Location 421
-
Data Engineering and Data Science
- Location 434
-
Figure 1-4. Data engineering sits upstream from data science
- Location 440
-
Figure 1-5. The Data Science Hierarchy of Needs
- Location 446
-
Rogati argues that companies need to build a solid data foundation (the bottom three levels of the hierarchy) before tackling areas such as AI and ML.
- Location 450
- artificial intelligence (ai), data hierarchy of needs, machine learning (ml),
- [note::Hierarchy = Data Science Hierarchy of Needs]
In an ideal world, data scientists should spend more than 90% of their time focused on the top layers of the pyramid: analytics, experimentation, and ML. When data engineers focus on these bottom parts of the hierarchy, they build a solid foundation for data scientists to succeed.
- Location 452
-
Data Engineering Skills and Activities
- Location 461
-
a data engineer juggles a lot of complex moving parts and must constantly optimize along the axes of cost, agility, scalability, simplicity, reuse, and interoperability
- Location 467
-
data engineers are now focused on balancing the simplest and most cost-effective, best-of-breed services that deliver value to the business.
- Location 476
-
A data engineer typically does not directly build ML models, create reports or dashboards, perform data analysis, build key performance indicators (KPIs), or develop software applications. A data engineer should have a good functioning understanding of these areas to serve stakeholders best.
- Location 478
-
Data Maturity and the Data Engineer
- Location 481
-
Data maturity is the progression toward higher data utilization, capabilities, and integration across the organization,
- Location 484
-
data maturity does not simply depend on the age or revenue of a company. An early-stage startup can have greater data maturity than a 100-year-old company with annual revenues in the billions. What matters is the way data is leveraged as a competitive advantage.
- Location 488
-
Stage 1: Starting with data
- Location 496
-
A data engineer should focus on the following in organizations getting started with data: Get buy-in from key stakeholders, including executive management. Ideally, the data engineer should have a sponsor for critical initiatives to design and build a data architecture to support the company’s goals. Define the right data architecture (usually solo, since a data architect likely isn’t available). This means determining business goals and the competitive advantage you’re aiming to achieve with your data initiative. Work toward a data architecture that supports these goals. See Chapter 3 for our advice on “good” data architecture. Identify and audit data that will support key initiatives and operate within the data architecture you designed. Build a solid data foundation for future data analysts and data scientists to generate reports and models that provide competitive value. In the meantime, you may also have to generate these reports and models until this team is hired.
- Location 508
-
This is a delicate stage with lots of pitfalls. Here are some tips for this stage: Organizational willpower may wane if a lot of visible successes don’t occur with data. Getting quick wins will establish the importance of data within the organization. Just keep in mind that quick wins will likely create technical debt. Have a plan to reduce this debt, as it will otherwise add friction for future delivery. Get out and talk to people, and avoid working in silos. We often see the data team working in a bubble, not communicating with people outside their departments and getting perspectives and feedback from business stakeholders. The danger is you’ll spend a lot of time working on things of little use to people. Avoid undifferentiated heavy lifting. Don’t box yourself in with unnecessary technical complexity. Use off-the-shelf, turnkey solutions wherever possible. Build custom solutions and code only where this creates a competitive advantage.
- Location 516
- data culture,
- [note::This stage = "Getting started with data"]
Stage 2: Scaling…
- Location 525
-
In organizations that are in stage 2 of data maturity, a data engineer’s goals are to do the following: Establish formal data practices Create scalable and robust data architectures Adopt DevOps and DataOps practices Build systems that support ML Continue to avoid…
- Location 528
-
Issues to watch out for include the following: As we grow more sophisticated with data, there’s a temptation to adopt bleeding-edge technologies based on social proof from Silicon Valley companies. This is rarely a good use of your time and energy. Any technology decisions should be driven by the value they’ll deliver to your customers. The main bottleneck for scaling is not cluster nodes, storage, or technology but the data engineering team. Focus on solutions that are simple to deploy and manage to expand your team’s throughput. You’ll be tempted to frame yourself as a technologist, a data genius who can deliver magical products. Shift your focus instead to pragmatic leadership…
- Location 533
-
Stage 3: Leading…
- Location 541
-
In organizations in stage 3 of data maturity, a data engineer will continue building on prior stages, plus they will do the following: Create automation for the seamless introduction and usage of new data Focus on building custom tools and systems that leverage data as a competitive advantage Focus on the “enterprisey” aspects of data, such as data management (including data governance and quality) and DataOps Deploy tools that expose and disseminate data throughout the organization, including data catalogs, data lineage tools, and metadata management systems Collaborate efficiently with software…
- Location 545
-
Issues to watch out for include the following: At this stage, complacency is a significant danger. Once organizations reach stage 3, they must constantly focus on maintenance and improvement or risk falling back to a lower stage. Technology distractions are a more significant danger here than in the other stages. There’s a temptation to pursue expensive hobby projects that don’t deliver…
- Location 552
-
- [note::"Utilize custom-built technology only where it provides a competitive advantage" is a reoccurring theme in each data maturity stage (1-3)]
The Background and Skills of a…
- Location 560
-
If you’re pivoting your career into data engineering, we’ve found that the transition is easiest when moving from an adjacent field, such as software engineering, ETL development, database administration, data science, or data analysis.
- Location 568
- career pivot, pink, career,
Zooming out, a data engineer must also understand the requirements of data consumers (data analysts and data scientists) and the broader implications of data across the organization. Data engineering is a holistic practice; the best data engineers view their responsibilities through business and technical lenses.
- Location 575
- favorite,
Business Responsibilities
- Location 577
-
Know how to communicate with nontechnical and technical people.
- Location 581
-
We suggest paying close attention to organizational hierarchies, who reports to whom, how people interact, and which silos exist. These observations will be invaluable to your success.
- Location 582
- stakeholder mapping,
- [note::For any problem you're trying to solve, it helps to map the stakeholders who can influence or are influenced by your progress towards a solution.]
Understand how to scope and gather business and product requirements.
- Location 584
-
Data engineering is a holistic practice; the best data engineers view their responsibilities through business and technical lenses.
- Location 585
-
- [note::Emphasis on and]
Many technologists mistakenly believe these practices are solved through technology. We feel this is dangerously wrong. Agile, DevOps, and DataOps are fundamentally cultural, requiring buy-in across the organization.
- Location 586
- favorite,
Know how to optimize for time to value, the total cost of ownership, and opportunity cost. Learn to monitor costs to avoid surprises.
- Location 589
-
success or failure is rarely a technology issue.
- Location 595
-
- [note::Stakeholder communication is paramount]
Technical Responsibilities
- Location 598
-
data engineers now focus on high-level abstractions or writing pipelines as code within an orchestration framework.
- Location 614
-
a data engineer who can’t write production-grade code will be severely hindered, and we don’t see this changing anytime soon.
- Location 617
-
At the time of this writing, the primary languages of data engineering are SQL, Python, a Java Virtual Machine (JVM) language (usually Java or Scala), and bash:
- Location 620
-
Understanding Java or Scala will be beneficial if you’re using a popular open source data framework.
- Location 633
-
Even today, data engineers frequently use command-line tools like awk or sed to process files in a data pipeline or call bash commands from orchestration frameworks. If you’re using Windows, feel free to substitute PowerShell for bash.
- Location 636
-
Data engineers also do well to develop expertise in composing SQL with other operations, either within frameworks such as Spark and Flink or by using orchestration to combine multiple tools. Data engineers should also learn modern SQL semantics for dealing with JavaScript Object Notation (JSON) parsing and nested data and consider leveraging a SQL management framework such as dbt (Data Build Tool).
- Location 645
- career capital, skills, sql,
Data engineers may also need to develop proficiency in secondary programming languages, including R, JavaScript, Go, Rust, C/C++, C#, and Julia. Developing in these languages is often necessary when popular across the company or used with domain-specific data tools. For instance, JavaScript has proven popular as a language for user-defined functions in cloud data warehouses. At the same time, C# and PowerShell are essential in companies that leverage Azure and the Microsoft ecosystem.
- Location 652
- programming languages, data engineering,
The Continuum of Data Engineering Roles, from A to B
- Location 665
-
Data Engineers Inside an Organization
- Location 689
-
Internal-Facing Versus External-Facing Data Engineers
- Location 693
-
Data Engineers and Other Technical Roles
- Location 718
-
Figure 1-12. Key technical stakeholders of data engineering
- Location 722
-
Upstream stakeholders
- Location 729
-
Data architects design the blueprint for organizational data management, mapping out processes and overall data architecture and systems.11 They also serve as a bridge between an organization’s technical and nontechnical sides.
- Location 735
-
Data architects implement policies for managing data across silos and business units, steer global strategies such as data management and data governance, and guide significant initiatives. Data architects often play a central role in cloud migrations and greenfield cloud design.
- Location 739
-
Nevertheless, data architects will remain influential visionaries in enterprises, working hand in hand with data engineers to determine the big picture of architecture practices and data strategies.
- Location 744
-
In well-run technical organizations, software engineers and data engineers coordinate from the inception of a new project to design application data for consumption by analytics and ML applications.
- Location 758
-
A data engineer should work together with software engineers to understand the applications that generate data, the volume, frequency, and format of the generated data, and anything else that will impact the data engineering lifecycle, such as data security and regulatory compliance. For example, this might mean setting upstream expectations on what the data software engineers need to do their jobs.
- Location 759
- data engineering,
Downstream stakeholders
- Location 769
-
If data engineers do their job and collaborate successfully, data scientists shouldn’t spend their time collecting, cleaning, and preparing data after initial exploratory work. Data engineers should automate this work as much as possible.
- Location 784
-
Whereas data scientists are forward-looking, a data analyst typically focuses on the past or present.
- Location 790
-
Data Engineers and Business Leadership
- Location 822
-
Data engineers often support data architects by acting as the glue between the business and data science/analytics.
- Location 826
-
Data in the C-suite
- Location 827
-
CIOs will work with engineers and architects to map out major initiatives and make strategic decisions on adopting major architectural elements, such as enterprise resource planning (ERP) and customer relationship management (CRM) systems, cloud migrations, data systems, and internal-facing IT.
- Location 846
-
A CTO owns the key technological strategy and architectures for external-facing applications, such as mobile, web apps, and IoT—all critical data sources for data engineers.
- Location 851
- business roles,
Data engineers and project managers
- Location 876
-
These large initiatives often benefit from project management (in contrast to product management, discussed next). Whereas data engineers function in an infrastructure and service delivery capacity, project managers direct traffic and serve as gatekeepers.
- Location 881
-
Data engineers and product managers
- Location 889
-
Data engineers and other management roles
- Location 896
-
For more information on data teams and how to structure them, we recommend John Thompson’s Building Analytics Teams (Packt) and Jesse Anderson’s Data Teams (Apress). Both books provide strong frameworks and perspectives on the roles of executives with data, who to hire, and how to construct the most effective data team for your company.
- Location 900
-
Conclusion
- Location 907
-
Additional Resources
- Location 917
-
“Big Data Will Be Dead in Five Years” by Lewis Gavin
- Location 920
-
“Data as a Product vs. Data as a Service” by Justin Gage
- Location 923
-
“The Downfall of the Data Engineer” by Maxime Beauchemin
- Location 928
-
“How Creating a Data-Driven Culture Can Drive Success” by Frederik Bussler
- Location 931
-
The Information Management Body of Knowledge website
- Location 933
-
“Information Management Body of Knowledge” Wikipedia page
- Location 934
-
“Information Management” Wikipedia page
- Location 935
-
“On Complexity in Big Data” by Jesse Anderson (O’Reilly)
- Location 936
-
“What Is a Data Architect? IT’s Data Framework Visionary” by Thor Olavsrud
- Location 943
-
The Data Engineering Lifecycle
- Location 980
-
data engineering lifecycle comprises stages that turn raw data ingredients into a useful end product, ready for consumption by analysts, data scientists, ML engineers, and others.
- Location 989
-
What Is the Data Engineering Lifecycle?
- Location 989
-
Figure 2-1. Components and undercurrents of the data engineering lifecycle
- Location 996
-
The Data Lifecycle Versus the Data Engineering Lifecycle
- Location 1009
-
source system is the origin of the data used in the data engineering lifecycle. For example, a source system could be an IoT device, an application message queue, or a transactional database.
- Location 1018
-
Generation: Source Systems
- Location 1018
-
Engineers also need to keep an open line of communication with source system owners on changes that could break pipelines and analytics.
- Location 1025
-
The following is a starting set of evaluation questions of source systems that data engineers must consider: What are the essential characteristics of the data source? Is it an application? A swarm of IoT devices? How is data persisted in the source system? Is data persisted long term, or is it temporary and quickly deleted? At what rate is data generated? How many events per second? How many gigabytes per hour? What level of consistency can data engineers expect from the output data? If you’re running data-quality checks against the output data, how often do data inconsistencies occur—nulls where they aren’t expected, lousy formatting, etc.? How often do errors occur? Will the data contain duplicates? Will some data values arrive late, possibly much later than other messages produced simultaneously? What is the schema of the ingested data? Will data engineers need to join across several tables or even several systems to get a complete picture of the data? If schema changes (say, a new column is added), how is this dealt with and communicated to downstream stakeholders? How frequently should data be pulled from the source system? For stateful systems (e.g., a database tracking customer account information), is data provided as periodic snapshots or update events from change data capture (CDC)? What’s the logic for how changes are performed, and how are these tracked in the source database? Who/what is the data provider that will transmit the data for downstream consumption? Will reading from a data source impact its performance? Does the source system have upstream data dependencies? What are the characteristics of these upstream systems? Are data-quality checks in place to check for late or missing data?
- Location 1049
- considerations, data source systems,
A data engineer should know how the source generates data, including relevant quirks or nuances. Data engineers also need to understand the limits of the source systems they interact with.
- Location 1069
-
Storage
- Location 1089
-
few data storage solutions function purely as storage, with many supporting complex transformation queries; even object storage solutions may support powerful query capabilities—e.g., Amazon S3 Select.
- Location 1093
-
Here are a few key engineering questions to ask when choosing a storage system for a data warehouse, data lakehouse, database, or object storage: Is this storage solution compatible with the architecture’s required write and read speeds? Will storage create a bottleneck for downstream processes? Do you understand how this storage technology works? Are you utilizing the storage system optimally or committing unnatural acts? For instance, are you applying a high rate of random access updates in an object storage system? (This is an antipattern with significant performance overhead.) Will this storage system handle anticipated future scale? You should consider all capacity limits on the storage system: total available storage, read operation rate, write volume, etc. Will downstream users and processes be able to retrieve data in the required service-level agreement (SLA)? Are you capturing metadata about schema evolution, data flows, data lineage, and so forth? Metadata has a significant impact on the utility of data. Metadata represents an investment in the future, dramatically enhancing discoverability and institutional knowledge to streamline future projects and architecture changes. Is this a pure storage solution (object storage), or does it support complex query patterns (i.e., a cloud data warehouse)? Is the storage system schema-agnostic (object storage)? Flexible schema (Cassandra)? Enforced schema (a cloud data warehouse)? How are you tracking master data, golden records data quality, and data lineage for data governance? (We have more to say on these in “Data Management”.) How are you handling regulatory compliance and data sovereignty? For example, can you store your data in certain geographical locations but not others?
- Location 1102
-
Understanding data access frequency
- Location 1120
-
Data access frequency will determine the temperature of your data. Data that is most frequently accessed is called hot data. Hot data is commonly retrieved many times per day, perhaps even several times per second—for example, in systems that serve user requests. This data should be stored for fast retrieval, where “fast” is relative to the use case. Lukewarm data might be accessed every so often—say, every week or month. Cold data is seldom queried and is appropriate for storing in an archival system. Cold data is often retained for compliance purposes or in case of a catastrophic failure in another system.
- Location 1123
-
- [note::Data Temperature = Frequency of Data Retrieval]
Selecting a storage system
- Location 1136
-
Ingestion
- Location 1144
-
source systems and ingestion represent the most significant bottlenecks of the data engineering lifecycle.
- Location 1148
-
Key engineering considerations for the ingestion phase
- Location 1153
-
When preparing to architect or build a system, here are some primary questions about the ingestion stage: What are the use cases for the data I’m ingesting? Can I reuse this data rather than create multiple versions of the same dataset? Are the systems generating and ingesting this data reliably, and is the data available when I need it? What is the data destination after ingestion? How frequently will I need to access the data? In what volume will the data typically arrive? What format is the data in? Can my downstream storage and transformation systems handle this format? Is the source data in good shape for immediate downstream use? If so, for how long, and what may cause it to be unusable? If the data is from a streaming source, does it need to be transformed before reaching its destination? Would an in-flight transformation be appropriate, where the data is transformed within the stream itself?
- Location 1154
-
Batch versus streaming
- Location 1167
-
Virtually all data we deal with is inherently streaming. Data is nearly always produced and updated continually at its source.
- Location 1168
-
Batch ingestion is simply a specialized and convenient way of processing this stream in large chunks—for example, handling a full day’s worth of data in a single batch.
- Location 1170
-
Key considerations for batch versus stream ingestion
- Location 1184
-
The following are some questions to ask yourself when determining whether streaming ingestion is an appropriate choice over batch ingestion: If I ingest the data in real time, can downstream storage systems handle the rate of data flow? Do I need millisecond real-time data ingestion? Or would a micro-batch approach work, accumulating and ingesting data, say, every minute? What are my use cases for streaming ingestion? What specific benefits do I realize by implementing streaming? If I get data in real time, what actions can I take on that data that would be an improvement upon batch? Will my streaming-first approach cost more in terms of time, money, maintenance, downtime, and opportunity cost than simply doing batch? Are my streaming pipeline and system reliable and redundant if infrastructure fails? What tools are most appropriate for the use case? Should I use a managed service (Amazon Kinesis, Google Cloud Pub/Sub, Google Cloud Dataflow) or stand up my own instances of Kafka, Flink, Spark, Pulsar, etc.? If I do the latter, who will manage it? What are the costs and trade-offs? If I’m deploying an ML model, what benefits do I have with online predictions and possibly continuous training? Am I getting data from a live production instance? If so, what’s the impact of my ingestion process on this source system?
- Location 1186
-
In the push model of data ingestion, a source system writes data out to a target, whether a database, object store, or filesystem. In the pull model, data is retrieved from the source system.
- Location 1202
-
Push versus pull
- Location 1202
-
continuous CDC,
- Location 1213
-
- [note::What does CDC stand for? -> "Change Data Capture"]
Transformation
- Location 1225
-
Immediately after ingestion, basic transformations map data into correct types (changing ingested string data into numeric and date types, for example), putting records into standard formats, and removing bad ones. Later stages of transformation may transform the data schema and apply normalization. Downstream, we can apply large-scale aggregation for reporting or featurize data for ML processes.
- Location 1231
-
Key considerations for the transformation phase
- Location 1234
-
When considering data transformations within the data engineering lifecycle, it helps to consider the following: What’s the cost and return on investment (ROI) of the transformation? What is the associated business value? Is the transformation as simple and self-isolated as possible? What business rules do the transformations support?
- Location 1235
-
Serving Data
- Location 1263
-
Analytics
- Location 1276
-
Figure 2-5. Types of analytics
- Location 1282
-
Multitenancy
- Location 1314
-
Machine learning
- Location 1321
-
The following are some considerations for the serving data phase specific to ML: Is the data of sufficient quality to perform reliable feature engineering? Quality requirements and assessments are developed in close collaboration with teams consuming the data. Is the data discoverable? Can data scientists and ML engineers easily find valuable data? Where are the technical and organizational boundaries between data engineering and ML engineering? This organizational question has significant architectural implications. Does the dataset properly represent ground truth? Is it unfairly biased?
- Location 1340
-
Reverse ETL
- Location 1349
-
Major Undercurrents Across the Data Engineering Lifecycle
- Location 1370
-
Figure 2-7. The major undercurrents of data engineering
- Location 1379
-
Security
- Location 1381
-
The principle of least privilege means giving a user or system access to only the essential data and resources to perform an intended function. A common antipattern we see with data engineers with little security experience is to give admin access to all users. This is a catastrophe waiting to happen!
- Location 1386
- data security,
Data Management
- Location 1403
-
The Data Management Association International (DAMA) Data Management Body of Knowledge (DMBOK), which we consider to be the definitive book for enterprise data management, offers this definition: Data management is the development, execution, and supervision of plans, policies, programs, and practices that deliver, control, protect, and enhance the value of data and information assets throughout their lifecycle.
- Location 1410
-
data governance engages people, processes, and technologies to maximize data value across an organization while protecting data with appropriate security controls.
- Location 1438
-
The core categories of data governance are discoverability, security, and accountability.2 Within these core categories are subcategories, such as data quality, metadata, and privacy.
- Location 1450
-
In a data-driven company, data must be available and discoverable. End users should have quick and reliable access to the data they need to do their jobs. They should know where the data comes from, how it relates to other data, and what the data means.
- Location 1453
-
We divide metadata into two major categories: autogenerated and human generated. Modern data engineering revolves around automation, but metadata collection is often manual and error prone.
- Location 1462
- data quality, metadata,
Metadata tools are only as good as their connectors to data systems and their ability to share metadata.
- Location 1470
-
Documentation and internal wiki tools provide a key foundation for metadata management, but these tools should also integrate with automated data cataloging. For example, data-scanning tools can generate wiki pages with links to relevant data objects.
- Location 1477
-
DMBOK identifies four main categories of metadata that are useful to data engineers: Business metadata Technical metadata Operational metadata Reference metadata
- Location 1481
-
Business metadata relates to the way data is used in the business, including business and data definitions, data rules and logic, how and where data is used, and the data owner(s). A data engineer uses business metadata to answer nontechnical questions about who, what, where, and how. For example, a data engineer may be tasked with creating a data pipeline for customer sales analysis. But what is a customer? Is it someone who’s purchased in the last 90 days? Or someone who’s purchased at any time the business has been open? A data…
- Location 1484
- metadata, business metadata,
Technical metadata describes the data created and used by systems across the data engineering lifecycle. It includes the data model and schema, data lineage, field mappings, and pipeline workflows. A data engineer uses technical metadata to create, connect, and monitor various systems across the data engineering lifecycle. Here are some common types of technical metadata that a data…
- Location 1491
- metadata,
Pipeline metadata captured in orchestration systems provides details of the workflow schedule, system and data dependencies, configurations…
- Location 1497
- data orchestration, data pipelines,
Data-lineage metadata tracks the origin and changes to data, and its dependencies, over time. As data flows through the data engineering lifecycle, it evolves through transformations and combinations with other data. Data lineage provides an audit trail of…
- Location 1500
-
Schema metadata describes the structure of data stored in a system such as a database, a data warehouse, a…
- Location 1503
-
Operational metadata describes the operational results of various systems and includes statistics about processes, job IDs, application runtime logs, data used in a process, and error logs. A data engineer uses operational metadata to determine whether a…
- Location 1508
-
Reference metadata is data used to classify other data. This is also referred to as lookup data. Standard examples of reference data are internal codes, geographic codes, units of measurement, and internal calendar standards. Note that much of reference data is fully managed internally, but items such as geographic codes might come from standard external references. Reference data is…
- Location 1514
-
Data accountability means assigning an individual to govern a portion of data. The responsible person then coordinates the governance…
- Location 1522
-
Data quality is the optimization of data toward the desired state and orbits the question, “What do you get compared with what you expect?” Data should conform to the expectations in the business metadata. Does the data match the definition agreed upon by the business?
- Location 1535
-
According to Data Governance: The Definitive Guide, data quality is defined by three main characteristics:4 Accuracy Is the collected data factually correct? Are there duplicate values? Are the numeric values accurate? Completeness Are the records complete? Do all required fields contain valid values? Timeliness Are records available in a timely fashion?
- Location 1542
- data quality,
- [note::Data quality is characterized by:
- Accuracy
- Completeness
- Timeliness]
Master data is data about business entities such as employees, customers, products, and locations.
- Location 1560
-
- [note::Do people in the field have qualms about the use of "master" in this term?]
Master data management (MDM) is the practice of building consistent entity definitions known as golden records. Golden records harmonize entity data across an organization and with its partners.
- Location 1565
-
Data modeling and design
- Location 1575
-
Whereas we traditionally think of data modeling as a problem for database administrators (DBAs) and ETL developers, data modeling can happen almost anywhere in an organization.
- Location 1580
-
Data engineers need to understand modeling best practices as well as develop the flexibility to apply the appropriate level and type of modeling to the data source and use case.
- Location 1591
-
Data lineage
- Location 1593
-
Data lineage describes the recording of an audit trail of data through its lifecycle, tracking both the systems that process the data and the upstream data it depends on.
- Location 1597
- blue,
We also note that Andy Petrella’s concept of Data Observability Driven Development (DODD) is closely related to data lineage. DODD observes data all along its lineage. This process is applied during development, testing, and finally production to deliver quality and conformity to expectations.
- Location 1603
-
Data integration and interoperability is the process of integrating data across tools and processes. As we move away from a single-stack
- Location 1608
- blue,
Data integration and interoperability
- Location 1608
-
Increasingly, integration happens through general-purpose APIs rather than custom database connections. For example, a data pipeline might pull data from the Salesforce API, store it to Amazon S3, call the Snowflake API to load it into a table, call the API again to run a query, and then export the results to S3 where Spark can consume them.
- Location 1614
-
Data lifecycle management
- Location 1622
-
This means we have pay-as-you-go storage costs instead of large up-front capital expenditures for an on-premises data lake. When every byte shows up on a monthly AWS statement, CFOs see opportunities for savings.
- Location 1627
-
Ethics and privacy
- Location 1638
-
Data used to live in the Wild West, freely collected and traded like baseball cards. Those days are long gone. Whereas data’s ethical and privacy implications were once considered nice to have, like security, they’re now central to the general data lifecycle. Data engineers need to do the right thing when no one else is watching, because everyone will be watching someday.
- Location 1643
-
Ensure that your data assets are compliant with a growing number of data regulations, such as GDPR and CCPA. Please take this seriously. We offer tips throughout the book to ensure that you’re baking ethics and privacy into the data engineering lifecycle.
- Location 1649
-
DataOps maps the best practices of Agile methodology, DevOps, and statistical process control (SPC) to data. Whereas DevOps aims to improve the release and quality of software products, DataOps does the same thing for data products.
- Location 1653
-
DataOps
- Location 1653
-
Like DevOps, DataOps borrows much from lean manufacturing and supply chain management, mixing people, processes, and technology to reduce time to value. As Data Kitchen (experts in DataOps) describes it:7 DataOps is a collection of technical practices, workflows, cultural norms, and architectural patterns that enable: Rapid innovation and experimentation delivering new insights to customers with increasing velocity Extremely high data quality and very low error rates Collaboration across complex arrays of people, technology, and environments Clear measurement, monitoring, and transparency of results
- Location 1660
- dev ops, data ops,
Observability and monitoring
- Location 1712
-
As we tell our clients, “Data is a silent killer.” We’ve seen countless examples of bad data lingering in reports for months or years. Executives may make key decisions from this bad data, discovering the error only much later. The outcomes are usually bad and sometimes catastrophic for the business. Initiatives are undermined and destroyed, years of work wasted. In some of the worst cases, bad data may lead companies to financial ruin.
- Location 1713
-
Observability, monitoring, logging, alerting, and tracing are all critical to getting ahead of any problems along the data engineering lifecycle. We recommend you incorporate SPC to understand whether events being monitored are out of line and which incidents are worth responding to.
- Location 1723
-
The purpose of DODD is to give everyone involved in the data chain visibility into the data and data applications so that everyone involved in the data value chain has the ability to identify changes to the data or data applications at every step—from ingestion to transformation to analysis—to help troubleshoot or prevent data issues. DODD focuses on making data observability a first-class consideration in the data engineering lifecycle.
- Location 1731
-
Incident response
- Location 1736
-
Trust takes a long time to build and can be lost in minutes. Incident response is as much about retroactively responding to incidents as proactively addressing them before they happen.
- Location 1748
-
DataOps summary
- Location 1750
-
Data Architecture
- Location 1761
-
Orchestration
- Location 1774
-
Orchestration is the process of coordinating many jobs to run as quickly and efficiently as possible on a scheduled cadence.
- Location 1782
- blue,
While many other interesting open source orchestration projects exist, such as Luigi and Conductor, Airflow is arguably the mindshare leader for the time being.
- Location 1802
- conductor, luigi, data orchestration, airflow,
Software Engineering
- Location 1812
-
Core data processing code
- Location 1821
-
Whether in ingestion, transformation, or data serving, data engineers need to be highly proficient and productive in frameworks and languages such as Spark, SQL, or Beam;
- Location 1823
-
It’s also imperative that a data engineer understand proper code-testing methodologies, such as unit, regression, integration, end-to-end, and smoke.
- Location 1825
-
Development of open source frameworks
- Location 1827
-
Streaming
- Location 1840
-
Infrastructure as code
- Location 1851
-
Pipelines as code
- Location 1862
-
General-purpose problem solving
- Location 1870
-
In practice, regardless of which high-level tools they adopt, data engineers will run into corner cases throughout the data engineering lifecycle that require them to solve problems outside the boundaries of their chosen tools and to write custom code. When using frameworks like Fivetran, Airbyte, or Matillion, data engineers will encounter data sources without existing connectors and need to write something custom. They should be proficient in software engineering to understand APIs, pull and transform data, handle exceptions, and so forth.
- Location 1871
-
Conclusion
- Location 1876
-
A data engineer has several top-level goals across the data lifecycle: produce optimum ROI and reduce costs (financial and opportunity), reduce risk (security, data quality), and maximize data value and utility.
- Location 1886
- favorite,
Additional Resources
- Location 1891
-
“Democratizing Data at Airbnb” by Chris Williams et al.
- Location 1898
-
“Five Steps to Begin Collecting the Value of Your Data” Lean-Data web page
- Location 1899
-
“Getting Started with DevOps Automation” by Jared Murrell
- Location 1900
-
“Staying Ahead of Debt” by Etai Mizrahi
- Location 1907
-
“What Is Metadata” by Michelle Knight
- Location 1908
-
3 Chris Williams et al., “Democratizing Data at Airbnb,” The Airbnb Tech Blog, May 12, 2017, https://oreil.ly/dM332.
- Location 1913
-
ethical behavior is doing the right thing when no one is watching,
- Location 1921
-
“What Is DataOps,” DataKitchen FAQ page, accessed May 5, 2022, https://oreil.ly/Ns06w.
- Location 1923
-
Designing Good Data Architecture
- Location 1931
-
What Is Data Architecture?
- Location 1936
-
Enterprise Architecture Defined
- Location 1947
-
Figure 3-1. Data architecture is a subset of enterprise architecture
- Location 1952
-
TOGAF is The Open Group Architecture Framework, a standard of The Open Group. It’s touted as the most widely used architecture framework today.
- Location 1962
-
Gartner is a global research and advisory company that produces research articles and reports on trends related to enterprises. Among other things, it is responsible for the (in)famous Gartner Hype Cycle.
- Location 1971
-
Data Architecture Defined
- Location 2031
-
“Good” Data Architecture
- Location 2068
-
Principles of Good Data Architecture
- Location 2090
-
Principle 1: Choose Common Components Wisely
- Location 2115
-
Principle 2: Plan for Failure
- Location 2138
-
Principle 3: Architect for Scalability
- Location 2160
-
Principle 4: Architecture Is Leadership
- Location 2175
-
Principle 5: Always Be Architecting
- Location 2195
-
Principle 6: Build Loosely Coupled Systems
- Location 2214
-
Principle 7: Make Reversible Decisions
- Location 2255
-
Principle 8: Prioritize Security
- Location 2272
-
Principle 9: Embrace FinOps
- Location 2322
-
Major Architecture Concepts
- Location 2370
-
Domains and Services
- Location 2376
-
Distributed Systems, Scalability, and Designing for Failure
- Location 2400
-
Tight Versus Loose Coupling: Tiers, Monoliths, and Microservices
- Location 2440
-
User Access: Single Versus Multitenant
- Location 2553
-
Event-Driven Architecture
- Location 2570
-
Brownfield Versus Greenfield Projects
- Location 2591
-
Examples and Types of Data Architecture
- Location 2630
-
Data Warehouse
- Location 2635
-
Data Lake
- Location 2713
-
Convergence, Next-Generation Data Lakes, and the Data Platform
- Location 2745
-
Modern Data Stack
- Location 2767
-
Lambda Architecture
- Location 2788
-
Kappa Architecture
- Location 2807
-
The Dataflow Model and Unified Batch and Streaming
- Location 2821
-
Architecture for IoT
- Location 2842
-
Data Mesh
- Location 2905
-
Other Data Architecture Examples
- Location 2927
-
Who’s Involved with Designing a Data Architecture?
- Location 2940
-
Additional Resources
- Location 2959
-
Choosing Technologies Across the Data Engineering Lifecycle
- Location 3111
-
Team Size and Capabilities
- Location 3146
-
Speed to Market
- Location 3163
-
Cost Optimization and Business Value
- Location 3190
-
Total Cost of Ownership
- Location 3199
-
Total Opportunity Cost of Ownership
- Location 3229
-
Today Versus the Future: Immutable Versus Transitory Technologies
- Location 3263
-
Our Advice
- Location 3300
-
On Premises
- Location 3317
-
Hybrid Cloud
- Location 3425
-
Decentralized: Blockchain and the Edge
- Location 3465
-
Our Advice
- Location 3473
-
Cloud Repatriation Arguments
- Location 3499
-
Build Versus Buy
- Location 3555
-
Open Source Software
- Location 3578
-
Proprietary Walled Gardens
- Location 3649
-
Our Advice
- Location 3686
-
Monolith Versus Modular
- Location 3699
-
The Distributed Monolith Pattern
- Location 3757
-
Our Advice
- Location 3774
-
Serverless Versus Servers
- Location 3785
-
How to Evaluate Server Versus Serverless
- Location 3835
-
Our Advice
- Location 3855
-
Optimization, Performance, and the Benchmark Wars
- Location 3873
-
Big Data...for the 1990s
- Location 3894
-
Nonsensical Cost Comparisons
- Location 3901
-
Asymmetric Optimization
- Location 3906
-
Caveat Emptor
- Location 3912
-
Undercurrents and Their Impacts on Choosing Technologies
- Location 3914
-
Data Management
- Location 3920
-
Data Architecture
- Location 3940
-
Orchestration Example: Airflow
- Location 3946
-
Software Engineering
- Location 3966
-
Additional Resources
- Location 3981
-
The Data Engineering Lifecycle in Depth
- Location 4028
-
Data Generation in Source Systems
- Location 4029
-
Sources of Data: How Is Data Created?
- Location 4043
-
Source Systems: Main Ideas
- Location 4061
-
Files and Unstructured Data
- Location 4064
-
Application Databases (OLTP Systems)
- Location 4085
-
Online Analytical Processing System
- Location 4137
-
Change Data Capture
- Location 4152
-
Database Logs
- Location 4191
-
Insert-Only
- Location 4212
-
Messages and Streams
- Location 4231
-
Types of Time
- Location 4264
-
Source System Practical Details
- Location 4287
-
Data Sharing
- Location 4588
-
Third-Party Data Sources
- Location 4605
-
Message Queues and Event-Streaming Platforms
- Location 4620
-
Whom You’ll Work With
- Location 4726
-
Undercurrents and Their Impact on Source Systems
- Location 4760
-
Data Management
- Location 4778
-
Data Architecture
- Location 4813
-
Software Engineering
- Location 4838
-
Additional Resources
- Location 4865
-
Raw Ingredients of Data Storage
- Location 4924
-
Magnetic Disk Drive
- Location 4934
-
Solid-State Drive
- Location 4975
-
Random Access Memory
- Location 4990
-
Networking and CPU
- Location 5025
-
Data Storage Systems
- Location 5098
-
Single Machine Versus Distributed Storage
- Location 5103
-
Eventual Versus Strong Consistency
- Location 5117
-
File Storage
- Location 5146
-
Block Storage
- Location 5205
-
Object Storage
- Location 5281
-
Cache and Memory-Based Storage Systems
- Location 5428
-
The Hadoop Distributed File System
- Location 5448
-
Streaming Storage
- Location 5474
-
Indexes, Partitioning, and Clustering
- Location 5488
-
Data Engineering Storage Abstractions
- Location 5542
-
The Data Warehouse
- Location 5560
-
The Data Lake
- Location 5573
-
The Data Lakehouse
- Location 5581
-
Data Platforms
- Location 5598
-
Stream-to-Batch Storage Architecture
- Location 5606
-
Big Ideas and Trends in Storage
- Location 5616
-
Data Catalog
- Location 5622
-
Data Sharing
- Location 5644
-
Separation of Compute from Storage
- Location 5672
-
Data Storage Lifecycle and Data Retention
- Location 5769
-
Single-Tenant Versus Multitenant Storage
- Location 5862
-
Whom You’ll Work With
- Location 5888
-
Undercurrents
- Location 5898
-
Security
- Location 5902
-
Conclusion
- Location 5956
-
Ingestion
- Location 5987
-
What Is Data Ingestion?
- Location 5994
-
Key Engineering Considerations for the Ingestion Phase
- Location 6029
-
Bounded Versus Unbounded Data
- Location 6048
-
Frequency
- Location 6065
-
Synchronous Versus Asynchronous Ingestion
- Location 6093
-
Serialization and Deserialization
- Location 6121
-
Throughput and Scalability
- Location 6130
-
Reliability and Durability
- Location 6146
-
Payload
- Location 6163
-
Push Versus Pull Versus Poll Patterns
- Location 6237
-
Batch Ingestion Considerations
- Location 6253
-
Snapshot or Differential Extraction
- Location 6272
-
File-Based Export and Ingestion
- Location 6281
-
ETL Versus ELT
- Location 6290
-
Inserts, Updates, and Batch Size
- Location 6304
-
Data Migration
- Location 6315
-
Message and Stream Ingestion Considerations
- Location 6330
-
Schema Evolution
- Location 6336
-
Late-Arriving Data
- Location 6346
-
Ordering and Multiple Delivery
- Location 6356
-
Replay
- Location 6360
-
Time to Live
- Location 6366
-
Message Size
- Location 6377
-
Error Handling and Dead-Letter Queues
- Location 6383
-
Consumer Pull and Push
- Location 6395
-
Location
- Location 6403
-
Ways to Ingest Data
- Location 6411
-
Direct Database Connection
- Location 6415
-
Change Data Capture
- Location 6447
-
Message Queues and Event-Streaming Platforms
- Location 6522
-
Managed Data Connectors
- Location 6548
-
Moving Data with Object Storage
- Location 6561
-
Databases and File Export
- Location 6579
-
Practical Issues with Common File Formats
- Location 6589
-
Shell
- Location 6607
-
SSH
- Location 6617
-
SFTP and SCP
- Location 6625
-
Web Interface
- Location 6653
-
Web Scraping
- Location 6659
-
Transfer Appliances for Data Migration
- Location 6677
-
Data Sharing
- Location 6694
-
Whom You’ll Work With
- Location 6705
-
Upstream Stakeholders
- Location 6710
-
Downstream Stakeholders
- Location 6723
-
Undercurrents
- Location 6737
-
Security
- Location 6740
-
Data Management
- Location 6746
-
DataOps
- Location 6791
-
Orchestration
- Location 6834
-
Software Engineering
- Location 6844
-
Conclusion
- Location 6855
-
Additional Resources
- Location 6861
-
Queries, Modeling, and Transformation
- Location 6877
-
Queries
- Location 6903
-
What Is a Query?
- Location 6908
-
The Life of a Query
- Location 6957
-
The Query Optimizer
- Location 6970
-
Improving Query Performance
- Location 6978
-
Queries on Streaming Data
- Location 7123
-
Data Modeling
- Location 7258
-
What Is a Data Model?
- Location 7275
-
Conceptual, Logical, and Physical Data Models
- Location 7292
-
Normalization
- Location 7328
-
Techniques for Modeling Batch Analytical Data
- Location 7503
-
Modeling Streaming Data
- Location 7881
-
Transformations
- Location 7907
-
Batch Transformations
- Location 7931
-
Materialized Views, Federation, and Query Virtualization
- Location 8270
-
Streaming Transformations and Processing
- Location 8338
-
Whom You’ll Work With
- Location 8395
-
Upstream Stakeholders
- Location 8405
-
Downstream Stakeholders
- Location 8417
-
Undercurrents
- Location 8424
-
Security
- Location 8427
-
Data Management
- Location 8436
-
DataOps
- Location 8458
-
Data Architecture
- Location 8479
-
Orchestration
- Location 8490
-
Software Engineering
- Location 8495
-
Conclusion
- Location 8515
-
Additional Resources
- Location 8528
-
Serving Data for Analytics, Machine Learning, and Reverse ETL
- Location 8614
-
General Considerations for Serving Data
- Location 8627
-
Trust
- Location 8633
-
What’s the Use Case, and Who’s the User?
- Location 8668
-
Data Products
- Location 8687
-
Self-Service or Not?
- Location 8709
-
Data Definitions and Logic
- Location 8732
-
Data Mesh
- Location 8757
-
Analytics
- Location 8768
-
Business Analytics
- Location 8779
-
Operational Analytics
- Location 8820
-
Embedded Analytics
- Location 8856
-
Machine Learning
- Location 8884
-
What a Data Engineer Should Know About ML
- Location 8899
-
Ways to Serve Data for Analytics and ML
- Location 8933
-
File Exchange
- Location 8938
-
Databases
- Location 8959
-
Streaming Systems
- Location 8989
-
Query Federation
- Location 8999
-
Data Sharing
- Location 9018
-
Semantic and Metrics Layers
- Location 9026
-
Serving Data in Notebooks
- Location 9053
-
Reverse ETL
- Location 9094
-
Whom You’ll Work With
- Location 9125
-
Undercurrents
- Location 9141
-
Security
- Location 9147
-
Data Management
- Location 9171
-
DataOps
- Location 9185
-
Data Architecture
- Location 9197
-
Orchestration
- Location 9204
-
Software Engineering
- Location 9219
-
Conclusion
- Location 9236
-
Additional Resources
- Location 9245
-
Security, Privacy, and the Future of Data Engineering
- Location 9285
-
Security and Privacy
- Location 9287
-
The Power of Negative Thinking
- Location 9320
-
Always Be Paranoid
- Location 9329
-
Processes
- Location 9337
-
Security Theater Versus Security Habit
- Location 9341
-
Active Security
- Location 9351
-
Shared Responsibility in the Cloud
- Location 9371
-
Always Back Up Your Data
- Location 9378
-
An Example Security Policy
- Location 9387
-
Technology
- Location 9416
-
Patch and Update Systems
- Location 9420
-
Encryption
- Location 9426
-
Logging, Monitoring, and Alerting
- Location 9448
-
Network Access
- Location 9473
-
Security for Low-Level Data Engineering
- Location 9491
-
Conclusion
- Location 9509
-
Additional Resources
- Location 9512
-
The Future of Data Engineering
- Location 9518
-
The Data Engineering Lifecycle Isn’t Going Away
- Location 9534
-
The Decline of Complexity and the Rise of Easy-to-Use Data Tools
- Location 9543
-
The Cloud-Scale Data OS and Improved Interoperability
- Location 9572
-
“Enterprisey” Data Engineering
- Location 9617
-
Titles and Responsibilities Will Morph...
- Location 9630
-
Moving Beyond the Modern Data Stack, Toward the Live Data Stack
- Location 9652
-
The Live Data Stack
- Location 9674
-
Streaming Pipelines and Real-Time Analytical Databases
- Location 9687
-
The Fusion of Data with Applications
- Location 9722
-
The Tight Feedback Between Applications and ML
- Location 9731
-
Dark Matter Data and the Rise of...Spreadsheets?!
- Location 9742
-
Conclusion
- Location 9774
-
Serialization and Compression Technical Details
- Location 9805
-
Serialization Formats
- Location 9809
-
Row-Based Serialization
- Location 9817
-
Columnar Serialization
- Location 9849
-
Hybrid Serialization
- Location 9903
-
Database Storage Engines
- Location 9922
-
Compression: gzip, bzip2, Snappy, Etc.
- Location 9934
-
Cloud Networking
- Location 9956
-
Cloud Network Topology
- Location 9959
-
Data Egress Charges
- Location 9965
-
Availability Zones
- Location 9973
-
GCP-Specific Networking and Multiregional Redundancy
- Location 10000
-
Direct Network Connections to the Clouds
- Location 10016
-
CDNs
- Location 10023
-
The Future of Data Egress Fees
- Location 10029
-
About the Authors
- Location 11858
-
Joe Reis is a business-minded data nerd who’s worked in the data industry for 20 years, with responsibilities ranging from statistical modeling, forecasting, machine learning, data engineering, data architecture, and almost everything else in between. Joe is the CEO and cofounder of Ternary Data, a data engineering and architecture consulting firm based in Salt Lake City, Utah. In addition, he volunteers with several technology groups and teaches at the University of Utah.
- Location 11859
-
Matt Housley is a data engineering consultant and cloud specialist. After some early programming experience with Logo, Basic, and 6502 assembly, he completed a PhD in mathematics at the University of Utah. Matt then began working in data science, eventually specializing in cloud-based data engineering. He cofounded Ternary Data with Joe Reis, where he leverages his teaching experience to train future data engineers and advise teams on robust data architecture. Matt and Joe also pontificate on all things data on The Monday Morning Data Chat.
- Location 11863
-