📗 In this lesson

In this lesson, we’ll cover the basics of data engineering, orchestration, and where Dagster fits in:

We’ll also walk you through a preview of the project you’ll build in this course.

🎯 Lesson objectives


What’s data engineering?

Data engineering is the practice of designing and building software for collecting, storing, and managing data. The most common goal in data engineering is to enable stakeholders (such as product managers, marketing, or the C-suite) to make informed decisions with data. Other common goals are providing data to external users, features for a machine learning model, or empowering applications to react to events.

To enable all of these workflows, those who practice data engineering create infrastructure and processes to create the data when needed. Many of these processes take into consideration the difficulties mentioned below:

Data engineering can be a difficult and time-consuming process due to the complexities of managing data from disparate sources. As data becomes larger and more complex, manual workflows become too time-intensive and unreliable. This is when data practitioners may consider adopting an orchestrator.


What’s an orchestrator?

An orchestrator is a tool that can manage and coordinate complex workflows and data pipelines. The field of orchestrators has continued to evolve with data engineering. For example, an orchestrator may be adopted when:

The first orchestrators were made to solve a simple problem: I need to run some scripts as a sequence of steps at specific times. Each step must wait for the step before it to finish before it starts. As time passed and the ceiling of what was possible in data engineering increased, people needed more out of their orchestrators.

Nowadays, modern orchestrators are used to create robustness and resiliency. When something breaks, orchestrators enable practitioners to understand where, when, and why it broke. Users expect orchestrators to let them monitor their workflows to understand individual steps and entire sequences of steps. At a glance, users can see what succeeded or had an error, how long each step took to run, and how a step connects to other steps. By empowering users with this information, data engineering becomes easier as people are able to develop data pipelines faster and troubleshoot issues more efficiently.

The first orchestrators removed the need for humans to run scripts manually at specific times. Today’s orchestrators continue to automate and reduce the need for human intervention. Orchestrators can retry a step when they fail, send a notification if something fails, prevent multiple steps from querying a database at the same time, or do a different step based on the result of the step before it.

There are multiple different types of orchestrators. Orchestrators can be general-purpose to accommodate for many types of workflows, or they can be made for specific types of workflows that deal with infrastructure, microservices, or data pipelines.