Your Pipeline Will Fail, But Here's How Data Engineers Handle It

Why Every Data Engineer Needs to Plan for Failure

Oct 04, 2024

The pipeline will fail. Not something you want to hear, but it will.
As sure as the sun will rise, that pipeline will eventually have a bad day.
Which, in turn, means you’re in for it too.

When a pipeline fails, there’s nowhere to hide. All eyes will instantly turn to you and your team, expecting you to get your sh*t together — quickly. This all depends on the pipeline, of course, but in my experience, as soon as the pipeline stops, the clock is ticking. If the pipeline involves money or people, that clock ticks even faster.

It’s just a matter of time. Eventually, the pipeline will break for some reason — whether it’s a schema change somewhere up the chain, a network failure, or human error. The pipeline will catch fire, and you need to be ready.

A lot of wide-eyed, bushy-tailed data engineers rock up on day one, thinking it’s all sunshine and rainbows — where pipelines run smoothly, and everything is perfect. The truth is, 90% of the time, things go well, and no one even bats an eyelid. No one notices the pipelines running day in and day out with no issues (well, your team cares, obviously, but anyone outside that bubble folks are none the wiser).

It’s the 10% that will get you. The 10% that you need to watch out for. People only take notice when things stop working. Remember that. A data pipeline that serves others is all about trust. When things break, that trust is broken. As soon as someone out there questions the reliability of the data you serve, it’s game over.

So the trick is to be on it, to play the data game defensively.

Some ways to cover your ass

Common Sense

You cannot put a price on good ol’ common sense when you work with data. Having common sense and high attention to detail will likely save your ass more times than anything else on this list. People making mistakes are one of the biggest causes of failure I’ve seen — people using AI even more so.

Design for Failures

Think cost control, scaling, retry mechanisms, and other things on this list. If you build pipelines with the worst-case scenario in mind, you will likely avoid a lot of pain.

Testing

Unit testing and integration testing, although painful at times, will cover your ass when things go south. Tests will catch things early (hopefully) long before they reach production. Tests are there to catch the small things you often overlook when you are moving fast. Write tests!

Version Control

If your pipeline is not in any sort of version control, stop reading now and go do it. This is beyond important, and there’s no need for me to spell out why.

Monitoring and Alerting

Good monitoring and alerting is this: if something breaks, it lets you know about it through every available form of communication known to man. Bad monitoring and alerting is Jim strolling over to your desk asking why his reports aren’t uploaded today.

Documentation

If your documentation is your code, you are dead to me. If Joe Schmoe is the oracle in your team who knows everything, so you don’t need documentation, you may as well give up. Documentation will make or break a good data engineering team. Write good docs!

Backups and Restore

Backup everything. Coming from DBA land, this is drilled into you from day one. It scares the sh*t out of me in the Data Engineering world why this is not a priority for people. When something goes south, the first question is always: is there a backup?

Do backups and here’s a pro tip: test them regularly!

Data Validation and Checking

“These numbers don’t match” are some of the worst words you will hear in data engineering. The great data teams of the world have validation rules for every part of the pipeline. The quality of the data you deliver is paramount, therefore, have checks that — surprise, surprise — check your data on the way through, so Karen in finance doesn’t have to ask you why there are duplicates in her data.

The Reality is…

There are probably hundreds of things you can do to avoid failure, and you could do it all, and STILL, at some appointed hour, that pipeline will fail. That’s just life and the way things go when you work with data. On those days, you need to take it on the chin, fix it, and move on — but fix it better, learn, and improve — be better. It’s all you can do if you’ve done your best to mitigate failure.

Some of the most common failures I’ve seen in my time as a data engineer and DBA are as follows: schema changes upstream (that no one thought to mention to you or your team), changes in data volume and resource issues, permissions changes, human errors (like poorly written queries locking up the database for all), code rushed into production that killed the system, data quality problems, and upgrades gone wrong. I could go on, but I tell you all this because things go wrong from time to time.

You need to learn to deal with failures in your pipelines. As they say, hindsight is 20/20 vision — it’s always easier to look back and spot the warning signs. If you stick around long enough in the game, you will learn to do things right, and with time and experience, you will learn to spot potential issues before they crop up.

The number one rule I have when dealing with failures is to go slow. Instinct is to rush and fix things fast, but I’ve seen rushing burn people time and again. Slow is the only way out of problems. So, go slooooooow.

📩 Think this is valuable? Share it with someone who might benefit.

🔥 Last Week's Most Read Article 🔥

5 Key Mindset Shifts to Succeed in Data Engineering

Tim Webster

September 6, 2024

Read full story

Thanks for reading! I send this email weekly. If you would like to receive it, join other readers and subscribe below to get my latest articles.

👉 If you enjoyed reading this post, and think this is valuable? Share it with someone who might benefit. Or click the ❤️ button on this post so more people can discover it on Substack 🙏

🚀 Want to get in touch? feel free to find me on LinkedIn.

🎯 Want to read more of my articles, find me on Medium.

✍🏼 Write Content for Art of Data Engineering Publication.