4 Data Engineering Pillars You Can't Afford to Ignore

Build Smart, Not Sorry

Dec 20, 2024

I recently read, “one big choice shapes a hundred more.” This couldn’t be truer when it comes to data.

Every choice you make has a ripple effect. The problem is, most don’t stop to think of the repercussions of a choice until it’s too late.

Example: “Let me patch this aggregation job real quick. Don’t worry, we’ll come back and fix it later” — That ‘quick fix’ is now in production, buddy and not going anywhere.

Fast forward two years: people are questioning what this piece of code is and why it’s skewing the numbers so drastically. This little gremlin’s not going to be pulled out easily, especially if fixing it risks hurting someone’s monthly figures — that's the truth of it.

They say, “Data is a silent killer.” It really is. Data will bite you hard if you are not careful. It’s all about choices.

Don’t get me wrong — most choices are decent enough. But no one notices those when things go smoothly. It’s the bad decisions that will come back and nail you and your team. Then it’s all hands on deck.

🤯 Classic Bad Decisions

Not building things that are simple to change (i.e., reversible, modular, loosely coupled).
Not sticking to tools and technologies the team is familiar with.
Not planning and diving headfirst into building.
Skipping merrily over data quality.
Not prioritising requirements, overengineering a simple problem.
“Shiny object syndrome.”

You need to build your house on a solid foundation, and you also need to get your house in order.

If you fail to do both, then you’re stacking sandcastles and the time bombs (yes, the bombs — plural) are ticking. It’s only a matter of time.

Maybe it’s because I was a DBA, or maybe because I’ve been through the trenches and been burned (more than a few times). I’ve had some hard lessons, and there are some “data pillars” I believe in.

Pillar 1 — You Can’t Lose or Corrupt Data

There’s no level zero in data engineering, but if there were, this would be it:
YOU CANNOT LOSE DATA!

This is basic stuff — back up your work, back up your data however you choose, but know how to get it back if you lose it. Then monitor things to catch issues before they escalate.

Backups, Recovery, and Monitoring — in that order. Reliable data ingestion is your priority number one.

Prevent data loss.
Know how to get it back (and practice this). Just because it’s in the cloud doesn’t mean it’s magically safe.
Understand the SLAs for your data recovery. How much data is your company okay with losing? Do you know? I bet you don’t.

Every company has different levels of tolerance for failure. Banks have zero tolerance, your online shoe store might be more lenient.

⚡ The Basics: Data Backups, Recovery, and Monitoring.

Pillar 2— Control Who Can Access or Modify Data

Ignore security, and you are leaving the door wide open for disaster.

Data breaches, compliance violations — security negligence. If the sound of that doesn’t scare you, then nothing will. Those are one-way tickets out of that very same door you left wide open.

Security is non-negotiable and will bring you and your team front and center — for all the wrong reasons. There is no excuse to ignore it. There is no winning if you take a careless approach to security.

You need to understand:

Principle of least privilege: Give only the essential permissions needed to do the job, nothing more. You need to do this!
Log and audit everything: If someone needs access, you need an audit trail with damn good reasons for why access was granted — because when something goes wrong, the first questions will be: Who gave access? And why?

There is a real problem out in the wild. Blanket access is given during setup and then never revisited. I’ve seen it time and again. Watch any data tutorial online when it comes to security — they give full access and move on. This is what they’re teaching people out there. This is an epic fail.

The biggest security risk to data is YOU. Don’t be the person who leaves the door wide open.

🧐 Security isn’t optional — it’s a foundational pillar of everything you build.

Pillar 3— Make Reversible Decisions

One of the worst things in data is living with the problem. You know the one. Everyone knows about it. Everyone sees it. No one talks about it. You know it’s a problem when something breaks in it, and everyone is too damn scared to touch it in fear of it falling to pieces.

These tend to be systems that were built willy-wonka style, started out with good intentions, then over time, patches added, things bolted on and coupled together, and ba bing bad boom — a Frankenstein.

I tell you this because in data (especially nowadays), you cannot afford to make decisions or build things that are hard to change. You need to be agile, adaptable. The things you build should be easy to change. Simple to reverse.

The question you need to ask yourself when you do anything:

Is this a reversible or irreversible decision?
How hard will it be to manage going forward?
What will this look like in 6 months, 1 year, or 2 years?

Irreversible decisions can ruin a team and its reputation, and once that’s happened, it’s game over. Trust is hard to earn and easy to lose in this game.

Avoid irreversible decisions by:

Taking your time to fully plan and understand a task, project, or system.
Find time to think before diving into the doing part.

Pillar 4— Build Loosely Coupled Systems (If You Can)

Think of a house of cards — one change here affects ten more, which affect ten more. Everything is connected and dependent on another. A change to one component should not bring half a dozen other things to a halt.

If you build systems or processes that are tightly coupled, then you are building what the experts call a “rat’s nest.” If you find yourself in this situation, you best buckle up — it’s going to get wild.

Every code change you push will leave you thinking in the back of your mind: What am I about to break? Did I remember everything?

This is no way to work.

I’ll be the first to admit that fully decoupling isn’t always possible — some things just have dependencies, and that’s that. But you should aim to decouple as much as you can. Think reusability. Loose coupling makes life easier, and scalability becomes a real option.

How to aim for decoupling:

Define clear boundaries and interfaces between components and processes.
Use modular designs to separate responsibilities.
Minimise dependencies — only connect where there is no other alternative.

📩 Think this is valuable? Share it with someone who might benefit.

🔥 Last Week's Most Read Article 🔥

How to Stay Sane and Productive in Data Engineering

Tim Webster

December 8, 2024

Read full story

Thanks for reading! I send this email weekly. If you would like to receive it, join other readers and subscribe below to get my latest articles.

👉 If you enjoyed reading this post, and think this is valuable? Share it with someone who might benefit. Or click the ❤️ button on this post so more people can discover it on Substack 🙏

🚀 Want to get in touch? feel free to find me on LinkedIn.

🎯 Want to read more of my articles, find me on Medium.

✍🏼 Write Content for Art of Data Engineering Publication.