HOW TO PREPARE FOR DATA ENGINEERING INTERVIEWS
My experience interviewing with 15+ companies for data engineering roles
It is often said that software engineering and data engineering interviewing are two distinct things. Fortunately, this is not the case for data engineering interviews. While you are asked to solve coding problems usually in 10-15 minutes, these can be managed with a little bit of practice. Also, typically in the interview process, you are asked questions close enough to what you will be solving on the job. Having attended 15+ interviews (a decent sample size), I can summarize most data engineering or analytics engineering interviews as follows:
Recruiter Screen
This is the first round in the process. Usually, a recruiter or a hiring manager will discuss the role with you and your interest in it. While these calls may be informal, don’t be tempted to treat it like a casual conversation. Be prepared to answer general questions such as what you’ve done in the past as an engineer, why you are interested in data engineering and what you are looking forward to in your next role. When asked if you have follow-up questions, discuss what the interview stages are (how many coding rounds etc.) and be sure to know how long the process typically takes. You can also ask if they have an interview preparation guide and have that handy to help you in the process.
Coding Screen (usually in Python)
This round usually involves a 10-20 minutes round where you are provided a problem and are asked to solve this in python. Usually the questions here involve using hash maps, arrays or strings. Leetcode’s easy and medium problems are enough to be able to ace this round.
To best prepare for this round, I recommend solving around 20 diverse leetcode problems that involve arrays, strings, lists and hash maps. I will also write out a separate guide for this section.
SQL Round
The SQL round in my opinion can range from easy to difficult questions depending on your level of expertise. The good news is that practicing from a few resources can help you ace this round. I only followed 2 resources which helped me ace all SQL rounds and they are as below:
Leetcode database (SQL) exercises
Pgexercises.com (if you are pressed for time, practice only the aggregates section)
Be sure to review window functions, self-joins and case when statements as these were tricky for me in the beginning. If time permits, read on about how to write optimized and organized SQL queries (using CTEs etc)
Data Modeling
For data modeling, typically, the best way to practice this is to think of the interviewing company and their business model and design a mock data warehouse. A good example of what this round might be like is in this video.
Another very helpful resource for me was grokking the system design interview book. You only need to go through the data warehouse section in each chapter and design a mock data warehouse for practice.
I also recommend knowing the different data modeling techniques
Snowflake
Star schema
Other DE concepts (optional if you are a new grad or intern; not optional if you have industry experience)
Almost every interview with the hiring manager as a experienced DE involved some variation of the below question:
Walk me through a data modeling challenge you recently worked on and what were some considerations you made? How did you solve it?
OR
Walk me through a pipeline you’d built
In order to answer the above question, one must know the tech stack and data architecture they’ve worked with (eg; what kind of data warehouse the company uses etc.)
Articulating the right answer took me some time and practice with other data engineers. Spend a week or two reviewing and rewriting this for yourself. Also rehearse this answer with another data engineer who will give you feedback on your answer.
In addition to that, the below concepts are typically asked in interviews as well; be sure to brush these up. I found that select chapters from these two books (Fundamentals of Data Engineering, Designing Data Intensive Applications) helped in learning this (and honestly, I am still reading and learning from those books)
Batch vs Streaming pipelines
OLAP vs OLTP
Basics of Hadoop
Basics of Spark
ETL vs ELT
ACID Compliance
Data Warehouse vs Data Lake vs Data Lakehouse
Row vs Column Oriented Databases
Kappa and Lambda Architectures
I’ve summarized a list of resources in this starter template. Feel free to download this to help with your preparation.
Awesome post! Thanks for sharing your experiences. I can't wait for more!