Remoteria — Pre-Vetted Offshore Staffing

Syed Ali

Interview guide

Data Engineer Interview Questions & Answers Guide (2026)

A hiring-manager’s interview kit for data engineers — with specific “what to look for” notes on every answer, red flags to watch, and a practical test.

Skip the interviews — hire a pre-vetted data engineer

Key facts

Role: Data Engineer
Technical questions: 14
Behavioral: 7
Role-fit: 5
Red flags: 8
Practical test: Included

How to use this guide

Pick 4-6 technical questions across difficulties, 2-3 behavioral, and 1-2 role-fit for a 45-minute interview. For senior roles, weight harder technical and role-fit higher. Always close with the practical test so you are hiring on evidence, not impressions. The “what to look for” notes are a scoring rubric: strong answers touch most points, weak answers miss them or replace them with platitudes.

Technical questions — Easy

1. Write a SQL query that deduplicates an events table keeping only the latest event per event_id.

Easy

What to look for: QUALIFY ROW_NUMBER() OVER (PARTITION BY event_id ORDER BY ingested_at DESC) = 1 in Snowflake/BigQuery, or a subquery with same ROW_NUMBER in Redshift. Tie-breaker logic for equal timestamps.

Technical questions — Medium

1. Explain the difference between a Type 1 and Type 2 slowly-changing dimension, and when you would pick each.

Medium

What to look for: Type 1 overwrites (no history); Type 2 versions rows with valid_from/valid_to or is_current. Type 2 when downstream needs point-in-time accuracy (commission calcs, historical reporting); Type 1 when only current state matters (contact info).

2. A nightly dbt run just took 6 hours instead of the usual 45 minutes. Walk me through triage.

Medium

What to look for: Check Snowflake query history / BigQuery job stats, find the long-running models, check for schema changes or volume spikes on upstream sources, look for accidentally-full-refreshed incrementals, inspect warehouse sizing and concurrent workloads. Systematic, not guessing.

3. Your Fivetran bill jumped 3x last month. How do you investigate and cut it?

Medium

What to look for: Check monthly active rows per connector, disable unused tables/columns, move low-value sources to a cheaper cadence or Airbyte, reconsider high-volume event sources that are better ingested via streaming. Pragmatic FinOps.

4. Compare Airflow vs Dagster vs Prefect. When would you pick each?

Medium

What to look for: Airflow = mature, massive community, painful local dev. Dagster = asset-first, better typing and testing, smaller ecosystem. Prefect = flexible Python-first, dynamic workflows. Should have an opinion based on team size and workflow shape, not dogma.

5. Explain partitioning vs clustering in BigQuery (or micro-partitions + clustering in Snowflake). How do they affect cost?

Medium

What to look for: BigQuery: partition on a date column (limit 4000), clustering sorts within partitions. Snowflake: automatic micro-partitions, clustering keys reorder. Both prune at query time. Should know WHERE predicates must match partition/cluster keys for pruning.

6. A source API changed a field name from "customer_id" to "account_id" without notice. How do you design pipelines to survive this?

Medium

What to look for: Raw/staging split with select * ingestion, dbt source freshness + column-presence tests, schema change alerts through Monte Carlo or dbt source. Rename happens in staging layer, mart contracts stay stable. Red flag: brittle explicit column lists all the way down.

7. Explain the difference between a data lake, a warehouse, and a lakehouse.

Medium

What to look for: Lake = raw object storage (S3) with open formats. Warehouse = structured, SQL-first, compute+storage. Lakehouse (Iceberg, Delta, Hudi) = lake storage with ACID/table semantics readable from warehouse engines. Should have an opinion on when lakehouse is worth the complexity.

8. Your dbt PR runs pass but you are worried about downstream breakage. How do you validate?

Medium

What to look for: Datafold data-diff or dbt-compare on affected models, dbt build --select state:modified+, check consumer dashboards in a dev project, coordinate with the analytics team. Not "merge and hope".

Technical questions — Hard

1. Walk me through how you would design an incremental dbt model for an events table that receives 200M rows/day and has late-arriving records up to 7 days late.

Hard

What to look for: Use is_incremental with a lookback window on event_timestamp, merge strategy on (event_id) to dedupe, partitioning/clustering on event_date, unique tests. Should understand the cost of a too-short lookback (missed data) vs too-long (bigger merge). Red flag: "insert only".

2. How would you set up CDC from a production Postgres database to Snowflake with a 5-minute freshness target?

Hard

What to look for: Debezium → Kafka → Snowflake Kafka connector, or Fivetran with HVR, or Snowflake’s native PG connector. Should discuss handling of deletes, out-of-order events, and WAL retention on the source. Bonus: Kafka Connect schema evolution.

3. Write the DDL and a dbt incremental model for tracking user subscription status over time with Type 2 SCD semantics.

Hard

What to look for: Target table with user_id, status, valid_from, valid_to, is_current. dbt snapshot with check or timestamp strategy, or a custom incremental with merge. Should handle closing the prior row in the same run.

4. What is exactly-once delivery in Kafka, and is it really achievable?

Hard

What to look for: Idempotent producer + transactional writes gives effectively-once within Kafka. End-to-end requires idempotent consumers (upserts on a primary key). "Exactly once" is a system property, not a delivery guarantee. Red flag: believes vendors claiming pure exactly-once.

5. We have PII in our raw ingestion layer and need to ship SOC 2. How do you handle it?

Hard

What to look for: Column-level masking / dynamic data masking in Snowflake, tokenization for fields that need joinability, role-based access, row access policies, audit logging, encryption at rest. Separate PII schema with restricted role. Bonus: differential privacy or k-anonymity for analytics marts.

Behavioral questions

1. Tell me about the worst data quality incident you responded to. What was the root cause and what did you put in place afterwards?

What to look for: Specific story, real root cause (not "someone messed up"), durable fix (test, contract, monitor) not just a patch. Owns the failure.

2. Walk me through a dbt or warehouse refactor that saved real money. How did you scope it and prove the savings?

What to look for: Specific numbers, understood the cost driver (full-refresh, inefficient joins, oversized warehouse), measured before/after, did not break the analytics team.

3. Describe a disagreement with a data analyst or ML engineer about schema or model design. How did it resolve?

What to look for: Brought trade-offs (flexibility vs performance), understood the consumer use case, landed on a decision with data. Not "I am the platform owner, my way."

4. Tell me about a time you carried a pager for pipelines. What was the on-call like?

What to look for: Concrete: alert volume, escalation policies, what they fixed at the runbook level to reduce pages. Shows operational maturity.

5. How do you onboard yourself into an unfamiliar data warehouse?

What to look for: Lineage exploration, source → staging → mart walkthrough, read a few business-critical dbt models, talk to the top analyst consumers, look at 30 days of alert history.

6. What is the most painful migration you have led (e.g. Redshift to Snowflake, Airflow to Dagster)?

What to look for: Specific pain points (SQL dialect differences, orchestration semantics, cost surprises), phased rollout, rollback plan, coordination with stakeholders.

7. How do you handle requests from analysts that would require breaking changes to a mart model?

What to look for: Versioning (dbt_v2 models, new columns additively, deprecation windows), stakeholder comms, not just refusing.

Role-fit questions

1. Our stack is Snowflake + dbt + Airflow + Fivetran. Anything there you have not used in production, and how would you ramp?

What to look for: Honest gap assessment and concrete ramp plan. Fakery is a red flag.

2. How do you feel about being on-call for data pipelines?

What to look for: Treats on-call as ownership, has runbook discipline, does not resent it but also advocates for reducing pager load.

3. Why data engineering instead of backend or ML engineering?

What to look for: Real answer — enjoys the warehouse domain, cares about data as a product, likes the intersection of engineering and analytics. Not "I drifted into it."

4. What does a healthy analytics org look like to you from a data engineer’s perspective?

What to look for: Analysts own SQL, engineers own platform and marts, clear contracts between the two, metrics governance. Not "engineers do everything because analysts cannot SQL."

5. Where do you draw the line between what belongs in dbt vs what belongs in application code?

What to look for: dbt for analytical transforms on warehouse data. Application code for operational logic that writes back to production. Should not put business-critical transactional logic in dbt, and should not build marts in microservices.

Red flags

Any one of these alone is usually reason to pass, especially combined with weak answers elsewhere.

• Calls themselves a data engineer but has only built Tableau dashboards.
• Does not know the difference between MERGE and INSERT ON CONFLICT.
• Writes dbt models with zero tests.
• Pitches streaming as the answer before asking latency requirements.
• Has never opened a Snowflake query profile or BigQuery execution plan.
• Cannot explain idempotency in the context of a pipeline.
• Thinks "the warehouse is slow" means we need a bigger warehouse, not better SQL.
• Has never handled a schema change from an upstream SaaS source.

Practical test

4-hour take-home: given a mock Postgres OLTP schema (users, orders, order_items, products) and a stream of event data in S3 (JSON), build a dbt project that lands a daily marts.fct_orders and marts.dim_customers with Type 2 SCD on customer email. Requirements: incremental models, at least 5 dbt tests, a README explaining your staging/intermediate/mart layer decisions, a docker-compose that runs dbt against DuckDB locally, and one Great Expectations suite. We grade on modeling decisions (35%), SQL correctness (25%), testing and data quality (20%), and documentation (20%). Bonus for catching the seeded duplicate-order bug in the raw data.

Scoring rubric

Score each answer 1-4: (1) Misses most of the rubric or gives platitudes; (2) Hits some points but cannot go deep when pressed; (3) Covers the rubric and can defend the answer under follow-ups; (4) Adds unprompted nuance, trade-offs, or real examples beyond the rubric. Hire at an average of 3.0+ across technical, behavioral, and role-fit, with zero red flags, and a pass on the practical test.

Data Engineer Interview Questions & Answers Guide (2026)

Key facts

How to use this guide

Technical questions — Easy

1. Write a SQL query that deduplicates an events table keeping only the latest event per event_id.

Technical questions — Medium

1. Explain the difference between a Type 1 and Type 2 slowly-changing dimension, and when you would pick each.

2. A nightly dbt run just took 6 hours instead of the usual 45 minutes. Walk me through triage.

3. Your Fivetran bill jumped 3x last month. How do you investigate and cut it?

4. Compare Airflow vs Dagster vs Prefect. When would you pick each?

5. Explain partitioning vs clustering in BigQuery (or micro-partitions + clustering in Snowflake). How do they affect cost?

6. A source API changed a field name from "customer_id" to "account_id" without notice. How do you design pipelines to survive this?

7. Explain the difference between a data lake, a warehouse, and a lakehouse.

8. Your dbt PR runs pass but you are worried about downstream breakage. How do you validate?

Technical questions — Hard

1. Walk me through how you would design an incremental dbt model for an events table that receives 200M rows/day and has late-arriving records up to 7 days late.

2. How would you set up CDC from a production Postgres database to Snowflake with a 5-minute freshness target?

3. Write the DDL and a dbt incremental model for tracking user subscription status over time with Type 2 SCD semantics.

4. What is exactly-once delivery in Kafka, and is it really achievable?

5. We have PII in our raw ingestion layer and need to ship SOC 2. How do you handle it?

Behavioral questions

1. Tell me about the worst data quality incident you responded to. What was the root cause and what did you put in place afterwards?

2. Walk me through a dbt or warehouse refactor that saved real money. How did you scope it and prove the savings?

3. Describe a disagreement with a data analyst or ML engineer about schema or model design. How did it resolve?

4. Tell me about a time you carried a pager for pipelines. What was the on-call like?

5. How do you onboard yourself into an unfamiliar data warehouse?

6. What is the most painful migration you have led (e.g. Redshift to Snowflake, Airflow to Dagster)?

7. How do you handle requests from analysts that would require breaking changes to a mart model?

Role-fit questions

1. Our stack is Snowflake + dbt + Airflow + Fivetran. Anything there you have not used in production, and how would you ramp?

2. How do you feel about being on-call for data pipelines?

3. Why data engineering instead of backend or ML engineering?

4. What does a healthy analytics org look like to you from a data engineer’s perspective?

5. Where do you draw the line between what belongs in dbt vs what belongs in application code?

Red flags

Practical test

Scoring rubric

Related

Written by Syed Ali