Interview guide
Machine Learning Engineer Interview Questions & Answers Guide (2026)
A hiring-manager’s interview kit for machine learning engineers — with specific “what to look for” notes on every answer, red flags to watch, and a practical test.
Key facts
- Role
- Machine Learning Engineer
- Technical questions
- 15
- Behavioral
- 7
- Role-fit
- 5
- Red flags
- 8
- Practical test
- Included
How to use this guide
Pick 4-6 technical questions across difficulties, 2-3 behavioral, and 1-2 role-fit for a 45-minute interview. For senior roles, weight harder technical and role-fit higher. Always close with the practical test so you are hiring on evidence, not impressions. The “what to look for” notes are a scoring rubric: strong answers touch most points, weak answers miss them or replace them with platitudes.
Technical questions — Easy
1. Walk me through the last ML system you shipped to production. Who used it, what was the business metric, and what lifted it?
EasyWhat to look for: Specific system, named business metric, specific lift numbers, real users. Red flag: Kaggle or "I got 0.94 AUC" with no production context.
2. When would you pick LightGBM over a neural network for a tabular problem?
EasyWhat to look for: Almost always for tabular — faster to train, better out-of-box, handles missing values, native categorical support, interpretable. NN only if you have embeddings, sequences, or multi-modal inputs. Pragmatic answer preferred over "NNs are modern".
3. When is it better to ship a rules engine instead of an ML model?
EasyWhat to look for: Small data, clear domain rules, regulatory interpretability required, low-stakes problem, or baseline for comparison. Senior answer: "ML is not free — rules win when rules capture the signal." Red flag: wants to ML everything.
Technical questions — Medium
1. You are building a churn model. What does your week-one data audit look like?
MediumWhat to look for: Distribution checks, duplicate detection, target leakage hunt (any feature only known post-churn), label definition sanity (30-day vs 90-day window), class imbalance, missing data patterns. Not "I train a random forest first."
2. Explain the difference between data drift, prediction drift, and concept drift. How do you monitor each?
MediumWhat to look for: Data drift = input distribution shift (KS test, PSI). Prediction drift = output distribution shift. Concept drift = relationship between X and Y changes. Need all three plus ground-truth business metric. Evidently / Arize for the tooling.
3. Explain LoRA vs full fine-tuning for a Llama 3.1 8B model. When would you pick each?
MediumWhat to look for: LoRA: small adapters (few MB), much cheaper, multiple adapters serve from one base model, enough for most domain adaptation. Full: needed for significant shifts in domain or capability, much more compute. QLoRA as memory-efficient middle ground.
4. How do you handle class imbalance on a fraud model where positives are 0.3% of data?
MediumWhat to look for: Stratified sampling, class weights, SMOTE with caveats (usually worse than class weights), calibrated probabilities with Platt/isotonic, PR-AUC not ROC-AUC, threshold tuning for precision/recall trade-off. Red flag: "I downsample to 50/50".
5. What is the difference between a feature store’s offline and online stores, and why does it matter?
MediumWhat to look for: Offline (warehouse) for training, online (Redis / DynamoDB / Postgres) for serving, must stay in sync to avoid train/serve skew. Point-in-time correctness on offline to prevent leakage. Feast/Tecton design patterns.
6. Walk me through setting up a shadow deployment for a new model.
MediumWhat to look for: Route production traffic to both old and new models, log predictions, compare distributions and specific predictions, no user impact. Run for enough time to cover weekly seasonality. Then ramp A/B with guardrails.
7. Explain why cross-validation can give you an overly optimistic estimate in time-series problems.
MediumWhat to look for: Random k-fold leaks future into past. Need TimeSeriesSplit / expanding or rolling window CV that respects temporal order. Same issue with grouped data (user ID, household).
8. What does calibration mean for a classification model, and how do you fix a miscalibrated model?
MediumWhat to look for: Predicted probabilities match empirical frequencies. Tree ensembles and NNs often miscalibrated. Fix with Platt scaling or isotonic regression on a held-out set. Matters when downstream uses probabilities (pricing, ranking, thresholds).
Technical questions — Hard
1. Your offline AUC improved from 0.82 to 0.87 but the online A/B test shows no business metric change. What is going on?
HardWhat to look for: Offline-online gap causes: proxy metric diverging from business (AUC on rare class with no revenue correlation), selection bias in training data, feedback loops, delayed labels, threshold choice. Should be comfortable saying the offline win was meaningless.
2. Walk me through deploying a PyTorch model to a real-time endpoint with P99 latency under 100ms at 500 RPS.
HardWhat to look for: TorchScript or ONNX export, Triton or TorchServe, batch inference with dynamic batching, GPU if needed, warm pools, caching. Load test with Locust or k6. Should discuss trade-offs of quantization (int8, fp16).
3. Your model predicts loan approval. How do you audit for fairness?
HardWhat to look for: Disaggregate metrics across protected attributes (race, gender, age), check disparate impact (4/5 rule), equalized odds, demographic parity trade-offs. Acknowledge you cannot satisfy all definitions simultaneously. Document thresholds before shipping.
4. Your training data is 2 TB. Walk me through the training pipeline.
HardWhat to look for: Spark / Dask for feature prep in the warehouse or lakehouse, sampling strategies, data loaders with shuffling and sharding, multi-GPU with DDP or Ray, checkpointing. Not "I load it into pandas."
Behavioral questions
1. Tell me about a model you killed in production. Why did you pull it?
What to look for: Honest about failure — drifted, business case changed, costs exceeded value, fairness issue. Shows willingness to sunset vs defend a pet project.
2. Describe a time you pushed back on a product manager asking for ML.
What to look for: Proposed a simpler solution (heuristic, rules, lookup table), showed the data did not support the ask, or scoped down to a realistic baseline. Not just rubber-stamping.
3. Walk me through debugging a model that worked in dev but performed badly in production.
What to look for: Train/serve skew, feature pipeline differences, data distribution shift since training, threshold misalignment, label definition mismatch. Systematic causal analysis.
4. Tell me about a time you were paged for a production ML incident.
What to look for: Real incident — drift alert, pipeline failure, latency regression. How they triaged, who they looped in, what the durable fix was.
5. How do you set expectations with stakeholders on ML project timelines?
What to look for: Week-one audit to scope, baseline first, honest about the data-cleanliness bottleneck, updates on expected ceiling not just "we are trying".
6. Describe your onboarding into an unfamiliar ML codebase.
What to look for: Read the training script, run a local reproduction, inspect the model registry and lineage, talk to the last owner, look at the last 3 months of monitoring alerts.
7. What is the most rigorous experiment you have run and what made it rigorous?
What to look for: Pre-registered hypothesis, power analysis, guardrail metrics, analysis plan before results, honest reporting of null results.
Role-fit questions
1. Our stack is [SageMaker / Vertex / Databricks] with [MLflow / W&B]. Gaps?
What to look for: Honest about what they have not used, concrete ramp plan. Not faking.
2. How do you feel about carrying on-call for production models?
What to look for: Treats model ops as part of the job, has done it, has opinions on reducing pager load through better monitoring and tests.
3. How much research do you want in this role vs shipping?
What to look for: Honest about preference. Red flag: wants to publish papers at a product startup. Green flag: ships, then optimizes, knows when to read a paper vs copy XGBoost defaults.
4. Do you prefer classical ML or LLM / foundation model work?
What to look for: Has an opinion, but is comfortable with both. Red flag: thinks classical ML is beneath them.
5. What is your stance on interpretability vs accuracy?
What to look for: Context-dependent. High-stakes (credit, healthcare) interpretability required. Low-stakes ranking — accuracy wins. Knows SHAP, LIME, partial dependence for post-hoc explanation.
Red flags
Any one of these alone is usually reason to pass, especially combined with weak answers elsewhere.
- • Cannot tell you how a tree in XGBoost actually splits.
- • Treats production deployment as someone else’s problem.
- • Has never monitored a model after it launched.
- • Ships without a baseline comparison.
- • Uses ROC-AUC on a 0.1% class rate and calls it great.
- • Does not know the difference between offline and online evaluation.
- • Claims experience with LLM fine-tuning but cannot explain LoRA.
- • Gets defensive when asked about failures — no model ever fails in their stories.
Practical test
6-hour take-home: given an anonymized churn dataset (200K rows, 40 features, monthly snapshots), build a churn model and a short writeup. Deliverables: (1) a data audit notebook flagging any leakage, drift, or imbalance issues; (2) a baseline model and a tuned model with MLflow tracking; (3) an evaluation section with calibration plot, PR curve, and a lift table by decile; (4) a FastAPI endpoint that serves the model from a Docker container; (5) a 500-word README covering how you would monitor this in production and what would trigger retraining. We grade on data audit depth (25%), modeling decisions and evaluation rigor (30%), serving quality (20%), and production judgment in the writeup (25%). Bonus for catching the seeded target-leakage column in the feature set.
Scoring rubric
Score each answer 1-4: (1) Misses most of the rubric or gives platitudes; (2) Hits some points but cannot go deep when pressed; (3) Covers the rubric and can defend the answer under follow-ups; (4) Adds unprompted nuance, trade-offs, or real examples beyond the rubric. Hire at an average of 3.0+ across technical, behavioral, and role-fit, with zero red flags, and a pass on the practical test.
Related
Written by Syed Ali
Founder, Remoteria
Syed Ali founded Remoteria after a decade building distributed teams across 4 continents. He has helped 500+ companies source, vet, onboard, and scale pre-vetted offshore talent in engineering, design, marketing, and operations.
- • 10+ years building distributed remote teams
- • 500+ successful offshore placements across US, UK, EU, and APAC
- • Specialist in offshore vetting and cross-timezone team integration
Last updated: April 12, 2026