2026-01-17
Technical Phone Screen Questions for ML Engineers
Technical Phone Screen Questions for ML Engineers
The phone screen is your first real technical conversation with an ML engineer candidate. Get this wrong, and you waste everyone's time. Get it right, and you'll filter out the resume-padders while identifying candidates who actually understand machine learning fundamentals, systems design, and practical trade-offs.
This guide walks you through the exact questions I recommend asking, how to score responses, and what red flags to watch for. These questions work because they reveal both depth and practical experience—not just what someone memorized from a course.
Why Phone Screening ML Engineers Is Different
Machine learning hiring is fundamentally different from general software engineering. A candidate might be a strong backend engineer but hopeless at ML. The reverse is also true.
Here's what makes ML screening unique:
- Math matters more — but you need to assess conceptual understanding, not ability to derive gradient descent from first principles
- Production context is critical — many candidates understand theory but have never shipped an ML system
- Tool preferences vary wildly — PyTorch vs TensorFlow, scikit-learn vs custom solutions, and this reveals how they think
- Scope ambiguity is common — candidates confuse data engineering, analytics, and ML work
- Trade-offs are nuanced — there's rarely a single "right" answer, making communication critical
The phone screen should take 45-60 minutes. Spend the first 5 minutes on background, 40-50 on technical content, and 5 on their questions. This length gives you enough signal without exhausting either party.
Foundation Questions: Testing Core Understanding
Start with questions that reveal whether someone understands ML fundamentals. These aren't gotchas—they're filters that separate practitioners from people who completed a Coursera course.
1. "Walk me through how you'd approach a binary classification problem from scratch. What are the first things you'd do?"
Why this works: This is open-ended enough to reveal their process, but constrained enough to be answerable in 5 minutes. You're looking for:
- Do they talk about data exploration first, or jump straight to algorithms?
- Do they mention class imbalance, missing values, or feature distributions?
- Are they thinking about the business context (what does a false positive vs false negative cost)?
- Do they know what baseline metrics mean (accuracy alone is often useless)?
What good looks like:
"I'd start by understanding the data—how many samples, class balance, what the features are. Then I'd build a simple baseline, maybe logistic regression, to establish a benchmark. Then iterate: feature engineering, trying different models, cross-validation, measuring performance on the right metrics."
Red flag:
"I'd use XGBoost or a neural network."
(They jumped to tools without mentioning data, baseline, or evaluation.)
2. "You're training a model and notice the training loss is decreasing, but validation loss is increasing. What's happening and how would you debug it?"
Why this works: Overfitting is the most common ML problem in practice. This tests:
- Do they know what overfitting looks like?
- Can they propose concrete debugging steps?
- Do they understand regularization approaches?
What good looks like:
"That's overfitting. The model is memorizing training data rather than generalizing. I'd check: training set size relative to model complexity, whether I'm doing cross-validation properly, add regularization (L1/L2, dropout if it's neural), get more data, or simplify the model. I'd also check for leakage—whether any target information leaked into features."
Red flag:
"I'd add more hidden layers."
(That makes overfitting worse, not better.)
3. "What's the difference between precision and recall? When would you optimize for one over the other?"
Why this works: This filters candidates who passed ML courses from those who've shipped systems. Real-world ML is about trade-offs.
What good looks like:
"Precision is true positives divided by predicted positives—it answers 'when the model says yes, how often is it right?' Recall is true positives divided by actual positives—'of all the real positives, how many did we catch?' You optimize precision when false positives are expensive (e.g., fraud detection, medical diagnoses), and recall when false negatives cost more (e.g., cancer screening, security threats)."
Red flag:
"I always use accuracy" or "I optimize for AUC"
(These dodge the real question about business context.)
Experience Questions: Separating Theorists from Practitioners
Now move into their actual work. A candidate might answer foundation questions perfectly but have never deployed a model. These questions force real stories.
4. "Describe the most recent ML system you built. Walk me through the entire pipeline from data to production."
Why this works: This is your stress test. You'll find out:
- Do they understand end-to-end systems, or just the model training piece?
- Do they know about data pipelines, feature stores, model serving, monitoring?
- How do they handle edge cases?
- Do they take ownership or pass blame?
What good looks like:
A coherent story covering: data sources → data cleaning → feature engineering → training pipeline → evaluation → deployment strategy → monitoring → retraining triggers.
They mention specific tools and trade-offs they made.
They discuss failures and what they learned.
Red flag:
- Only talks about the model ("I used random forest and got 85% accuracy")
- Vague about deployment ("It goes to production")
- No monitoring or retraining strategy
- Dismisses important parts as "someone else's job"
5. "Tell me about a time a model you built performed well in testing but failed in production. What happened?"
Why this works: Everyone ships a bad model eventually. How they respond reveals maturity.
- Do they understand data distribution shift?
- Do they blame external factors or take responsibility?
- Did they learn and implement safeguards?
- Are they defensive or reflective?
What good looks like:
"We built a demand forecasting model that worked great on historical data but failed when we deployed it. Turned out the data distribution had shifted—new product categories, different seasonality. We didn't have monitoring set up to catch it immediately. Now I always set up automated alerts for prediction distribution changes and implement retraining pipelines."
Red flag:
"It never failed" or "The data was bad"
(Unrealistic or shows no learning.)
6. "How do you handle missing data in a production system? What's your approach?"
Why this works: Missing data is omnipresent in real ML. This tests:
- Do they know multiple strategies (imputation, deletion, flagging)?
- Do they think about missing data patterns?
- Do they understand the downstream impact?
What good looks like:
"It depends on the amount and pattern. If it's random and small, I might impute with mean/median or use a more sophisticated method like KNN imputation. If it's structural—like a sensor that sometimes doesn't report—I'd flag that as a feature, because it's informative. I'd never silently drop data in production. I'd monitor imputation rates to catch when data quality degrades."
Red flag:
"Drop the rows" or "Just use mean imputation everywhere"
Systems and Trade-offs Questions
These questions separate senior-level candidates. They're less about right answers and more about thinking clearly under uncertainty.
7. "You have a recommendation model that's too slow for real-time serving. How do you approach optimizing it?"
Why this works: Real production constraints dominate ML decisions. This reveals:
- Do they understand the full stack (model, infrastructure, caching)?
- Can they prioritize between many options?
- Do they think about trade-offs between accuracy and speed?
What good looks like:
A response that considers: profiling to find bottlenecks → model compression (quantization, pruning, knowledge distillation) → caching and memoization → batch serving if real-time isn't required → approximate algorithms → distributed serving → accepting lower accuracy for speed when appropriate.
Red flag:
"Use a faster model like linear regression"
(Assumes accuracy doesn't matter without asking.)
8. "How would you approach building a recommendation system for [product/domain they're familiar with]? What would your first version look like, and how would you iterate?"
Why this works: This is realistic system design. You're looking for:
- Pragmatism — do they start simple or over-engineer?
- Iteration mindset — do they plan for learning and improvement?
- Full-system thinking — data, features, model, infrastructure, metrics
What good looks like:
"I'd start simple: item popularity baseline. Then add collaborative filtering—either user-based or matrix factorization depending on data scale. Measure: diversity, coverage, business metrics like engagement. Add content-based features gradually. Only move to deep learning if simpler methods hit a wall. I'd A/B test everything."
Red flag:
"I'd use a deep neural network" as the first answer
9. "Explain feature engineering. Why is it often more important than model selection?"
Why this works: This separates theorists from practitioners. In real ML, feature engineering beats algorithm selection most of the time.
What good looks like:
"Features determine ceiling—you can't compensate for bad features with a better algorithm. I spend more time on features: understanding which raw inputs matter, creating domain-specific features, feature interactions, temporal features if it's time-series. Models can be simple if features are good. I've seen linear models with great features beat complex models with poor features."
Red flag:
"I just use whatever features are in the data" or "I let the model learn features automatically"
Deep Learning and Specific Framework Questions
If the role involves deep learning, ask these. Otherwise, you can skip them.
10. "What's the difference between batch normalization and layer normalization? When would you use each?"
Why this works: This tests deep learning intuition.
What good looks like:
"Batch norm normalizes per batch across examples, layer norm normalizes per example. Batch norm is efficient but changes at train vs test time (requires running stats). Layer norm is consistent but slightly slower. Layer norm works better for small batch sizes or when you need stable behavior. Most modern architectures use layer norm."
Red flag:
"I don't know" or conflating them with dropout
11. "You're training a large language model and it's running out of memory. What are some strategies?"
Why this works: Modern ML is constrained by memory. This reveals whether they understand practical limitations.
What good looks like:
"Gradient checkpointing to trade compute for memory, mixed precision training, distributed training across GPUs, optimizer state sharding, smaller batch sizes, or using a smaller model. I'd profile to understand where memory is going first."
Red flag:
"Get more GPUs"
(That's not a technical solution, it's a budget increase.)
Metrics and Evaluation Questions
Metrics are often where recruiters see the biggest gaps. Candidates can build models but don't know how to evaluate them properly.
12. "We're building a spam classifier. How would you measure its performance? What metrics matter?"
Why this works: This forces thinking about business context in evaluation.
What good looks like:
"Accuracy is misleading if spam is imbalanced. I'd use precision (when we flag spam, are we right?) and recall (what percent of real spam do we catch?). F1 if they're equally important. ROC-AUC to see trade-off across thresholds. But ultimately I care about: true positive rate for users (emails we correctly identified), false positive rate (legitimate emails we incorrectly blocked), and cost of each type of error. I'd measure on a held-out test set and potentially different time periods to catch distribution shift."
Red flag:
"I'd use accuracy" or single metric obsession
13. "How do you set up cross-validation? Why not just use a single train/test split?"
Why this works: This is fundamental but often misunderstood.
What good looks like:
"Cross-validation gives you variance estimate—you see how stable performance is across different train/test splits. With a single split you might get lucky. I typically use k-fold (k=5 or 10), stratified for classification to preserve class distribution. Time-series is different—you'd use time-based splits to avoid leakage."
Red flag:
"Train/test split is fine" or not knowing what stratification is
Communication and Process Questions
Technical skills matter, but so does how they work with others.
14. "How would you explain why a model's performance degraded to a non-technical stakeholder?"
Why this works: Real ML work is collaborative. This tests communication.
What good looks like:
"I'd start with the high-level impact: 'Performance dropped from 92% to 85%.' Then explain the likely causes in simple terms: 'The underlying data changed—new kinds of customers, different patterns.' Then what we're doing: 'We're retraining the model with fresh data and adding monitoring to catch this sooner.' Avoid jargon unless they ask."
Red flag:
"It's complicated" or heavy jargon-dumping
15. "Tell me about a disagreement you had with a data scientist, engineer, or PM about an ML approach. How did you resolve it?"
Why this works: This reveals collaboration maturity and ability to handle disagreement.
What good looks like:
A specific story with: the disagreement → their perspective → the other person's perspective → how they resolved it → what they learned.
Red flag:
"I was right" or no concrete example
Coding and Implementation Questions
For senior roles, ask one quick coding question. Nothing complex—just enough to verify they can write actual code.
16. "Implement a function that calculates precision, recall, and F1 score given true labels and predictions. You can use whatever language you're comfortable with."
Why this works: This is a 5-minute implementation that tests:
- Do they actually code?
- Can they handle edge cases (empty arrays, no positives)?
- Are they methodical or sloppy?
You're not looking for perfect code—it's a phone screen. You're looking for logical thinking and ability to articulate what they're doing.
Red flag:
- Can't start the problem
- Gets confused about what precision/recall are while implementing
- Doesn't consider edge cases
- Dismisses it as too simple (it's not about simplicity, it's about confirming they can code)
Scoring Rubric for Phone Screens
After the call, rate each area on this rubric:
| Area | Strong (3) | Adequate (2) | Weak (1) |
|---|---|---|---|
| Fundamentals | Clear on classification, overfitting, metrics | Knows concepts but vague on application | Confused or missing basics |
| Production Experience | Built and deployed multiple systems, understands full pipeline | Built systems but gaps in deployment/monitoring | Theory only, no production work |
| Problem-Solving | Methodical approach, considers trade-offs, asks clarifying questions | Jumps to solutions, but ultimately finds answer | Scattered, gives up or over-complicates |
| Communication | Clear explanations, handles follow-ups well | Adequate but uses jargon or unclear | Defensive, hard to follow |
| Tool Knowledge | Deep in specific tools, understands when to use each | Knows tools but hasn't deeply mastered any | Unfamiliar with common tools |
| Flags | No red flags | Minor concerns | Major concerns (lying, gaps in understanding) |
Scoring system: - 18+ points: Strong move forward to technical round - 14-17 points: Conditional pass (depends on role/seniority) - 13 or below: Likely pass
Common Red Flags That Should End the Call Early
Some warning signs are deal-breakers:
-
Confidence with no foundation — They're certain about wrong answers. (Overfitting is solved with more hidden layers. Accuracy is always the right metric.)
-
Resume-padding — Their "5 years ML experience" amounts to using scikit-learn tutorials. Press: "Tell me specifically what you built." If they can't, they're padding.
-
Framework obsession over fundamentals — They know PyTorch syntax but can't explain why batch normalization helps. (Tools change; understanding doesn't.)
-
No production experience yet claiming seniority — "Senior ML Engineer" who's never deployed a model is a bad hire.
-
Dismissive about important topics — "Data cleaning is boring, I just use the raw data" or "I don't care about monitoring." (This person will cause production fires.)
-
Can't articulate their own work — They built something but can't explain it clearly. Either they didn't build it or don't understand it.
What NOT to Do in Phone Screens
Don't ask: - Whiteboard complex math derivations (not predictive of job performance) - Trivia about frameworks ("What's the default learning rate in Adam?") - Gotcha questions designed to trick them - Only hard problems (screening should be easier than on-site)
Do: - Listen more than you talk - Follow up on vague answers ("Can you be more specific?") - Ask about failures and learnings (most signal-rich) - Confirm their interest and salary expectations early
Tailoring Questions to Role and Seniority
For junior ML engineers (0-2 years): - Focus on fundamentals, learning ability, and one shipped project - Be more forgiving of gaps in production knowledge - Questions 1-6, 12-13, 16
For mid-level (2-5 years): - Full range of questions - Emphasis on production systems (questions 4-5, 7) - Systems thinking (question 8)
For senior (5+ years): - Deep dives into complex problems (questions 7-9) - Leadership and mentorship questions - Architecture and trade-off thinking - Challenge their assumptions appropriately
Using This Guide Practically
-
Customize 3-4 questions to your specific domain — If you're hiring for NLP, ask about text preprocessing. Computer vision? Ask about augmentation and model architectures.
-
Ask follow-ups ruthlessly — When they give a surface-level answer, dig deeper. "Why did you choose that approach?" "What would you do differently now?"
-
Listen for language — Do they say "we" or "I"? Do they take responsibility or blame others? Do they ask questions back?
-
Take notes during the call — Quote them. You'll need evidence when comparing candidates.
-
Don't try to teach during the screen — If you correct them or explain something, you're wasting screening time. Save teaching for on-site or after hire.
Making Better ML Hiring Decisions
Phone screens are just the first filter. Strong performance here indicates someone who understands ML fundamentals, has shipped systems, and can communicate clearly.
However, phone screens miss things: ability to work in your specific stack, cultural fit, ability to learn from your team, and collaboration under real pressure. That's what on-site rounds are for.
Use these questions to build signal quickly and fairly. Ask them consistently across candidates so you can compare. Reference Zumo to supplement your phone screens with objective GitHub data on a candidate's coding activity, contribution patterns, and technical depth—adding another dimension to your evaluation beyond what any single interview can reveal.
FAQ
How long should a phone screen take?
45-60 minutes is ideal. Spend 5 minutes on background/rapport, 40-50 on technical questions (3-4 deep dives rather than 10 shallow ones), and 5 on their questions. Going longer exhausts both parties without adding signal.
Should I let candidates prepare for the phone screen?
Yes. Tell them the general topics (ML fundamentals, a past project, systems design) but don't give them exact questions. The goal is to see how they think under slight pressure, not trick them.
What if they can't answer a question?
Ask: "How would you approach figuring this out?" Many ML engineers haven't memorized everything—but strong ones know how to investigate and learn. That's often more valuable than having the answer memorized.
Should I ask coding questions on the phone?
For senior roles (5+), one simple implementation question is good—something they can talk through. For junior/mid-level, you can do this or skip it if you're doing a longer technical on-site. The goal isn't to gatekeep but to confirm they can actually code.
How much weight should phone screen results carry?
Phone screens are ~40% of hiring signal. They're good at filtering obviously unqualified candidates and identifying strong fundamentals. They're bad at predicting culture fit, team collaboration, or ability to learn your specific stack. Use them to decide who gets on-site, not who to hire.