Technical Phone Screen Questions for ML Engineers

The phone screen is your first real technical conversation with an ML engineer candidate. Get this wrong, and you waste everyone's time. Get it right, and you'll filter out the resume-padders while identifying candidates who actually understand machine learning fundamentals, systems design, and practical trade-offs.

This guide walks you through the exact questions I recommend asking, how to score responses, and what red flags to watch for. These questions work because they reveal both depth and practical experience—not just what someone memorized from a course.

Why Phone Screening ML Engineers Is Different

Machine learning hiring is fundamentally different from general software engineering. A candidate might be a strong backend engineer but hopeless at ML. The reverse is also true.

Here's what makes ML screening unique:

Math matters more — but you need to assess conceptual understanding, not ability to derive gradient descent from first principles
Production context is critical — many candidates understand theory but have never shipped an ML system
Tool preferences vary wildly — PyTorch vs TensorFlow, scikit-learn vs custom solutions, and this reveals how they think
Scope ambiguity is common — candidates confuse data engineering, analytics, and ML work
Trade-offs are nuanced — there's rarely a single "right" answer, making communication critical

The phone screen should take 45-60 minutes. Spend the first 5 minutes on background, 40-50 on technical content, and 5 on their questions. This length gives you enough signal without exhausting either party.

Foundation Questions: Testing Core Understanding

Start with questions that reveal whether someone understands ML fundamentals. These aren't gotchas—they're filters that separate practitioners from people who completed a Coursera course.

1. "Walk me through how you'd approach a binary classification problem from scratch. What are the first things you'd do?"

Why this works: This is open-ended enough to reveal their process, but constrained enough to be answerable in 5 minutes. You're looking for:

Do they talk about data exploration first, or jump straight to algorithms?
Do they mention class imbalance, missing values, or feature distributions?
Are they thinking about the business context (what does a false positive vs false negative cost)?
Do they know what baseline metrics mean (accuracy alone is often useless)?

What good looks like:

"I'd start by understanding the data—how many samples, class balance, what the features are. Then I'd build a simple baseline, maybe logistic regression, to establish a benchmark. Then iterate: feature engineering, trying different models, cross-validation, measuring performance on the right metrics."

Red flag:

"I'd use XGBoost or a neural network."

(They jumped to tools without mentioning data, baseline, or evaluation.)

2. "You're training a model and notice the training loss is decreasing, but validation loss is increasing. What's happening and how would you debug it?"

Why this works: Overfitting is the most common ML problem in practice. This tests:

Do they know what overfitting looks like?
Can they propose concrete debugging steps?
Do they understand regularization approaches?

What good looks like:

"That's overfitting. The model is memorizing training data rather than generalizing. I'd check: training set size relative to model complexity, whether I'm doing cross-validation properly, add regularization (L1/L2, dropout if it's neural), get more data, or simplify the model. I'd also check for leakage—whether any target information leaked into features."

Red flag:

"I'd add more hidden layers."

(That makes overfitting worse, not better.)

3. "What's the difference between precision and recall? When would you optimize for one over the other?"

Why this works: This filters candidates who passed ML courses from those who've shipped systems. Real-world ML is about trade-offs.

What good looks like:

"Precision is true positives divided by predicted positives—it answers 'when the model says yes, how often is it right?' Recall is true positives divided by actual positives—'of all the real positives, how many did we catch?' You optimize precision when false positives are expensive (e.g., fraud detection, medical diagnoses), and recall when false negatives cost more (e.g., cancer screening, security threats)."

Red flag:

"I always use accuracy" or "I optimize for AUC"

(These dodge the real question about business context.)

Experience Questions: Separating Theorists from Practitioners

Now move into their actual work. A candidate might answer foundation questions perfectly but have never deployed a model. These questions force real stories.

4. "Describe the most recent ML system you built. Walk me through the entire pipeline from data to production."

Why this works: This is your stress test. You'll find out:

Do they understand end-to-end systems, or just the model training piece?
Do they know about data pipelines, feature stores, model serving, monitoring?
How do they handle edge cases?
Do they take ownership or pass blame?

What good looks like:

A coherent story covering: data sources → data cleaning → feature engineering → training pipeline → evaluation → deployment strategy → monitoring → retraining triggers.

They mention specific tools and trade-offs they made.

They discuss failures and what they learned.

Red flag:

Only talks about the model ("I used random forest and got 85% accuracy")
Vague about deployment ("It goes to production")
No monitoring or retraining strategy
Dismisses important parts as "someone else's job"

5. "Tell me about a time a model you built performed well in testing but failed in production. What happened?"

Why this works: Everyone ships a bad model eventually. How they respond reveals maturity.

Do they understand data distribution shift?
Do they blame external factors or take responsibility?
Did they learn and implement safeguards?
Are they defensive or reflective?

What good looks like:

"We built a demand forecasting model that worked great on historical data but failed when we deployed it. Turned out the data distribution had shifted—new product categories, different seasonality. We didn't have monitoring set up to catch it immediately. Now I always set up automated alerts for prediction distribution changes and implement retraining pipelines."

Red flag:

"It never failed" or "The data was bad"

(Unrealistic or shows no learning.)

6. "How do you handle missing data in a production system? What's your approach?"

Why this works: Missing data is omnipresent in real ML. This tests:

Do they know multiple strategies (imputation, deletion, flagging)?
Do they think about missing data patterns?
Do they understand the downstream impact?

What good looks like:

"It depends on the amount and pattern. If it's random and small, I might impute with mean/median or use a more sophisticated method like KNN imputation. If it's structural—like a sensor that sometimes doesn't report—I'd flag that as a feature, because it's informative. I'd never silently drop data in production. I'd monitor imputation rates to catch when data quality degrades."

Red flag:

"Drop the rows" or "Just use mean imputation everywhere"

Systems and Trade-offs Questions

These questions separate senior-level candidates. They're less about right answers and more about thinking clearly under uncertainty.

7. "You have a recommendation model that's too slow for real-time serving. How do you approach optimizing it?"

Why this works: Real production constraints dominate ML decisions. This reveals:

Do they understand the full stack (model, infrastructure, caching)?
Can they prioritize between many options?
Do they think about trade-offs between accuracy and speed?

What good looks like:

A response that considers: profiling to find bottlenecks → model compression (quantization, pruning, knowledge distillation) → caching and memoization → batch serving if real-time isn't required → approximate algorithms → distributed serving → accepting lower accuracy for speed when appropriate.

Red flag:

"Use a faster model like linear regression"

(Assumes accuracy doesn't matter without asking.)

8. "How would you approach building a recommendation system for [product/domain they're familiar with]? What would your first version look like, and how would you iterate?"

Why this works: This is realistic system design. You're looking for:

Pragmatism — do they start simple or over-engineer?
Iteration mindset — do they plan for learning and improvement?
Full-system thinking — data, features, model, infrastructure, metrics

What good looks like:

"I'd start simple: item popularity baseline. Then add collaborative filtering—either user-based or matrix factorization depending on data scale. Measure: diversity, coverage, business metrics like engagement. Add content-based features gradually. Only move to deep learning if simpler methods hit a wall. I'd A/B test everything."

Red flag:

"I'd use a deep neural network" as the first answer

9. "Explain feature engineering. Why is it often more important than model selection?"

Why this works: This separates theorists from practitioners. In real ML, feature engineering beats algorithm selection most of the time.

What good looks like:

"Features determine ceiling—you can't compensate for bad features with a better algorithm. I spend more time on features: understanding which raw inputs matter, creating domain-specific features, feature interactions, temporal features if it's time-series. Models can be simple if features are good. I've seen linear models with great features beat complex models with poor features."

Red flag:

"I just use whatever features are in the data" or "I let the model learn features automatically"

Deep Learning and Specific Framework Questions

If the role involves deep learning, ask these. Otherwise, you can skip them.

10. "What's the difference between batch normalization and layer normalization? When would you use each?"

Why this works: This tests deep learning intuition.

What good looks like:

"Batch norm normalizes per batch across examples, layer norm normalizes per example. Batch norm is efficient but changes at train vs test time (requires running stats). Layer norm is consistent but slightly slower. Layer norm works better for small batch sizes or when you need stable behavior. Most modern architectures use layer norm."

Red flag:

"I don't know" or conflating them with dropout

11. "You're training a large language model and it's running out of memory. What are some strategies?"

Why this works: Modern ML is constrained by memory. This reveals whether they understand practical limitations.

What good looks like:

"Gradient checkpointing to trade compute for memory, mixed precision training, distributed training across GPUs, optimizer state sharding, smaller batch sizes, or using a smaller model. I'd profile to understand where memory is going first."

Red flag:

"Get more GPUs"

(That's not a technical solution, it's a budget increase.)

Metrics and Evaluation Questions

Metrics are often where recruiters see the biggest gaps. Candidates can build models but don't know how to evaluate them properly.

12. "We're building a spam classifier. How would you measure its performance? What metrics matter?"

Why this works: This forces thinking about business context in evaluation.

What good looks like:

"Accuracy is misleading if spam is imbalanced. I'd use precision (when we flag spam, are we right?) and recall (what percent of real spam do we catch?). F1 if they're equally important. ROC-AUC to see trade-off across thresholds. But ultimately I care about: true positive rate for users (emails we correctly identified), false positive rate (legitimate emails we incorrectly blocked), and cost of each type of error. I'd measure on a held-out test set and potentially different time periods to catch distribution shift."

Red flag:

"I'd use accuracy" or single metric obsession

13. "How do you set up cross-validation? Why not just use a single train/test split?"

Why this works: This is fundamental but often misunderstood.

What good looks like:

"Cross-validation gives you variance estimate—you see how stable performance is across different train/test splits. With a single split you might get lucky. I typically use k-fold (k=5 or 10), stratified for classification to preserve class distribution. Time-series is different—you'd use time-based splits to avoid leakage."

Red flag:

"Train/test split is fine" or not knowing what stratification is

Communication and Process Questions

Technical skills matter, but so does how they work with others.

14. "How would you explain why a model's performance degraded to a non-technical stakeholder?"

Why this works: Real ML work is collaborative. This tests communication.

What good looks like:

"I'd start with the high-level impact: 'Performance dropped from 92% to 85%.' Then explain the likely causes in simple terms: 'The underlying data changed—new kinds of customers, different patterns.' Then what we're doing: 'We're retraining the model with fresh data and adding monitoring to catch this sooner.' Avoid jargon unless they ask."

Red flag:

"It's complicated" or heavy jargon-dumping

15. "Tell me about a disagreement you had with a data scientist, engineer, or PM about an ML approach. How did you resolve it?"

Why this works: This reveals collaboration maturity and ability to handle disagreement.

What good looks like:

A specific story with: the disagreement → their perspective → the other person's perspective → how they resolved it → what they learned.

Red flag:

"I was right" or no concrete example

Coding and Implementation Questions

For senior roles, ask one quick coding question. Nothing complex—just enough to verify they can write actual code.

16. "Implement a function that calculates precision, recall, and F1 score given true labels and predictions. You can use whatever language you're comfortable with."

Why this works: This is a 5-minute implementation that tests:

Do they actually code?
Can they handle edge cases (empty arrays, no positives)?
Are they methodical or sloppy?

You're not looking for perfect code—it's a phone screen. You're looking for logical thinking and ability to articulate what they're doing.

Red flag:

Can't start the problem
Gets confused about what precision/recall are while implementing
Doesn't consider edge cases
Dismisses it as too simple (it's not about simplicity, it's about confirming they can code)

Scoring Rubric for Phone Screens

After the call, rate each area on this rubric:

Area	Strong (3)	Adequate (2)	Weak (1)
Fundamentals	Clear on classification, overfitting, metrics	Knows concepts but vague on application	Confused or missing basics
Production Experience	Built and deployed multiple systems, understands full pipeline	Built systems but gaps in deployment/monitoring	Theory only, no production work
Problem-Solving	Methodical approach, considers trade-offs, asks clarifying questions	Jumps to solutions, but ultimately finds answer	Scattered, gives up or over-complicates
Communication	Clear explanations, handles follow-ups well	Adequate but uses jargon or unclear	Defensive, hard to follow
Tool Knowledge	Deep in specific tools, understands when to use each	Knows tools but hasn't deeply mastered any	Unfamiliar with common tools
Flags	No red flags	Minor concerns	Major concerns (lying, gaps in understanding)

Scoring system: - 18+ points: Strong move forward to technical round - 14-17 points: Conditional pass (depends on role/seniority) - 13 or below: Likely pass

Common Red Flags That Should End the Call Early

Some warning signs are deal-breakers:

Confidence with no foundation — They're certain about wrong answers. (Overfitting is solved with more hidden layers. Accuracy is always the right metric.)
Resume-padding — Their "5 years ML experience" amounts to using scikit-learn tutorials. Press: "Tell me specifically what you built." If they can't, they're padding.
Framework obsession over fundamentals — They know PyTorch syntax but can't explain why batch normalization helps. (Tools change; understanding doesn't.)
No production experience yet claiming seniority — "Senior ML Engineer" who's never deployed a model is a bad hire.
Dismissive about important topics — "Data cleaning is boring, I just use the raw data" or "I don't care about monitoring." (This person will cause production fires.)
Can't articulate their own work — They built something but can't explain it clearly. Either they didn't build it or don't understand it.

What NOT to Do in Phone Screens

Don't ask: - Whiteboard complex math derivations (not predictive of job performance) - Trivia about frameworks ("What's the default learning rate in Adam?") - Gotcha questions designed to trick them - Only hard problems (screening should be easier than on-site)

Do: - Listen more than you talk - Follow up on vague answers ("Can you be more specific?") - Ask about failures and learnings (most signal-rich) - Confirm their interest and salary expectations early

Tailoring Questions to Role and Seniority

For junior ML engineers (0-2 years): - Focus on fundamentals, learning ability, and one shipped project - Be more forgiving of gaps in production knowledge - Questions 1-6, 12-13, 16

For mid-level (2-5 years): - Full range of questions - Emphasis on production systems (questions 4-5, 7) - Systems thinking (question 8)

For senior (5+ years): - Deep dives into complex problems (questions 7-9) - Leadership and mentorship questions - Architecture and trade-off thinking - Challenge their assumptions appropriately

Using This Guide Practically

Customize 3-4 questions to your specific domain — If you're hiring for NLP, ask about text preprocessing. Computer vision? Ask about augmentation and model architectures.
Ask follow-ups ruthlessly — When they give a surface-level answer, dig deeper. "Why did you choose that approach?" "What would you do differently now?"
Listen for language — Do they say "we" or "I"? Do they take responsibility or blame others? Do they ask questions back?
Take notes during the call — Quote them. You'll need evidence when comparing candidates.
Don't try to teach during the screen — If you correct them or explain something, you're wasting screening time. Save teaching for on-site or after hire.

Making Better ML Hiring Decisions

Phone screens are just the first filter. Strong performance here indicates someone who understands ML fundamentals, has shipped systems, and can communicate clearly.

However, phone screens miss things: ability to work in your specific stack, cultural fit, ability to learn from your team, and collaboration under real pressure. That's what on-site rounds are for.

Use these questions to build signal quickly and fairly. Ask them consistently across candidates so you can compare. Reference Zumo to supplement your phone screens with objective GitHub data on a candidate's coding activity, contribution patterns, and technical depth—adding another dimension to your evaluation beyond what any single interview can reveal.

FAQ

How long should a phone screen take?

45-60 minutes is ideal. Spend 5 minutes on background/rapport, 40-50 on technical questions (3-4 deep dives rather than 10 shallow ones), and 5 on their questions. Going longer exhausts both parties without adding signal.

Should I let candidates prepare for the phone screen?

Yes. Tell them the general topics (ML fundamentals, a past project, systems design) but don't give them exact questions. The goal is to see how they think under slight pressure, not trick them.

What if they can't answer a question?

Ask: "How would you approach figuring this out?" Many ML engineers haven't memorized everything—but strong ones know how to investigate and learn. That's often more valuable than having the answer memorized.

Should I ask coding questions on the phone?

For senior roles (5+), one simple implementation question is good—something they can talk through. For junior/mid-level, you can do this or skip it if you're doing a longer technical on-site. The goal isn't to gatekeep but to confirm they can actually code.

How much weight should phone screen results carry?

Phone screens are ~40% of hiring signal. They're good at filtering obviously unqualified candidates and identifying strong fundamentals. They're bad at predicting culture fit, team collaboration, or ability to learn your specific stack. Use them to decide who gets on-site, not who to hire.

Ready to source and screen ML engineers more efficiently? Zumo analyzes GitHub activity to show you real coding patterns, contribution depth, and technical specialization before the phone screen even happens. Combine these interview questions with objective technical data to make faster, better hiring decisions.

Technical Phone Screen Questions for ML Engineers

Technical Phone Screen Questions for ML Engineers

Why Phone Screening ML Engineers Is Different

Foundation Questions: Testing Core Understanding

1. "Walk me through how you'd approach a binary classification problem from scratch. What are the first things you'd do?"

2. "You're training a model and notice the training loss is decreasing, but validation loss is increasing. What's happening and how would you debug it?"

3. "What's the difference between precision and recall? When would you optimize for one over the other?"

Experience Questions: Separating Theorists from Practitioners

4. "Describe the most recent ML system you built. Walk me through the entire pipeline from data to production."

5. "Tell me about a time a model you built performed well in testing but failed in production. What happened?"

6. "How do you handle missing data in a production system? What's your approach?"

Systems and Trade-offs Questions

7. "You have a recommendation model that's too slow for real-time serving. How do you approach optimizing it?"

8. "How would you approach building a recommendation system for [product/domain they're familiar with]? What would your first version look like, and how would you iterate?"

9. "Explain feature engineering. Why is it often more important than model selection?"

Deep Learning and Specific Framework Questions

10. "What's the difference between batch normalization and layer normalization? When would you use each?"

11. "You're training a large language model and it's running out of memory. What are some strategies?"

Metrics and Evaluation Questions

12. "We're building a spam classifier. How would you measure its performance? What metrics matter?"

13. "How do you set up cross-validation? Why not just use a single train/test split?"

Communication and Process Questions

14. "How would you explain why a model's performance degraded to a non-technical stakeholder?"

15. "Tell me about a disagreement you had with a data scientist, engineer, or PM about an ML approach. How did you resolve it?"

Coding and Implementation Questions

16. "Implement a function that calculates precision, recall, and F1 score given true labels and predictions. You can use whatever language you're comfortable with."

Scoring Rubric for Phone Screens

Common Red Flags That Should End the Call Early

What NOT to Do in Phone Screens

Tailoring Questions to Role and Seniority

Using This Guide Practically

Making Better ML Hiring Decisions

FAQ

How long should a phone screen take?

Should I let candidates prepare for the phone screen?

What if they can't answer a question?

Should I ask coding questions on the phone?

How much weight should phone screen results carry?

Related Reading