Interview guide
AI Agent Developer Interview Questions & Answers Guide (2026)
A hiring-manager’s interview kit for ai agent developers — with specific “what to look for” notes on every answer, red flags to watch, and a practical test.
Key facts
- Role
- AI Agent Developer
- Technical questions
- 15
- Behavioral
- 7
- Role-fit
- 5
- Red flags
- 8
- Practical test
- Included
How to use this guide
Pick 4-6 technical questions across difficulties, 2-3 behavioral, and 1-2 role-fit for a 45-minute interview. For senior roles, weight harder technical and role-fit higher. Always close with the practical test so you are hiring on evidence, not impressions. The “what to look for” notes are a scoring rubric: strong answers touch most points, weak answers miss them or replace them with platitudes.
Technical questions — Easy
1. Walk me through a production agent you shipped. What was the task, what framework did you use, and what went wrong in the first two weeks after launch?
EasyWhat to look for: Specific agent, specific framework choice and why, concrete failure mode (infinite tool loop, rate limit cascades, prompt regression, cost spike), concrete fix. Red flag: everything went perfectly, or it was only a demo.
Technical questions — Medium
1. Your agent calls a tool in a loop and never terminates. How do you diagnose and prevent it?
MediumWhat to look for: Max-step guard in LangGraph / LangChain, require model to return a "done" signal, log each tool call with state, use stronger model for planning, add observation summaries. Not just "increase timeout".
2. Claude Sonnet 4.5 vs GPT-4o vs Gemini 1.5 Pro for a long-context document Q&A. How do you pick?
MediumWhat to look for: Actual context window sizes, prompt-caching differences (Anthropic caching is generous), structured output fidelity, cost per million tokens, latency characteristics. Should have tried at least two on the same eval set.
3. Explain how you would enforce structured output from an LLM and handle the case where the model returns malformed JSON.
MediumWhat to look for: OpenAI structured outputs / Anthropic tool use / Instructor / Zod with retries. Pydantic validation with a retry loop that feeds the validation error back to the model. Not "I regex the output."
4. How do you build a golden eval set and use it in CI?
MediumWhat to look for: 50-200 representative inputs (happy paths, edge cases, known failures), expected outputs or pass criteria, LLM-as-judge with structured rubric, exact-match where possible. CI runs on prompt/model change, diffs results, flags regressions. Tools: promptfoo, LangSmith, custom harness.
5. Compare LangGraph vs a custom state machine for agent orchestration.
MediumWhat to look for: LangGraph: built-in state, checkpointing, human-in-the-loop, observability hooks. Custom: more control, less lock-in, simpler for linear flows. Opinion based on team and flow complexity, not dogma. Red flag: "LangChain is bad" with no alternative plan.
6. Chunking strategy for technical documentation where code blocks must not be split. How do you implement it?
MediumWhat to look for: Markdown-aware splitter (LangChain MarkdownHeaderTextSplitter or custom), respect code fences, include heading ancestry in chunk metadata, overlap at heading boundaries. Token-count cap per chunk. Re-evaluate chunking against retrieval eval.
7. Walk me through setting up observability for a LangGraph agent.
MediumWhat to look for: LangSmith / Langfuse tracing, tag by user and workflow, capture tool inputs/outputs, token usage, latency per step. Alert on error rate, latency P95, cost per day. Run replayable traces to debug specific user sessions.
8. How do you handle a 60-minute agent task without the client hanging?
MediumWhat to look for: Background worker with a job queue (Inngest, Temporal, Celery, Trigger.dev), SSE or polling for progress, checkpoints in LangGraph so resumable, webhook on completion. Not "just increase the timeout."
9. Your agent returns confidently wrong answers. Walk through the diagnosis.
MediumWhat to look for: Check retrieval quality first (are the right chunks coming back?), then grounding (is the model citing the context or freewheeling?), prompt (does it tell the model to refuse when uncertain?), model choice, evaluation of confidence calibration. Not "I add ‘do not hallucinate’ to the prompt."
Technical questions — Hard
1. Design a RAG pipeline for a customer support agent answering from 20k Zendesk articles that update daily.
HardWhat to look for: Chunking strategy (semantic or heading-based, not fixed size), embedding model (text-embedding-3-large or Voyage), hybrid search (BM25 + vector), re-ranker, incremental re-indexing via webhook, deletion hooks, eval on real past tickets. Should not say "just stuff it in Pinecone."
2. You are paying $4,000/month on OpenAI tokens and need to cut it in half without losing quality. Walk me through it.
HardWhat to look for: Log tag by workflow first, identify the top 3 cost drivers. Prompt caching (Anthropic), context truncation, summary memory, route easy queries to gpt-4o-mini or Haiku, cache embeddings, batch embeddings. Measure against the eval set to prove no regression.
3. What is prompt injection, and how do you defend against it in an agent that calls tools on behalf of a user?
HardWhat to look for: Untrusted input modifying agent behavior. Defenses: separate system prompt channel (when available), sandbox tools, require explicit user confirmation for destructive actions, output filtering, least-privilege tool permissions, test with an adversarial prompt set. Not "I tell the model to ignore injections."
4. Explain the difference between similarity-based retrieval and knowledge-graph retrieval. When do you reach for the graph?
HardWhat to look for: Vector similarity finds topically similar text. Graph (GraphRAG, Neo4j) captures relationships and multi-hop queries (who worked with whom, cause-effect chains). Graph is heavier — only worth it when relationship queries matter. Most agents only need vector + BM25.
5. You need a browser-using agent to fill forms on 20 SaaS tools. How do you build it reliably?
HardWhat to look for: Playwright / Browserbase / Anthropic Computer Use for controlled environments, selector strategies (semantic role-based not brittle XPath), screenshots for vision verification, retry with different strategies, sandbox/isolation, audit logging. Human fallback.
Behavioral questions
1. Tell me about a prompt change that broke production. How did you find out and what did you put in place after?
What to look for: Real story — eval regression not caught, user complaint, silent quality drop. Follow-up: eval in CI, versioning prompts, canary rollouts. Owns the failure.
2. Describe a disagreement with product about agent scope or reliability targets.
What to look for: Brought eval numbers, pushed back on promises the agent could not hit, proposed a narrower happy-path launch. Not just saying yes to scope creep.
3. Walk me through the most painful bug you debugged in an agent.
What to look for: Specific: race condition in tool calling, context pollution, incorrect JSON schema causing retry storms. Systematic debugging story with traces and evals.
4. How do you stay current on fast-moving LLM tooling without chasing every shiny new thing?
What to look for: Uses evals to decide when a new model/framework is actually better, reads provider changelogs and a few trusted sources, has a sandbox project. Not "I install every new package."
5. Tell me about when you pushed back against using AI for a feature.
What to look for: Proposed a rules-based solution, or deterministic workflow, because the use case did not need LLMs. Senior judgment.
6. How do you work with a non-technical stakeholder who wants "an agent that does everything"?
What to look for: Scopes to specific success criteria, builds a baseline, measures, iterates. Treats "everything" as a red flag and extracts concrete user tasks.
7. Describe your worst LLM cost spike and how you caught it.
What to look for: Real number ("went from $500 to $4,200 overnight"), detection mechanism, root cause (loop, retry storm, unexpected traffic), post-mortem action.
Role-fit questions
1. Our stack is [LangGraph + Pinecone + Claude + Vercel]. Anything there you have not used, and how would you ramp?
What to look for: Honest gap assessment plus a concrete ramp plan with a small evaluation project. Fakery is a red flag.
2. How do you feel about being on-call for an AI system where the failure modes are subtle (quality drift, not crashes)?
What to look for: Treats eval regressions as incidents, has run this kind of ops before, advocates for alerting on task success rate not just 500s.
3. Where does your real interest sit — prompt and eval craft, agent architecture, fine-tuning, or infra?
What to look for: Has an honest specialty. Role needs all four but the candidate should be strongest in one and competent in the rest.
4. What is your take on open-source vs frontier API models for production?
What to look for: Open-source wins on cost at volume and on data control; frontier wins on quality and speed of iteration. Context-dependent, not dogma.
5. What LLM features released in the last 6 months changed how you build agents?
What to look for: Concrete, current examples: prompt caching, extended thinking / reasoning models, native structured outputs, computer use, 1M context. Tells you they are actually in the trenches.
Red flags
Any one of these alone is usually reason to pass, especially combined with weak answers elsewhere.
- • Describes every LLM problem as "we just need a better prompt".
- • Has never written an eval set or used LLM-as-judge.
- • Confuses a LangChain notebook with a production agent.
- • Cannot explain the difference between tool calling and structured output.
- • Dismisses retrieval quality problems with "we should just fine-tune".
- • Has never looked at per-call token usage or cost.
- • Proposes multi-agent orchestration for problems a single model call would solve.
- • Does not use version control or evals when iterating on prompts.
Practical test
5-hour take-home: build a documentation Q&A agent over a provided corpus of ~500 markdown files (mixed technical docs). Requirements: (1) ingestion script with chunking strategy of your choice and embeddings in pgvector or Chroma; (2) a LangGraph or Vercel AI SDK agent that retrieves, re-ranks, and answers with citations back to source files; (3) a golden eval set of 20 Q/A pairs with an LLM-as-judge harness that runs from the CLI; (4) a FastAPI or Next.js endpoint with SSE streaming; (5) a README covering cost per query, latency, and what you would change with another week. We grade on retrieval quality on the eval set (30%), code quality and structure (25%), eval harness rigor (20%), and production readiness / writeup (25%). Bonus for catching the 5 seeded "unanswerable" questions and having the agent refuse gracefully.
Scoring rubric
Score each answer 1-4: (1) Misses most of the rubric or gives platitudes; (2) Hits some points but cannot go deep when pressed; (3) Covers the rubric and can defend the answer under follow-ups; (4) Adds unprompted nuance, trade-offs, or real examples beyond the rubric. Hire at an average of 3.0+ across technical, behavioral, and role-fit, with zero red flags, and a pass on the practical test.
Related
Written by Syed Ali
Founder, Remoteria
Syed Ali founded Remoteria after a decade building distributed teams across 4 continents. He has helped 500+ companies source, vet, onboard, and scale pre-vetted offshore talent in engineering, design, marketing, and operations.
- • 10+ years building distributed remote teams
- • 500+ successful offshore placements across US, UK, EU, and APAC
- • Specialist in offshore vetting and cross-timezone team integration
Last updated: April 12, 2026