AI EngineerRemote

LH02 - Agentic Evals Engineer

REMOTEPosted April 2, 2026

About the Project

We are partnering with a rapidly growing engineering organization (100+ engineers) building production-grade Agentic AI systems that power complex internal workflows and customer-facing products. These systems are built around LLM-driven agents orchestrated with LangGraph, integrated with MCP (Model Context Protocol) servers, backend services, and domain-specific tools. These are multi-step, stateful systems, not simple prompt-response applications. A major focus area includes document-centric workflows, where agents must:

Parse and interpret structured and unstructured documents
Extract entities and structured data
Validate fields and business logic
Route decisions based on extracted information
Operate across multi-step, stateful workflows
Trigger downstream tools and services
Maintain state across long-running workflows These systems often combine LLMs with traditional NLP models (NER, classifiers, rule-based validators, OCR pipelines, etc.). As these systems scale, evaluation becomes mission-critical — especially for correctness, consistency, reliability, and regression protection across complex workflows. This role focuses on designing and operating robust evaluation frameworks for agentic systems, document processing pipelines, and voice-based agents in production. This is not a research-only role and not limited to prompt grading. This is a systems-level evaluation engineering role deeply integrated with orchestration, structured extraction, NLP validation, voice agents, and production reliability. This is a long-term, high-impact role with direct influence over how quality is defined and enforced across AI-powered systems.

Role and Responsibilities

Agentic System Evaluation

Design and implement evaluation frameworks for multi-step agent workflows. Agent evaluation differs from traditional LLM evaluation because agents must be evaluated across decision sequences, tool calls, and workflow trajectories, not just final outputs. You will work on:

Evaluation frameworks for multi-step agent workflows
Tool invocation correctness
State transitions and memory behavior
Planner / executor coordination
End-to-end task completion validation
Trajectory evaluation (sequence of agent actions)
Deterministic replay and trace-based debugging
Structured output validation
Failure classification and root cause analysis
Automated regression testing pipelines for LangGraph systems
Scenario-based testing for long-running workflows
Evaluation datasets and scenario generation
Measuring agent reliability over time You will help define what it means for an agent system to be correct, not just whether a single response looks good.

Document Processing & NLP Evaluation

A major part of this role focuses on document ingestion and document-processing pipelines. You will design evaluation frameworks for systems that perform:

Document parsing
Entity extraction (NER)
Field extraction and normalization
Document classification
Schema validation
Business rule validation
Cross-document consistency checks
OCR pipeline evaluation
Structured output extraction using LLMs
Hybrid systems combining:
- LLM extraction
- Traditional NLP models
- Rules and validators
- Heuristics and confidence thresholds You will build evaluation harnesses that compare:
LLM-based extraction
Traditional NLP models (NER models)
Hybrid rule + model systems
OCR + extraction pipelines
Multiple prompts or extraction strategies You will implement evaluation metrics such as:
Precision / Recall / F1 for entity extraction
Field-level accuracy scoring
Document-level extraction accuracy
Schema conformance validation
Cross-field validation
Cross-document consistency
Confidence calibration
Error categorization
Drift detection over time You will also help develop:
Gold datasets and labeling standards
Adversarial and edge-case document test sets
Messy PDFs, OCR noise, ambiguous fields, conflicting values
Evaluation datasets representing real production documents
Regression test suites for document workflows Many workflows will involve pipelines where:
OCR → NER → Extraction → Validation → Agent decision → Tool call → Workflow routing
Extracted entities feed into agent reasoning
Business logic depends on extraction confidence
Incorrect extraction causes workflow failures downstream You will help evaluate the entire pipeline, not just individual models.

Voice Agent & Conversational System Evaluation

In addition to document and agent workflow evaluation, this role will also support evaluation of voice-based agentic systems. We evaluate voice agents using platforms such as Hamming.ai, which provides automated testing, simulated calls, regression testing, and production monitoring for AI voice agents. Voice agent evaluation includes:

Monitoring production calls for agent behavior correctness
Evaluating multi-turn conversation flows
Task completion success rates
Detecting when agents go off-script
Evaluating tool usage during calls
Latency and interruption handling
ASR / transcription error impact
Conversation state tracking
Dialogue planning failures
Regression testing using simulated calls
Converting failed production calls into regression tests
Tracking business outcomes from calls Voice agent testing platforms can simulate thousands of calls with different personas, accents, noise conditions, and scenarios to detect failures before deployment. This role may involve building:
Call evaluation metrics
Conversation scoring frameworks
Regression testing pipelines for voice agents
Production monitoring dashboards for conversational quality
Failure classification systems for voice agent errors
Conversation trajectory evaluation
Call outcome classification and scoring This work is similar to agent evaluation but applied to real-time conversational systems operating over voice, which introduces additional failure modes such as speech recognition errors, latency, interruptions, and dialogue management issues.

Evaluation Infrastructure & Tooling

You will help build the infrastructure that prevents AI regressions in production systems. This includes:

Offline and batch evaluation pipelines
Evaluation datasets and scenario generators
Regression testing frameworks for agents and document pipelines
CI/CD integration for AI systems
Deterministic replay systems
Structured output validation pipelines
Evaluation dashboards
Drift detection systems
Model vs. model comparison systems
Production vs. staging comparison systems
Latency and cost tracking
Evaluation logging and observability
Failure classification and error taxonomy systems You will help design hybrid evaluation approaches, including:
Rule-based validators
Deterministic scoring layers
LLM-as-judge systems
Confidence thresholding strategies
Ensemble evaluation methods
Human-in-the-loop review systems You will work closely with agent and backend engineers to improve:
Determinism
Observability
Debuggability
Testability
Reliability of document workflows
Reliability of voice agents
Reliability of agent orchestration systems

What We’re Looking For

Must-Haves

4+ years of professional engineering experience in backend systems, ML systems, NLP, or AI infrastructure
Strong proficiency in Python
Experience working with:
- LLM-based systems
- Structured output extraction
- Multi-step agent workflows
- Document processing systems
Strong understanding of NLP fundamentals, including:
- Named Entity Recognition (NER)
- Classification models
- Precision / Recall / F1 metrics
- Dataset creation and evaluation methodology
Experience building evaluation pipelines for document extraction systems
Familiarity with structured validation tools (JSON Schema, Pydantic, etc.)
Experience designing regression testing systems for AI pipelines
Ability to distinguish between:
- Model extraction failures
- OCR noise issues
- Prompt design flaws
- Orchestration bugs
- Tool failures
- Business logic errors
- Voice conversation failures
Strong analytical and debugging skills in ambiguous, non-deterministic systems

We Are Specifically Looking For Candidates Who Have

Evaluated document processing systems in production
Built entity-level scoring frameworks
Compared traditional NLP models vs LLM extraction systems
Designed hybrid validation systems (rules + models)
Diagnosed extraction or routing failures in multi-step AI workflows
Built gold datasets for structured document evaluation
Built evaluation frameworks for agent systems
Worked on voice agents, conversational AI, or call automation systems
Evaluated multi-turn conversations or task completion in conversational systems
Built regression testing systems for AI workflows If your experience is limited to prompt iteration or academic NLP benchmarking without real-world document pipelines or agent workflows, this role is likely not the right fit.

Nice-to-Haves

Experience with OCR pipelines and noisy document inputs
Experience evaluating PDFs, invoices, contracts, medical or financial documents
Familiarity with LangGraph-based orchestration systems
Experience with MCP servers or tool-exposed agents
Experience in regulated environments (finance, healthcare, legal)
Exposure to human-in-the-loop evaluation systems
Experience designing model confidence calibration systems
Background in search, ranking, or information retrieval evaluation
Experience with conversational AI or voice agents
Experience evaluating multi-turn dialogues or call center automation systems

Why Join Us?

Define how AI system quality is measured in production
Work at the intersection of:
- Agentic orchestration
- NLP evaluation
- Document processing
- Voice agents
- Backend systems
- Production AI reliability
Build infrastructure that prevents silent failures in high-stakes workflows
High ownership and architectural influence
Help shape the evaluation standards for hybrid AI systems combining:
- LLM agents
- Classical NLP
- Document pipelines
- Voice agents
- Multi-step workflows

Ready to apply?

Send us your info and we'll reach out within 2 business days.

Apply Now