NeuronHire Logo
AI EngineerRemote

LH02 - Agentic Evals Engineer

REMOTEPosted April 2, 2026
Apply for this role →

Application

Add your info [here]


About the Project

We are partnering with a rapidly growing engineering organization (100+ engineers) building production-grade Agentic AI systems that power complex internal workflows and customer-facing products. These systems are built around LLM-driven agents orchestrated with LangGraph, integrated with MCP (Model Context Protocol) servers, backend services, and domain-specific tools. These are multi-step, stateful systems, not simple prompt-response applications. A major focus area includes document-centric workflows, where agents must:

  • Parse and interpret structured and unstructured documents
  • Extract entities and structured data
  • Validate fields and business logic
  • Route decisions based on extracted information
  • Operate across multi-step, stateful workflows
  • Trigger downstream tools and services
  • Maintain state across long-running workflows These systems often combine LLMs with traditional NLP models (NER, classifiers, rule-based validators, OCR pipelines, etc.). As these systems scale, evaluation becomes mission-critical — especially for correctness, consistency, reliability, and regression protection across complex workflows. This role focuses on designing and operating robust evaluation frameworks for agentic systems, document processing pipelines, and voice-based agents in production. This is not a research-only role and not limited to prompt grading. This is a systems-level evaluation engineering role deeply integrated with orchestration, structured extraction, NLP validation, voice agents, and production reliability. This is a long-term, high-impact role with direct influence over how quality is defined and enforced across AI-powered systems.

Role and Responsibilities

Agentic System Evaluation

Design and implement evaluation frameworks for multi-step agent workflows. Agent evaluation differs from traditional LLM evaluation because agents must be evaluated across decision sequences, tool calls, and workflow trajectories, not just final outputs. You will work on:

  • Evaluation frameworks for multi-step agent workflows
  • Tool invocation correctness
  • State transitions and memory behavior
  • Planner / executor coordination
  • End-to-end task completion validation
  • Trajectory evaluation (sequence of agent actions)
  • Deterministic replay and trace-based debugging
  • Structured output validation
  • Failure classification and root cause analysis
  • Automated regression testing pipelines for LangGraph systems
  • Scenario-based testing for long-running workflows
  • Evaluation datasets and scenario generation
  • Measuring agent reliability over time You will help define what it means for an agent system to be correct, not just whether a single response looks good.

Document Processing & NLP Evaluation

A major part of this role focuses on document ingestion and document-processing pipelines. You will design evaluation frameworks for systems that perform:

  • Document parsing
  • Entity extraction (NER)
  • Field extraction and normalization
  • Document classification
  • Schema validation
  • Business rule validation
  • Cross-document consistency checks
  • OCR pipeline evaluation
  • Structured output extraction using LLMs
  • Hybrid systems combining:
    • LLM extraction
    • Traditional NLP models
    • Rules and validators
    • Heuristics and confidence thresholds You will build evaluation harnesses that compare:
  • LLM-based extraction
  • Traditional NLP models (NER models)
  • Hybrid rule + model systems
  • OCR + extraction pipelines
  • Multiple prompts or extraction strategies You will implement evaluation metrics such as:
  • Precision / Recall / F1 for entity extraction
  • Field-level accuracy scoring
  • Document-level extraction accuracy
  • Schema conformance validation
  • Cross-field validation
  • Cross-document consistency
  • Confidence calibration
  • Error categorization
  • Drift detection over time You will also help develop:
  • Gold datasets and labeling standards
  • Adversarial and edge-case document test sets
  • Messy PDFs, OCR noise, ambiguous fields, conflicting values
  • Evaluation datasets representing real production documents
  • Regression test suites for document workflows Many workflows will involve pipelines where:
  • OCR → NER → Extraction → Validation → Agent decision → Tool call → Workflow routing
  • Extracted entities feed into agent reasoning
  • Business logic depends on extraction confidence
  • Incorrect extraction causes workflow failures downstream You will help evaluate the entire pipeline, not just individual models.

Voice Agent & Conversational System Evaluation

In addition to document and agent workflow evaluation, this role will also support evaluation of voice-based agentic systems. We evaluate voice agents using platforms such as Hamming.ai, which provides automated testing, simulated calls, regression testing, and production monitoring for AI voice agents. Voice agent evaluation includes:

  • Monitoring production calls for agent behavior correctness
  • Evaluating multi-turn conversation flows
  • Task completion success rates
  • Detecting when agents go off-script
  • Evaluating tool usage during calls
  • Latency and interruption handling
  • ASR / transcription error impact
  • Conversation state tracking
  • Dialogue planning failures
  • Regression testing using simulated calls
  • Converting failed production calls into regression tests
  • Tracking business outcomes from calls Voice agent testing platforms can simulate thousands of calls with different personas, accents, noise conditions, and scenarios to detect failures before deployment. This role may involve building:
  • Call evaluation metrics
  • Conversation scoring frameworks
  • Regression testing pipelines for voice agents
  • Production monitoring dashboards for conversational quality
  • Failure classification systems for voice agent errors
  • Conversation trajectory evaluation
  • Call outcome classification and scoring This work is similar to agent evaluation but applied to real-time conversational systems operating over voice, which introduces additional failure modes such as speech recognition errors, latency, interruptions, and dialogue management issues.

Evaluation Infrastructure & Tooling

You will help build the infrastructure that prevents AI regressions in production systems. This includes:

  • Offline and batch evaluation pipelines
  • Evaluation datasets and scenario generators
  • Regression testing frameworks for agents and document pipelines
  • CI/CD integration for AI systems
  • Deterministic replay systems
  • Structured output validation pipelines
  • Evaluation dashboards
  • Drift detection systems
  • Model vs. model comparison systems
  • Production vs. staging comparison systems
  • Latency and cost tracking
  • Evaluation logging and observability
  • Failure classification and error taxonomy systems You will help design hybrid evaluation approaches, including:
  • Rule-based validators
  • Deterministic scoring layers
  • LLM-as-judge systems
  • Confidence thresholding strategies
  • Ensemble evaluation methods
  • Human-in-the-loop review systems You will work closely with agent and backend engineers to improve:
  • Determinism
  • Observability
  • Debuggability
  • Testability
  • Reliability of document workflows
  • Reliability of voice agents
  • Reliability of agent orchestration systems

What We’re Looking For

Must-Haves

  • 4+ years of professional engineering experience in backend systems, ML systems, NLP, or AI infrastructure
  • Strong proficiency in Python
  • Experience working with:
    • LLM-based systems
    • Structured output extraction
    • Multi-step agent workflows
    • Document processing systems
  • Strong understanding of NLP fundamentals, including:
    • Named Entity Recognition (NER)
    • Classification models
    • Precision / Recall / F1 metrics
    • Dataset creation and evaluation methodology
  • Experience building evaluation pipelines for document extraction systems
  • Familiarity with structured validation tools (JSON Schema, Pydantic, etc.)
  • Experience designing regression testing systems for AI pipelines
  • Ability to distinguish between:
    • Model extraction failures
    • OCR noise issues
    • Prompt design flaws
    • Orchestration bugs
    • Tool failures
    • Business logic errors
    • Voice conversation failures
  • Strong analytical and debugging skills in ambiguous, non-deterministic systems

We Are Specifically Looking For Candidates Who Have

  • Evaluated document processing systems in production
  • Built entity-level scoring frameworks
  • Compared traditional NLP models vs LLM extraction systems
  • Designed hybrid validation systems (rules + models)
  • Diagnosed extraction or routing failures in multi-step AI workflows
  • Built gold datasets for structured document evaluation
  • Built evaluation frameworks for agent systems
  • Worked on voice agents, conversational AI, or call automation systems
  • Evaluated multi-turn conversations or task completion in conversational systems
  • Built regression testing systems for AI workflows If your experience is limited to prompt iteration or academic NLP benchmarking without real-world document pipelines or agent workflows, this role is likely not the right fit.

Nice-to-Haves

  • Experience with OCR pipelines and noisy document inputs
  • Experience evaluating PDFs, invoices, contracts, medical or financial documents
  • Familiarity with LangGraph-based orchestration systems
  • Experience with MCP servers or tool-exposed agents
  • Experience in regulated environments (finance, healthcare, legal)
  • Exposure to human-in-the-loop evaluation systems
  • Experience designing model confidence calibration systems
  • Background in search, ranking, or information retrieval evaluation
  • Experience with conversational AI or voice agents
  • Experience evaluating multi-turn dialogues or call center automation systems

Why Join Us?

  • Define how AI system quality is measured in production
  • Work at the intersection of:
    • Agentic orchestration
    • NLP evaluation
    • Document processing
    • Voice agents
    • Backend systems
    • Production AI reliability
  • Build infrastructure that prevents silent failures in high-stakes workflows
  • High ownership and architectural influence
  • Help shape the evaluation standards for hybrid AI systems combining:
    • LLM agents
    • Classical NLP
    • Document pipelines
    • Voice agents
    • Multi-step workflows

Application Instructions

Submit everything here: https://www.neuronhire.com/join-pool-form Please include:

  • Your résumé/CV highlighting document processing, NLP, evaluation systems, agent workflows, or conversational AI
  • Links to repositories or documentation showing:
    • NER evaluation pipelines
    • Extraction benchmarking systems
    • Agent regression frameworks
    • Evaluation tooling
  • Your availability and compensation expectations
  • Brief description of a complex document-processing, agent, or voice system failure you diagnosed and how you evaluated or debugged it (optional but highly valued)

Ready to apply?

Send us your info and we'll reach out within 2 business days.

Apply Now