Role Overview

We are hiring highly analytical and research-oriented professionals with strong academic or industry research backgrounds to contribute to advanced AI benchmark development. This role focuses on building complex tasks that require reading, analyzing, and synthesizing large-scale document collections across multiple domains.

The ideal candidate should be capable of designing structured research tasks, creating accurate ground-truth datasets, and developing evaluation frameworks for multi-agent AI systems.

Key Responsibilities

Build benchmark tasks requiring deep reading, reasoning, and synthesis across large document collections
Curate real-world research corpora such as academic papers, case studies, and technical reports
Design complex questions requiring comprehensive multi-source analysis
Create structured ground-truth oracles in JSON with precise and verifiable answers
Develop LLM judge prompts to evaluate outputs field-by-field against defined oracles
Create decomposition guides for parallel multi-agent research workflows
Support improvement of AI systems through high-quality benchmark design

Mandatory Requirements

Research Background

5+ years of academic or industry research experience in any scientific or technical domain
Strong reading comprehension and ability to extract structured insights from unstructured text
High attention to detail with focus on exact values and factual accuracy

Technical Skills

Experience with JSON and structured data design
Ability to create schemas and validate output formats
Python scripting skills for judge scripts, automation, and data processing
Experience with AI coding benchmarks such as SWE-bench or Terminal-bench
Comfortable with Docker, including Dockerfiles, image builds, and debugging container environments

Preferred Qualifications

Experience with systematic reviews, meta-analyses, or literature surveys
Familiarity with medical, legal, or scientific document analysis
Experience in NLP or information extraction tasks
Knowledge of LLM evaluation benchmarks such as MMLU, GPQA, or SimpleQA
Experience curating datasets for AI evaluation

LLM - Senior Fullstack Developer (Python/JavaScript)

Job description

Role Overview

Key Responsibilities

Research Background

Related

Related