Role Overview
We are hiring highly analytical and research-oriented professionals with strong academic or industry research backgrounds to contribute to advanced AI benchmark development. This role focuses on building complex tasks that require reading, analyzing, and synthesizing large-scale document collections across multiple domains.
The ideal candidate should be capable of designing structured research tasks, creating accurate ground-truth datasets, and developing evaluation frameworks for multi-agent AI systems.
Key Responsibilities
- Build benchmark tasks requiring deep reading, reasoning, and synthesis across large document collections
- Curate real-world research corpora such as academic papers, case studies, and technical reports
- Design complex questions requiring comprehensive multi-source analysis
- Create structured ground-truth oracles in JSON with precise and verifiable answers
- Develop LLM judge prompts to evaluate outputs field-by-field against defined oracles
- Create decomposition guides for parallel multi-agent research workflows
- Support improvement of AI systems through high-quality benchmark design
Mandatory Requirements
Research Background
- 5+ years of academic or industry research experience in any scientific or technical domain
- Strong reading comprehension and ability to extract structured insights from unstructured text
- High attention to detail with focus on exact values and factual accuracy
Technical Skills
- Experience with JSON and structured data design
- Ability to create schemas and validate output formats
- Python scripting skills for judge scripts, automation, and data processing
- Experience with AI coding benchmarks such as SWE-bench or Terminal-bench
- Comfortable with Docker, including Dockerfiles, image builds, and debugging container environments
Preferred Qualifications
- Experience with systematic reviews, meta-analyses, or literature surveys
- Familiarity with medical, legal, or scientific document analysis
- Experience in NLP or information extraction tasks
- Knowledge of LLM evaluation benchmarks such as MMLU, GPQA, or SimpleQA
- Experience curating datasets for AI evaluation