i18n Evaluation Lead — Program & Engineering
Raindrops TechnologyJob description
Quantryx.ai ( Raindrops Technology) is hiring an i18n Evaluation Lead to own both program delivery and operations engineering for a multilingual AI evaluation engagement supporting a leading technology company's conversational AI product. This is a lead + hands-on engineering role — you'll manage a team of Linguistic QA Analysts while building and maintaining the automated evaluation pipeline that powers the entire operation.all. What You'll Do
Program Leadership
Own end-to-end program delivery for multilingual evaluation across 7 locales
Serve as primary client interface with the technology partner's program team
Manage and mentor a team of Linguistic QA Analysts across Spanish (US & MX), French (Canada), Italian, Portuguese (Brazil), and Japanese
Design and maintain multi-dimension rating questionnaires and calibration protocols
Ensure inter-rater reliability standards are met across all dimensions and locales
Deliver headroom reports with actionable recommendations to client stakeholders
Conduct quarterly business reviews and rubric alignment sessions
Engineering & Tooling
Design, build, and maintain the end-to-end evaluation pipeline: query generation, model invocation, response capture, rating UI, and statistical analysis
Develop and operate the rating UI used by Linguistic QA Analysts
Implement automated headroom calculation, delta tracking, and report generation
Build real-time dashboards for quality scores, reliability metrics, and trend analysis
Manage data infrastructure: version-controlled query sets, encrypted eval data, audit logs
Implement blind double-rating protocols and adjudication workflows in tooling
Ensure pipeline scalability for large-volume multilingual evaluation runs
What We're Looking For
7+ years in NLP evaluation, i18n quality, or language technology — with at least 3 years in a leadership or program management capacity
Strong software engineering skills: Python and/or Node.js for pipeline automation
Experience managing multilingual evaluation or localization teams
Statistical literacy: hypothesis testing, confidence intervals, inter-rater reliability metrics (Cohen's kappa)
Experience building rating/annotation UIs or evaluation tooling for human assessment
Database design (SQL + NoSQL) and cloud infrastructure experience (GCP preferred)
Familiarity with CI/CD, data visualization frameworks, and dashboard tooling
Excellent client-facing communication and presentation skills
Bachelor's degree required; Master's in Linguistics, Computational Linguistics, CS, or related field preferred
Nice to Have
Experience with Google's evaluation methodologies or vendor program structures
Background in both program management and engineering (rare but ideal for this role)
Familiarity with multimodal AI evaluation
¿Te interesa este puesto?