*The Mission** We are seeking a versatile developer to help us push the boundaries of LLM capabilities. In this role, you won't just be writing code; you will be designing the "Gold Standard" benchmarks used to evaluate how AI models reason, execute, and solve problems in real-world technical enviro