Back to all positionsResearch
Data Scientist - AI Benchmarks
Remote (Worldwide)full-timeRemote OK
Position Filled· Closed August 30, 2025
About This Role
We sought a Data Scientist to design and implement evaluation frameworks for our AI systems. This research-focused role was essential in measuring and improving AI performance across various use cases in the travel and airline industry.
You would have been responsible for answering critical questions: How good is our AI really? How do we measure "good"? How do we systematically improve? This role required both rigorous scientific thinking and practical engineering skills.
What You Would Do
- Design comprehensive benchmarks for evaluating LLM and AI agent performance
- Develop custom metrics and scoring systems tailored to our specific use cases (booking assistance, customer service, content generation)
- Build automated evaluation pipelines that run continuously across model versions
- Analyze model performance across dimensions: accuracy, latency, cost, safety, and user satisfaction
- Conduct A/B tests and statistical analyses to measure real-world impact of AI improvements
- Create reproducible evaluation datasets and maintain benchmark integrity
- Research and implement emerging evaluation methodologies from academic literature
- Collaborate with engineering on model selection, fine-tuning decisions, and prompt optimization
- Present findings and recommendations to technical and non-technical stakeholders
- Contribute to internal and external publications on AI evaluation
Requirements
- Strong background in statistics, mathematics, data science, or related quantitative field
- Experience designing and implementing ML evaluation methodologies
- Proficiency in Python with data analysis libraries (pandas, numpy, scipy)
- Understanding of LLM capabilities, limitations, and failure modes
- Experience with experimental design and statistical hypothesis testing
- Excellent analytical thinking and documentation skills
- Ability to communicate complex findings clearly
Nice to Have
- Published research in AI/ML evaluation, NLP, or related fields
- Experience with human evaluation studies and annotation
- Knowledge of prompt engineering and LLM optimization techniques
- Familiarity with ML experiment tracking tools (MLflow, Weights & Biases)
- Background in information retrieval or search quality evaluation
What We Offered
- Competitive salary in USD
- Fully remote position with asynchronous work culture
- Research publication opportunities and conference attendance
- Annual conference and learning budget
- Flexible schedule to accommodate deep work
- Opportunity to shape how the industry evaluates AI systems
This position has been filled. Check our careers page for current openings.