Data Scientist - AI Benchmarks | Careers at AI Champions

About This Role

We sought a Data Scientist to design and implement evaluation frameworks for our AI systems. This research-focused role was essential in measuring and improving AI performance across various use cases in the travel and airline industry.

You would have been responsible for answering critical questions: How good is our AI really? How do we measure "good"? How do we systematically improve? This role required both rigorous scientific thinking and practical engineering skills.

What You Would Do

Design comprehensive benchmarks for evaluating LLM and AI agent performance
Develop custom metrics and scoring systems tailored to our specific use cases (booking assistance, customer service, content generation)
Build automated evaluation pipelines that run continuously across model versions
Analyze model performance across dimensions: accuracy, latency, cost, safety, and user satisfaction
Conduct A/B tests and statistical analyses to measure real-world impact of AI improvements
Create reproducible evaluation datasets and maintain benchmark integrity
Research and implement emerging evaluation methodologies from academic literature
Collaborate with engineering on model selection, fine-tuning decisions, and prompt optimization
Present findings and recommendations to technical and non-technical stakeholders
Contribute to internal and external publications on AI evaluation

Requirements

Strong background in statistics, mathematics, data science, or related quantitative field
Experience designing and implementing ML evaluation methodologies
Proficiency in Python with data analysis libraries (pandas, numpy, scipy)
Understanding of LLM capabilities, limitations, and failure modes
Experience with experimental design and statistical hypothesis testing
Excellent analytical thinking and documentation skills
Ability to communicate complex findings clearly

Nice to Have

Published research in AI/ML evaluation, NLP, or related fields
Experience with human evaluation studies and annotation
Knowledge of prompt engineering and LLM optimization techniques
Familiarity with ML experiment tracking tools (MLflow, Weights & Biases)
Background in information retrieval or search quality evaluation

What We Offered

Competitive salary in USD
Fully remote position with asynchronous work culture
Research publication opportunities and conference attendance
Annual conference and learning budget
Flexible schedule to accommodate deep work
Opportunity to shape how the industry evaluates AI systems