Mindrift is seeking a detail-oriented professional to support the evaluation and improvement of large language model agents through structured testing and analysis. In this role, you will design realistic scenarios that reflect human workflows and define clear success criteria to assess AI-generated behavior. You will collaborate within AI-focused projects, ensuring evaluation tasks are precise, repeatable, and aligned with production standards.
Responsibilities:
- Design structured test cases that simulate complex, real-world human tasks.
- Define gold-standard behaviors and scoring logic for agent evaluation.
- Review and analyze agent logs, decision paths, and failure modes.
- Work with repositories and testing frameworks to validate scenarios.
- Refine prompts, instructions, and test cases for clarity and difficulty.
- Ensure evaluation scenarios are reusable, scalable, and production-ready.
Requirements:
- Bachelor’s or Master’s degree in a relevant technical or analytical field.
- Background in quality assurance, software testing, data analysis, or NLP.
- Strong understanding of test design principles and edge case coverage.
- Excellent written English communication skills.
- Ability to work with structured formats such as JSON or YAML.
- Basic experience with Python and JavaScript.
Benefits:
- Flexible, remote freelance work schedule.
- Competitive hourly compensation based on skills and experience.
- Hands-on experience with advanced AI systems.
- Opportunity to build a strong portfolio in AI evaluation.
This role offers a unique opportunity to contribute directly to the development of responsible and high-quality AI technologies.