Quality Lead
About the role
You will own Whym’s recommendation quality end to end. Every suggestion we put in front of a user passes through your systems. You will build the data pipeline that feeds the engine, the ranking that orders what we show, and the evaluation that tells us whether it works.
Quality is central to the product’s promise. Users respond to the concept. Whether they stay depends on whether each suggestion lands.
This is a tiny startup. Everyone does customer support, including the CEO and CTO. You will too. Some weeks you will debug a ranker. Other weeks you will review user feedback, build a labeling UI, or sit with a tester to understand why a suggestion missed. You should find that appealing, not beneath you.
What you will work on
- Ranking and relevance. Design and ship the algorithms that decide what Whym suggests to whom, and in what order. Own the full stack: retrieval, candidate generation, ranking, and freshness. Iterate based on eval signals and user feedback. Today’s ranker is an LLM self-score placeholder. You will replace it with something learned.
- Evaluation. Build the framework that tells us whether a quality change helped or hurt. Design rubrics. Run human eval. Operate LLM-as-judge at scale. Maintain offline test sets. Set up AB tests. Build agentic harnesses for optimization. Create the metrics dashboards the team lives by. Without eval, quality work is guesswork. You will make sure we are not guessing.
- Data pipeline and indexing. Own how activity data enters Whym. You will design the index, decide what we own versus fetch on demand, and expand source coverage as we open new markets. Scraping, integration, and data partnerships run through this function.
- Quality analytics. Answer the hard questions with data. Where are we failing? Which suggestions convert? What does a good session look like? The analytics muscle sits next to ranking and eval because all three only work when they feed each other.
What we’re looking for
- 8+ years in engineering, with meaningful time on search, feed, recommender, or ranking systems at consumer scale.
- You have shipped ranking or retrieval systems that real users depend on. We want to hear about specific decisions you made, what you measured, and how the system improved over time.
- Strong data science and ML foundations. You can design an experiment, reason about metrics, and tell a good signal from a noisy one.
- Hands-on with evaluation. You know how to build test sets, write rubrics, and run both human and LLM-based eval. You have opinions about what breaks when you try to scale eval, and how to fix it.
- AI-native. You use modern AI tools in your daily work. You have run LLM-as-judge pipelines and understand their failure modes. You adopt new techniques before they are mainstream.
- Fluent with data infrastructure. You can design a pipeline, debug a schema, and argue for or against owning an index. You do not need a platform team to get work done.
- Strong product instinct. You connect ranking decisions to user outcomes. You know which metric matters and can say why in plain language.
- Clear writer and communicator. You document decisions. You present results without spin. You disagree without posturing. You know what AI slop looks like and you do not turn it in as final work.
- Comfortable with ambiguity. You thrive at early stage, where the definition of quality is itself evolving.
- High agency. You notice what needs doing and do it without waiting to be asked.
Strong pluses
- Experience at Google Search, YouTube, TikTok, Instagram, Netflix, Spotify, or another recommender-driven product at scale.
- Background building evaluation frameworks from scratch, not just operating an existing one.
- Experience with local discovery, location-based products, or anything involving fresh real-world inventory.
- Published work on ranking, recommendations, or evaluation.
- Experience managing agentic or LLM-driven systems in production.
- A point of view on how modern AI changes what ranking and evaluation look like.
- Experience at an early-stage startup where you wore many hats.
- Systems thinker. You build processes that scale. You document what you build.
Location and working model
Whym is a hybrid team based in the Bay Area, California and Bend, Oregon. We work together in person three days a week and reserve the other days for focused remote work. We prefer candidates who can work with us at our hubs, but will make room for exceptional candidates who want to be fully remote. Bend-based and remote teammates travel to the Bay Area regularly so the whole team can spend time together.
About Whym
Whym is an AI-powered app that helps people make time for what matters most. We suggest activities aligned with your values and help you coordinate plans with close friends. We are a two-person founding team with experience from Google, LinkedIn, Lyft, and earlier-stage startups. Whym is a Delaware C-Corp Public Benefit Corporation, pre-seed.
Interested?
Send a short note about your relevant work to jobs@whym.co.