Two years of end-to-end UX research spanning brand transition, information architecture, design systems, data viz, AI Health Coach, agentic memory CUJs, conversational quality evaluation, and a 13M+ user general availability launch.
See the live product
When I joined the Google Health team, Fitbit was one of the most recognized health and fitness brands in the world, and Google was about to retire it. The Fitbit app would become the Google Health app, a centralized health destination built on Gemini-powered AI, with an entirely new brand identity, information architecture, and product vision.
This wasn't a redesign. It was a full product transformation: new name, new logo, new navigation structure, new AI capabilities, and a new contract with users about what a health app could do for them. My research sat at the center of that transformation, from the earliest brand validation and IA studies through tiered internal testing, to the agentic AI evaluation pipelines that shaped the Google Health Coach at launch.
The research challenge: How do you validate a brand transition that millions of people feel ownership over, test a new AI that doesn't yet exist, and maintain research velocity across a product team running on sprint cadences — all without compromising the rigor that a regulated health product demands?
The research program wasn't a single study — it was a sustained, phased effort that evolved as the product evolved. Each track fed the next, and several ran concurrently during the most intensive build phases.
The most novel research challenge on this project was evaluating an agentic AI system — one that reasons, plans, and takes actions across multiple turns — in a health context where the stakes of a poor response are high. Standard usability testing wasn't sufficient. We needed evaluation infrastructure that could scale with the product's development cadence.
I led the development of a conversational quality codebook — a structured taxonomy of what "good" looks like for the Health Coach across response types: health guidance, goal setting, progress reflection, and safety edge cases. This became the foundation for human evaluation protocols and, eventually, the autorater pipelines that could assess AI responses at scale without requiring a researcher in every loop.
The codebook informed prompt engineering decisions, shaped the evaluative pipeline architecture, and provided the KPI framework used to benchmark the Coach at launch — connecting UX research directly to the model's development roadmap.
The insight that changed the direction: Early evaluation revealed that users didn't just want accurate responses — they wanted responses that felt grounded in their history. A technically correct answer from an AI that seemed to have forgotten last week's conversation scored lower on trust than a slightly less precise answer that referenced prior context. This finding directly shaped the memory CUJ prioritization.
The Google Health app launched to general availability on May 19, 2026 — automatically updating for all existing Fitbit users worldwide, bringing the best of Fitbit's tracking heritage together with Gemini-powered AI health coaching. Research shaped every layer of that experience: the brand that users trusted enough to accept, the IA that made their data legible, the AI Coach that felt personal rather than generic, and the quality frameworks that kept it honest.
Specific research artifacts, study designs, participant data, and internal findings are protected under NDA. Screenshots, prototypes, and additional detail are available for discussion in a portfolio review. Reach out to set that up →
The most important shift I made on this project was moving from thinking about research as studying a product to thinking about research as building evaluation infrastructure. When the product is an agentic AI system that changes with every model update, one-time studies aren't enough. You need frameworks — codebooks, pipelines, rubrics — that can assess quality continuously as the product evolves.
The conversational quality codebook was the most impactful artifact I built here, not because it answered a research question, but because it gave the entire team a shared definition of "good" — one that engineering could implement, product could prioritize against, and researchers could evaluate consistently. That's the kind of infrastructure that outlasts any individual study.
The other lesson: in a health context, trust is not a feature. It's the precondition for everything else. Users didn't engage with the Health Coach's capabilities until they trusted it wouldn't mislead them. Research's job was to find exactly where that trust broke — and fix it before launch.