Google Health · 2024–2026 0 → GA Launch 13M+ Users Agentic AI

From Fitbit to Google Health.  Leading research across a full product transition and GA launch

Two years of end-to-end UX research spanning brand transition, information architecture, design systems, data viz, AI Health Coach, agentic memory CUJs, conversational quality evaluation, and a 13M+ user general availability launch.

See the live product
Google Health app four screens
Role
Senior Mixed-Methods UX Researcher
Timeline
Oct 2024 – Present
Methods
Longitudinal IDIs, concept testing, evaluative usability, agentic AI evaluation, user feedback analysis, internal testing
Impact
Fitbit → Google Health GA launch to 13M+ users; unblocked rebrand; app IA, AI Health Coach CUJs and agentic conversation quality frameworks
Context

The biggest rebrand in Google Health history & the research that made it possible

When I joined the Google Health team, Fitbit was one of the most recognized health and fitness brands in the world, and Google was about to retire it. The Fitbit app would become the Google Health app, a centralized health destination built on Gemini-powered AI, with an entirely new brand identity, information architecture, and product vision.

This wasn't a redesign. It was a full product transformation: new name, new logo, new navigation structure, new AI capabilities, and a new contract with users about what a health app could do for them. My research sat at the center of that transformation, from the earliest brand validation and IA studies through tiered internal testing, to the agentic AI evaluation pipelines that shaped the Google Health Coach at launch.

The research challenge: How do you validate a brand transition that millions of people feel ownership over, test a new AI that doesn't yet exist, and maintain research velocity across a product team running on sprint cadences — all without compromising the rigor that a regulated health product demands?

Google Health app four-tab layout: Today, Fitness, Sleep, Health
Google Health app — the four-tab layout (Today, Fitness, Sleep, Health) that research helped define. Source: Google Blog, May 2026.

Research program

Six research tracks across two years

The research program wasn't a single study — it was a sustained, phased effort that evolved as the product evolved. Each track fed the next, and several ran concurrently during the most intensive build phases.

Track 01
Brand transition & rebrand validation
Validated the Fitbit → Google Health transition across user segments. Concept testing on brand perception, trust transfer, and emotional response to the identity shift. Unblocked Director-level sign-off on the rebrand strategy.
Track 02
Today page & IA research
Foundational research on the new four-tab information architecture (Today, Fitness, Sleep, Health). Studied how users navigate, what they prioritize in the Today stream, and where the current IA created friction or confusion.
Track 03
Health Coach personalization & data viz dashboards
Longitudinal research on how users want the AI Health Coach to personalize over time. Visualization research for peace-of-mind health metrics dashboards — understanding which data needs to be immediately readable vs. exploratory.
Track 04
Agentic AI & conversational quality
Led evaluative research on agentic AI conversation dynamics. Developed a conversational quality codebook for the Health Coach. Designed and validated autorater pipelines to evaluate AI subagent responses at scale, informing KPI benchmarks for CSAT and adoption.
Track 05
User & agentic memory CUJs
Research into Critical User Journeys for memory — how users expect the AI to remember context, goals, and history across sessions. Studied the trust and transparency dynamics of agentic memory in a health context.
Track 06
Internal testing → public preview → GA
Managed tiered launch research: internal testing, public preview feedback analysis, triangulating signals across build versions, UI changes, and AI subagent behavior. Intake and capacity planning aligned to release cadence through GA launch.

Deep dive — agentic AI evaluation

Building the quality framework for an AI health coach

The most novel research challenge on this project was evaluating an agentic AI system — one that reasons, plans, and takes actions across multiple turns — in a health context where the stakes of a poor response are high. Standard usability testing wasn't sufficient. We needed evaluation infrastructure that could scale with the product's development cadence.

I led the development of a conversational quality codebook — a structured taxonomy of what "good" looks like for the Health Coach across response types: health guidance, goal setting, progress reflection, and safety edge cases. This became the foundation for human evaluation protocols and, eventually, the autorater pipelines that could assess AI responses at scale without requiring a researcher in every loop.

The codebook informed prompt engineering decisions, shaped the evaluative pipeline architecture, and provided the KPI framework used to benchmark the Coach at launch — connecting UX research directly to the model's development roadmap.

The insight that changed the direction: Early evaluation revealed that users didn't just want accurate responses — they wanted responses that felt grounded in their history. A technically correct answer from an AI that seemed to have forgotten last week's conversation scored lower on trust than a slightly less precise answer that referenced prior context. This finding directly shaped the memory CUJ prioritization.


What research produced

Deliverables

Brand research
Concept testing report on the Fitbit → Google Health transition — validating brand perception, trust transfer, and unblocking rebrand approval at Director level.
IA research
Foundational study on the four-tab IA and Today stream — user mental models, navigation patterns, and priority hierarchy informing the final tab structure.
Convo quality codebook
Structured taxonomy defining quality dimensions for Health Coach responses — the foundation for human evaluation protocols and autorater pipeline development.
Autorater pipeline
Co-developed AI evaluation tooling to assess conversational quality across AI subagents at scale — establishing CSAT and adoption KPI benchmarks for launch.
Memory CUJ docs
Critical User Journey documentation for user and agentic memory — trust, transparency, and context expectations that shaped the Health Coach memory architecture.
Launch readiness
Tiered testing feedback synthesis across internal → public preview → GA — triangulating signals across build versions, UI changes, and subagent behavior.
Drop your own image here in edit mode
Caption: add your own

Impact

Global Rollout to 13M+ users

The Google Health app launched to general availability on May 19, 2026 — automatically updating for all existing Fitbit users worldwide, bringing the best of Fitbit's tracking heritage together with Gemini-powered AI health coaching. Research shaped every layer of that experience: the brand that users trusted enough to accept, the IA that made their data legible, the AI Coach that felt personal rather than generic, and the quality frameworks that kept it honest.

13M+
Users at GA launch — Fitbit's entire user base transitioned to Google Health
GA
Full public launch May 2026, following internal → public preview → GA research pipeline
6+
Research tracks spanning brand, IA, AI coaching, memory, evaluation, and launch readiness

Specific research artifacts, study designs, participant data, and internal findings are protected under NDA. Screenshots, prototypes, and additional detail are available for discussion in a portfolio review. Reach out to set that up →



Reflection

What this project taught me about researching AI at scale

The most important shift I made on this project was moving from thinking about research as studying a product to thinking about research as building evaluation infrastructure. When the product is an agentic AI system that changes with every model update, one-time studies aren't enough. You need frameworks — codebooks, pipelines, rubrics — that can assess quality continuously as the product evolves.

The conversational quality codebook was the most impactful artifact I built here, not because it answered a research question, but because it gave the entire team a shared definition of "good" — one that engineering could implement, product could prioritize against, and researchers could evaluate consistently. That's the kind of infrastructure that outlasts any individual study.

The other lesson: in a health context, trust is not a feature. It's the precondition for everything else. Users didn't engage with the Health Coach's capabilities until they trusted it wouldn't mislead them. Research's job was to find exactly where that trust broke — and fix it before launch.

Editing Click any text to edit · drag images onto image blocks to replace