Google Measuring the User Experience on a Large Scale: User-Centered Metrics for Web Applications User Manual

Page 2

Advertising

packages [5,11] provide off the shelf solutions. Custom
analysis of large-scale log data is made easier via modern
distributed systems [4,8] and specialized programming
languages [e.g. 12]. Web usage mining techniques can be
used to segment visitors to a site according to their behavior
[3]. Multiple vendors support rapid deployment and
analysis of user surveys, and some also provide software for
large-scale remote usability or benchmarking tests [e.g. 14].
A large body of work exists on the proper design and
analysis of controlled A/B tests [e.g. 10] where two similar
populations of users are given different user interfaces, and
their responses can be rigorously measured and compared.

Despite this progress, it can still be challenging to use these
tools effectively. Standard web analytics metrics may be
too generic to apply to a particular product goal or research
question. The sheer amount of data available can be
overwhelming, and it is necessary to scope out exactly what
to look for, and what actions will be taken as a result.
Several experts suggest a best practice of focusing on a
small number of key business or user goals, and using
metrics to help track progress towards them [2, 9, 10]. We
share this philosophy, but have found that this is often
easier said than done. Product teams have not always
agreed on or clearly articulated their goals, which makes
defining related metrics difficult.

It is clear that metrics should not stand alone. They should
be triangulated with findings from other sources, such as
usability studies and field studies [6,9], which leads to
better decision-making [15]. Also, they are primarily useful
for evaluation of launched products, and are not a substitute
for early or formative user research. We sought to create a
framework that would combine large-scale attitudinal and
behavioral data, and complement, not replace, existing user
experience research methods in use at our company.

PULSE METRICS

The most commonly used large-scale metrics are focused
on business or technical aspects of a product, and they (or
similar variations) are widely used by many organizations
to track overall product health. We call these PULSE
metrics: Page views, Uptime, Latency, Seven-day active
users (i.e. the number of unique users who used the product
at least once in the last week), and Earnings.

These metrics are all extremely important, and are related to
user experience – for example, a product that has a lot of
outages (low uptime) or is very slow (high latency) is
unlikely to attract users. An e-commerce site whose
purchasing flow has too many steps is likely to earn less
money. A product with an excellent user experience is
more likely to see increases in page views and unique users.

However, these are all either very low-level or indirect
metrics of user experience, making them problematic when
used to evaluate the impact of user interface changes. They
may also have ambiguous interpretation – for example, a
rise in page views for a particular feature may occur

because the feature is genuinely popular, or because a
confusing interface leads users to get lost in it, clicking
around to figure out how to escape. A change that brings in
more revenue in the short term may result in a poorer user
experience that drives away users in the longer term.

A count of unique users over a given time period, such as
seven-day active users, is commonly used as a metric of
user experience. It measures the overall volume of the user
base, but gives no insight into the users’ level of
commitment to the product, such as how frequently each of
them visited during the seven days. It also does not
differentiate between new users and returning users. In a
worst-case retention scenario of 100% turnover in the user
base from week to week, the count of seven-day active
users could still increase, in theory.

HEART METRICS

Based on the shortcomings we saw in PULSE, both for
measuring user experience quality, and providing
actionable data, we created a complementary metrics
framework, HEART: Happiness, Engagement, Adoption,
Retention, and Task success. These are categories, from
which teams can then define the specific metrics that they
will use to track progress towards goals. The Happiness and
Task Success categories are generalized from existing user
experience metrics: Happiness incorporates satisfaction,
and Task Success incorporates both effectiveness and
efficiency. Engagement, Adoption, and Retention are new
categories, made possible by large-scale behavioral data.

The framework originated from our experiences of working
with teams to create and track user-centered metrics for
their products. We started to see patterns in the types of
metrics we were using or suggesting, and realized that
generalizing these into a framework would make the
principles more memorable, and usable by other teams.

It is not always appropriate to employ metrics from every
category, but referring to the framework helps to make an
explicit decision about including or excluding a particular
category. For example, Engagement may not be meaningful
in an enterprise context, if users are expected to use the
product as part of their work. In this case a team may
choose to focus more on Happiness or Task Success. But it
may still be meaningful to consider Engagement at a feature
level, rather than the overall product level.

Happiness

We use the term “Happiness” to describe metrics that are
attitudinal in nature. These relate to subjective aspects of
user experience, like satisfaction, visual appeal, likelihood
to recommend, and perceived ease of use. With a general,
well-designed survey, it is possible to track the same
metrics over time to see progress as changes are made.

For example, our site has a personalized homepage,
iGoogle. The team tracks a number of metrics via a weekly
in-product survey, to understand the impact of changes and

Advertising