Designing with Data: Improving the User Experience with A/B Testing
Availability: Usually ships in 24 hours
Ships from and sold by Amazon.com
On the surface, design practices and data science may not seem like obvious partners. But these disciplines actually work toward the same goal, helping designers and product managers understand users so they can craft elegant digital experiences. While data can enhance design, design can bring deeper meaning to data.
This practical guide shows you how to conduct data-driven A/B testing for making design decisions on everything from small tweaks to large-scale UX concepts. Complete with real-world examples, this book shows you how to make data-driven design part of your product design workflow.
About the Author
Rochelle King is Global VP of Design and User Experience at Spotify where she is responsible for the teams that oversee user research and craft the product experience at Spotify. Prior to Spotify, Rochelle was VP of User Experience and Product Services at Netflix. where she managed the Design, Enhanced Content, Content Marketing, and Localization teams at Netflix. Collectively, these groups were responsible for the UI, layout, meta-data (editorial and visual assets), and presentation of the Netflix service internationally across all platforms. Rochelle has over 14 years of experience working on consumer-facing products. You can find her on Twitter @rochelleking.
Dr. Elizabeth Churchill is a Director of User Experience at Google. Her work focuses on the connected ecosystems of the Social Web and Internet of Things. For two decades, Elizabeth has been a research leader at well-known corporate R&D organizations including Fuji Xerox's research lab in Silicon Valley (FXPAL), the Palo Alto Research Center (PARC), eBay Research Labs in San Jose, and Yahoo! in Santa Clara, California.
Elizabeth has contributed groundbreaking research in a number of areas, publishing over 100 peer reviewed articles, coediting 5 books in HCI-related fields, contributing as a regular columnist for the Association of Computing Machinery's (ACM) Interactions magazine since 2008, and publishing an academic textbook, Foundations for Designing User-Centered Systems. She has also launched successful products, and has more than 50 patents granted or pending.
Caitlin Tan is a User Researcher at Spotify, and a recent graduate from MIT.
Most helpful customer reviews
15 of 15 people found the following review helpful.
Strengths and Weaknesses
By Ron Kohavi
The strength of this book is that it's written for designers, a group that sometimes considers A/B testing as "competing," with the creative process. The authors point out the complementary value and call the "genius designer" a myth. The weakness of the book is that the statistics are wrong at times, which may mislead readers.
I have been using A/B tests and more sophisticated controlled experiments for over a decade, including leading the ExP Platform at Microsoft, which is used to run over 12,000 experiment treatments/year, Some of my work is referenced in this book, so please take this review in the appropriate context.
Here are some key points I loved:
• Great observations, such as "[Ensure] you’re running meaningful learning tests rather than relying on A/B testing as a “crutch” — that is, where you to stop thinking carefully and critically about your overarching goal( s) and run tests blindly, just because you can.
• Nice quotations from multiple people doing A/B testing in the industry
• Good observations about insensitive metrics such as NPS, which take "significant change in experience and a long time to change what users think about a company." Another example, which is even more extreme, is stock price. You could run experiments and watch the stock ticker. Good luck with that insensitive metric.
• Good observation about metrics that "can't fail," such as clicks on a feature that didn't exist.
• Netflix found "a very strong correlation between viewing hours and retention....used viewing hours (or content consumption) as their strongest proxy metric for retention."
Coming up with short-term metrics predictive of long-term success is one of the hardest things.
• "Deviating significantly from your existing experience requires more resources and effort than making small iterations.”
• For those who "worry that A/B testing and using data in the design process might stifle creativity.... generating a large variety of different hypotheses prior to designing forces you and your team to be more creative."
• Nice references to Dan McKinley's observations that most features are killed for lack of usage, and that unexciting features, such as "emails to people who gave up in the midst of a purchase had much bigger potential impact to the business."
• "…changing something about the algorithm that increases response speed (e.g., content download on mobile devices or in getting search results); users see the same thing but the experience is more responsive, and feels smoother. Although these performance variables aren’t “visible” to the user and may not be part of visual design, these variables strongly influence the user experience."
Great point about the importance of performance and the fact that this cannot be measured in prototypes or sketches. We ran multiple "slowdown" experiments to measure the value of perf.
• Interesting discussion on the “Painted door” tests and the point that it's a questionable test that misleads users. It's also unable to measure a key metric: repeat usage: once you slam into the painted door, you know not to do it again.
• Nice concept of "Experiment 0," the experiment I might run before the one being planned.
• "inconclusive result doesn’t mean that you didn’t learn anything. You might have learned that the behavior you were targeting is in fact not as impactful as you were hoping for."
• An important point to remember "When analyzing and interpreting your results, remember that A/ B testing shows you behaviors but not why they occurred."
• “There is a difference between using data to make and inform decisions in one part of an organization versus having it be universally embraced by the entire organization.”
• "One could believe that a designer or product person who doesn’t know the right answer must not have enough experience. Actually it’s almost inversely true. Because I have some experience, I know that we don’t know the right answer until we test."
• "steer people away from using phrases like 'my idea is ...' and toward saying 'my hypothesis is...'"
• "one of the most important aspects of experimental work is triangulating with other sources and types of data."
• The book addresses ethics, rarely discussed
Here are some things I didn’t like:
• The book is verbose. I read the electronic version, but the paperback is 370 pages, giving a sense of the size.
• Very few surprising "eye opening" examples. Several of the papers on exp-platform, such as the Rules of Thumb paper, and the Sept-Oct 2017 HBR article on experimentation have surprising examples showing the humbling value of A/B testing. Th A/B Testing book by Siroker and Koomen have great examples.
• The authors fall into a common pitfall of misinterpreting p-values. For example, they write
o "a p-value helps quantify the probability of seeing differences observed in the data of your experiment simply by chance."
But p-value is a conditional probability, assuming the null (no difference).
o "p = 0.05 or less to be statistically significant. This means we have 95% confidence in our result."
This is wrong. P-value is conditioned on the Null hypothesis being true
o "A false positive is when you conclude that there is a difference between groups based on a test, when in fact there is no difference in the world.... This means that 5% of the time, we will have a false positive."
o "around 1 in 20 A/ B experiments will result in a false positive, and therefore a false learning! Worse yet, with every new treatment you add, your error rate will increase by another 5% so with 4 additional treatments, your error rate could be as high as 25%." Both halves are wrong. p-value of 0.05 does not equate to 5% false positive rate, and adding treatments does not linearly add 5%; it's 1-0.95^4 = 18%
o "But getting a p-value below that twice due to chance has a probability of much less than 1% — about 1 in every 400 times."
1/400 assumes you can multiply the two p-values. Need to use Fisher's combined probability test (meta-analysis)
• "Sometimes, one metric is constrained by another. If you’re trying to evaluate your hypotheses on the basis of app open rate and app download rate, for instance, app download rate is the upper bound for app open rate because you must download the app in order to open it. This means that app open rate will require a bigger sample to measure, and you’ll need to have at least that big of a sample in your test."
The idea that constrained metrics require larger samples is wrong as phrased. Triggering to smaller populations is highly beneficial in practice. For example, if you make a change to the checkout process, analyze only users who started checking out. While the sample size is smaller, the average treatment effect is larger. Including users who provably have a zero treatment effect is always bad.
• Larger companies with many active users generally roll out an A/ B test to 1% or less, because they can afford to keep the experimental group small while still collecting a large enough sample in a reasonable amount of time without underpowering their experiment.”
This is the 1% fallacy. Large companies want to be able to detect small differences. If Bing doesn't detect a 0.5% degradation to revenue in a US test, it might not realize the idea is going to lose $15M/year. The experiment must be sufficiently powered to detect small degradations in high-variance metrics like revenue that we care about. Most Bing experiments run at 10-20%, after an initial canary test at 0.5%
• "It’s always great to see the results you hoped to see!"
The value of an A/B test is the delta between expected and actual results. Some of the best examples are ones where the results are MUCH BETTER than what was expected.
• "if you run too many experiments concurrently, you risk having user exposed to multiple variables at the same time, creating experimental confounds such that you will not be able to tell which one is having an impact on behavior, ultimately muddying your results."
You can test for interactions. Bing, Booking.com, Facebook, Google, all run hundreds of concurrent experiments. This is a (mostly) solved problem.
2 of 2 people found the following review helpful.
The old is new again.
By Jerry Saperstein
I cut my teeth on mail-order marketing, what they now call direct-response. Might even be called something else now.
One of the cardinal rules of mail-order marketing was – and remains – test, test, test.
You tested everything. The headline, the copy, the color of the paper, its weight, everything until you had tested enough determine the most efficient marketing package.
This book’s primary authors are eminently qualified and highly experienced. Even better, they are graceful writers.
The authors define “A/B testing [as] a methodology to compare two or more versions of an experience to see which one performs the best relative to some objective measure”. In other words, you test to find out what works best.
The book is intended to acquaint designers and product managers in launching digital products using data to guide the product’s refinement. In other words, the book how to show designers and product managers how to use the wealth of data available to better market their product.
Over the course of the first six chapters, they do precisely that. This stuff is really good. The authors, one with Spotify in her background, the other with Netflix, truly understand the concept, mechanics and worth of testing.
The last two chapters smelled too much like political correctness for my taste and, in my opinion could have been left out without harming the value of the book.
If you are not thoroughly experienced with the concept of A/B testing in marketing vehicles, you will benefit from this book.
0 of 0 people found the following review helpful.
Improve the user experience and the product and save money, touches on Six Sigma principals, but has it's own spin on the detail
By Courtland J. Carpenter
The book could be more adaptable to various industry, but the work they do discuss reminds me of the test design and data used from a Six Sigma course I took from Purdue a few years back. You can design your process by creating process or experiments that show how you can improve the overall production. One of my professors described his best finder of flaws with walking the plant floor to discover where the process fell down and then when some of those flaws are fixed you stop getting outliers data and help to streamline the process.
While this is not exactly what the book is about, it's not totally Six Sigma, it does touch on the same things just from a more data centered way. What is does do is give you ideas how to design those experiments with the kind of data you use and then how to read the results. Testing itself is very data centered. I work as an embedded software tester, and the design starts with customer requirements, that turn into system requirements which include both hardware and software, then become software requirements for the testing I do from which I create software test verification requirements. Data becomes closer and closer the more you get removed from the customer requirements as a central part. It often becomes difficult going that path to get a good design.
Here's an example from my experience. A new type of emission control system was being put on all the trucks my former company produced. They had a new multiplexed line of Line Haul or what you may call Semi trucks coming out. They had a couple years to design the new emissions control system, but do to the pipeline of not going from the data needed to control the request system, the system requirements detailed a set of two state diagrams that described the function that were horrible. One had 13 state and over 30 transitions the other about 11 states and over 25 transitions, it was implemented correctly in software, but the operation was klugy at best. Engine variants made it worse and the vehicle launch had to be put back because of the failures. The test group and system group working from the data central side of things designed two state diagrams one with 4 states and 5 transitions, the other with 3 states and 4 transitions, and it did the same thing, but worked nearly flawless. It took six weeks working 7 days a week to re-implement the software, and the loss of revenue, and poor launch probably cost millions. Which shows one of the reasons to consider the methods this book supports. Not perfect, but a good start, recommended.
Manage research, learning and skills at defaultLogic. Create an account using LinkedIn or facebook to manage and organize your IT knowledge. defaultLogic works like a shopping cart for information -- helping you to save, discuss and share.