Rules for a successful multivariate test (Billy’s Optimization Guide Part 3)

No Comments Methodology, Testing Concerns, Testing Techniques

Rules of Six Detail

If you missed it, see Part 1 (A/B Split Testing) and Part 2 (Multivariate Test Basics).

With the basics of part 2 down, it’s time to start designing a multivariate test.  Every optimization project has different challenges and goals, luckily though, there are a few rules that apply to every multivariate test design.  These rules fit into two categories: technical rules and content rules.

Technical rules:

  1. Choose the appropriate multivariate test type (full or fractional factorial)
  2. Determine the number of factors and levels that can be tested based on estimated conversion traffic (choose a test array)
  3. Stop the test when it has stabilized, not based on your earlier estimations

These rules ensure statistical significance by constraining the test to the appropriate size at the beginning and then letting the test gather the proper amount of data at the end.

Running a test full factorial, if your traffic supports it, may be a good choice if you’re testing content that you believe to have many interactions or if you only want to test 2 factors with 2 levels each.  (Note: the smallest fractional factorial test size is 3 factors with 2 levels each.)  Typically though, you’ll want to run a fractional factorial test to save time and expand the number of factors and levels you can test.

In order to find out how many factors and levels you can test, you need to have some idea of your predicted page views, conversions, as well as an estimate of lift.  The reason that lift matters, is that a large lift will get you more conversions and so your test will stabilize quicker.  Because of this, I would be conservative with lift estimates to ensure that the test is not designed too large.  At Widemile, we have a large list of arrays available to our tool and have calculated the approximate conversions needed to stabilize, allowing me to look at the three criteria I listed and find the arrays that are statistically viable for testing.  You should look for something similar with your tool of choice.

To figure out when a test is stabilized, I prefer to primarily look at level influence stabilization with experiment conversion rate stabilization for support.  Widemile Optimize shows this using graphs, so I simply look for horizontal trending of lines, meaning winning levels and experiments stay winners and their level of influence or conversion rates stay fairly constant (look horizontal) over 3-5 days.  If you don’t have graphs available,  the historical cumulative conversion rate for your experiments and see if there is a lot of variance between the latest few days of your test.

Content rules:

  1. Every item you test should answer an important question
  2. Test variety not quantity
  3. Test opposites first then refine
  4. Remember you can run more than one test

The content rules are closely tied together.  In effect, they ensure that the items selected for testing have purpose and that they don’t needlessly expand the size of your test, reducing its efficiency.  I begin designing tests by creating hypothesis regarding issues with the page and then choose factors and design levels to address those issues.

An example hypothesis is “Having a hero shot on the right side of the page causes users to ignore the important value proposition on the left side.”  To test this, I would choose hero shot position as a factor and then have “left side hero shot” as the baseline level and “right side hero shot” as the second level.  This example also illustrates that, other than headlines and images, testing layout is possible with creative use of CSS and sometimes JavaScript.  As long as you can revert from one to another and it matches the other factors and levels, you are at liberty to test anything.

Coming back to the rules, make sure that you are testing as few items as possible to find out what you need.  Before testing a collection of lifestyle hero shots, choose one and test it against an iconic hero shot.  This will save you the time of going down a path of testing something that may not work.

Lastly, you aren’t going to be able to get the best page on the first run or even second, third, etc.  If you knew what your audience liked 100% of the time then you wouldn’t need testing.  Remember to think of your overall test plan beyond just the first run, so that you can answer all the questions you need without having to force everything into one test.

In summary, determine what you’re trying to achieve, select the proper testing method to meet those goals and then make sure to be purposeful and efficient with the content you end up testing in front of your visitors.  Testing and optimization is not difficult, although it can be tough to start.  Follow these rules and you’ll be on your way to conquering conversion rates, bounce rates, funnel drop-offs and many other metrics.

Photo credit: Aranda\Lasch (CC)

An Essential Primer on Full and Fractional Factorial Test Design

4 Comments Methodology, Terminology, Testing Techniques

keys

What are full and fractional factorial test designs? How do they relate to optimization and what about interactions?

Once you get down and dirty with testing, these questions matter. Whether selecting an optimization platform or trying to thoroughly understand the tests you are building, grasping these concepts will put you in greater control and allow you to design and analyze your tests more effectively.

As simply as possible, I hope to educate you and other marketers about full and fractional factorial test designs and why fractional factorial is the best choice for multivariate testing of online campaigns.

Note: “Partial factorial” and “fractional factorial” are the same. Also, if you don’t have a thorough understanding of experiments and interactions, please read those first.

The tests used in optimization are from the design of experiments field. (From Wikipedia: “Design of experiments is the design of all information-gathering exercises where variation is present, whether under the full control of the experimenter or not.”) The two types of tests I will focus on are fractional factorial and full factorial.

Here is an example I will use to explain these concepts. Below is a test matrix outlining a test for a landing page with 5 factors with 2 levels each. Don’t let the vocabulary scare you away, this means that there are 5 parts of the page being tested and 2 variations of each.

matrix

Recipe Matrix: 5 factors = 5 parts (hero shot, headline, etc.) and 2 levels = 2 variations

These factors and their respective levels make up the possible combinations for a landing page. The combinations displayed are called experiments.

Let’s calculate the total number of experiments possible (even if you know how to do this already, this is important to understanding the distinction between fractional and full factorial.) There are 2 levels for each factor, so you can have 2×2x2×2x2 (2 to the 5th power) = 32 possible experiments. This means there are exactly 32 combinations of hero shots, headlines, sub headlines, button text and main copy from our matrix outlined above. Note that if we add another factor, it becomes 2 to the 6th power or 64 possible experiments. Additionally, if you add 2 more levels to any of the existing 5 factors, it will increase from 32 to 4×2x2×2x2 = 64 experiments also.

In testing, each experiment must get a minimum amount of measurable conversions, known as the sample size per experiment. This ensures that there is enough data for a solid statistical analysis. Therefore the more experiments you have, the more conversions you need. You can think of conversion data as time also, since the longer you leave your web page up, the more data you get.

Now we’re ready to go back to the difference between the two test designs. Full factorial testing requires that every possible experiment combination is shown, so our 5-factor test would need to display all 32 experiments. This means that if there is a sample size of 100 conversions, 3,200 conversions will be required. Fractional factorial works differently, it displays a much smaller number of experiments, about 8 in this case, so it would need about 800 conversions.

Since full factorial gathers additional data, it reveals all possible interactions, but as seen by the numbers above, there is a trade-off. More data equals more information but more data also equals a longer test duration. The minimum data requirements for full factorial are very high since you are showing every experiment.

Even if you are using full factorial to get the same amount of information as a fractional factorial test, it will take more time since you need more data to see statistically relevant differences between the many experiments.

You might be wondering how fractional factorial can be accurate if interactions are possible?

Random interactions of high relevance are very rare, especially when looking for interactions of more than 2 factors. You really need to design tests where you look for meaningful interactions that are based on true business requirements rather than hoping for a random and low influence interaction between a red button, a hero shot and a headline.

Whatever the interaction is, you need to be able to understand your audience and infer why there was an interaction in the first place, only then are you ready to start designing for interactions.

Tests should not be filled with random levels, they should be carefully designed for success by focusing on testable hypotheses around the audience. Could a 1 pixel drop shade on a button interacting with the copyright statement ever be truly significant, and not a victim of random error? Is it worth sacrificing thousands of conversions to learn a lesson that won’t result in any relevant increase of real world conversions?

There are interactions that might make sense and those that should be avoided from being measured because of the amount of testing time it adds.

This brings me to fractional factorial. It is possible for fractional factorial tests to detect interactions. How so? Using our example of a 5-factor test, fractional factorial can include everything from only main-effects all the way to 4-factor interaction effects. Full factorial’s only difference is that it is the full extension and includes the 5-factor interaction effects.

Fractional factorial is not a one-trick pony, it is a continuum ranging from testing for no interactions (only main effects) to one factor less than full factorial. It is exactly what the name fractional implies; even one less is a “fraction” of full factorial. It gives you the power to make trade-offs between testing only main effects to testing for interactions based on intelligent test design.

Once you decide to test for all possible interactions, you are committing to a full-factorial test and incur the associated traffic requirements. I’d love to see a test design that is designed for full interactions and still makes sense! Not having the ability to reduce the number of interactions is a huge detriment rather than a benefit of solutions limited to full-factorial testing.

Radically shorter test times allow for many more smart marketing ideas to be tested and adapted based on what you learn from each test run. You, the marketer have the ability to analyze your results and tweak follow-on tests to capitalize on what you learn. This common-sense approach is what hypothesis-based testing is all about and is very powerful. Focus on testing smart ideas to increase your conversion rate – that’s what matters most.

The graph below illustrates how much information is gained and the amount of testing needed, based on the number of interactions tested.

effects graph

In my experience, the red area shows how valuable the data is based on which effects are being tested, while the blue area shows the amount of data (or time) needed to gather the data to confirm those effects. The x-axis goes from left to right, from main effects to full factorial (5-factor effects).

At Widemile, we believe it is more effective to perform quick, successive tests detecting only main-effects rather than randomly hoping for interactions. While interactions might give you small or even large gains, it likely will never not trump the gains from additional testing, nor the time and money lost looking for random interactions. The additional time required for full factorial tests is large and not many marketers want to wait more than a month for a test to complete.

Fractional factorial is preferred by a few camps, including Widemile, Omniture’s Test&Target (formerly Offermatica) and Interwoven’s Optimost. Full factorial is used in Google’s free Website Optimizer and some tools offered by smaller providers.

Testing for all interactions sacrifices a lot of time. With the speed that audiences, marketing campaigns and seasons can change, it is important to get the most testing done in the least amount of time without sacrificing the quality of the data. Fractional factorial allows you to do just that, making it the wisest choice for multivariate testing.

What is Taguchi? How does it relate to testing?

8 Comments Terminology
the Taguchi method

Multivariate testing is a buzz word these days, but the buzzword of multivariate testing seems to be Taguchi. However, that term is being abused. Do you know what Taguchi really means? I wasn’t even positive, so to get some background, I did some research and talked with Vladimir (Widemile’s Chief Scientist).

The name and method comes from Genichi Taguchi. His method, also known as Robust Design, attempted to improve product manufacturing quality. Therefore it falls into an area of engineering called Quality Engineering.

Does this sound aligned with website testing? Not really, and this is the problem of using the term Taguchi with web site testing. The goals of manufacturing and the goals of a website are not the same.

What most people are attempting to grasp when using the term Taguchi is fractional factorial test design. (I discussed this at length in my post about the difference between Widemile’s technology and Google Optimizer.) The Taguchi method uses a fractional factorial test design and is under the umbrella of fractional factorial testing but is not the only or best fractional factorial method. In fact, even within manufacturing, the Taguchi method was the inspiration for many new techniques but many statisticians find it flawed.*

It is important to differentiate the Taguchi method from fractional factorial test design since one is a basis for manufacturing while the other is purely related to design of experiments. You need to ensure that the math and science behind your testing is based on methods that have the end goal of optimizing your website only. So if your testing tool uses the Taguchi method for testing, you better ask what that really means.

So does Widemile use Taguchi? We don’t use the Taguchi method, however do use fractional factorial test design. I like to say that our platform goes beyond Taguchi because it was specifically made for optimizing web content.

Don’t get sucked into the Taguchi method, it is just a buzzword used by your fellow marketers. Just because the technology doesn’t use Taguchi, doesn’t mean you should count it out.

*Read more after the jump for Vladimir’s explanation of the Taguchi method and its criticisms
Read the rest of this entry »

Google Optimizer is slow (or Not all Multivariate Testing is the same)

5 Comments Terminology, Testing Concerns

*Update: Hello!  If you’ve found this article after reading the book Always Be Testing, I encourage you to take a look at a more recent and in-depth article I’ve written here: An Essential Primer on Full and Fractional Factorial Test Design.  Thanks for visiting!

Without knowing it, people might assume that there’s only one method to multivariate testing. That it has been long figured out by math and statistic wizards. I have learned otherwise from Widemile’s personal math wizard, Chief Scientist, Vladimir Brayman.

(Just as a side note, he does not have a typical office. Rather than papers and folders strewn about, he has statistic and math books. Lucky for me though, he has a great skill at distilling all the goodness in those books and teaching me what I need to know, in a way I understand.)

Most recently, we discussed why Widemile’s technology trumps Google Optimizer.

Widemile vs Google

Having a strong creative team and testing experts ensures better results than giving a marketer a tool like Google Optimizer, that’s easy for most people to understand. But explaining how Widemile’s technology can test more, faster, is a little more complicated.

Let’s explore how Google’s testing works versus Widemile’s. Google Optimizer uses full factorial test design, meaning it creates a page for every combination of your tested page elements. So if you wanted to test 4 different hero shots, 4 buttons and 4 headlines, that would require 4*4*4=64 page combinations. The disadvantage of this method is that you need significant traffic for each of the 64 pages. Meaning you either need a lot of traffic or a lot of time; for most companies, they’ll need both.

To solve this, Widemile’s optimization platform use fractional factorial test design. This method tests only a small fraction of the total possible page combinations and uses statistical analysis to derive almost all of the same information that would be found in a full factorial test. This works because marginal information is gained in testing all 64 page combinations, while testing a few important combinations tell us nearly everything we need to know.

Google actually criticizes fractional factorial test design (look here where it says “A note about ‘fractional factorial testing’”), saying that it requires the same number of impressions, but can not derive the depth of conclusions that a full factorial design can. While true that full factorial squeezes out the most information, that is at a sacrifice of extending the test many times longer than with a fractional factorial test, all to learn the smallest influences.

Doing successive tests to find high influence items with fractional factorial testing will get much higher gains than getting every ounce of information out of one extremely long full factorial test. In addition, with a carefully designed fractional factorial test you can learn all the major influences and the interactions between elements on the page.

Fractional factorial test design gets you a completed test in weeks rather than months or years even, and because of that, you can test more than you would normally be able to in the same time frame. You can either test more in one larger test, or do many smaller successive tests.

Not to say that Google Optimizer isn’t a great tool, especially since it is free, but any company that spends thousands of dollars on SEM has a lot to gain by using technology that gets rapid results.

If you got any questions about this, let me know and I’ll try to answer them or get you an answer.