At eTail West one wise retailer told this audience, “80% of what you think you know about your site is wrong.” His message was one among many ways marketers urged their audiences to test (and test often).
The problem is, marketers often assume testing in itself is enough to negate the impact of personal bias, even as many brazenly ignore the statistics behind running a valid test. The results are several surprisingly common misconceptions that can turn even the most well-intentioned tester into a mindless, hypothesis confirming drone.
One of the biggest mistakes a marketer can make is failing to understand the difference between one-tailed and two-tailed tests. And we don’t blame them. Testing vendors don’t necessarily provide the option to calculate statistical significance in more than one way, and if they don’t, they probably aren’t going to bother explaining the difference. I’m not here to say a one-tailed test is inherently useless, but rather it is a risky point of confusion when understanding the validity of your testing campaigns and making decisions about the user experience on your site or mobile app.
Now let’s get into some detail.
One-Tailed Tests: What is a One-Tailed Test?
A one-tailed test allows you to determine if one mean is greater or less than another mean, but not both. A direction must be chosen prior to testing.
In other words, a one-tailed test tells you the effect of a change in one direction and not the other. Think of it this way, if you are trying to decide if you should buy a brand name product or a generic product at your local drugstore, a one-tailed test of the effectiveness of the product would only tell you if the generic product worked better than the brand name. You would have no insight into whether the product was equivalent or worse.
Since the generic product is cheaper, you could see what looks like a minimal impact, but is in fact a negative impact (meaning it doesn’t work very well at all!), but you go ahead and purchase the generic product because it is cheaper.
If this is the case, you’re probably wondering when a one-tailed test should be used? One-tailed tests should be used only when you are not worried about missing an effect in the untested direction.
But how does this impact optimization?If you’re running a test and only using a one-tailed test, you will only see significance if your new variant outperforms the default. There are 2 outcomes: the new variants wins or we cannot distinguish it from the default.
Here’s a quick summary:
Two-Tailed Tests: What is a Two-Tailed Test?
A two-tailed test allows you to determine if two means are different from one another. A direction does not have to be specified prior to testing.
In other words, a two-tailed test will taken into account the possibility of both a positive and a negative effect.
Let’s head back to the drug store. If you were doing a two-tailed test of the generic against the brand name product, you would have insight into whether the effectiveness of the product was equivalent or worse than the brand name product. In this instance, you can make a more educated decision because if the generic product is equivalent, you would purchase it because it is cheaper, but if it is far less effective than the brand name product, you’d probably shell out the extra money for the brand name product. You wouldn’t want to waste your money on an ineffective product, would you?
So when should a two-tailed test be used? Two-tailed tests should be used when you are willing to accept any of the following: one mean being greater, lower or similar to the other.
And how does this impact optimization?When running a test, if you are using a two-tailed test you will see significance if your new variant’s mean is different from that of the default. There are 3 outcomes: the new variant wins, loses or is similar to the default.
Here’s a quick summary:
Which Testing Approach Should You Be Using?
Two-tailed tests mitigate the risk involved with predicting how future visitors will be impacted by the tested content. By accounting for all possible outcomes, this approach provides more valuable, unbiased insights which can be reported on with confidence. Testing is supposed to make it easier for marketers to understand the impact of a certain change without the need for IT intervention, but when the difference between one-tailed and two-tailed tests goes ignored, both the marketer’s time and the IT resources are at risk of being wasted. Don’t fall victim to personal bias or winning results which are vacant of real meaning. This testing approach does require more traffic and time, but that is a small price to pay for reliable results. Now get testing!
The CXO Buyer's Guide will help you clarify your testing and optimization needs so you can get the right tools to meet your marketing needs.
In statistical significance testing, a one-tailed test and a two-tailed test are alternative ways of computing the statistical significance of a parameter inferred from a data set, in terms of a test statistic. A two-tailed test is appropriate if the estimated value may be more than or less than the reference value, for example, whether a test taker may score above or below the historical average. A one-tailed test is appropriate if the estimated value may depart from the reference value in only one direction, for example, whether a machine produces more than one-percent defective products. Alternative names are one-sided and two-sided tests; the terminology "tail" is used because the extreme portions of distributions, where observations lead to rejection of the null hypothesis, are small and often "tail off" toward zero as in the normal distribution or "bell curve", pictured on the right.
One-tailed tests are used for asymmetric distributions that have a single tail, such as the chi-squared distribution, which are common in measuring goodness-of-fit, or for one side of a distribution that has two tails, such as the normal distribution, which is common in estimating location; this corresponds to specifying a direction. Two-tailed tests are only applicable when there are two tails, such as in the normal distribution, and correspond to considering either direction significant.
In the approach of Ronald Fisher, the null hypothesis H0 will be rejected when the p-value of the test statistic is sufficiently extreme (vis-a-vis the test statistic's sampling distribution) and thus judged unlikely to be the result of chance. In a one-tailed test, "extreme" is decided beforehand as either meaning "sufficiently small" or meaning "sufficiently large" – values in the other direction are considered not significant. In a two-tailed test, "extreme" means "either sufficiently small or sufficiently large", and values in either direction are considered significant. For a given test statistic there is a single two-tailed test, and two one-tailed tests, one each for either direction. Given data of a given significance level in a two-tailed test for a test statistic, in the corresponding one-tailed tests for the same test statistic it will be considered either twice as significant (half the p-value), if the data is in the direction specified by the test, or not significant at all (p-value above 0.05), if the data is in the direction opposite that specified by the test.
For example, if flipping a coin, testing whether it is biased towards heads is a one-tailed test, and getting data of "all heads" would be seen as highly significant, while getting data of "all tails" would be not significant at all (p = 1). By contrast, testing whether it is biased in either direction is a two-tailed test, and either "all heads" or "all tails" would both be seen as highly significant data. In medical testing, while one is generally interested in whether a treatment results in outcomes that are better than chance, thus suggesting a one-tailed test; a worse outcome is also interesting for the scientific field, therefore one should use a two-tailed test that corresponds instead to testing whether the treatment results in outcomes that are different from chance, either better or worse. In the archetypal lady tasting tea experiment, Fisher tested whether the lady in question was better than chance at distinguishing two types of tea preparation, not whether her ability was different from chance, and thus he used a one-tailed test.
Coin flipping example
Main article: Checking whether a coin is fair
In coin flipping, the null hypothesis is a sequence of Bernoulli trials with probability 0.5, yielding a random variable X which is 1 for heads and 0 for tails, and a common test statistic is the sample mean (of the number of heads) If testing for whether the coin is biased towards heads, a one-tailed test would be used – only large numbers of heads would be significant. In that case a data set of five heads (HHHHH), with sample mean of 1, has a chance of occurring, (5 consecutive flips with 2 outcomes - ((1/2)^5 =1/32), and thus would have and would be significant (rejecting the null hypothesis) if using 0.05 as the cutoff. However, if testing for whether the coin is biased towards heads or tails, a two-tailed test would be used, and a data set of five heads (sample mean 1) is as extreme as a data set of five tails (sample mean 0), so the p-value would be and this would not be significant (not rejecting the null hypothesis) if using 0.05 as the cutoff.
The p-value was introduced by Karl Pearson in the Pearson's chi-squared test, where he defined P (original notation) as the probability that the statistic would be at or above a given level. This is a one-tailed definition, and the chi-squared distribution is asymmetric, only assuming positive or zero values, and has only one tail, the upper one. It measures goodness of fit of data with a theoretical distribution, with zero corresponding to exact agreement with the theoretical distribution; the p-value thus measures how likely the fit would be this bad or worse.
The distinction between one-tailed and two-tailed tests was popularized by Ronald Fisher in the influential book Statistical Methods for Research Workers, where he applied it especially to the normal distribution, which is a symmetric distribution with two equal tails. The normal distribution is a common measure of location, rather than goodness-of-fit, and has two tails, corresponding to the estimate of location being above or below the theoretical location (e.g., sample mean compared with theoretical mean). In the case of a symmetric distribution such as the normal distribution, the one-tailed p-value is exactly half the two-tailed p-value:
Some confusion is sometimes introduced by the fact that in some cases we wish to know the probability that the deviation, known to be positive, shall exceed an observed value, whereas in other cases the probability required is that a deviation, which is equally frequently positive and negative, shall exceed an observed value; the latter probability is always half the former.
— Ronald Fisher, Statistical Methods for Research Workers
Fisher emphasized the importance of measuring the tail – the observed value of the test statistic and all more extreme – rather than simply the probability of specific outcome itself, in his The Design of Experiments (1935). He explains this as because a specific set of data may be unlikely (in the null hypothesis), but more extreme outcomes likely, so seen in this light, the specific but not extreme unlikely data should not be considered significant.
If the test statistic follows a Student's t-distribution in the null hypothesis – which is common where the underlying variable follows a normal distribution with unknown scaling factor, then the test is referred to as a one-tailed or two-tailed t-test. If the test is performed using the actual population mean and variance, rather than an estimate from a sample, it would be called a one-tailed or two-tailed Z-test.
The statistical tables for t and for Z provide critical values for both one- and two-tailed tests. That is, they provide the critical values that cut off an entire region at one or the other end of the sampling distribution as well as the critical values that cut off the regions (of half the size) at both ends of the sampling distribution.
- ^Kock, N. (2015). One-tailed or two-tailed P values in PLS-SEM? International Journal of e-Collaboration, 11(2), 1-7.
- ^Mundry, R., & Fischer, J. (1998). Use of statistical programs for nonparametric tests of small samples often leads to incorrect P values: Examples from Animal Behaviour. Animal behaviour, 56(1), 256-259.
- ^Pillemer, D. B. (1991). One-versus two-tailed hypothesis tests in contemporary educational research. Educational Researcher, 20(9), 13-17.
- ^John E. Freund, (1984) Modern Elementary Statistics, sixth edition. Prentice hall. ISBN 0-13-593525-3 (Section "Inferences about Means", chapter "Significance Tests", page 289.)
- ^J M Bland, D G Bland (BMJ, 1994) Statistics Notes: One and two sided tests of significance
- ^Pearson, Karl (1900). "On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling"(PDF). Philosophical Magazine. Series 5. 50 (302): 157–175. doi:10.1080/14786440009463897.
- ^ abFisher, Ronald (1925). Statistical Methods for Research Workers. Edinburgh: Oliver & Boyd. ISBN 0-05-002170-2.
- ^Fisher, Ronald A. (1971) . The Design of Experiments (9th ed.). Macmillan. ISBN 0-02-844690-9.