I’m now in the middle of a project to find out how the use of emoticons in IM conversations relates to the use of actual facial expressions and, together with my colleagues, I have to set up an experiment. We have this plan about how we’re going to do it: we have interesting literature about the subject, we have a nice and original experiment design, we’ve found the technology we need to carry it out and we’ve almost figured out who we want as participants. But at the end of out to-do list for today “# of participants” is still there, sporting a devilish wink >;-) So, how is sample size chosen when doing an experiment?
I think, because that’s how the process goes in my head, I’ll divide considerations into practical and statistical. I just want to list the factors that need to be taken into consideration and how to reach the best compromise, I guess I’ll leave worrying about the validity of introspective task analysis for some other day :P
I’ll start with the practical factors: money and time. The things that need to be taken into account are:
- Duration of the actual experiment for each participant
- How long does it take to analyze data for each participant? (dependant on kind of media, coding method and type of analysis)
- How much does each participant cost? (test costs + experimenter’s hours + monetary rewards for participants if any)
These may seem obvious but, for example, the time for data analysis is very often underestimated and rarely (seriously) considered a function of the number of participants. Something these factors have in common is that they are all limiting factors. Practical factors are going to tell you how many participants you can (or, more accurately, can’t) have, but how many do you need? Hopefully, that’s what statistical considerations should tell you, or in other words: how good you need your results to be.
So, let’s say that you want to test new interface against an old one, you may want to know if the mean number of errors that people make using the new interface (μ2) is smaller than for the old interface (μ1). I’m going to use this example because I think it’s pretty common, the same principles apply to other kinds of tests. So your hypotheses look like this:
H0: μ1 =< μ2
H1: μ1 > μ2
And you test n users with the old interface and n users with the new interface.
Generally, you’d choose what you want your type I error (α, or probability of rejecting H0 if H0 is true) to be. The idea behind this is that if you found H1 to be true you’d be 100(1-α)% sure that you’re right, so that’s why α is the kind of thing that you’d like to choose arbitrarily. So once that you choose an α, you end up with a β (probability of type II error, or of failing to reject H0 if it’s false). I don’t think that an extensive explanation of why β is determined by your choice of α would be relevant here, this is the quick one. If you get too meticulous about the evidence you need to say that the new interface is better (low α) it’s more likely that you’ll end up discarding the claim even if it’s true (high β) just because you’re being too picky. On the other hand, if you definitely don’t want to miss the new interface in the case it’s better (low β), you will adopt it even when the evidence is not so strong, and it’s more likely that you adopt it even if it’s not really better (high α). I hope this makes sense to you, if not you can read more about it here.
So β depends on the α you choose but what other things does it depend on? If we’re considering the scenario in which you had to reject H0 but didn’t (this is what type II error, β, is all about), then μ1 > μ2. This means that there is actually a δ = μ1 – μ2. The image on the below shows a possible distribution of the errors made by the users in the old interface (left) and the new interface (right) when there is a δ, the vertical line . If you look at the image you can easily see how as the difference the means, δ, increases, (i.e. the curve in the right moves further right) the type II error (which is proportional to the purple area) decreases. So β will not only depend on our choice of α but also on how big the effect that the experimenter is trying to detect is (to which extent the new interface is better than the old one). Simply put: big effects are easier to detect than small effects.

The image above also shows how β depends on the population variances. It’s easy to see that “flattening” the curves (increasing variance) would increase the purple area.
However, given our choice of α and the fact that the difference in the means and variances are inherent characteristics of the populations, it would seem that there is little we can do to further improve our experiment (i.e. decrease β), that is until we take a closer look at what happens when we change sample size. To do this, I’ll assume that you agree with me that, if you have an experiment like this, you probably want to do a two-sample t-test. A t-test is what you would use when your populations have a t-distribution, which means that you have a population that ideally should have a normal distribution but you sample is too small to accurately estimate the variance (when the sample is infinte, the t-distribution matches the normal distribution because you know the exact variance). The image below shows how the t distribution changes with sample size (k).

When the sample size (k) increases, the curve rises and gets slimmer, the same effect we would get with a decreasing variance, making the type II error smaller (the purple area shrinks). So sample size is the variable you can manipulate to get results you’re more confident with.
So, how do you do it? How much is enough? Luckily, some nice people came up with something called Operating Characteristics Curves. OC curves show the probability of failing to reject Ho as a function of the difference between means ( δ) and the variance (σ) for different numbers of participants. You can see an example in the image below.

The example in the image shows the probability of accepting Ho as a function of d = (μ1 – μ2)/σ, for different sample sizes (different curves) for the two-sided t-test with an α of 0,05. You can see there that if increasing sample size from 5 to 15 may be a good idea, you may want to think twice about going from 15 to 20 (you may want to if you want to detect very small effects) even if the amount of time and money you’re adding to you your original investment is the same in both cases.
So once you have your α and you have estimated the variance (which you have to do for your test anyway), you decide what is the minimum difference in means ( δ) that you would like to detect and with which confidence you want to be sure that you’ll indeed detect it (β). When you have this information, the procedure goes like this:
- Look for the right OC curve set for your α
- Calculate d
- Find the point in the graph whith coordinates (d, β)
- Move horizontally to the left until you meet the closest OC curve, that curve will give you the number of participants you need (n)
- Then move down to look up the d that will give the β you want for the curve you just chose. This d is the minimum difference in means that you can be (1-β)100% confident to detect
Finally, check that the sample size you found you need according to statistical considerations is within the limits you set due to practical considerations (if not, adjust your budget or expectations and repeat ;-)
Hi, my name is Luz Caballero. I'm a User Experience designer/researcher.
Hey thank you for the clear explenation, Luz. I had it before, but it’s always good to see it again. Statistics always seem to slip away from mind after a while.