If an effect size is known, there are various methods that can be used to calculate an appropriate sample size for a desired level for α and power1,2. Multiple online calculators and software packages can be used for such calculations (see below).
An extensive review of this subject is beyond the scope of this article, and researchers are encouraged to consult a statistician; however, there are several important factors that should be considered.
Effect Size
The actual effect size in an experiment is rarely known beforehand, and neither is the variance in the data. These are usually approximations informed by historical or pilot study data, which may or may not reflect the outcomes of a proposed experiment. When establishing this effect size for sample size calculations, it is critical that this value is set at the lower end of what would be considered scientifically important, as this determines the minimum difference that can be reliably detected with that sample size. For example, if the sample size is calculated to detect a difference of 2 standard deviations, this n value would not be sufficient to detect any effect less than this value with confidence.
α and Power
A second critical factor is determining the appropriate levels for α and power (1-β). To a non-statistician, these values often represent opportunities for confusion1.
The Type I error rate (α) is easiest to grasp; this is the false positive rate and corresponds to a desired p value for statistical hypothesis testing. The 'standard' α value of 0.05 reflects a 1 in 20 chance that a detected difference between groups is not real (i.e. occurring only by chance). As such, α or p values can be easily misunderstood; they only support, but cannot prove, that two groups are different and are easily subject to bias.
As an example of bias, suppose twenty research groups around the world are testing the same hypothesis: Drug A causes Effect B. At an α level of 0.05, there is a good probability that one of these groups will produce data showing that Drug A does cause Effect B, even if this is not in fact true. Given that positive findings are more readily published than negative findings (i.e. publication/reporting bias), this effect may be reported as real in this hypothetical situation (even if 19 other groups failed to detect an effect).
This statistical reality emphasizes the importance of reproducing studies and reporting negative results. Furthermore, there are some that argue that a p value of 0.01 or lower may be more appropriate than the 0.05 standard.
The power of an experiment (1-β) is related to but independent of α. It roughly corresponds to the probability of a detecting a result that is a true positive (rejecting the null hypothesis when an alternative hypothesis is true). A higher-powered experiment will have a greater chance to detect an effect if one exists. Generally, power levels are set to 0.8 or higher, with high risk experiments often using greater power levels (e.g. for toxicology studies in which it is important to have a high confidence of detecting effects).
Sample Size Calculation Resources