How Many Samples?

Massart,Desire;Smeyers-Verbeke,Johanna;Vander Heyden,Yvan;

How Many Samples?

July 1, 2005

By Desiré L. Massart
Johanna Smeyers-Verbeke

Article

LCGC Europe

LCGC EuropeLCGC Europe-07-01-2005

Volume 18

Issue 7

Pages: 390–393

The sample size is determined by three factors: the size of the difference between the means that should be detected, the precision of the methods being compared and the significance levels at which the test is performed.

Chromatographers often have to compare two sets of data. For instance, when validating a new analytical method the analyst may want to compare a reference method with a new method by analysing replicates of the same material, or, preferably, several materials at least once.

To keep it simple, let us suppose one material is analysed n times for each method. The analyst hopes that no significant difference will be found between the mean results obtained with the two methods. In that instance it is concluded that the new method is not subject to bias. Before starting with the experiments, the analyst must decide how many (replicate) samples should be analysed to allow an adequate statistical analysis of the results.

In some instances, published ISO or other standards tell the analyst the number of materials, concentration levels and replicates to be analysed, but, in many other situations, analysts have to, or want to, make this decision themselves.

The question of sample size is much more general than the detection of a possible bias in method validation. It arises when an experiment must be performed that will end with a statistical hypothesis test. Suppose the yield of an extraction method is only 90% and a new procedure is developed where initial tests indicate the yield could be higher. The analyst will then perform a certain number of extractions on the same material(s) followed by a t-test to decide if the yield is significantly higher than with the original method. The question will be how many extractions should be performed to decide if there is an actual effect? This column will explain what is involved in a sample size calculation and the questions that need to be answered before starting.

Three Factors

The sample size is determined by three factors:

the size of the difference between the means

that should be detected, the

precision

of the methods being compared and the

significance levels

at which the test is performed.

The larger the acceptable size difference between means, δ, (in our situation the bias) the less samples are required to detect this difference (if it occurs). Suppose the mean result of the reference method is 25.00 (in whatever units) and a bias of less than 0.01 is considered acceptable. If this is compared with a situation where a bias of up to 5.00 is considered the limit, it is (much) more difficult to find the small bias of 0.01 than that of 5.00, and it can be expected that more samples will be needed to detect the small bias than the larger one.

This means that to determine the sample size, the analyst must define a priori what the acceptable bias is, before trying to prove that it is not larger. Although this appears good practice, especially from a quality assurance (QA) point of view, we find that, in practice, analysts are often unable, or unwilling, to state what an acceptable bias is in their particular situation.

Precision is used here as a general term for repeatability or reproducibility estimated by the standard deviation around the mean by measuring replicate samples. In statistical texts the term used is variance, (i.e., the square of the standard deviation). It is easier to show that a certain bias is present when the precision is better, meaning that the standard deviation is smaller. Suppose that an analyst wants to show that the new method does not have a bias greater than 1.0. If the reference and the new method have precisions of 0.01 only a few samples will be needed, if it is 10.0, then it will be difficult to detect the 1.0 bias, if it exists, and take many samples to do so.

The significance level is a more difficult matter. Two such levels have to be fixed, an α and a β-level.

Errors and Significance Levels of Two Kinds

Scientists must be careful when they accept a new hypothesis. The scientist who finds a difference between the mean of a number of data collected on the object of his study, before and after applying a certain treatment, wants to be as sure as possible that the difference is meaningful, (i.e., that it is not a result of measurement errors). It is, unfortunately, never possible to be 100% sure because of measurement and other random errors. Therefore, when a statistical test is applied, the risk of making an erroneous conclusion is computed. This type of error is called the

type I error

. The type I error or

error of the first kind

is the risk or probability of incorrectly rejecting the null hypothesis, the null hypothesis being that there is no difference between the two means. It should not exceed a certain level, very often 5%. This predetermined level is called

the level of significance

α.

For the method validation problem presented in the introduction, the type I error would be to accept that there is a bias when there is not one. Analysts want to avoid this and another error, namely to decide that there is no bias when there is one. In fact, there always is a bias between two methods, but it is hopefully too small to be of practical importance. Therefore, we should amend our earlier statement to be that the analyst would like to be as sure as possible that there is no bias larger than δ. This is a type II error.

The error of the second kind or type II error is the risk or probability of incorrectly accepting the null hypothesis. The level of significance for the type II error is called β. The type II error is the risk that a given bias δ will not be detected, although such a bias (or a larger one) exists. To accept that the bias is smaller than δ, the calculated risk of not detecting such a bias should not exceed β. β is fixed by the analyst after α has been set and has meaning only for a given δ.

In summary, there are two types of errors to consider when making a sample size calculation: wrongly concluding that there is a bias and wrongly concluding that there is no bias larger than δ. The more conservative (i.e., the smaller the values of α and β), the less likely wrong decisions are, but also the higher the required sample size, n.

Sample Size for Comparing Two Means

We will not go into the, sometimes complex, equations but give some results. The graph of Figure 1 shows

, the required sample size, as a function of (a) λ = δ σ√2 and (b) λ = δ

√2. It is supposed here that the standard deviation is not significantly different for the two means (i.e., in our situation for the two methods). The symbol

is used when the standard deviation is estimated from the experiments and is, therefore, known only within a relatively large margin of error. When the standard deviation is known or estimated with enough samples (20 or more) to consider it well enough known, the symbol used is σ.

Figure 1: The required sample size with Î± = 0.05 for the comparison of two means as a function of Î» = Î´ Ïâ2 (a) and Î» = Î´ sâ2 (b).

The graphs are given for α = 0.05 and for two values of β. Let us compute the precision needed when the sample size n = 6, the number of replicates often used by analytical chemists when comparing methods. For β = 0.05 it is found that λ = 1.45 when the standard deviation is known and λ = 1.85 when it is estimated. In other words, to detect a difference δ = 1 or more between the two methods with n = 6, α and β = 0.05, s should not be larger than 1/1.85 √2 = 0.38. When the standard deviation is known, a less precise method is required. σ should then not be larger than 0.49. It should be noted that while for α the value 0.05 is generally used, there is no guidance for β and often larger β — values, such as β = 0.1 are applied. For the latter value of β, σ should not be larger than 0.54. The larger β, (i.e., the less conservative), the less precise the methods must be.

Many more graphs can be found in ISO 3494-1976 not only for the comparison of two means, but also for the comparison of one mean with a given value and for the comparison of two variances.¹ Unfortunately, they do not cover all situations of interest, (e.g., the situation where the precisions of the two methods are not equal or when for one method the standard deviation is known and estimated for the other).

Sample Size in Method Validation

The literature contains some guidance on how to perform sample size calculations for the detection of bias and the comparison of precisions. An ISO standard describes how to proceed for the comparison of a new analytical method (ISO calls it an alternative method) with an existing international standard method studied in an interlaboratory fashion.

In reference 3 the situation is considered where an individual laboratory wishes to compare a new method with an older, in-house validated method. For more complex validation designs, requiring analysis of variance (ANOVA) for the statistical computations, for example, no specific guidance is available.

Power Analysis

Sample size calculations can be performed for all hypothesis tests such as the F-test and ANOVA. They are often found under the name of

power analysis

. The

power of a test

is defined as 1 – (the type II error) and

should be such that the power is higher than 1 – β. Some software is available, for instance nQueryAdvisor

^®

, which allows sample size for several statistical tests to be computed.

⁴

There is also specific software to calculate the power for clinical studies, (e.g., a group of patients receives a placebo, a second group receives a certain dose of a new drug and a third group another, higher dose). The question is how many patients should be included in the studies to find with a given level of significance that the new drug has an effect, (e.g., lowering blood pressure by a given amount).

Power analysis is also applied to compare experimental designs and to decide which of them has the highest power, (i.e., is the most likely to indicate an effect, if such an effect exists). As far as we know there is no commercially available software directed at the specific needs of analysts.

Conclusion

How many samples should be analysed is one of the most fundamental questions that can (and should) be asked. It is a simple question and it is frustrating, both for the analyst and for the statistician or chemometrician, that the answer is not so simple. To begin with, the analyst must make an

a priori

decision about two of the three factors mentioned earlier, namely the levels of α and β, the minimum bias δ (for method validation) or minimum effect considered relevant (for other applications). This often proves very difficult in practice. At least estimates of the third factor, the standard deviation, must also be available. Moreover, when the experimental set-up is more complex than the comparison of two means (or two variances), the calculations are far from evident and the help of a professional statistician may be needed. It should cause no surprise if the statistician is not able to give the answer immediately: specialized knowledge is needed and the statistician may need time to find the answer.

Column editor, Desire Luc Massart, is an emeritus professor at the Vrije Universiteit Brussel, Belgium and performs research on chemometrics in process analysis and its use in the detection of counterfeiting products or illegal manufacturing processes. Johanna (An) Smeyers-Verbeke is a professor at the Vrije Universiteit Brussel and is head of the department of analytical chemistry and pharmaceutical technology. Yvan Vander Heyden is a professor at the same university and heads a research group on chemometrics and separation science.

References

1. ISO 3494-1976 Statistical interpretation of data — power of tests relating to means and variances (1976).

2. ISO 5725-6:1994(E), section 8 (Comparison of alternative methods).

3. S.Kutthatharmmakul, D.L.Massart and J.Smeyers-Verbeke, Chemometrics and Intelligent Laboratory Systems, 52, 61–73 (2000); www.vub.ac.be/fabi/validation/compar/index.html.

4. nQueryAdvisor^®, Statistical Solutions Ltd (http://www.statsol.ie).

Articles in this issue