Skip Navigation Links
Link to the last RSS article here: Resolving A Case of An Expired SAS 8.2 Installation. -- Ed.
Null Hypothesis Significance Testing
By Mike Clark, Research and Statistical Support Services Consultant
This is the beginning of a series discussing methodological approaches used in the social sciences. This article outlines the general problems and difficulties associated with a common method of statistical inference in psychology (my background) and other social science fields: null hypothesis significance testing (hereafter NHST). This introduction can serve as a starting point for researchers that are interested in examining these important issues in further detail. Subsequent articles will discuss alternative inference frameworks such as Bayesian analysis and Likelihood Estimation.
A Conceptual Overview
Statistical hypothesis testing involves setting up an initial hypothesis, and then performing a set of calculations on the data that give us some basis to judge as to whether our initial hypothesis should be retained or rejected. A common example in the social sciences is the situation where the researcher is interested in whether the means of various groups differ in a specified population. For example, we may want to see if grade point averages vary across college classification (freshman, sophomore, junior, senior). Following NHST procedures, a hypothesis that we might initially hold is that there is no difference among the groups (i.e. that their means are equal). We then perform our statistical analysis, and our procedures may lead us to say that we have not provided enough support to reject our initial hypothesis, or our procedures may lead us to believe that the initial hypothesis is untenable, whereby we would conclude that there are differences in the population groups. Subsequently, more statistical analyses, using similar logic, would be performed to discover specifically which groups differ.
A thought that might occur to many researchers in a discussion of NHST is that they didn’t know there was a problem or they may have been only vaguely aware of viable alternatives to NHST. This has been the case in many basic statistics courses – students are not told that there are some subtle difficulties with NHST, and that other alternatives might be more appropriate depending on what the researcher is trying to accomplish.
An important issue which is sometimes overlooked involves the practical interpretation of what we are doing. In NHST, we may state a null hypothesis that the difference between population groups are zero, or if we have more information, we may specify a specific value (or in a single sample case, we may specify that the mean of the sample data is equal to the population mean). Nonetheless, it is almost impossible to come up with that exact specified value in the sample under any circumstances of adequate sampling of data. For example, in a two-group design, our null hypothesis states that there are no population group differences while the alternative hypothesis states that the population groups are not equal. Below is an example of a more formal expression of hypotheses (null and alternative) regarding the difference between two group means:
No matter how much the freshmen and sophomore populations look similar, the odds of them having exactly the same sample GPA, regardless of class sizes, is next to zero, and yet this is what our null hypothesis is suggesting. This null hypothesis of no difference in the population is sometimes thought of as a “straw-man” statement since we know that group samples will reflect some differences to some arbitrary decimal point. Having the observed sample difference, however small, be declared as statistically significant, is then a function of having a large enough sample size (all other things being equal) – if statistical significance is needed for the observed sample difference, one only needs to increase the sample sizes until the observed p-value reaches the cut-off criterion for significance.
Another source of confusion is related to the interpretation of NHST analysis results. Common sense would suggest that we are trying to determine the viability of a hypothesis. In other words, what the probability is that a hypothesis is true given the data at hand [p(H|D), the probability of an hypothesis given the data]. On the contrary, NHST actually involves a different conditional statement. We are not looking for the probability of the hypothesis tested but rather the probability of the data if some hypothesis (the null hypothesis) were true [p(D|H), the probability of the data given some hypothesis]. The goal of NHST is such that if the probability of the data given the null hypothesis is low enough, we might start thinking the data come from a world in which the null hypothesis is not true. Consequently, we reject the null hypothesis as a believable description of the population, and decide to believe an alternative explanation of events. Unfortunately, many researchers make the mistake of thinking that a failure to reject the null hypothesis has provided a probability that the null hypothesis is true – researchers may say: “a failure to reject the null hypothesis means that my groups are equal within some specified probability” – however, this is a conditional hypothesis that NHST is not testing.
Another misunderstood issue is interpreting the observed p-value in a valid way and choosing a corresponding cut-off value for the observed p-value. For some researchers, there is a rigid adherence to p = .05 as a cutoff point for significance (or some other e.g. p=.01). In other words, if the probability of the data under the null hypothesis is .045, these researchers will conclude to reject the null hypothesis. However, if the probability value is .055 (slightly above the cut-off), many researchers may not even discuss the result, or at best give it lower class status of significance (i.e. “marginal significance”). However, the decision whether to accept or reject the null hypothesis is inherently a subjective one, despite many interpretations to the contrary. To conclude that a result is “marginally” or “highly” significant is nonsensical. A statistical result is or isn’t statistically significant depending on the researcher’s point of view, and regardless of the p-value obtained. But what is exactly does this p-value represent?
P-Values and Error Rates
As mentioned previously, the observed p-value doesn’t represent the probability of the null hypothesis. Furthermore, the p-value also doesn’t tell us about the likelihood of any alternative hypothesis. Even if we had a specific alternative hypothesis, the p-value obtained with NHST only deals with the null hypothesis distribution of values (and a hypothetical one at that – e.g. can we really obtain a random sample from the population of all kids with ADHD?).
Historically, there have been at least two ways to approach “statistical significance”. In much of the social sciences, these two approaches are blended together in an almost incoherent fashion – and this hybrid has been promulgated in methodology texts. Fisher, a developer of the NHST methodology, even seemed to change his mind at one point as to how to interpret a NHST p-value. Fisher’s stance was that the observed p-value in NHST reflected our confidence in the null hypothesis. However, we already know is a problematic interpretation in the sense that the p-value is attached to the data (D), not to the hypothesis (H). Fisher also made no claims to an alternative hypothesis.
Neyman and Pearson, also developers of NHST, disagreed with Fisher’s approach. Neyman and Pearson’s approach was to specify an acceptable significance level before the experiment was conducted, and introduced the alpha cut-off (a) level, or Type I error rate (along with the concepts of: Type II error rate, power, and the alternative hypothesis). In the Neyman and Pearson approach: a researcher should, before data analysis, specify the probability of making a type I error (probability of incorrectly rejecting the null hypothesis when it is actually true). This specification will determine decisions about the design of the experiment (e.g. sample size for the experiment).
Thus, if I set the error rate at 5%, or a = .05, and I conduct the same experiment many times (all things being equal), and perform the corresponding analyses of the data, rejecting the null hypothesis 100 times, I will only be incorrect in doing so no more than five of those 100 replications. With this approach, it makes no difference whether the obtained p-value is .045 or .001, since we would make the same sort of decision, to reject the null, as long as our test statistic (e.g. observed t-value) falls beyond our specified cutoff point (critical value). In fact, the reporting of a specific p-value makes no sense in this approach - our statistic either makes the cut or it doesn’t based on our chosen alpha level.
The drawback with the Neyman-Pearson approach is that though we do have an idea as to a hypothetical long run situation of events, we are at a loss as to where our particular scenario resides within those hypothetically infinite number of random samples and analyses. In other words, we’ve rejected the null, but we’ll never really know if this is the time that we’ve made the type I error.
What if the analysis does not allow a rejection of the null hypothesis- what does that mean? Fisher thought that it meant we weren’t trying hard enough. Essentially, since we can’t prove any hypothesis, only falsify it (e.g. in a Popperian sense), conclusions can’t be drawn from a non-significant p-value. In other words, no matter how many white swans I see I can never prove that no black one exists, so if I don’t see a black one I must keep looking. Despite these issues, Neyman and Pearson took a practical stance regarding this procedure. If we don’t reach our cut-off, then according to the rules we’ve laid out, we act as though the null hypothesis were true (i.e. decide one course of action rather than another).
There are instances, however, where researchers using the N-P method will use Fisherian phrases like ‘fail to reject the null’. In many journal articles and textbooks, researchers blend the two interpretations of NHST- the epistemic approach of Fisher: a procedure that tells us about the falsehood of a nil hypothesis, and the behavioristic Neyman-Pearson approach that allows for making decisions but does not really infer anything. These researchers will often specify an alpha level (cut-off level) and then interpret the p-value in the Fisherian sense. In fact, some researches will erroneously interpret the p-value as a kind of effect size, or strength of the finding (correlation, difference of means etc.) such that a p-value of .005 is representative of a stronger result than .035.
The crux of the matter is that as researchers, sometimes little attention is paid to what the results mean practically, before moving on to a next set of analyses. Poor research design in the social sciences often make it difficult to detect important phenomenon from study to study (i.e. low sample size leading to low statistical power). Additionally, practices like rigid adherence to cut-off values despite inadequate sample size contribute to a lack of replicability of important phenomenon in the literature. Furthermore, editorial practices that tend only to publish statistically significant results (publication bias) have also led to spurious findings being reinforced in the literature as non-chance findings.
Poor methodological practice in the social science is a practice that encourages finding a significant result for data rather than approaching data with a thoughtful, problem-solving approach. Researchers that find themselves worrying about finding an observed p < .05, will find that their design will often ensure that such a result is found, and often be based on questions that are not all that interesting, with results that may be largely unenlightening. Confirmatory approaches combined with exploratory approaches (techniques that allow the data speak for itself), are flexible in the face of contradictory evidence, and assumes enough competency on the part of those who will be interested in the results to make decisions about the data for themselves.
What To Do?
A first step toward good statistical inference would be to recognize that the process of data analysis is more subjective than it was previously presumed to be. Researchers must make decisions every step of the way: interpreting previous results, formulating hypotheses, designing potential experiments, analyzing results, and deciding what is important to investigate further. As statistical methodology is a major tool that researchers use to study the data collected, researchers must be thoughtful in their approach and decision-making with regard to how they proceed at each stage of the analytical process. Decisions will have to be made on the part of the experimenter, and the researcher would be advised to be flexible, cautious, open-minded, and to use modern methods appropriate to the analysis situation.
Significance testing is problematic, much more so than talked about here, and one is invited to look into some of the references provided below. Much of the problem seems to stem from a misunderstanding of the results. With a more careful approach and a basic understanding of the origins of the analyses we are conducting, NHST can provide much insight into the constructs social scientists concern themselves with.
Informative Web Links