A hypothesis test is a statistical test to check whether a statement is true or false based on a sample of values from one or more random variables, usually empirically measured or simulated. More precisely, it allows us to quantify of how confident we are that the statement is true of false. Hypothesis tests can take many forms and test for many different statements. They are strongly related to confidence intervals.
Introduction#
To start, you need a random sample of random variables. In order to perform a hypothesis test, we need to introduce a test statistic, which is a statistic defined on the sample. follows some probability distribution . However, it only does only under specific circumstances. These circumstances are defined as a set of mathematical constraint that need to be met for the test to succeed. One outcome of the test is accepting the null hypothesis , a statement that says that the effect or law does not hold. In other words, "accepting the null hypothesis" is a complicated way of saying "we have no good evidence that the effect we're trying to prove exists; any apparent correlation is probably just random chance". This does not necessarily mean that the effect doesn't exist: it just means that our evidence can't prove that it does.
A clarification: the word "effect" should be taken very broadly. Hypotheses can take a lot of different shapes, so the "effect" can be literal, like improving rate of recovery after taking a specific kind of medicine, or figurative, like a parameter having the value that we want or following a different distribution . It really depends on how the test is interpreted.1 Accepting the null hypothesis means that the effect we have in mind is, according to our evidence, probably nonexistent. Perhaps the medicine does nothing to improve healing or the parameter has a different value from our original claim, but we can't know for sure: we'd need a better sample to know more.
The other possible outcome of the test is rejecting the null hypothesis. In this case, we say that the effect does hold and we instead accept the alternative hypothesis . This is an alternative statement that says that the effect exists and gives insight on how it works; perhaps the medicine does work and we can quantify how well or the parameter does have the value we expect.
How do we determine the outcome of the test? By definition, the null hypothesis is rejected if the value of calculated on the sample lies in the critical region (or rejection region) of 's distribution, . This region is an interval of the domain of (the sample space of ) and is determined by introducing a new parameter called the significance level of the test:
Here we chose the critical region to be , where is some value in 's sample space. The significance level is an integral of a PDF, so it represents a probability; "a probability of what?", you may ask.
Well, the interpretation is not that simple, for is arbitrarily chosen.2 Recall that the critical region is needed to determine if is accepted or rejected. If one chooses to be really small, then the critical region is also really small, but that also means that it's really unlikely that 's value on the sample falls in it. Thus, the smaller we choose to be, the more extreme the value of needs to be to fall in the region. That's why we call it "significance level": we can adjust it to determine how likely or unlikely the test is to pass, which in turn gives you different information.
A strict test (very low ) will give you a lot of confidence, because the only way you can reject the null hypothesis is with a very extreme value of . The more extreme the value is, the less likely it's just random chance, so the more confident you are that there's an actual effect going on. However, this comes at the cost of higher false negatives. Maybe the effect does exist, but you put such high requirements onto it (very low ) that the test fails (you accept ) even if the effect exists because it's not extreme enough to trigger the test's condition. For example, say you're testing a new medication. You claim that it's so effective that it cuts the recovery time down in half. You find a test sample of patients, get your data, run the hypothesis test and it returns that the medication does not work. Well, here you'd be wrong, for the medication does indeed work, but it only cuts the recovery time down by a third. Your test requirement was to cut in half, however, so the test says it doesn't do that. The test is right, but you ought to be careful to not misinterpret it. Your question wasn't "does the medicine work?", it was "does the medicine work really well?". The test (correctly) says no and you should not mistake that for saying that the medicine doesn't work at all. It does, but you can't know that, because that's not what you asked when you set a very low .
Conversely, a loose test (high ) is more likely to reject and give you a positive outcome, but you lose out on confidence. The bigger you make the critical region, the more prone the test is to false positives, as very low-confidence tests could be passed by sheer chance and random fluctuations.3
Basically, choosing means choosing how demanding you want the test to be: low give you more confident results, but you risk false negatives; high makes it more likely you get a positive outcome, but you risk false positives. Which is right for your test depends on the context, the benefits of passing the test and harm done by a false positive or negative. For more discussion on false positives and negatives, you can read Binary classification > Assessment. If you have no particular requirements, a value of is considered standard.
So what happens when we complete the test? If the null hypothesis is accepted, we must understand that we don't have enough evidence to claim that the effect exists (or doesn't). If it is rejected, we claim that the effect does exists and that it behaves according to the alternative hypothesis. The reliability of these outcomes is expressed by the significance level.
This is the general idea of a hypothesis test: collect a sample, choose a hypothesis you want to prove, a test statistic to work with and a significance level to determine confidence, then see if the test statistic falls in the critical region. Based on that, determine which hypothesis you accept and go on from there.
Now, this introduction outlined a specific kind of test: a so-called single tail, fixed significance level test using the critical region interpretation. "Critical region interpretation" because we're interpreting the outcome using the critical region outlined above; "fixed significance level" because we choose as an arbitrary constant; "single tail" because we define the critical region on just one extreme of . This is just one of many ways of doing tests, however. There are different interpretations, like the common -value interpretation, that can be used to reach similar conclusions in slightly different ways. Similarly, there are other definitions like double tail tests that provide different properties for different scenarios. But at heart, all tests follow the same logic:
- Collect a sample
- Determine your null and alternative hypotheses
- Choose a test statistic
- Calculate the test statistic on the sample
- Accept or reject the null hypothesis based on some factor like significance level
Different tests work better with different kinds of data. Given the choice, you should always strive to pick the test that maximizes the chance of true negatives (accepting when the effect doesn't exist) and true positives (accepting when the effect exists) on your specific data.
Footnotes#
-
Though in the end, it all comes back down to a mathematical statement. ↩
-
Actually, it doesn't have to be. Choosing by hand is called a priori significance. You could instead do a posteriori significance by choosing the bounds of critical region (i.e., ) instead, and seeing what that integrates to. ↩
-
As an aside, the probability that a test has of returning a true negative, so accepting when the effect correctly doesn't exist, is called the power of the test. It's a metric that's often used in medical or industrial studies to determine how large the sample size should be. ↩