Hypothesis test


A hypothesis test is a statistical test to check whether a statement is true or false based on a sample of values from one or more random variables, usually empirically measured or simulated. More precisely, it allows us to quantify of how confident we are that the statement is true of false. Hypothesis tests can take many forms and test for many different statements. They are strongly related to confidence intervals.

Introduction

To start, you need a random sample {X1,,XN}\{ X_{1},\ldots,X_{N} \} of NN random variables. In order to perform a hypothesis test, we need to introduce a test statistic, which is a statistic T=T(X1,,XN)T=T(X_{1},\ldots,X_{N}) defined on the sample. TT follows some probability distribution ϕ0(t)\phi_{0}(t). However, it only does only under specific circumstances. These circumstances are defined as a set of mathematical constraint that need to be met for the test to succeed. One outcome of the test is accepting the null hypothesis H0H_{0}, a statement that says that the effect or law does not hold. In other words, "accepting the null hypothesis" is a complicated way of saying "we have no good evidence that the effect we're trying to prove exists; any apparent correlation is probably just random chance". This does not necessarily mean that the effect doesn't exist: it just means that our evidence can't prove that it does.

A clarification: the word "effect" should be taken very broadly. Hypotheses can take a lot of different shapes, so the "effect" can be literal, like improving rate of recovery after taking a specific kind of medicine, or figurative, like a parameter having the value that we want or TT following a different distribution ϕ1\phi_{1}. It really depends on how the test is interpreted.1 Accepting the null hypothesis means that the effect we have in mind is, according to our evidence, probably nonexistent. Perhaps the medicine does nothing to improve healing or the parameter has a different value from our original claim, but we can't know for sure: we'd need a better sample to know more.

The other possible outcome of the test is rejecting the null hypothesis. In this case, we say that the effect does hold and we instead accept the alternative hypothesis H1H_{1}. This is an alternative statement that says that the effect exists and gives insight on how it works; perhaps the medicine does work and we can quantify how well or the parameter does have the value we expect.

How do we determine the outcome of the test? By definition, the null hypothesis is rejected if the value of TT calculated on the sample lies in the critical region (or rejection region) of TT's distribution, ϕ0\phi_{0}. This region is an interval of the domain of ϕ0\phi_{0} (the sample space of TT) and is determined by introducing a new parameter α[0,1]\alpha \in[0,1] called the significance level of the test:

α=tαϕ0(t) dt\alpha=\int_{t_{\alpha}}^{\infty} \phi_{0}(t) \ dt

Here we chose the critical region to be [tα,+[[t_{\alpha},+\infty[, where tαt_{\alpha} is some value in TT's sample space. The significance level is an integral of a PDF, so it represents a probability; "a probability of what?", you may ask.

center

Well, the interpretation is not that simple, for α\alpha is arbitrarily chosen.2 Recall that the critical region is needed to determine if H0H_{0} is accepted or rejected. If one chooses α\alpha to be really small, then the critical region is also really small, but that also means that it's really unlikely that TT's value on the sample falls in it. Thus, the smaller we choose α\alpha to be, the more extreme the value of TT needs to be to fall in the region. That's why we call it "significance level": we can adjust it to determine how likely or unlikely the test is to pass, which in turn gives you different information.

A strict test (very low α\alpha) will give you a lot of confidence, because the only way you can reject the null hypothesis is with a very extreme value of TT. The more extreme the value is, the less likely it's just random chance, so the more confident you are that there's an actual effect going on. However, this comes at the cost of higher false negatives. Maybe the effect does exist, but you put such high requirements onto it (very low α\alpha) that the test fails (you accept H0H_{0}) even if the effect exists because it's not extreme enough to trigger the test's condition. For example, say you're testing a new medication. You claim that it's so effective that it cuts the recovery time down in half. You find a test sample of patients, get your data, run the hypothesis test and it returns that the medication does not work. Well, here you'd be wrong, for the medication does indeed work, but it only cuts the recovery time down by a third. Your test requirement was to cut in half, however, so the test says it doesn't do that. The test is right, but you ought to be careful to not misinterpret it. Your question wasn't "does the medicine work?", it was "does the medicine work really well?". The test (correctly) says no and you should not mistake that for saying that the medicine doesn't work at all. It does, but you can't know that, because that's not what you asked when you set a very low α\alpha.

Conversely, a loose test (high α\alpha) is more likely to reject H0H_{0} and give you a positive outcome, but you lose out on confidence. The bigger you make the critical region, the more prone the test is to false positives, as very low-confidence tests could be passed by sheer chance and random fluctuations.3

Basically, choosing α\alpha means choosing how demanding you want the test to be: low α\alpha give you more confident results, but you risk false negatives; high α\alpha makes it more likely you get a positive outcome, but you risk false positives. Which α\alpha is right for your test depends on the context, the benefits of passing the test and harm done by a false positive or negative. For more discussion on false positives and negatives, you can read Binary classification > Assessment. If you have no particular requirements, a value of α=5%\alpha=5\% is considered standard.

So what happens when we complete the test? If the null hypothesis is accepted, we must understand that we don't have enough evidence to claim that the effect exists (or doesn't). If it is rejected, we claim that the effect does exists and that it behaves according to the alternative hypothesis. The reliability of these outcomes is expressed by the significance level.

This is the general idea of a hypothesis test: collect a sample, choose a hypothesis you want to prove, a test statistic to work with and a significance level to determine confidence, then see if the test statistic falls in the critical region. Based on that, determine which hypothesis you accept and go on from there.

Now, this introduction outlined a specific kind of test: a so-called single tail, fixed significance level test using the critical region interpretation. "Critical region interpretation" because we're interpreting the outcome using the critical region outlined above; "fixed significance level" because we choose α\alpha as an arbitrary constant; "single tail" because we define the critical region on just one extreme of ϕ0\phi_{0}. This is just one of many ways of doing tests, however. There are different interpretations, like the common pp-value interpretation, that can be used to reach similar conclusions in slightly different ways. Similarly, there are other definitions like double tail tests that provide different properties for different scenarios. But at heart, all tests follow the same logic:

  1. Collect a sample
  2. Determine your null and alternative hypotheses
  3. Choose a test statistic
  4. Calculate the test statistic on the sample
  5. Accept or reject the null hypothesis based on some factor like significance level

Different tests work better with different kinds of data. Given the choice, you should always strive to pick the test that maximizes the chance of true negatives (accepting H0H_{0} when the effect doesn't exist) and true positives (accepting H1H_{1} when the effect exists) on your specific data.

Footnotes

  1. Though in the end, it all comes back down to a mathematical statement.

  2. Actually, it doesn't have to be. Choosing α\alpha by hand is called a priori significance. You could instead do a posteriori significance by choosing the bounds of critical region (i.e., tαt_{\alpha}) instead, and seeing what α\alpha that integrates to.

  3. As an aside, the probability that a test has of returning a true negative, so accepting H0H_{0} when the effect correctly doesn't exist, is called the power of the test. It's a metric that's often used in medical or industrial studies to determine how large the sample size should be.