2 Probability and Statistics, in theory
It must be said that I have no desire to subject you, the reader, to the same philosophical torment that I have undergone in my pursuit of clarity of this topic, which I could not even claim to have reached. Moreover, given that this is an experimental physics thesis, it is exceedingly likely that you are a pragmatist, and wish to move swiftly on to the sections that may be useful to your life. Nonetheless, I’d like you to allow me the courtesy of proposing that there may in fact be something truly useful in revisiting these foundations, if only to reaffirm their non-utility.
For this section, my most heavily used and recommended resource is Cousins (2018), which I heard Bob himself refer to in a post-PHYSTAT chat as his “life’s work”. Do give it a read – so many useful gems by someone who took a lot of time to build bridges between science and statistics!
2.1 Probability, distributions, and likelihood
Both the pre-amble to and the beginning of my PhD studies were decorated with thoughts on the nature of probability. At the time, I had just come out of a Masters degree project that involved the application of nested sampling (Skilling 2006) to infer the momenta of invisible neutrinos. This technique was invented by John Skilling, a prolific figure in the realm of Bayesian statistics, if only for his advocation thereof (if you’re planning to be indisposed for a time, you may find Skilling (2008) of interest as reading material, but do make sure to dilute it with things like Senn (2011) and Mayo (2018)). This started me down a rabbit hole on the much-contended topic of the definition of probability. It turned out the topic was vast, with arguments both mathematical and philosophical dating back many hundreds of years, many of which are still alive in the present day. Of this, I was particularly surprised — my undergraduate courses had spent a mere ten minutes or so on the subject, and I thought I understood it perfectly well given the time allocated! “Surely”, I mused in confusion, “probability, as I’ve encountered it both intuitively and mathematically, is just
right?”
Alas, the contentious nature of humanity would not allow me such a privilege. However, it is from this place of existential angst that we shall proceed, and uncover where this definition of probability encounters its limits (but also where it is useful).
2.1.1 Axioms of probability theory
For probability
always (at least one event will occur)- The union (
) of variables for mutually exclusive events (i.e. only if they cannot occur at the same time)
The next notion to cover is that of conditional probability:
Depending on whose axioms you follow, this quantity may be either derived (as in (Shafer and Vovk 2006)) or axiomatically stated (as in (Finetti 1970)). We can also expand this definition generally to include as many variables as we want, e.g. for 3 variables:
and then for N variables:
where
The axioms above lead to several other properties. For instance, the notion of conditional probability leads to the equation known as Bayes’ Theorem:
since the notion of “and” (
Another important notion is that of independence:
- For independent events
, (i.e. their venn diagrams don’t overlap, so the occurrence of cannot influence )- This also implies
- This also implies
Independence is an important assumption for HEP – we assume that events in our particle colliders occur without influencing each other. This assumption fundamentally changes any modelling we do of a collider physics process in a probabilistic sense, as it reduces the modelling to event-level probabilities (and the joint distribution over all events is recovered through multiplication as above).
2.1.2 Interpretations of
The previous section discussed axioms and properties for probability
The reason we faced issues there is because we tried to assign a probability to a fact. Similar questions could look like “what is the probability that the Higgs Boson exists?”, or “what are the odds of the next prime minister of the U.K. being a walrus?” – they’re all things that will either be true or untrue. We may like to make some probabilistic statements about these facts though (I’m particularly interested in the chances of the latter question): to do this, we need to shift our interpretation of
Now, doing things the Bayesian way doesn’t mean we assign probabilites to facts with no reason (though we are free to do so). We end up incorporating prior information on these facts, e.g. the probability of it raining generally in your area, through Bayes’ theorem in Equation 2.4 (
2.1.3 Probability density
We now shift our discussion from the big
Here, we’re using capital letters for random variables (the thing that follows the distribution), and realizations of that random variable are in lowercase letters. We can also define the cumulative density function (cdf), defined for random variable
This gives us the relation
Pretty much all the same relations for big
2.1.3.1 i.i.d.
A common expression you’ll see about a set of random variables is that they’re i.i.d., which stands for independently and identically distributed. This refers to the situation when
- Samples are drawn from the same probability distribution
- Each sample drawn (random variable) is independent from any other sample
An example is just drawing some random sample, e.g. np.random.uniform(size=10)
. That would be 10 i.i.d. samples from a uniform distribution.
2.1.4 Change of variables formula
If the distribution over
Differentiating both sides with respect to
where we’ve inserted the absolute value to ensure this holds for both monotonically increasing and decreasing functions. The multi-dimensional version of this involves the determinant of the Jacobian matrix, defined for a function involving vectors
This leads to the formula
which can be thought of as manipulating the space in the
2.1.5 Expectation values
The expectation value for a quantity
which I loosely think of as the average value that
2.1.6 Commonly used distributions
2.1.6.1 Normal
We define the normal distribution – commonly known as a “Gaussian” in HEP – with the pdf
The parameters
The normal distribution is exceedingly useful as a modelling tool due to the central limit theorem, which states that the distribution of the sum (and consequently the mean) of i.i.d. random variables calculated on a sample tends to a normal distribution as the size of your sample goes to infinity. This type of distribution is known as a sampling distribution as it involves a quantity that is computed on a per-sample basis.
The specific normal distribution you end up with from the cental limit theorem is
2.1.6.2 Poisson
Another very common distribution in HEP is the Poisson distribution. It’s useful when you want to model the occurrence rate of an event (and what happens on average). It’s defined for independent events
where
One commonly used notion is the approximation of a Poisson distribution with a normal distribution with mean
2.1.6.3 Uniform
The uniform distribution is a very simple construct, with pdf
We’ll use the shorthand of
2.1.7 Nuisance parameters
When our probability model
If we have some prior knowledge about
In the case where the distribution
To extract information about parameters of interest
Profiling: for each value of
Here, we’re essentially picking our best guess of the value of the nuisance parameters given a specified
Marginalization: we simply integrate away the dependence on the nuisance parameters entirely:
Note that this requires a specification of a prior pdf
2.2 Metrics of probability
2.2.1 Fisher information and the Cramér–Rao bound
A useful quantity in many ways is the Fisher information matrix. We can write the definition for an element of this matrix for a likelihood function
We’ll focus on the fact that you can extract parameter uncertainty estimates from this quantity. How? We turn to the Cramér–Rao bound, which states that if we have an unbiased estimator for the likelihood parameters
where
for sample size
2.2.2 Kullback-Leibler divergence
The Kullback-Leibler divergence (KL divergence) is a commonly used quantity in machine learning and information theory to represent the “distance” between a probability distribution
One interesting thing to note is that this is not a distance in the typical sense, since it’s an asymmetric number, i.e.
2.3 Inference
Statistical inference can be viewed as the inverse of the data generating process. Let’s look at an example.
Say that the Lorax created the Earth. Moreover, say that he did so by rolling a six-sided die, and putting the result of the die into a World Generating Machine, which merely requires a single number as input. As a neutral party that cannot confirm the Lorax’s roll, but can definitely observe the outcome, we may wonder: assuming we have a model of the World Generating Machine, what number did the Lorax roll to produce all that we see around us today? Moreover, can we factor in our prior suspicion that a Lorax rolled a die to select a number at random?
Put in a more general way: given a parametrized model of the world around us, potential suspicions about the parameters themselves, and some data that could be described by that model, which values of the parameters could have produced the data? Moreover, is the model a good description of the data at all? Could another model have described the data more accurately? These type of questions fall within the umbrella of statistical inference.
Let us add a layer of specificity to our questions. The key inquiries that are typically dealt with through inference are:
- Point estimation: What value(s) of my model parameters best describe the data?
- Interval estimation: How can I specify a plausible range for my model parameters based on the data they attempt to describe? (One can obtain a symbiotic relation between interval and point estimation by viewing a point estimate as the limit of shrinking intervals)
- Hypothesis testing: Given a model of the data generating process, to what extent does it describe the data when…
- …compared to another model with different form?
- …compared to the same model with different parameters?
- …not being compared to any particular alternative? (this has the particular name of the goodness-of-fit test)
These questions are where the schools of probability see their strongest divide; whether we can assign probabilities to facts – e.g. what are the odds that neutrinos have masses – fundamentally alters the kind of question we can ask of our model/data. For example, instead of a point estimate of the neutrino masses, a Bayesian procedure could infer their distribution given the data (and some prior beliefs about their values), also termed the posterior distribution. Similarly, Bayesian intervals can say “what are the chances the true parameter value lies in this interval” (which frequentist procedures can not determine), and hypothesis testing can evaluate the probability of the hypothesis itself.
However, these Bayesian procedures undertake the weight of the specification of a prior distribution for all of these quantities, which will (in the limited data regime, a.k.a. real life) strongly affect the result. Moreover, the frequentist procedures are indeed still useful for making exact, even if convoluted, statements about the model in the presence of observed data.
A brief aside: you’ll notice that the common thread throughout these statements is the model. As like many statistics-adjacent HEP researchers before me, I will emphasize that the likelihood model is the most important ingredient for any inference procedure. In the limit of a well-designed model and sufficient data, even a room full of the most polarized statisticians, hopelessly resigned to their particular school of thought, will agree on basic statements about the validity of a hypothesis. Even failing this,a Bayesian would not ignore a
But enough with the aside, or I’ll be subject to the same criticism of misplaced effort. Let’s review the different methods that implement answers to our questions of inference.
2.3.1 Point estimation
Given a probability model
In practice, we calculate
The prefactor of 2 leads to some nice theoretical properties via Wilk’s theorem, but that’s a later discussion (the 2 doesn’t matter for optimization purposes). We use calculate the MLE lot in HEP for a few reasons:
- It has a Normal sampling distribution in the asymptotic limit (so we can easily quote errors using a sample standard deviation)
- In the finite data regime, the MLE is effective for many distributions we care about (e.g. Normal, Poisson)
- Simple to understand and debug!
Other methods of point estimation could be as simple as quoting the mode (value with highest density), median (value that partitions the distribution into equal segments of probability), or arithmetic mean (sum of the data divided by the size of the data). These are more useful in the absence of a good probability model.
2.3.2 Confidence intervals
A significant part of my early PhD days was spent trying to understand the confidence interval. I will do my best to try and make you bang your head against the wall a little less than me (metaphorically, I hope).
We begin with some statistic of the observed data
- How do we define a notion of “extreme” for the data?
- How do we construct the interval itself to satisfy this property (namely that the values of
consistently treat as extreme or not)?
For the first point, there’s a two-part answer: we need a way to order the data with a notion of extremity (rank it from least to most extreme), and then also determine the cutoff for which points are considered extreme or not. This cutoff isn’t going to be by value, but by proportion; we’ll place our yardstick such that all values of
For 1-D
We can also look at a central ordering, which produces an interval that can be constructed at a given confidence level
Let’s make this concrete with an example: we’ll take the same distribution studied in (Cousins 2018) (Section 6.4), where we have a normal distribution with width parametrized as 1/5th of the mean:
From here, we can examime the pdf (at
Let’s now use the aforementioned method to construct a central confidence interval for
Finding these limits can be viewed as an optimization problem: we just need to choose
We can visualize the probability density for each of
Neyman construction
A different way to build confidence intervals (that generalizes to multiple dimensions) can be found in the form of the Neyman construction. Here, we start by constructing an acceptance interval for
The endpoints
We can then draw the acceptance intervals for many values of
Even though the appearance of the plot in Figure 2.4 makes it look so, it is completely not required that
Likelihood ratio ordering
A fourth ordering principle for the data was proposed by Feldman & Cousins in (Feldman and Cousins 1998), and is based on the likelihood ratio
where
Profiling
In the case where the likelihood parameters can be split into
where
Coverage
Despite that a confidence interval is in values of
How do we know our confidence set has this property? Well, the Neyman confidence set has it by definition: if we think about just the single distribution at the true value
This is all well and good, but may ring some alarm bells – we only have one interval in practice, right? Isn’t there a chance that we draw data such that the confidence interval falls in the 5% that don’t cover the true, unknown value of
In short, yes. But that doesn’t mean confidence intervals are useless – a single 95% interval is overwhelmingly more likely to contain the true parameter value than not. Remember: we’re always going to have inherent uncertainty on the statements we make on unknown true parameter values – this uncertainty is expressed as a probability distribution for Bayesian methods, and as a statement about sampling properties (results in the limit of many repeated experiments) for frequentist methods.
2.3.3 Frequentist hypothesis testing
Hypothesis testing gives us a framework to make qualitative statements about how extreme the observed data seems under a proposed theory, often in comparison to various alternatives. Note we use “extreme” in the same way as with confidence intervals, so we’re going to need to establish an ordering principle for the data. Let’s go into more detail.
Once again, we concern ourselves with a statistic of the data
- Two totally separate probability models
and , which can differ in number of parameters and functional form - A set of hypotheses within a probability model;
induces a family of hypotheses parametrized by , for instance, with any two values of potentially serving as and
A hypothesis at a point (e.g.
So, how do we actually do the testing itself? We’ll need an ordering principle to rank data from least to most extreme. To do this, we usually concentrate on defining a good test statistic of the data
Under which distribution is this ordering and designation of
One more thing we can deduce from the above: our statements on how extreme the observed data appears are fundamentally under the assumption that
If the observed data
So if our testing revolves around
When framing things in terms of error, we can see that the confidence level
The quantities
-values
So we set up a test by choosing the size
Note that the quantity
We can then rephrase a hypothesis test as rejecting
Optimality
What defines an optimal test? Well,
In the case where
When we have a composite hypothesis, things get trickier, since we’ve only worked out the best test statistic for two point hypotheses. Instead of most powerful test, we need to change our language: we want the uniformly most powerful test across the many point hypotheses contained within the composite hypothesis.
A proxy way to address this is the following: we know that the likelihood ratio is optimal for two point hypothesis. If we’re testing a point null (
2.3.4 Duality between hypothesis test and intervals
It was probably hard to watch the number of times I defined “extreme” in this very specific way that seems to apply to both intervals and tests, without commenting on how they’re similar or different. That’s because they’re one-to-one in many ways!
Say we make a
I hope this is clear enough, it took me a little while to see initially! One can then invert a test to produce a related confidence interval, which just means taking the set of
2.3.5 Bayesian procedures
I won’t talk much about the Bayesian methodology here since I didn’t explicitly use it in my PhD, but it’s worth mentioning for contrast. Moreover, if you’re a student within HEP reading this, I don’t want to make it seem like there’s only one way of doing things – there are analyses that use these methods, and I think they should be explored more! See something like (Sivia and Skilling 2006) for a comprehensive introduction.
As I’ve mentioned a few times, the Bayesian interpretation of probability admits distributions for facts; one can ask “how likely is supersymmetry?” or “what’s the distribution for my signal strength?” These questions are often answered via use of Bayes theorem, which states that for probability distribution
We call
One thing to note: Bayes theorem does not use the Bayesian definition of probability unless we decide to – it’s derived from the axioms of probability, and is thus invariant of interpretation. Using this theorem in a Bayesian way only holds when we have a true but unknown quantity (e.g.
Armed with this equation, Bayesian procedures include:
- Posterior estimation (e.g. through Markov chain monte carlo and it’s variants)
- Can then perform point estimation via maximum a posteriori (the mode of the posterior)
- Credible intervals: in contrast to confidence intervals where the endpoints are random variables,
itself is now the random variable, and the interval is constructed such that it contains the true value with some allocated probability. - Hypothesis testing: Bayesian hypothesis tests do exist, and involve the model evidence; I won’t talk about them here.
- Evidence estimation, e.g. nested sampling (Skilling 2006), which also comes with a posterior estimate as a by-product.