David G. Khachatrian

Note: Long post ahead.

To attempt to aid in navigation, please find a table of contents below. The post itself follows.

Introduction

I had a great time taking the MITx course on the Fundamentals of Statistics earlier this year. I learned a lot, both from the course itself and from the process of mulling over its contents with fellow students in the discussion forum. I wanted to hold onto some of the more enlightening realizations by logging them here. (Some light edits have been made for cleanliness and/or sanitization purposes, but otherwise posts have remained untouched.)

Questions I answered.

What to do in practice when Wald’s test and the likelihood ratio test give you different answers?

Question:

In practice, when you encounter situation similar to the one presented in the recitation, when 2 different tests give you 2 different results, how do you decide whether to reject or fail to reject the null hypothesis? Thanks!

Answer:

Just spitballing here, but:

As the TA suggests, in the finite \(n\) regime, one of the distributions might be “closer” to the \(\chi^2\) distribution than the other. So our choice of test matters. And if we’re being good statisticians, we aren’t playing around with the data to post-hoc decide which test we want to use.

In this case, my thought process would be: “What do I care about more: (1) whether my null hypothesis for $\theta$ itself is close to $\hat{\theta}^{MLE}$, or (2) whether my null hypothesis for $\theta$ is about as likely to give the observed data as $\hat{\theta}^{MLE}$. Remember what the professor mentioned in lecture: looking at our log-likelihood graph, Wald’s test compares closeness in the “$x$” (i.e. $\theta$) axis, while the likelihood ratio test compares closeness in the “$y$” axis.

This seems to go back to the “estimation vs. prediction” dichotomy. If we use Wald’s test, we’re interested in our parameter, so we’re presumably interested in the statistical model we have set up and in interpreting its parameters. In the likelihood-ratio test, we’re interested in the likelihood, so we’re presumably not too interested in what our model’s internals are as long as it spits out “good” answers often.

As a concrete example: Say you’re working in a marketing firm and you have a bunch of data on customer activity. You’re tasked with determining whether we should keep using Advertising Channel $A$ (because the cost of that channel is increasing, or something). Prior history means you’ve been given or have come up with a model that takes in customer activity (including exposure to $A$; say that parameter is $\theta_i$) and spits out profit from a customer with that background $f_{\vec{\theta}}(x)$. Say $\frac{\partial}{\partial \theta_i} f > 0$. Previous projections have estimated that $\theta_i \approx c$. Now, you have two ideas:

  1. “Alright, well, I have this model and I want to know whether $A$ is helping us out overall. That’s the same as $\theta_i$ having a positive coefficient. The status quo is that it’s been lucrative like it was in the past, so let’s set up a hypothesis test where $H_0: \theta_i = c, H_1: \theta_i \leq 0 $ and see if the data leads us to reject this hypothesis. Let’s use Wald’s test!”

  2. “OK, I wanna see whether $A$ seems to be helping us out. Well, are we far more likely to have gotten the data we got if $\theta_i = c > 0$ compared to if $\theta_i = 0$? Alright, my hypothesis test is $H_0: \theta_i = c, H_1: \theta_i \leq 0$ – let’s use the likelihood ratio test and see whether there’s evidence leading us to drop this channel.”

It’s somewhat subtle, but (1) is thinking more about the “internals” of the model while (2) is thinking more about the “outputs” of the model.

A question on simulation.

[Context note: This question is on why the Kolmogorov-Smirnov test has a pivotal distribution (i.e. a distribution that does not depend on unknown parameters) even for finite $n$. (For $n \to \infty$, we have the distribution converging to the distribution of the random variable $K = \sup_{t\in [0,1]} \lvert B(t) \rvert$, where $B(t)$ is the Brownian bridge (a Wiener process where $B(t=0) = B(t=1) = 0$).)]

Question:

In a simulation, we have multiple runs, and we vary something randomly between the runs. Then we look at the results in aggregate.

In this case, given a sample of data points, is not everything known? The empirical distribution is known. The CDF values of the hypothetical (proposed) distribution have to be computed at the sample data points. You cannot randomly pick from a uniform distribution. It is true that the CDF function is generally a mapping from R:R but if we use a random variable as an input, the output is a function of the random variable, and this function is the uniform distribution. But how can we make random picks from it in this case?

I am not sure, but are we saying we resample with replacement the data in each run (the bootstrap approach)? Then you can get different empirical CDFs and hypothetical CDFs in each run.

It would be useful if someone could provide a basic pseudo-code of the simulation?

Answer:

It has taken/is taking me a while to grok this myself, but I’ll share my current understanding in case it helps.

The key is that the test statistic we’re dealing with has a pivotal distribution in the sense that the statistic is drawn from a distribution not depending on our unknown parameters. Which seems crazy, but it’s because:

  1. If $T \sim F_T$, $S = F_T(T) \sim U[0,1]$.

    What’s $F_S(s)$? $F_S(s) = Pr[S \leq s] = Pr[F_T(T) \leq s] = Pr[T \leq F^{-1}_T(s)] = F_T(F_T^{-1}(s)) = s$, so $S$ is $U[0,1]$. In words? Thinking specifically of the equality $F_S(s) = Pr[T \leq F^{-1}_T(s)] $, $F_S(s)$ is asking “What proportion of $T$ falls below is the $s$’th quantile of $T$?” The proportion is $s$ (“duh”).

  2. Under the null (where $T \sim F$), letting $X = F(T) \sim U[0,1]$, $F_n(t) =\frac{1}{n}\sum_i\mathbf{1}[T_i \leq t] = \frac{1}{n}\sum_i\mathbf{1}[X_i \leq F(t)]$.

    So the question of “What proportion of $T_i \sim F$ falls below $t$?” and “What proportion of $X_i = F(T)\sim U[0,1]$ falls below the quantile corresponding to $t$?” are the same. No matter what $F$ is.

Using (1) and (2), we convert the K-S statistic $T_n$ into something like “$\text{the empirical CDF of a } U[0,1] \text{with } n \text{ samples} - \text{the actual CDF of a } U[0,1]$”.

But we can simulate the distribution of this statistic. Draw $n$ iid U[0,1] r.v’s. Calculate their empirical CDF $F_n(t)$ (“How many of these $n$ random uniform variables happened to fall under $t$ in this simulation?”). Subtract $t$ (the actual CDF of a U[0,1]). You get a quantity. This is one simulation. Repeat this many times. You now have a good estimate of the distribution that $T_n$ follows.

This was all prep work – all done before the people out in the field went out and collected observations $X_1, \cdots X_n$ and brought it back to you. Finally, you can calculate the $T_n$ for your data. Now you compare it to the distribution of $T_n$ (which doesn’t depend on $F$, as we’ve shown above) you created via simulation, get your p-value/etc, and Bob’s your uncle.

EDIT: Slight fix/clarification about $F_S(s)$.

How do p-values really help us?

Qeustion:

Speaking loosely, p-values tell us P(data|H0) , whereas the question we are trying to answer through a statistical test is typically P(H0|data) . These are of course related by Bayes’ Rule, but the frequentist approach we have been considering so far during this course makes no such attempt to move from P(data|H0) to P(H0|data) . Indeed, when discussing them, including in this class, it seems as if frequentists conflate these two probabilities and speak as if the p-value tells us directly about the probability H0 is true.

What am I missing?

Answer:

Having previously read pieces about p-values to this effect, I have also had a bit of difficulty reconciling things.

However, I was reminded from these lectures that the loose connection that $\text{p-value} \approx P(data \mid H_0)$ and not being $P(H_0 \mid data)$ is totally fine, because hypothesis tests are never meant to confirm the null hypothesis. They are testing whether the given data is “reasonable” given the null compared to an alternative, and there is inherent asymmetry between $H_0$ and $H_1$.

This is different from conjuring up a means by which to compute $P(H_0 \mid data)$ and $P(H_1 \mid data)$ and comparing the two – this has removed the asymmetry that’s built into the hypothesis testing framework and sounds more like maximum likelihood estimation to me. (To see an example of how this is different, consider Type I and Type II errors are different; if $H_0$ and $H_1$ were “interchangeable”, what do the Type I and Type II errors even mean?)

As far as I currently understand it, this is admittedly a philosophical difference. I’m reminded of the professor’s analogy earlier of “innocent until proven guilty” or “starting on Island $H_0$ and seeing if my evidence sends me toward $H_1$”. This is the hypothesis testing mindset. On the other hand, in a Bayesian framework, we assign prior probabilities, but aside from that there’s no real asymmetry between different choices. Even if you said A is more likely than B a priori (Bayesian), that’s not quite the same as “I’m operating under the belief that A is true until I receive enough evidence to start to believe that this hypothesis A is very unlikely to be true” (hypothesis testing).

(Admittedly, I was trying to convince myself as I wrote this, so apologies if it’s not very clear haha.)

Final conclusions of this Recitation

[Context Note: Here, “OLS” = “Ordinary Least Squares”, “PCR” = “Principal Component Regression”, “RR” = “Ridge Regression”]

Question:

The conclusions at the end of Ridge Estimator video were not clear (at least to me). Would someone kindly elaborate a bit more on the conclusions after looking at the bias and variance of the OLS, PCR, and Ridge Estimators.

Thanks in advance.

Regards

Answer:

What part’s confusing?

OLS “is forced” to use all $p$ dimensions of $\beta$, even if some of those dimensions are essentially collinear (say coordinates $i$ and $j$). This makes the variance shoot up because there is difficulty figuring out which coordinate should “account for” the variation seen along those coordinates, because there is little “separate” variations between the two coordinates you can use to distinguish between them.

For an extreme example, imagine that $\beta_i + \beta_j = \text{const} = 10$ so that $i$ and $j$ are perfectly collinear. How do you “choose” $\hat{\beta_i}$ and $\hat{\beta_j}$? You can set the first to whatever you want, say $\hat{\beta_i} := 42e^\pi$, or $3$, or $0$, or $500,000,000$, and you can still pick a $\hat{\beta_j}$ that makes the overall model fit the data equally well. So that’s infinite variance. The closer you get to collinear, the more your estimator’s variance explodes.

How do you remedy this? PCR chooses the tack that “Let’s drop the directions of the data with little/no variance”. This makes the coordinates less collinear to one another, reducing variance (but increases bias, because you’ve essentially assumed that $\beta$ along those principal components is equal to zero).

RR instead decides “Let’s just add a small fudge factor to my optimization problem that ends up increasing my bias and decreasing my variance – hopefully the variance of decreases more than the bias increases”. (If you go with a Bayesian perspective, this method assumes a prior that the squared norm of your estimator is relatively small.)

Threads I made.

Why we need the multivariate CLT.

Took me a second to appreciate why we needed to use the multivariate CLT in this example. You might be thinking what I thought: “After all, aren’t we just dealing with two functions of $X_i$ (namely $\sum_i X_i$ and $\sum_i X_i^2$)?”

But the fact that we’re dealing with two functions of $X_i$ should give us all the more reason to want to use the multivariate CLT, not less. The point of doing the multivariate CLT is to capture the covariances between the coordinates of interest, and one could reasonably suspect that our two functions of $X_i$ are correlated. And in fact, if we had $X_i \sim N(\mu, \sigma^2)$, we have that $Cov(X_i, X_i^2) = 2\mu\sigma^2$. It’s only because we had $\mu = 0$ in the problem statement that one might get lucky with “getting away” with an incorrect method of convergence.

(Oh, and thanks for the recitation section! Really helped make a little more concrete when/why/how to use the multivariate CLT/delta method when working with an estimator.)


4/9 EDIT: Fix typo in value of covariance term.

Some clarifications regarding Fisher information.

(I believe the below discussion will not give away any answers (given that $n$ is not mentioned in any of the prompts/graders), but if it seems off to staff/seems too “hinty”, I understand if it gets redacted.)

You may (as I did) scratch your head at the description of the Fisher information in this context. “We’ve always described $I(\theta) = -E_X[L_1(\theta; x)]$ in terms of one observation, not $n$ of them! And the grader doesn’t mention using $n$ anywhere (e.g. the variance)! What gives?”

The above is in fact the Fisher information for one observation $I_1(\theta)$. Assuming i.i.d samples, each observation gives the same amount of Fisher information on expectation, so if we view $n$ observations, we have $I_n(\theta) = n I_1(\theta)$. The definition given in this problem is for $I_n(\theta)$. Since (a) asks for the $n \times p$ design matrix $\mathbb{X}$ rather than a single-observation $1 \times p$ row-vector $\mathbf{X}_i^T$, one may expect that the resulting Fisher information is closer to $I_n(\theta)$ than $I_1(\theta)$. (Though I believe an $E_Y[\dots]$ surrounding the sum in the problem’s definition would be appropriate.)

You’ll also notice that in (b), the problem asks for the variance and not the “asymptotic variance”. We have normally been dealing with the asymptotic variance $Var(\sqrt{n}\hat{\theta})$, which for the MLE is $Var(\sqrt{n}\hat{\theta}^{MLE}) = f(I_1(\theta))$ (avoiding stating $f$ explicitly in case that would be too much of a hint for the problems). If we instead consider the actual variance of our estimator, we have (by dividing both sides by $n$ and recalling that $I_n(\theta) = n I_1(\theta)$, it turns out that $Var(\hat{\theta}^{MLE}) = f(I_n(\theta))$, using the Fisher information for $n$ samples.

Clarification of $\mu$ in (3)

[Context note: This post was when discussing ridge regression and attempting to converting our likelihood function into a Gaussian “standard” form.]

For those who got lost when we determine $\mu$ (like me), here’s how I’m understanding it. Recall $X$ is $n \times p$, $\beta$ is $p$-dimensional.

Our goal is to get the expression $-\beta^T X^TY - Y^TX\beta + \beta^T X^T X \beta + \lambda\beta^T\beta$ to look like $\frac{1}{\sigma^2}(\beta - \mu)^T\Sigma^{-1}(\beta - \mu)$ for some $\mu$ and $\Sigma$. Expand the second expression and we get $(\beta - \mu)^T\Sigma^{-1}(\beta - \mu) = (\beta - \mu)^T(\Sigma^{-1}\beta - \Sigma^{-1}\mu) = \frac{1}{\sigma^2}(\beta^T\Sigma^{-1}\beta - \beta^T\Sigma^{-1}\mu - \mu^T\Sigma^{-1}\beta + \mu^T\Sigma^{-1}\mu)$

Let’s match like terms. $\beta^T X^T X \beta + \lambda\beta^T\beta = \beta^T(X^T X - \lambda I_p)\beta$, which matches with our first term $\frac{1}{\sigma^2}(\beta^T\Sigma^{-1}\beta)$, so $\Sigma^{-1} = \sigma^2 (X^T X - \lambda I_p)$.

What about $\mu$? Let’s match $- \frac{1}{\sigma^2}\beta^T\Sigma^{-1}\mu$ with the only term in our expression whose dependence on $\beta$ is a prepended $\beta^T$, namely $-\beta^T X^TY$. Then $- \frac{1}{\sigma^2}\beta^T\Sigma^{-1}\mu = -\beta^T X^TY \implies \mu = \sigma^2\Sigma X^T Y = (X^T X - \lambda I_p)^{-1} X^T Y$

Reminder reference for “$AA^T$ is positive definite”.

Why, if $A$ is of full row rank, is $AA^T$ positive definite? One definition is that a matrix $M$ is positive definite iff $x^T M x > 0$ for all appropriate vectors $x$, only equaling $0$ if $x = \textbf{0}$. Then setting $M = AA^T$, we get $x^T A A^T x = (A^T x)^T(A^T x) = \lvert\lvert A^T x\rvert\rvert{}^2$, which is greater than zero for nonzero vectors $x$ due to $A$ being full row rank (and therefore $AA^T$ being full rank).

(Note: It seems we’re using some different “transpose” convention here/in this unit in general compared to earlier ones ($X_i \beta$ as opposed to the often-previously-seen $X_i^T \beta$).)

A pleasantly simple reminder of a few equivalent definitions for positive definiteness is available here.

Proofs of Glivenko-Cantelli Theorem and Strong Law of Large Numbers.

The beginning of this set of notes proves the Glivenko-Cantelli Theorem. It uses the Strong Law of Large Numbers at the end, for which you can find a very approachable proof (using stricter conditions where $E[X^2], E[X^4] < \infty$) here. The mathematician Terence Tao gave a proof of the two Laws of Large Numbers using the more lax condition of $E[\lvert X\rvert] < \infty$ here.

Note/reminder: We’re using denominator-layout notation for matrix calculus.

I very often get stymied about notation when needing to perform matrix calculus, e.g. $\frac{d}{d\textbf{x}}\textbf{Ax}$, as one may have an inclination to do for certain problems.

If you find yourself checking relevant Wikipedia articles to double-check calculations (e.g. https://en.wikipedia.org/wiki/Matrix_calculus), you’ll find that we’re using what’s called denominator-layout notation for our matrix notation. This aligns with the fact that we have $\nabla f(x_1,x_2,\cdots, x_n) = \frac{d}{d\textbf{x}}f$ as a column vector (i.e. an $n \times 1$ vector).

Extremely hand-wavey way to justify (d-r).

[Context note: This is in the context of understanding the degrees of freedom for the chi-square test (a test to see whether a discrete distribution matches a proposed null hypothesis) or the likelihood-ratio test (a test to compare whether there is evidence that the unconstrained maximum-likelihood estimator for the parameter vector differs greatly from a constrained maximum-likelihood estimator).]

4/14/19 EDIT: After continuing forward, I have a different hand-wavey way of determining the number of degrees of freedom for an asymptotic chi-square distribution. Basically, $df= \text{df of our distribution using our estimator} - \text{df of our distribution under the null hypothesis}$. (Recall that if we’re using an identifiable statistical model, we have a one-to-one correspondence between $\theta$ and $\mathbb{P}_\theta$.)

The idea here is that in this likeilhood-ratio test, our unconstrained MLE has all $d$ dimensions to “play with”, while our constrained MLE under the null has only $r$ dimensions to “play with”. This leads to an asymptotic $\chi^2_{d-r}$ distribution.

I prefer the above reasoning over my previous explanation because it seems to generalize a bit more sanely to at least some other hypothesis test cases, e.g. the chi-square test that comes up in the following lecture. (It’s a bit less obvious whether this can explain the fact that $n*\hat{\sigma^2}/\sigma^2 \to_{n\to \infty} \chi^2_{n-1} $, as this is not a test statistic arising from a hypothesis test, but hopefully the justifications provided for that case is clear enough already.)

I leave my earlier post below.


Note: as stated in the title, this is extremely hand-wavey and is just my way of trying to make sense of the somewhat odd-without-context assertion of Wilks’ Theorem. Hopefully this doesn’t just make things more confusing haha.

For the likelihood ratio test, our null and alternative hypotheses concern $(d-r)$ coordinates of our parameter, so it makes sense for our distribution to have the same “dimensionality”, leading to $\chi^2_{d-r}$.

To provide some vaguely relevant backup: When we use the multivariate delta method to collapse some multivariate parameter to a scalar (say $g: \mathbb{R}^d \to \mathbb{R}, g(\vec{\theta}) = \sum_{i=1}^d \theta^{(i)})$ and normalize accordingly, we asymptotically approach a one-dimensional Normal distribution, not a d-dimensional or a (d-1)-dimensional one.

In fact, in the above example, $d-1$ coordinates are free to vary (the set of solutions for $\sum_{i=1}^d \theta^{(i)} = C$ sit on a $d-1$-dimensional plane) and everything “after” the $r = d-1$’th coordinate is constrained by the “first” $r$. And $d-r = d- (d-1) = 1$ gives the dimensionality we “expect”.

Some errata.

In (4): the TA states that a property of projection matrices is that they are equal to their transpose. This is not true – one can have “oblique” projection matrices $P$ that are not equal to $P^T$ (see for example this example from Wikipedia). The main requirement is that it is idempotent, i.e. $P^2 = P$.

The TA’s comments following that are true for the case of orthogonal projection matrices (but not for all projection matrices). Note that the eigendecomposition of a full-rank square matrix $A$ is $V\Lambda V^{-1}$ (Wikipedia entry); for the case of symmetric matrices, $V^{-1} = V^T$ (the eigenvectors are orthogonal) and we obtain the relationship as shown in the video.

[…]

All in all, this was an incredibly helpful recitation – thank you!

Threads made by others.

Deep Dive/Recitation on Chi-Squared, Students t Distributions, and Independence of the Sample Mean and Variance for iid Normal Variables

[Post by @markweitzman]

I have posted on YouTube 5 videos covering a deep dive/recitation on the chi-square, Student t distributions, and Cochran’s theorem. Slides from the videos are also available below. The production values/standards leave much to be desired, but I welcome feedback on overall usefulness especially with respect to content. The videos are described below, along with their links, but I recommend videos 2,3 especially if you had difficulty with the heavy linear algebra in the recitation above. I have tried to use a more simplified approach. the remainder of the videos are more mathematical involving the calculation of pdf’s.

Video 1: Derivation of the Chi-Squared pdf (13:42 minutes)

Video 2: Behavior of iid normal variables under an orthogonal transformation (17:45 minutes)

Video 3: Independence of the sample mean and sample variance for iid normal variables (15:11 minutes)

Video 4: PDF's for the sample mean and sample variance (23:32)

Video 5: PDF for Students t distribution (22:34)

[…]

External writeups.

Derivation of the Welch-Satterthweite equation.

In addition to the posts above, I ended up spending a long time trying to understand where the Welch-Satterthwaite equation (used to calculate an “effective degrees of freedom” for a T-test involving two samples of different sizes) comes from. Seeing that there were no clear explanations online, I wrote up my newfound understanding on the equation as an answer to the Math StackExchange post that would often pop up when I attempted to look up relevant information online.