Defining Random Variables

We write a k-vector (of scalars) as a row {\boldsymbol{x}}= \begin{bmatrix} x_1 & x_2 & \ldots & x_k \end{bmatrix}. The transpose of {\boldsymbol{x}} as {\boldsymbol{x}}^T= \begin{bmatrix} x_1 \\ x_2\\ \vdots \\ x_k \end{bmatrix}.

We use uppercase letters X,Y,Z,\ldots to denote random variables. Random vectors are denoted by bold uppercase letters {\boldsymbol{X}},{\boldsymbol{Y}},{\boldsymbol{Z}},\ldots, and written as a row vector. For example, {\boldsymbol{X}}= \begin{bmatrix} X_{[1]} & X_{[2]} & \ldots & X_{[k]} \end{bmatrix}.

In order to distinguish random matrices from vectors, a random matrix is denoted by {\mathbb{X}}.

The expectation of {\boldsymbol{X}} is defined as {\mathbb{E}\left[ {\boldsymbol{X}} \right]}= \begin{bmatrix} {\mathbb{E}\left[ X_{[1]} \right]} & {\mathbb{E}\left[ X_{[2]} \right]} & \ldots & {\mathbb{E}\left[ X_{[k]} \right]} \end{bmatrix}. The k\times k covariance matrix of {\boldsymbol{X}} is defined as \begin{aligned} {\mathbb{V}\left[ {\boldsymbol{X}} \right]} &={\mathbb{E}\left[ ({\boldsymbol{X}}-{\mathbb{E}\left[ {\boldsymbol{X}} \right]})^T({\boldsymbol{X}}-{\mathbb{E}\left[ {\boldsymbol{X}} \right]}) \right]} \\ &=\begin{bmatrix} \sigma_1^2 & \sigma_{12} & \ldots & \sigma_{1k} \\ \sigma_{21} & \sigma_{2}^2 & \ldots & \sigma_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{k1} & \sigma_{k2}^2 & \ldots & \sigma_{k}^2 \\ \end{bmatrix}_{k\times k} \end{aligned}

where \sigma_j={\mathbb{V}\left[ X_{[j]} \right]} and \sigma_{ij}={\text{Cov}\left[ X_{[i]},X_{[j]} \right]} for i,j=1,2,\ldots,k and i\neq j.

Theorem 1 (Linearity of Exectation) Let {\mathbb{A}}_{l\times k},{\mathbb{B}}_{m\times l} be fixed matrices and {\boldsymbol{c}} a fixed vector of size l. If {\boldsymbol{X}} and {\boldsymbol{Y}} are random vectors of size k and m, respectively, such that {\mathbb{E}\left[ X \right]}<\infty,{\mathbb{E}\left[ Y \right]}<\infty, then {\mathbb{E}\left[ {\mathbb{A}}{\boldsymbol{X}}+{\boldsymbol{Y}}{\mathbb{B}}+{\boldsymbol{c}} \right]}={\mathbb{A}}{\mathbb{E}\left[ {\boldsymbol{X}} \right]}+{\mathbb{E}\left[ {\boldsymbol{Y}} \right]}{\mathbb{B}}+{\boldsymbol{c}}.

Conditional Expectation and the BLP

Let us roll two dice, and define random variables X and Y as the difference and sum of the face-values, respectively. Depending on what nature decides to choose when the dice are rolled, the random variable X can output a number from \{-5,-2,-1,0,1,2,3,4,5\} and Y a number from \{2,3,\ldots,12\}.

If X=5, then the face-values are (6,1)f

CEF Bivariate

(Optional) CEF Multivariate

Theorem 2 (Characterization of CEF) If {\mathbb{E}\left[ Y^2 \right]}<\infty and {\boldsymbol{X}} is a random vector such that Y=m({\boldsymbol{X}})+e, then the following statements are equivalent:
1. m({\boldsymbol{X}})={\mathbb{E}\left[ Y|{\boldsymbol{X}} \right]}, the CEF of Y given {\boldsymbol{X}}
2. {\mathbb{E}\left[ e|{\boldsymbol{X}} \right]}=0

Best Linear Predictor

Let Y be a random variable and {\boldsymbol{X}} be a random vector of k variables. We denote the best linear predictor of Y given {\boldsymbol{X}} by \mathscr{P}[Y|{\boldsymbol{X}}]. It’s also called the linear projection of Y on {\boldsymbol{X}}.

Theorem 3 (Best Linear Predictor) Under the following assumptions

{\mathbb{E}\left[ Y^2 \right]}<\infty
{\mathbb{E}\left[ ||\bf{X}||^2 \right]}<\infty
{\mathbb{Q}}_{\bf{XX}}\stackrel{\text{def}}{=}{\mathbb{E}\left[ {\boldsymbol{X}}^T{\boldsymbol{X}} \right]} is positive-definite

the best linear predictor exists uniquely, and has the form \mathscr{P}[Y|{\boldsymbol{X}}]={\boldsymbol{X}}{\boldsymbol{\beta}}, where {\boldsymbol{\beta}}=\left({\mathbb{E}\left[ {\boldsymbol{X}}^T{\boldsymbol{X}} \right]}\right)^{-1}{\mathbb{E}}[{\boldsymbol{X}}^TY] is a column vector.

In the following theorem, we show that the BLP error is uncorrelated to the explanatory variables.

Theorem 4 (Best Linear Predictor Error) If the BLP exists, the linear projection error \varepsilon=Y-\mathscr{P}[Y|{\boldsymbol{X}}] follows the following properties:

{\mathbb{E}}[{\boldsymbol{X}}^T\varepsilon]={\boldsymbol{0}}
moreover, {\mathbb{E}}[\varepsilon]=0 if {\boldsymbol{X}}=\begin{bmatrix}1 & X_{[1]} & \ldots & X_{[k]} \end{bmatrix} contains a constant.

Large-Sample Regression

We assume that the best linear predictor, \mathscr{P}[Y|{\boldsymbol{X}}], of Y given {\boldsymbol{X}} is {\boldsymbol{X}}{\boldsymbol{\beta}}. Y={\boldsymbol{X}}{\boldsymbol{\beta}}+\varepsilon. We have from Theorem 4 {\mathbb{E}\left[ \varepsilon \right]}=0,\text{ and }{\mathbb{E}\left[ {\boldsymbol{X}}^T\varepsilon \right]}={\boldsymbol{0}}.

We also assume that the dataset \{(Y_i,{\boldsymbol{X}}_i)\} is taken i.i.d. from the joint distribution of (Y,{\boldsymbol{X}}). For each i, we can write Y_i={\boldsymbol{X_i}}{\boldsymbol{\beta}}+\varepsilon_i. In matrix notation, we can write {\boldsymbol{Y}}={\mathbb{X}}{\boldsymbol{\beta}}+{\boldsymbol{\varepsilon}}. Then {\mathbb{E}\left[ {\boldsymbol{\varepsilon}} \right]}={\boldsymbol{0}},\text{ and } {\mathbb{E}\left[ {\boldsymbol{\varepsilon}} \right]}={\boldsymbol{0}}

Consistency of OLS Estimators

Asymptotic Normality

We start by revealing an alternative expression for the OLS estimators \widehat{{\boldsymbol{\beta}}} using matrix notation.

\begin{aligned} \widehat{{\boldsymbol{\beta}}} &=\left[{\mathbb{X}}^T{\mathbb{X}}\right]^{-1}{\mathbb{X}}^T{\boldsymbol{Y}} \\ &=\left[{\mathbb{X}}^T{\mathbb{X}}\right]^{-1}{\mathbb{X}}^T({\mathbb{X}}{\boldsymbol{\beta}}+{\boldsymbol{\varepsilon}}) \\ &=\left[{\mathbb{X}}^T{\mathbb{X}}\right]^{-1}({\mathbb{X}}^T{\mathbb{X}}){\boldsymbol{\beta}}+ \left[{\mathbb{X}}^T{\mathbb{X}}\right]^{-1}{\mathbb{X}}^T{\boldsymbol{\varepsilon}} \\ &={\boldsymbol{\beta}} + \left[{\mathbb{X}}^T{\mathbb{X}}\right]^{-1}{\mathbb{X}}^T{\boldsymbol{\varepsilon}} \end{aligned}

So, \widehat{{\boldsymbol{\beta}}}-{\boldsymbol{\beta}} = \left[{\mathbb{X}}^T{\mathbb{X}}\right]^{-1}{\mathbb{X}}^T{\boldsymbol{\varepsilon}} \tag{1}

We can then multiply by \sqrt{n} both sides of Equation 1 to get \begin{aligned} \sqrt{n}\left(\widehat{{\boldsymbol{\beta}}}-{\boldsymbol{\beta}}\right) &=\left( \frac{1}{n}\sum\limits_{i=1}^n{\boldsymbol{X}}_i^T{\boldsymbol{X}}_i \right)^{-1} \left( \frac{1}{\sqrt{n}}\sum\limits_{i=1}^n{\boldsymbol{X}}_i^T\varepsilon_i \right) \\ &=\widehat{{\mathbb{Q}}}_{{\boldsymbol{XX}}}^{-1} \left( \frac{1}{\sqrt{n}}\sum\limits_{i=1}^n{\boldsymbol{X}}_i^T\varepsilon_i \right) \end{aligned} From the consistency of OLS estimators, we already have \widehat{{\mathbb{Q}}}_{{\boldsymbol{XX}}}\xrightarrow[p]{\quad\quad}{\mathbb{Q}}_{{\boldsymbol{XX}}} Our aim now is to understand the distribution of the stochastic term (the second term) in the above expression.

We first note (from i.i.d. and Theorem 4) that {\mathbb{E}\left[ {\boldsymbol{X}}_i^T\varepsilon_i \right]}={\mathbb{E}\left[ {\boldsymbol{X}}^T\varepsilon \right]}={\boldsymbol{0}}. Let us compute the covariance matrix of {\boldsymbol{X}}_i\varepsilon_i. Since the expectation vector is zero, we have {\mathbb{V}}[{\boldsymbol{X}}_i^T\varepsilon_i]={\mathbb{E}\left[ {\boldsymbol{X}}_i^T\varepsilon_i\left({\boldsymbol{X}}_i^T\varepsilon_i\right)^T \right]}={\mathbb{E}\left[ {\boldsymbol{X}}^T{\boldsymbol{X}}\varepsilon^2 \right]}\stackrel{\text{def}}{=}{\mathbb{A}}. As any function of \{(Y_i,{\boldsymbol{X}}_i)\}’s are independent, \{{\boldsymbol{X}}_i\varepsilon_i\}’s are independent. By the (multivariate) Central Limit Theorem, as n\to\infty \frac{1}{\sqrt{n}}\sum\limits_{i=1}^n{\boldsymbol{X}}_i^T\varepsilon_i \xrightarrow[d]{\quad\quad}\mathcal{N}({\boldsymbol{0}},{\mathbb{A}}). There is a small technicality here, we must have {\mathbb{A}}<\infty. This can be imposed by a stronger regularity condition on the moments, e.g., {\mathbb{E}\left[ Y^4 \right]},{\mathbb{E}\left[ ||{\boldsymbol{X}}||^4 \right]}<\infty. Putting everything together, we conclude \sqrt{n}(\widehat{{\boldsymbol{\beta}}}-{\boldsymbol{\beta}})\xrightarrow[d]{\quad\quad} {\mathbb{Q}}_{{\boldsymbol{XX}}}^{-1}\mathcal{N}({\boldsymbol{0}},{\mathbb{A}}) =\mathcal{N}\left({\boldsymbol{0}},\left[{\mathbb{Q}}_{{\boldsymbol{XX}}}^{-1}\right]^T{\mathbb{A}}{\mathbb{Q}}_{{\boldsymbol{XX}}}^{-1}\right) =\mathcal{N}\left({\boldsymbol{0}},{\mathbb{Q}}_{{\boldsymbol{XX}}}^{-1}{\mathbb{A}}{\mathbb{Q}}_{{\boldsymbol{XX}}}^{-1}\right)

Theorem 5 (Asymptotic Distribution of OLS Estimators) We assume the following:
1. The observations \{(Y_i,{\boldsymbol{X}}_i)\}_{i=1}^n are i.i.d from the joint distribution of (Y,{\boldsymbol{X}})
2. {\mathbb{E}\left[ Y^4 \right]}<\infty
3. {\mathbb{E}\left[ ||{\boldsymbol{X}}||^4 \right]}<\infty
4. {\mathbb{Q}}_{{\boldsymbol{XX}}}={\mathbb{E}\left[ {\boldsymbol{X}}{\boldsymbol{X}}' \right]} is positive-definite. Under these assumptions, as n\to\infty \sqrt{n}(\widehat{{\boldsymbol{\beta}}}-{\boldsymbol{\beta}})\xrightarrow[d]{\quad\quad} \mathcal{N}\left({\boldsymbol{0}},{\mathbb{V}}_{{\boldsymbol{\beta}}}\right), where {\mathbb{V}}_{{\boldsymbol{\beta}}}\stackrel{\text{def}}{=}{\mathbb{Q}}_{{\boldsymbol{XX}}}^{-1}{\mathbb{A}}{\mathbb{Q}}_{{\boldsymbol{XX}}}^{-1} and {\mathbb{Q}}_{{\boldsymbol{XX}}}={\mathbb{E}\left[ {\boldsymbol{X}}^T{\boldsymbol{X}} \right]}, {\mathbb{A}}={\mathbb{E}\left[ {\boldsymbol{X}}^T{\boldsymbol{X}}\varepsilon^2 \right]}.

The covariance matrix {\mathbb{V}}_{{\boldsymbol{\beta}}} is called the asymptotic variance matrix of \widehat{{\boldsymbol{\beta}}}. The matrix is sometimes referred to as the sandwich form.

Covariance Matrix Estimation

We now turn our attention to the estimation of the sandwich matrix using a finite sample.

Heteroskedastic Variance

Theorem 5 presented the asymptotic covariance matrix of \sqrt{n}(\widehat{{\boldsymbol{\beta}}}-{\boldsymbol{\beta}}) is {\mathbb{V}}_{{\boldsymbol{\beta}}} ={\mathbb{Q}}_{{\boldsymbol{XX}}}^{-1}{\mathbb{A}}{\mathbb{Q}}_{{\boldsymbol{XX}}}^{-1}. Without imposing any homoskedasticity condition, we estimate {\mathbb{V}}_{{\boldsymbol{\beta}}} using a plug-in estimator.

We have already seen that \widehat{{\mathbb{Q}}}_{{\boldsymbol{XX}}}=\frac{1}{n}\sum\limits_{i=1}^n{\boldsymbol{X}}_i^T{\boldsymbol{X}}_i is a natural estimator for {\mathbb{Q}}_{{\boldsymbol{XX}}}. For {\mathbb{A}}, we use the moment estimator \widehat{{\mathbb{A}}}=\frac{1}{n}\sum\limits_{i=1}^n{\boldsymbol{X}}_i^T{\boldsymbol{X}}_ie_i^2, where e_i=(Y_i-{\boldsymbol{X}}_i\widehat{{\boldsymbol{\beta}}}) is the i-th residual. As it turns out, \widehat{{\mathbb{A}}} is a consistent estimator for {\mathbb{A}}.

As a result, we get the following plug-in estimator for {\mathbb{V}}_{{\boldsymbol{\beta}}}: \widehat{{\mathbb{V}}}_{{\boldsymbol{\beta}}}= \widehat{{\mathbb{Q}}}_{{\boldsymbol{XX}}}^{-1}\widehat{{\mathbb{A}}}\widehat{{\mathbb{Q}}}_{{\boldsymbol{XX}}}^{-1} The estimator is also consistent. For a proof, see Hensen 2013.

As a consequence, we can get the following estimator for the variance, {\mathbb{V}}_{\widehat{{\boldsymbol{\beta}}}}, of \widehat{{\boldsymbol{\beta}}} in the heteroskedastic case. \begin{aligned} \widehat{{\mathbb{V}}}\left[\widehat{{\boldsymbol{\beta}}}\right] &=\frac{1}{n}\widehat{{\mathbb{V}}}_{{\boldsymbol{\beta}}}^{\text{HC0}} \\ &=\frac{1}{n}\widehat{{\mathbb{Q}}}_{{\boldsymbol{XX}}}^{-1}\widehat{{\mathbb{A}}}\widehat{{\mathbb{Q}}}_{{\boldsymbol{XX}}}^{-1} \\ &=\frac{1}{n}\left(\frac{1}{n}\sum\limits_{i=1}^n{\boldsymbol{X}}_i^T{\boldsymbol{X}}_i\right)^{-1} \left(\frac{1}{n}\sum\limits_{i=1}^ne_i^2{\boldsymbol{X}}_i^T{\boldsymbol{X}}_i\right) \left(\frac{1}{n}\sum\limits_{i=1}^n{\boldsymbol{X}}_i^T{\boldsymbol{X}}_i\right)^{-1} \\ &=\left({\mathbb{X}}^T{\mathbb{X}}\right)^{-1} {\mathbb{X}}^T{\mathbb{D}}{\mathbb{X}} \left({\mathbb{X}}^T{\mathbb{X}}\right)^{-1} \end{aligned} where {\mathbb{D}} is an n\times n diagonal matrix with diagonal entries e_1^2,e_2^2,\ldots,e_n^2. The estimator, \widehat{{\mathbb{V}}}\left[\widehat{{\boldsymbol{\beta}}}\right], is referred to as the robust error variance estimator for the OLS coefficients \widehat{{\boldsymbol{\beta}}}.