Answer each of the following questions with True, False, or
Uncertain and justify your answers with a concise argument or a proof if necessary. You
may cite theorems proved in class but you must show precisely how you are applying
the theorem and check that the assumptions required for the theorem hold. Correct
answers without justification will receive minimal credit.
(a) (5 points) The sample mean of an i.i.d. sample {xi}
n
i=1 is an unbiased estimator
for E[X].
(b) (5 points) The residuals from an OLS regression, {uˆi}
n
i=1 must contain some
negative values.
(c) (5 points) Suppose, given an i.i.d. sample {xi
, yi}
n
i=1, and a univariate regression
of Y on X, you are able to reject a two-sided hypothesis test of H0 : β1 = 0 with
significance level α. Then your data will also reject a one-sided hypothesis test of
H0 : β1 = 0 with the same significance level.
(d) (5 points) Suppose you are interested in the relationship between age and health,
and you divide the data into three mutually exclusive age categories, child =
I{age < 18}, adult = I{18 ≤ age < 65}, and senior = I{age ≥ 65}. Then we
must omit one of the age categories in the following regression:
Y = β0 + β1child + β2adult + β3senior + U
1
(e) (5 points) Suppose the true regression model is Y = β0 +β1X1 +β2X2 +U, with
cov(X1, U) = cov(X2, U) = 0. If we omit X2 from the regression and estimate the
naive regression Y = β0 + β1X1 + , we will recover a βˆ
1 that is consistent as long
as cov(X1, X2) = 0.
(f) (5 points) In a multivariate regression, Y = β0 + β1X1 + β2X2 + U, the test
statistic of an F-test for a hypothesis test of H0 : β1 = β2 = 0 is always nonnegative.
(g) (5 points) If an OLS estimation returns a high R2
it is likely that the model
shows a causal relationship between X and Y .
2. (25 POINTS) Esther Duflo of MIT has studied one of the largest school construction
projects in Indonesia, which occurred between 1973 and 1978. She uses data from
two Indonesian censuses (1973 and 1978), where she restricts the sample to individuals
aged 12-17 in each census. She observes an individual’s region of birth and the location
of school construction sites to study whether school construction increases educational
attainment. Let Yi be years of education individual i completed. Duflo defines two
types of regions: regions of high exposure to school construction projects and regions
of low exposure to school construction projects. Consider the following regression:
Y = β0 + β1C78 + β2High + β3(High × C78) + U
where C78i = 1 if individual i is in the 1978 census and 0 if individual i is in the 1973
census. Let Highi = 1 if individual i’s region of birth has high exposure to school
construction and 0 if i’s region has low exposure to construction.
(a) (5 points) What null hypothesis would you test if you wished to see if individuals
in 1978 in regions of low exposure were no different in educational attainment
compared to individuals in 1973 in regions of low exposure? Explain.
(b) (5 points) What null hypothesis would you test if you wished to see if individuals
in 1978 in regions of high exposure were no different in educational attainment
than individuals in 1973 in regions of high exposure? Explain.
2
(c) (5 points) What null hypothesis would you test if you wished to see if the change
in educational attainment from 1973 to 1978 in areas of low construction is no
different from the change in educational attainment from 1973 to 1978 in areas of
high construction? Explain.
(d) (5 points) Duflo finds βˆ
3 = .15 with SE(βˆ
3) = .03. How do you interpret this
estimate in words?
(e) (5 points) Given that βˆ
3 = .15 with SE(βˆ
3) = .03, perform a hypothesis test that
school construction project had no effect on educational attainment in Indonesia.
Do you reject your null hypothesis at the 10% significance level? (Hint: if W ∼ t∞,
then P r{W ≤ 1.96} = .975 and P r{W ≤ 1.645} = .95 and assume that n → ∞
here.)
3. (40 POINTS) Medicaid is a health insurance program for low-income Americans.
To be eligible for Medicaid your income must be below a certain threshold. Many
states have expanded access to Medicaid, increasing the income eligibility threshold,
as Nebraska did in the 2018 election. In this problem we will consider the effects of
Medicaid expansion. Specifically, how much additional health care do individuals use
when they have access to health insurance? Consider the following regression:
Y = β0 + β1D + U (1)
Let Di = 1 if the person is covered by Medicaid, and 0 otherwise. (Those not covered
by Medicaid may have another type of insurance or they may be uninsured.) Let Y = 1
if the individual has seen a primary care physician in the past year and 0 if not.
(a) (5 points) How should we interpret β1?
(b) (5 points) Do you think cov(U, D) here? Why or why not?
(c) (5 points) Do you think an OLS estimate βˆ
1 will overstate or understate the true
causal effect of having Medicaid insurance? Explain.
To estimate a causal effect, researchers Amy Finkelstein and Kate Baicker and
3
their team found a unique natural experiment in Oregon. Oregon aimed to expand Medicaid but did not have the budget to cover all newly eligible adults.
Hence Oregon implemented a lottery among newly eligible individuals, which was
randomized. Let Zi = 1 if individual i won the lottery for Medicaid and Zi = 0 if
the individual lost the lottery. And let’s reconsider the regression of Equation 1,
for the sample of lottery entrants. Let Y be an indicator for seeing a primary care
physician in the year following the lottery, and let D be an indicator for being
enrolled in Medicaid after the lottery.
(d) (5 points) State the two conditions necessary for Z to be a valid instrument.
(Please state both in math and in words what the conditions mean in this context.)
(e) (5 points) Do you think the exogeneity condition is likely to hold in this context?
Why or why not?
(f) (5 points) Suppose the two conditions for a valid instrument are satisfied. How
would you implement the regression? (Hint: here you can discuss either the
two stage least squares approach, or you can demonstrate how to use the two
conditions: E[U] = 0 and E[ZU] = 0 and estimate the parameters by the analogy
principle.)
(g) (5 points) The researchers’ carry out their instrumental variables regression and
obtain an estimate: βˆIV
1 = .12. How would you interpret this estimate and the
magnitude of the estimate? (For context, the fraction of individuals who see a
primary care physician among the non-winners of the lottery is .24.)
(h) (5 points) Suppose a skeptical reader of the study argues that even though the
instrument is valid and satisfies the necessary assumptions, the regression is still
omitting important determinants of health, like age and other risk factors. Does
omitting these affect the causal interpretation of βˆIV
1