Regression quiz

May 7, 2020adminUncategorized

Answer each of the following questions with True, False, or
Uncertain and justify your answers with a concise argument or a proof if necessary. You
may cite theorems proved in class but you must show precisely how you are applying
the theorem and check that the assumptions required for the theorem hold. Correct
answers without justification will receive minimal credit.
(a) (5 points) The sample mean of an i.i.d. sample {xi}
n
i=1 is an unbiased estimator
for E[X].
(b) (5 points) The residuals from an OLS regression, {uˆi}
n
i=1 must contain some
negative values.
(c) (5 points) Suppose, given an i.i.d. sample {xi
, yi}
n
i=1, and a univariate regression
of Y on X, you are able to reject a two-sided hypothesis test of H0 : β1 = 0 with
significance level α. Then your data will also reject a one-sided hypothesis test of
H0 : β1 = 0 with the same significance level.
(d) (5 points) Suppose you are interested in the relationship between age and health,
and you divide the data into three mutually exclusive age categories, child =
I{age < 18}, adult = I{18 ≤ age < 65}, and senior = I{age ≥ 65}. Then we must omit one of the age categories in the following regression: Y = β0 + β1child + β2adult + β3senior + U 1 (e) (5 points) Suppose the true regression model is Y = β0 +β1X1 +β2X2 +U, with cov(X1, U) = cov(X2, U) = 0. If we omit X2 from the regression and estimate the naive regression Y = β0 + β1X1 + , we will recover a βˆ 1 that is consistent as long as cov(X1, X2) = 0. (f) (5 points) In a multivariate regression, Y = β0 + β1X1 + β2X2 + U, the test statistic of an F-test for a hypothesis test of H0 : β1 = β2 = 0 is always nonnegative. (g) (5 points) If an OLS estimation returns a high R2 it is likely that the model shows a causal relationship between X and Y . 2. (25 POINTS) Esther Duflo of MIT has studied one of the largest school construction projects in Indonesia, which occurred between 1973 and 1978. She uses data from two Indonesian censuses (1973 and 1978), where she restricts the sample to individuals aged 12-17 in each census. She observes an individual’s region of birth and the location of school construction sites to study whether school construction increases educational attainment. Let Yi be years of education individual i completed. Duflo defines two types of regions: regions of high exposure to school construction projects and regions of low exposure to school construction projects. Consider the following regression: Y = β0 + β1C78 + β2High + β3(High × C78) + U where C78i = 1 if individual i is in the 1978 census and 0 if individual i is in the 1973 census. Let Highi = 1 if individual i’s region of birth has high exposure to school construction and 0 if i’s region has low exposure to construction. (a) (5 points) What null hypothesis would you test if you wished to see if individuals in 1978 in regions of low exposure were no different in educational attainment compared to individuals in 1973 in regions of low exposure? Explain. (b) (5 points) What null hypothesis would you test if you wished to see if individuals in 1978 in regions of high exposure were no different in educational attainment than individuals in 1973 in regions of high exposure? Explain. 2 (c) (5 points) What null hypothesis would you test if you wished to see if the change in educational attainment from 1973 to 1978 in areas of low construction is no different from the change in educational attainment from 1973 to 1978 in areas of high construction? Explain. (d) (5 points) Duflo finds βˆ 3 = .15 with SE(βˆ 3) = .03. How do you interpret this estimate in words? (e) (5 points) Given that βˆ 3 = .15 with SE(βˆ 3) = .03, perform a hypothesis test that school construction project had no effect on educational attainment in Indonesia. Do you reject your null hypothesis at the 10% significance level? (Hint: if W ∼ t∞, then P r{W ≤ 1.96} = .975 and P r{W ≤ 1.645} = .95 and assume that n → ∞ here.) 3. (40 POINTS) Medicaid is a health insurance program for low-income Americans. To be eligible for Medicaid your income must be below a certain threshold. Many states have expanded access to Medicaid, increasing the income eligibility threshold, as Nebraska did in the 2018 election. In this problem we will consider the effects of Medicaid expansion. Specifically, how much additional health care do individuals use when they have access to health insurance? Consider the following regression: Y = β0 + β1D + U (1) Let Di = 1 if the person is covered by Medicaid, and 0 otherwise. (Those not covered by Medicaid may have another type of insurance or they may be uninsured.) Let Y = 1 if the individual has seen a primary care physician in the past year and 0 if not. (a) (5 points) How should we interpret β1? (b) (5 points) Do you think cov(U, D) here? Why or why not? (c) (5 points) Do you think an OLS estimate βˆ 1 will overstate or understate the true causal effect of having Medicaid insurance? Explain. To estimate a causal effect, researchers Amy Finkelstein and Kate Baicker and 3 their team found a unique natural experiment in Oregon. Oregon aimed to expand Medicaid but did not have the budget to cover all newly eligible adults. Hence Oregon implemented a lottery among newly eligible individuals, which was randomized. Let Zi = 1 if individual i won the lottery for Medicaid and Zi = 0 if the individual lost the lottery. And let’s reconsider the regression of Equation 1, for the sample of lottery entrants. Let Y be an indicator for seeing a primary care physician in the year following the lottery, and let D be an indicator for being enrolled in Medicaid after the lottery. (d) (5 points) State the two conditions necessary for Z to be a valid instrument. (Please state both in math and in words what the conditions mean in this context.) (e) (5 points) Do you think the exogeneity condition is likely to hold in this context? Why or why not? (f) (5 points) Suppose the two conditions for a valid instrument are satisfied. How would you implement the regression? (Hint: here you can discuss either the two stage least squares approach, or you can demonstrate how to use the two conditions: E[U] = 0 and E[ZU] = 0 and estimate the parameters by the analogy principle.) (g) (5 points) The researchers’ carry out their instrumental variables regression and obtain an estimate: βˆIV 1 = .12. How would you interpret this estimate and the magnitude of the estimate? (For context, the fraction of individuals who see a primary care physician among the non-winners of the lottery is .24.) (h) (5 points) Suppose a skeptical reader of the study argues that even though the instrument is valid and satisfies the necessary assumptions, the regression is still omitting important determinants of health, like age and other risk factors. Does omitting these affect the causal interpretation of βˆIV 1