29 Nov 2016 • on Data Analysis Capstone

Assignment 2: Methods

This is the second assignment for the Data Analysis Capstone from Data Analysis and Interpretation course ministered by Wesleyan University. You can see all the previous content here.

In this assignment, we have to talk about the methods used in the research.

Methods

Sample

To make this research, the QOG Standard Dataset 2016 [1] was used. This dataset consists of approximately 2500 variables from more than 100 data sources. The used variables was extracted from four differents database:

Environmental Performance Data (EPI) [2];
International Monetary Fund (IMF) [3];
Worldbank - World Development Indicators (WDI) [4];
World Economic Forum (WEF) [5].

In the QoG Standard CS dataset, data from 2012 is prioritized, however, if no data are available for a country for 2012, data for 2013 is included. If no data for 2013 exists, data for 2011 is included, and so on up to a maximum of +/- 3 years.

In the codebook you can find a detailed description of all data sources and variables sorted by original data sources.

Every single variable has a different sample number. The variables with the most samples are Incidence of tuberculosis (per 100,000 people), Air Quality and Water and Sanitation with N=191. The variable with the lowest samples are Smoking prevalence, females and Smoking prevalence, males with N = 127.

After dropping the countries with miss information, a total of N = 109 was selected to make the research.

Measures

For this work, the variables that will be used are:

Tuberculosis treatment success rate (% of new cases).
Health expenditure per capita, PPP (constant 2011 international dollar)
Water and Sanitation: Access to Drinking Water and Access to Sanitation
Air Quality: Household Air Quality, Air Pollution - Average Exposure to PM2.5 and Air Pollution
Smoking prevalence, females (% of adults)
Smoking prevalence, males (% of adults)
Business impact of tuberculosis
Tuberculosis case detection rate (%, all forms)
Incidence of tuberculosis (per 100,000 people)
GDP (PPP) (share of world total) (%)

All variables are quantitative and will be used without any management.

Analysis

The distributions for the predictors and the tuberculosis treatment success rate response variable were evaluated by examining the mean, standard deviation and minimum and maximum values.

Scatter plots were also examined. For test bivariate associations between individual predictors and the tuberculosis treatment success rate response variable, pearson correlation were used.

Lasso regression with the least angle regression selection algorithm was used to identify the subset of variables that best predicted the tuberculosis treatment success rate.

As the data set has few samples, the lasso regression model was estimated on the entire data set (N=109). All predictor variables were standardized to have a mean=0 and standard deviation=1 prior to conducting the lasso regression analysis. Cross validation was performed using k-fold cross validation specifying 10 cross validation folds. The change in the cross validation mean squared error rate at each step was used to identify the best subset of predictor variables. Predictive accuracy was assessed by determining the mean squared error rate of the training data prediction algorithm when applied to observations in the test data set.

References

[1] QOG Standard Dataset 2016

[2] Environmental Performance Data

[3] International Monetary Fund

[4] Worldbank - World Development Indicators

[5] World Economic Forum