Assignment 3: Preliminary Results

This is the third assignment for the Data Analysis Capstone from Data Analysis and Interpretation course ministered by Wesleyan University. You can see all the previous content here.

In this assignment, we have to talk about the results obtained in the research.

Results

Descriptive Statistics

Table 1 shows descriptive statistics for the quantitative data analytic variables. The average of the response variable, tuberculosis treatment success rate, was 78.29%, with a minimum success rate of 0% and a maximum of 100%.

Table 1. Descriptive Statistic for Data Analytic Variables.
Analysis Variable N Mean Std Dev Minimun Maximum
Air Quality 109 78.97 18.73 14.30 100.00
Water and Sanitation access 109 55.56 33.05 2.88 100.00
GDP PPP share of world total 109 0.81 2.44 0.00 19.57
Health expenditure per capita 109 1424.78 1626.97 34.81 8845.18
Smoking prevalence females 109 11.57 10.22 0.40 39.80
Smoking prevalence males 109 34.44 12.83 8.90 71.80
TB case detection rate 109 75.28 17.89 16.00 120.00
Incidence of TB 109 128.83 195.29 1.60 1042.00
TB treatment success rate 109 78.29 15.64 0.00 100.00
Business impact of TB 109 5.24 1.05 2.27 6.84


Bivariate Analysis

Scatter plots for the association between the tuberculosis success rate response variable and quantitative predictors (Figure 1) revealed that only the variables GDP PPP share of the world total, Smoking prevalence males and Incidence of Tuberculosis increased when the tuberculosis treatment had a greater success rate. However, the other variables decreased when the success treatment rate had a great value.

Figure 1. Association between predictors and tuberculosis success rate.

Figure 1

Table 2 shows all the Pearson values of the variables. The variables GDP PPP share of world total, Smoking prevalence males and Incidence of tuberculosis were not significantly associated with the response variable tuberculosis treatment success rate.

Table 2. Pearson values of the association between predictors and tuberculosis success rate.
Analysis Variable Pearson p-value
Air Quality -0.26776 0.0049
Water and Sanitation access -0.38838 3.0049e-05
GDP PPP share of world total 0.04870 0.6150
Health expenditure per capita -0.37709 5.3036e-05
Smoking prevalence females -0.41092 9.0657e-06
Smoking prevalence males 0.07624 0.43071
TB case detection rate -0.30539 0.00124
Incidence of TB 0.16489 0.08664
Business impact of TB -0.33497 0.00037


Multivariable Analysis

Figure 2 shows that only four variables were retained in the model selected by the lasso regression analysis. The other five predictors were excluded. The Smoking prevalence females (%) and the Helath expenditure per capita were most strongly associated with tuberculosis success treatment rate, followed by Air quality and at last Water and sanitation access (Table 3).

Figure 2. Regression Coefficients Progression for Lasso Paths.

Figure 2

Table 3. Lasso Regression Coefficients.
Analysis Variable Coef
Air Quality -0.95440
Water and Sanitation access -0.94431
Health expenditure per capita -1.41151
Smoking prevalence females -3.17685


As the data set have low samples, the data were not splited into training and a test sets. The least angle regression algorithm with k=10 fold cross-validation was used to estimate the lasso regression model in the data set. The change in the cross-validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

Figure 3 shows that there is variability across the individual cross-validation folds in the training data set, but the change in the mean square error as variables are added to the model follows the same pattern for each fold.

Figure 3. Mean squared error on each fold.

Figure 3

The mean squared error for the data was MSE = 193.10 and the R-square value was 0.2034, indicating that the selected model explained 20.34% of the variance in tuberculosis success treatment rate for the dataset.