This is the third assignment for the Data Analysis Capstone from Data Analysis and Interpretation course ministered by Wesleyan University. You can see all the previous content here.
In this assignment, we have to talk about the results obtained in the research.
Results
Descriptive Statistics
Table 1 shows descriptive statistics for the quantitative data analytic variables. The average of the response variable, tuberculosis treatment success rate, was 78.29%, with a minimum success rate of 0% and a maximum of 100%.
Analysis Variable | N | Mean | Std Dev | Minimun | Maximum |
---|---|---|---|---|---|
Air Quality | 109 | 78.97 | 18.73 | 14.30 | 100.00 |
Water and Sanitation access | 109 | 55.56 | 33.05 | 2.88 | 100.00 |
GDP PPP share of world total | 109 | 0.81 | 2.44 | 0.00 | 19.57 |
Health expenditure per capita | 109 | 1424.78 | 1626.97 | 34.81 | 8845.18 |
Smoking prevalence females | 109 | 11.57 | 10.22 | 0.40 | 39.80 |
Smoking prevalence males | 109 | 34.44 | 12.83 | 8.90 | 71.80 |
TB case detection rate | 109 | 75.28 | 17.89 | 16.00 | 120.00 |
Incidence of TB | 109 | 128.83 | 195.29 | 1.60 | 1042.00 |
TB treatment success rate | 109 | 78.29 | 15.64 | 0.00 | 100.00 |
Business impact of TB | 109 | 5.24 | 1.05 | 2.27 | 6.84 |
Bivariate Analysis
Scatter plots for the association between the tuberculosis success rate response variable and quantitative predictors (Figure 1) revealed that only the variables GDP PPP share of the world total, Smoking prevalence males and Incidence of Tuberculosis increased when the tuberculosis treatment had a greater success rate. However, the other variables decreased when the success treatment rate had a great value.
Table 2 shows all the Pearson values of the variables. The variables GDP PPP share of world total, Smoking prevalence males and Incidence of tuberculosis were not significantly associated with the response variable tuberculosis treatment success rate.
Analysis Variable | Pearson | p-value |
---|---|---|
Air Quality | -0.26776 | 0.0049 |
Water and Sanitation access | -0.38838 | 3.0049e-05 |
GDP PPP share of world total | 0.04870 | 0.6150 |
Health expenditure per capita | -0.37709 | 5.3036e-05 |
Smoking prevalence females | -0.41092 | 9.0657e-06 |
Smoking prevalence males | 0.07624 | 0.43071 |
TB case detection rate | -0.30539 | 0.00124 |
Incidence of TB | 0.16489 | 0.08664 |
Business impact of TB | -0.33497 | 0.00037 |
Multivariable Analysis
Figure 2 shows that only four variables were retained in the model selected by the lasso regression analysis. The other five predictors were excluded. The Smoking prevalence females (%) and the Helath expenditure per capita were most strongly associated with tuberculosis success treatment rate, followed by Air quality and at last Water and sanitation access (Table 3).
Analysis Variable | Coef |
---|---|
Air Quality | -0.95440 |
Water and Sanitation access | -0.94431 |
Health expenditure per capita | -1.41151 |
Smoking prevalence females | -3.17685 |
As the data set have low samples, the data were not splited into training and a test sets. The least angle regression algorithm with k=10 fold cross-validation was used to estimate the lasso regression model in the data set. The change in the cross-validation average (mean) squared error at each step was used to identify the best subset of predictor variables.
Figure 3 shows that there is variability across the individual cross-validation folds in the training data set, but the change in the mean square error as variables are added to the model follows the same pattern for each fold.
The mean squared error for the data was MSE = 193.10 and the R-square value was 0.2034, indicating that the selected model explained 20.34% of the variance in tuberculosis success treatment rate for the dataset.