This is the first assignment for the data analysis tool course, second of a series of five courses from Data Analysis and Interpretation ministered from Wesleyan University. The content of the first course you can see here.
For this assignment, we have to make the Analysis of Variance (ANOVA) in our data. As I am already working with a quantitative response variable (the incidence of new breast cancer in 100,000 female residents during the 2002 year) and a categorical explanatory variable (mean of the sugar consumption (grams per person and day) between 1961 and 2002), it is possible to use the ANOVA technique and, as the explanatory variable have five categories, it will be necessary to use the post hoc test for ANOVA.
The hypothesis that is being approached is that the quantity of sugar consumption in countries increases the incidence of new breast cancer cases.
The complete program for this assignment can be download here and the dataset here.
Analysis of Variance
Below is the code to make the ANOVA technique:
First I have imported the libraries and management the data like in the other assignments presented until here.
import pandas
import numpy
import statistics
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi
# Import the data set to memory
data = pandas.read_csv("separatedData.csv", low_memory = False)
# Change data type among variables to numeric
data["breastCancer100th"] = data["breastCancer100th"].convert_objects(convert_numeric=True)
data["meanSugarPerson"] = data["meanSugarPerson"].convert_objects(convert_numeric=True)
# Create a subData with only the variables breastCancer100th, meanSugarPerson, meanFoodPerson, meanCholesterol
sub1=data[['breastCancer100th', 'meanSugarPerson']]
# Create the conditions to a new variable named sugar_consumption that will categorize the meanSugarPerson answers
def sugar_consumption (row):
if 0 < row['meanSugarPerson'] <= 30 : return 0 # Desirable between 0 and 30 g.
if 30 < row['meanSugarPerson'] <= 60 : return 1 # Raised between 30 and 60 g.
if 60 < row['meanSugarPerson'] <= 90 : return 2 # Borderline high between 60 and 90 g.
if 90 < row['meanSugarPerson'] <= 120 : return 3 # High between 90 and 120 g.
if row['meanSugarPerson'] > 120 : return 4 # Very high under 120g.
# Add the new variable sugar_consumption to subData
sub1['sugar_consumption'] = sub1.apply (lambda row: sugar_consumption (row),axis=1)
Then I have created a new sub data and make the ANOVA method.
# creating a sub data with only the breast cancer cases and the sugar consumption mean
sub2 = sub1[['breastCancer100th', 'sugar_consumption']].dropna()
# using ols function for calculating the F-statistic and associated p value
model1 = smf.ols(formula='breastCancer100th ~ C(sugar_consumption)', data=sub2).fit()
print (model1.summary())
OLS Regression Results
==============================================================================
Dep. Variable: breastCancer100th R-squared: 0.411
Model: OLS Adj. R-squared: 0.392
Method: Least Squares F-statistic: 21.59
Date: Fri, 29 Jul 2016 Prob (F-statistic): 1.55e-13
Time: 19:03:11 Log-Likelihood: -560.14
No. Observations: 129 AIC: 1130.
Df Residuals: 124 BIC: 1145.
Df Model: 4
Covariance Type: nonrobust
=============================================================================================
coef std err t P>|t| [95.0% Conf. Int.]
---------------------------------------------------------------------------------------------
Intercept 20.6481 3.651 5.655 0.000 13.421 27.875
C(sugar_consumption)[T.1] 2.9940 5.682 0.527 0.599 -8.251 14.239
C(sugar_consumption)[T.2] 14.0744 4.995 2.818 0.006 4.189 23.960
C(sugar_consumption)[T.3] 25.3777 4.995 5.081 0.000 15.492 35.263
C(sugar_consumption)[T.4] 45.5661 5.520 8.254 0.000 34.640 56.493
==============================================================================
Omnibus: 2.575 Durbin-Watson: 1.688
Prob(Omnibus): 0.276 Jarque-Bera (JB): 2.533
Skew: 0.336 Prob(JB): 0.282
Kurtosis: 2.864 Cond. No. 5.76
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We can see that the F-test is 21.59 and the p-value is lower than 0.05 indicating that the null hypothesis is false.
For the mean and the standard deviation we have:
# means for breast cancer by sugar consumption
print ('means for breast cancer by sugar consumption')
m1= sub2.groupby('sugar_consumption').mean()
print (m1)
# standard deviations for breast cancer by sugar consumption
print ('standard deviations for breast cancer by sugar consumption')
sd1 = sub2.groupby('sugar_consumption').std()
print (sd1)
means for breast cancer by sugar consumption
sugar_consumption breastCancer100th
0 20.648148
1 23.642105
2 34.722581
3 46.025806
4 66.214286
standard deviations for breast cancer by sugar consumption
sugar_consumption breastCancer100th
0 6.607535
1 10.970228
2 16.280432
3 26.222649
4 25.255302
We can see that the mean for each category is different but there are some categories that have a close value. However, as we have five categories in the explanatory variable, we need to make a Post hoc test in order to avoid the type 1 error.
Post hoc test
# Post hoc test for ANOVA (as the categorical variable have five categories)
mc1 = multi.MultiComparison(sub2['breastCancer100th'], sub2['sugar_consumption'])
res1 = mc1.tukeyhsd()
print(res1.summary())
Multiple Comparison of Means - Tukey HSD,FWER=0.05
==============================================
group1 group2 meandiff lower upper reject
----------------------------------------------
0 1 2.994 -12.7348 18.7227 False
0 2 14.0744 0.2475 27.9013 True
0 3 25.3777 11.5508 39.2045 True
0 4 45.5661 30.2834 60.8489 True
1 2 11.0805 -4.2234 26.3843 False
1 3 22.3837 7.0799 37.6875 True
1 4 42.5722 25.9412 59.2031 True
2 3 11.3032 -2.0384 24.6448 False
2 4 31.4917 16.6466 46.3368 True
3 4 20.1885 5.3433 35.0336 True
----------------------------------------------
The Post hoc test presented us that only three comparison groups have the null hypothesis confirmed. The groups are:
- (0) Desirable with (1) Raised
- (1) Raised with (2) Borderline high
- (2) Borderline high with (3) High
This means that the amount of new breast cancer cases among the population that intakes sugar in the proportions of this groups doesn’t have a significant difference. However, to the other seven comparisons, the post hoc test presents that the null hypothesis is false, demonstrating that the high consumption of sugar has a relation with a greater incidence of breast cancer case.