Home › Forums › Main Forums › Python Forum › Hypthesis Testing with Python
-
Below code gives the general method for doing t-Test, ANOVA and Chi-Square test with Python. Please feel free to ask questions. Please note that: impute is the input data frame.
################### Part IV: Statistical Analyses. ##########################
# 1) Test the association between account status and different factors by Chi-Square Test.import scipy
from scipy.stats import chi2
from scipy.stats import chi2_contingency # import Scipy's built-in functionimpute.head()
impute.info()############## For reporting, the summary includes grand totals. ############### However, For Chi-Square test, the contingency table exlcudes grand totals.#########
summary =impute.pivot_table('acctno', <wbr> ### analysis variable
index=['credit'], ### rows
columns=['active'], ### columns
aggfunc='nunique',
margins=True)
summarydf =impute.pivot_table('acctno', <wbr> ### analysis variable
index=['credit']<wbr>, ### rows
columns=['active'], ### columns
aggfunc='nunique',
margins=False)######### use the chi2_contingency() function to do Chi-Squared test. ##########
### Degree of freedom is calculated by using the following formula:
### DOF =(r-1)(c-1), where r and C are the levels of treatment and outcome variables.chi2, p_value, DOF, expected= chi2_contingency(df, correction=False) # "correction=False" means no Yates' correction is used! results=pd.DataFrame({'Chi-Square' : round(chi2, 2),
'Degrees of Freedom' : DOF,
'p_value' : p_value}, index=['Chi-Square Test Output'] )
### p-value <0.05, we reject H0 and conclude that there exists an association between credit and account status.######## Do the same thing on DealerType/RatePlan vs. Status etc. ############
### 2) Test the association between account status and tenure segment.
summary =impute.pivot_table('acctno',
index=['tenure_bin'],
columns=['active'],
aggfunc='nunique',
margins=True)df =impute.pivot_table('acctno',
index=['tenure_bin'],
columns=['active'],
aggfunc='nunique',
margins=False)
chi2, p_value, DOF, expected= chi2_contingency(df, correction=False)
# "correction=False" means no Yates' correction is used!
results=pd.DataFrame({'Chi-Square' : round(chi2, 2),
'Degrees of Freedom' : DOF,
'p_value' : p_value}, index=['Chi-Square Test Output'] )
### p-value <0.05, we reject H0 and conclude that there exists an association between tenure and account status.- This discussion was modified 3 years, 11 months ago by Datura.
-
### 3) Test on means of Sales or other numeric variables by Student t-test and ANOVA.
import scipy.stats as stats
### Two sample t-test
summary =impute.pivot_table('sales',
columns=['credit'],
aggfunc='mean',
margins=False)tStat, p_value = stats.ttest_ind(impute[impute['credit']==0 ]['sales'],
impute[impute['credit']==1 ]['sales'],
equal_var = False,
nan_policy='omit' )
#run independent sample T-Test
print("t-Stat:{0}, p-value:{1}".format(tStat, p_value))
# We cannot reject H0 therefore conclude the sales are same for good and bad credit people.### Multiple sample ANOVA test. We look into the deactivated people only. ###################
summary =impute.pivot_table('sales',
columns=['reason'],
aggfunc='mean',
margins=False)### we need to drop null values on analysis variable, otherwise we will get nan results.
impute.isnull().sum()
sub=impute[ (sub['reason'] !=' ') & (pd.notnull(impute['sales'] ) ) ]FStat, p_value = stats.f_oneway( sub[ sub['reason']=='COMP']['sales'],
sub[ sub['reason']=='DEBT']['sales'],
sub[ sub['reason']=='MOVE']['sales'],
sub[ sub['reason']=='NEED']['sales'],
sub[ sub['reason']=='TECH']['sales'] )print("F-Stat: {0}, DOF={1}, p-value: {2}".format(FStat, len(sub)-1, p_value))
# Since p-value<0.05, we reject H0 and conclude that the sales means
# significantly differ among people with different deactivation reasons.### Note: From ANOVA analysis above, we know that treatment differences are statistically significant,
### but ANOVA does not tell which treatments are significantly different from each other.
### To know the pairs of significant different treatments, we will perform multiple
### pairwise comparison (Post-hoc comparison) analysis using Tukey HSD test.#################### Tukey’s multi-comparison method #########################
# This method tests at p <0.05 (correcting for the fact that multiple comparisons
# are being made which would normally increase the probability of a significant
# difference being identified). A results of ’reject = True’ means that a significant
# difference has been observed.
# load packages
from statsmodels.stats.multicomp import (pairwise_tukeyhsd, MultiComparison)### class statsmodels.sandbox.stats.multicomp.MultiComparison(data, groups, group_order=None)
sub.info()MultiComp = MultiComparison(sub['sales'], ### analysis variable
sub['reason'].replace(' ', ' NA'), ### group variable
group_order=None) ### group_order: the desired order for the group mean results to be reported in.
results=MultiComp.tukeyhsd().summary()
results
############################ End of Program. #################################- This reply was modified 3 years, 11 months ago by Datura.
Log in to reply.