import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from scipy import stats as st
Our company is working on the dosing regimen for a new therapeutic compound. The compound will be administered in a tablet with a time-release coating. Both the thickness of the time-release coating and the total mass of compound in the tablet will influence the plasma concentration profile of the drug over time. Furthermore, biological variation in kidney function creates variation in each sample.
We have conducted the first round of exploratory experiments examining the effect of both the mass of the compound per tablet and the thickness of the time release coating in mice. There were four levels of coating thickness and four levels of compound mass included in the experiment. Five mice were subjected to each combination of factor levels, for a total or 80 measurements. Blood samples were collected on all mice at 4 and 8 hours. As a junior employee your access to proprietary knowledge is limited, so the values of the levels are not supplied. Your task is to perform a basic exploratory data analysis with two-way ANOVA and write up the results to present to the rest of the team. The tablet thickness is labeled "t" and the mass of the compound is labeled "c", each level is denoted with an integer.
The data sets are given below.
Your conclusions should be clearly stated in your Jupyter Notebook.
For a review of two-way ANOVA see this resource: http://www.real-statistics.com/two-way-anova/two-factor-anova-with-replication/_
df_4 = pd.read_csv('data_4.csv')
df_4.head()
Unnamed: 0 | t1 | t2 | t3 | t4 | |
---|---|---|---|---|---|
0 | c1 | 0.0208 | 0.3208 | 0.4208 | 0.7208 |
1 | c1 | 0.0000 | 0.1436 | 0.2436 | 0.5436 |
2 | c1 | 0.4664 | 0.7664 | 0.8664 | 1.1664 |
3 | c1 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
4 | c1 | 1.9642 | 2.2642 | 2.3642 | 2.6642 |
df_4 = pd.melt(df_4, id_vars=['Unnamed: 0'], var_name='Thickness')
df_4 = df_4.rename(columns={"Unnamed: 0": "Mass","value":'Plasma_Concentration'})
df_4.head()
Mass | Thickness | Plasma_Concentration | |
---|---|---|---|
0 | c1 | t1 | 0.0208 |
1 | c1 | t1 | 0.0000 |
2 | c1 | t1 | 0.4664 |
3 | c1 | t1 | 0.0000 |
4 | c1 | t1 | 1.9642 |
df_8 = pd.read_csv('data_8.csv')
df_8.head()
Unnamed: 0 | t1 | t2 | t3 | t4 | |
---|---|---|---|---|---|
0 | c1 | 0.9201 | 1.2201 | 1.3201 | 1.6201 |
1 | c1 | 0.3800 | 0.6800 | 0.7800 | 1.0800 |
2 | c1 | 0.3652 | 0.6652 | 0.7652 | 1.0652 |
3 | c1 | 0.0000 | 0.0000 | 0.0018 | 0.3018 |
4 | c1 | 1.4187 | 1.7187 | 1.8187 | 2.1187 |
df_8 = pd.melt(df_8, id_vars=['Unnamed: 0'], var_name='Thickness')
df_8 = df_8.rename(columns={"Unnamed: 0": "Mass","value":'Plasma_Concentration'})
df_8.head()
Mass | Thickness | Plasma_Concentration | |
---|---|---|---|
0 | c1 | t1 | 0.9201 |
1 | c1 | t1 | 0.3800 |
2 | c1 | t1 | 0.3652 |
3 | c1 | t1 | 0.0000 |
4 | c1 | t1 | 1.4187 |
df_4.columns
Index(['Mass', 'Thickness', 'Plasma_Concentration'], dtype='object')
df_4.boxplot(column=['Plasma_Concentration'],by = 'Mass',figsize = (5,6))
plt.show()
df_4.boxplot(column=['Plasma_Concentration'],by = 'Thickness',figsize = (5,6))
plt.show()
df_8.boxplot(column=['Plasma_Concentration'],by = 'Mass',figsize = (5,6))
plt.show()
df_8.boxplot(column=['Plasma_Concentration'],by = 'Thickness',figsize = (5,6))
plt.show()
model = ols('Plasma_Concentration ~ C(Mass) + C(Thickness) + C(Mass):C(Thickness)', data=df_4).fit()
sm.stats.anova_lm(model, typ=2)
sum_sq | df | F | PR(>F) | |
---|---|---|---|---|
C(Mass) | 20.702632 | 3.0 | 6.368337 | 0.000761 |
C(Thickness) | 4.075817 | 3.0 | 1.253762 | 0.297774 |
C(Mass):C(Thickness) | 0.054839 | 9.0 | 0.005623 | 1.000000 |
Residual | 69.351877 | 64.0 | NaN | NaN |
model = ols('Plasma_Concentration ~ C(Mass) + C(Thickness)', data=df_8).fit()
sm.stats.anova_lm(model, typ=2)
sum_sq | df | F | PR(>F) | |
---|---|---|---|---|
C(Mass) | 3.670774 | 3.0 | 3.130997 | 0.030712 |
C(Thickness) | 4.584936 | 3.0 | 3.910734 | 0.011984 |
Residual | 28.528345 | 73.0 | NaN | NaN |
We can see the following p-values for each of the factors in the table:
4 Hours
Since the p-value for C(Mass) and is smaller than .05, this means that Mass has a statistically significant influence on the plasma concentration at 4 hours.
Since the p-value for C(Thickness) is greater than .05, this means that Thickness has no statistically significant influence on the plasma concentration at 4 hours.
8 Hours
Since the p-values for C(Thickness) is smaller than .05, this means that Thickness has a statistically significant influence on the plasma concentration at 8 hours.
Since the p-values for C(Mass) is greater than .05, this means that Mass has no statistically significant influence on the plasma concentration at 8 hours.
Ans: C4 and T4
Ans: C4 and T4
Ans: C3 and T3
Perform a I-way ANOVA for the factor "Soil" (HINT: check out the df.dropna() function)
Perform a 2-way ANOVA for the factors "Soil" and "Biochar" with the interaction
Perform a Tukey's HSD posthoc test on your 2-way ANOVA (Hint: first use the df.dropna() function again, but only using the "Grav" column)
df = pd.read_excel('Biocharcoals.xlsx')
df.head()
Soil | Biochar | Replicate | Treatment | day | Temp | DeltaCO2 | FractionCO2 | Grav | Nitrate | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Chehalis | 350 | 1 | Chehalis.350 | 0 | NaN | NaN | NaN | 0.427248 | 5.079171 |
1 | Chehalis | 350 | 1 | Chehalis.350 | 1 | 24.30 | 0.005872 | 0.707971 | 0.420356 | 11.092042 |
2 | Chehalis | 350 | 1 | Chehalis.350 | 2 | 24.60 | 0.006985 | 0.642727 | 0.419754 | 10.118050 |
3 | Chehalis | 350 | 1 | Chehalis.350 | 3 | 25.15 | 0.004096 | 0.458804 | 0.422973 | 10.903952 |
4 | Chehalis | 350 | 1 | Chehalis.350 | 4 | 25.20 | 0.003023 | 0.356825 | 0.424141 | 12.643656 |
df = df.dropna()
df.head()
Soil | Biochar | Replicate | Treatment | day | Temp | DeltaCO2 | FractionCO2 | Grav | Nitrate | |
---|---|---|---|---|---|---|---|---|---|---|
1 | Chehalis | 350 | 1 | Chehalis.350 | 1 | 24.30 | 0.005872 | 0.707971 | 0.420356 | 11.092042 |
2 | Chehalis | 350 | 1 | Chehalis.350 | 2 | 24.60 | 0.006985 | 0.642727 | 0.419754 | 10.118050 |
3 | Chehalis | 350 | 1 | Chehalis.350 | 3 | 25.15 | 0.004096 | 0.458804 | 0.422973 | 10.903952 |
4 | Chehalis | 350 | 1 | Chehalis.350 | 4 | 25.20 | 0.003023 | 0.356825 | 0.424141 | 12.643656 |
5 | Chehalis | 350 | 1 | Chehalis.350 | 7 | 25.06 | 0.005722 | 0.262590 | 0.421858 | 21.088875 |
df_grouped = df.groupby('Treatment', as_index = False)['Grav'].agg({'mean','std'},axis = 0).reset_index(level=[0])
df_grouped
Treatment | mean | std | |
---|---|---|---|
0 | Chehalis.0 | 0.396191 | 0.025297 |
1 | Chehalis.350 | 0.431703 | 0.029243 |
2 | Chehalis.500 | 0.443616 | 0.026858 |
3 | Chehalis.700 | 0.438545 | 0.060386 |
4 | Willamette.0 | 0.347809 | 0.046381 |
5 | Willamette.350 | 0.348051 | 0.021975 |
6 | Willamette.500 | 0.350081 | 0.020354 |
7 | Willamette.700 | 0.349836 | 0.020466 |
df_grouped.set_index('Treatment').plot.bar(color = ['b','r'],rot=45)
plt.title("Mean and Standard Deviation of Grav by Treatment")
plt.ylabel('Values')
plt.show()
df['Soil'].value_counts()
Chehalis 108 Willamette 108 Name: Soil, dtype: int64
Chehalis_idx = df["Soil"] == "Chehalis"
Willamette_idx = df["Soil"] == "Willamette"
Chehalis_Grav = df[Chehalis_idx]["Grav"]
Willamette_Grav = df[Willamette_idx]["Grav"]
stat, pval = st.f_oneway(Chehalis_Grav,Willamette_Grav)
print('p-values',pval)
if pval < 0.05: # alpha value is 0.05 or 5%
print("Null Hypothesis is Rejected")
else:
print("Failed to Reject Null Hypothesis")
p-values 2.6477869219833055e-38 Null Hypothesis is Rejected
This means we do have sufficient evidence to say that there is a statistically significant difference between the mean Grav of the two groups.
We can perform one-way ANOVA with two groups (Ideally used for 3 or more groups). T test is generally the more go to approach for two groups
df['Biochar'].value_counts()
350 54 500 54 700 54 None 54 Name: Biochar, dtype: int64
idx_350 = df["Biochar"] == 350
idx_500 = df["Biochar"] == 500
idx_700 = df["Biochar"] == 700
idx_None = df["Biochar"] == 'None'
Grav_350 = df[idx_350]["Grav"]
Grav_500 = df[idx_500]["Grav"]
Grav_700 = df[idx_700]["Grav"]
Grav_None = df[idx_None]["Grav"]
stat, pval = st.f_oneway(Grav_350,Grav_500,Grav_700,Grav_None)
print('p-values',pval)
if pval < 0.05: # alpha value is 0.05 or 5%
print("Null Hypothesis is Rejected")
else:
print("Failed to Reject Null Hypothesis")
p-values 0.06719121812784043 Failed to Reject Null Hypothesis
model = ols('Grav ~ C(Soil) + C(Biochar) + C(Soil):C(Biochar)', data=df).fit()
sm.stats.anova_lm(model, typ=2)
sum_sq | df | F | PR(>F) | |
---|---|---|---|---|
C(Soil) | 0.333354 | 1.0 | 285.829386 | 6.400521e-41 |
C(Biochar) | 0.020300 | 3.0 | 5.801913 | 7.934483e-04 |
C(Soil):C(Biochar) | 0.017062 | 3.0 | 4.876608 | 2.680177e-03 |
Residual | 0.242584 | 208.0 | NaN | NaN |
tukey = pairwise_tukeyhsd(endog=df['Grav'],groups=df['Soil'],alpha=0.05)
print(tukey)
Multiple Comparison of Means - Tukey HSD, FWER=0.05 ========================================================= group1 group2 meandiff p-adj lower upper reject --------------------------------------------------------- Chehalis Willamette -0.0786 0.001 -0.0883 -0.0689 True ---------------------------------------------------------
The 95% confidence interval for the mean difference between group Chehalis and group Willamette is (-0.0883, -0.0689), and since this interval does not contain zero we know that the difference between these two group means is statistically significant.
Likewise, the p-value for the mean difference between group Chehalis and group Willamette is 0.001, which is smaller than our significance level of 0.05, so this also indicates that the difference between these two group means is statistically significant.
biochar = df['Biochar'].astype(str)
tukey = pairwise_tukeyhsd(endog=df['Grav'],groups=biochar,alpha=0.05)
print(tukey)
Multiple Comparison of Means - Tukey HSD, FWER=0.05 =================================================== group1 group2 meandiff p-adj lower upper reject --------------------------------------------------- 350 500 0.007 0.9 -0.0194 0.0333 False 350 700 0.0043 0.9 -0.022 0.0307 False 350 None -0.0179 0.2977 -0.0442 0.0085 False 500 700 -0.0027 0.9 -0.029 0.0237 False 500 None -0.0248 0.0727 -0.0512 0.0015 False 700 None -0.0222 0.132 -0.0485 0.0042 False ---------------------------------------------------
The 95% confidence interval for the mean difference between group 350 and group 500 is (-0.0194, 0.0333), and since this interval does contain zero we know that the difference between these two group means is not statistically significant.
Likewise, the p-value for the mean difference between group 350 and group 500 is 0.9, which is greater than our significance level of 0.05, so this also indicates that the difference between these two group means is not statistically significant.
If the interval contains zero, then we know that the difference in group means is not statistically significant. In the example above, All pairwise comparisons are not statistically significant.
No it does not corrobrate visually with graph in Question 2
Yes it reinforces our findings in One-way ANOVA performed above