1. |
With the following code, we can create a dummy dataframe for an experiment.
First, let's consider that we have 50 scores recorded on different subjects in a task with 3 levels of difficulty.
>>> import numpy as np
>>> import pandas as pd
>>> n=60
>>> x=np.array([23,20,32])
>>> x=x.repeat(n)
>>> x=x+2*np.random.randn(x.size)
>>> df=pd.DataFrame({"Score":x,"Condition":np.repeat(['verbal shadowing','working memory load','control'],n)})
How would you compute the mean score for each level of difficulty?
|
|
df['Condition'].apply(np.mean) |
|
np.mean(df.groupby('Condition')) |
|
df.groupby('Condition').mean() |
|
df.groupby('Condition').np.mean() |
|
Don't know. |
2. |
Now, if you were to visually check the homoscedasticity of the data in each cell of the design, which command would provide you with the best graph?
|
|
df.boxplot(by='Condition') |
|
df.groupby('Condition').plot(kind='box') |
|
df.groupby('Condition').boxplot() |
|
Don't know. |
3. |
Actually in each condition, half of the subjects were tested in the morning and the other half in the afternoon, how would you add this factor in the current dataframe?
|
|
df.append(pd.Series(np.tile(np.repeat(['morning','afternoon'],30),3)),column=['Time']) |
|
df.addcolumn(np.tile(np.repeat(['morning','afternoon'],30),3),name='Time') |
|
df['Time']=pd.Series(np.tile(np.repeat(['morning','afternoon'],30),3),index=df.index) |
|
Don't know. |
4. |
How would you write the formula ( shown as xxxxxxx ) in the following code in order to assess the validity of keeping the "Time" factor in the analysis of the experiment data?
>>> import statsmodels as sm
>>> from statsmodels.formula.api import ols
>>> from statsmodels.graphics.api import interaction_plot
>>> from statsmodels.stats.anova import anova_lm
>>> formula='xxxxxxx'
>>> m = ols(formula, df)
>>> r = m.fit()
>>> print anova_lm(r)
>>> r.resid.plot()
>>> interaction_plot(df['Condition'],df['Time'],df['Score'])
|
|
Score ~ Condition * Time |
|
Score ~ Condition + Time |
|
Score(Condition,Time) |
|
Don't know. |
5. |
In a study on monozygotous twins, researchers are interested in
testing average reading speed in each pair of twins. What
scipy.stats test function would you use in this case? |
|
ttest_1samp() |
|
ttest_ind() |
|
ttest_rel() |
|
Any of the above. |
|
Don't know. |
6. |
What scipy.stats test function would you use if the study on those monozygotous twins was done with reaction time recordings, which ususally are poisson-like distributed?
|
|
mannwhitneyu() |
|
wilcoxon() |
|
ranksums() |
|
Any of the above. |
|
Don't know. |
7. |
A data file with some missing values has been loaded using Pandas read_csv() as
follows:
>>> import numpy as np
>>> import pandas as pd
>>> d = pd.read_csv("readings2.csv", na_values=".")
>>> d.count()
Treatment 44
Response 38
dtype: int64
>>> d.head()
Treatment Response
0 Treated 24
1 Treated 43
2 Treated 58
3 Treated 71
4 Treated 43
>>> d.Treatment[d.Response.isnull()]
8 Treated
17 Treated
29 Control
34 Control
41 Control
42 Control
Name: Treatment, dtype: object
>>> grp = d.groupby('Treatment')
>>> grp.agg([np.size, np.mean, np.std])
Response
size mean std
Treatment
Control 23 44.368421 15.724046
Treated 21 51.631579 10.111657
Suppose we have defined a function f that allows to replace
every missing values with overall mean like this:
>>> f = lambda x: x.fillna(x.mean())
How would you use this function to fill missing values for each
treatment group separately in the above
data set?
|
|
grp.transform(f) |
|
d.transform(f) |
|
grp.filter(f) |
|
d.filter(f) |
|
Don't know. |
8. |
Using the statsmodels scikit, we fitted a regression
line to a bivariate series of 20 observations. The results are shown
below:
>>> import numpy as np
>>> from scipy import stats
>>> import statsmodels.api as sm
>>> n = 20
>>> x = np.random.uniform(0, 10, n)
>>> y = 1.1 + 0.8*x + np.random.normal(0, 5, n)
>>> X = sm.add_constant(x, prepend=False)
>>> m = sm.OLS(y, X)
>>> r = m.fit()
>>> print r.summary()
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.489
Model: OLS Adj. R-squared: 0.460
Method: Least Squares F-statistic: 17.20
Date: Mon, 19 Jan 2015 Prob (F-statistic): xxxxxxx
Time: 12:12:13 Log-Likelihood: -47.392
No. Observations: 20 AIC: 98.78
Df Residuals: 18 BIC: 100.8
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 0.7781 0.188 4.147 0.001 0.384 1.172
const 2.1638 1.038 2.084 0.052 -0.018 4.345
==============================================================================
Omnibus: 0.372 Durbin-Watson: 2.174
Prob(Omnibus): 0.830 Jarque-Bera (JB): 0.515
Skew: 0.137 Prob(JB): 0.773
Kurtosis: 2.263 Cond. No. 9.63
==============================================================================
What is the p-value for the F-test (shown as xxxxxxx in the above output)?
|
|
1 - scipy.stats.f.cdf(r.fvalue, 1, 18) |
|
1 - scipy.stats.f.ppf(17.20, 1, 18) |
|
r.f_pvalue |
|
Any of the above. |
|
Don't know. |
9. |
In order to check whether a dice is crooked, we recorded the output of throws until each face appeared at least 5 times. The results are stored in the following python dictionary:
>>> rolls={"value":["one","two","three","four","five","six"],"count":[9,5,10,8,11,14]}
What command would you use assuming that you imported the stats module from scipy with: from scipy import stats .
|
|
stats.fisher_exact(rolls['count']) |
|
stats.chisquare(rolls['count']) |
|
stats.chi2(rolls['count']) |
|
Any of the above. |
|
Don't know. |
10. |
The 'low birth weight study', available as birthwt in R built-in datasets, is a retrospective cohort study where data from 189 mothers were recorded during 1986. Medical scientists were interested in potential risk factors associated wih low infant birth weight.
The data can be imported as follows (data are fetched from the web):
>>> import statsmodels.api as sm
>>> import statsmodels.formula.api as smf
>>> d = sm.datasets.get_rdataset('birthwt', package='MASS').data
>>> d.head()
low age lwt race smoke ptl ht ui ftv bwt
85 0 19 182 2 0 0 0 1 0 2523
86 0 33 155 3 0 0 0 0 3 2551
87 0 20 105 1 1 0 0 0 1 2557
88 0 21 108 1 1 0 0 1 2 2594
89 0 18 107 1 1 0 0 1 0 2600
A baby is considered underweight if his weight is < 2.5 kg. This is recorded in the binary variable low . Explanatory variables of interest are the smoking status (smoke=1 means that the mother was smoking during pregnancy), age (in years) and ethnicity (race , where 1=white, 2=black, 3=other).
Using smf.logit() , you can the parameter estimates from a logistic regression model that includes all three predictors. What is the value of the odds-ratio for the smoke variable?
|
|
3.006 |
|
3.033 |
|
1.095 |
|
Don't know. |