Assignment 3

This is the third (and last) quiz for the cogmaster-stats course. There are 10 multiple choice questions in total, with only one correct response. It is recommended not to answer at random, but rather to check the "Don't know" option. When you are done with the quizz, you can validate your responses by clicking on the Validate button at the bottom of the page.
Responses are not saved as you typed them, so be sure to check all questions before submitting your answers.
If you have any question, you can ask on Twitter: @cogmasterstats, or by email.
Due date: January, 15.


Nom
Email
Date (E.g., 21/08/2008; this is also checked internally.)

1. With the following code, we can create a dummy dataframe for an experiment. First, let's consider that we have 50 scores recorded on different subjects in a task with 3 levels of difficulty.
>>> import numpy as np
>>> import pandas as pd
>>> n=60
>>> x=np.array([23,20,32])
>>> x=x.repeat(n)
>>> x=x+2*np.random.randn(x.size)
>>> df=pd.DataFrame({"Score":x,"Condition":np.repeat(['verbal shadowing','working memory load','control'],n)})
How would you compute the mean score for each level of difficulty?
df['Condition'].apply(np.mean)
np.mean(df.groupby('Condition'))
df.groupby('Condition').mean()
df.groupby('Condition').np.mean()
Don't know.
2. Now, if you were to visually check the homoscedasticity of the data in each cell of the design, which command would provide you with the best graph?
df.boxplot(by='Condition')
df.groupby('Condition').plot(kind='box')
df.groupby('Condition').boxplot()
Don't know.
3. Actually in each condition, half of the subjects were tested in the morning and the other half in the afternoon, how would you add this factor in the current dataframe?
df.append(pd.Series(np.tile(np.repeat(['morning','afternoon'],30),3)),column=['Time'])
df.addcolumn(np.tile(np.repeat(['morning','afternoon'],30),3),name='Time')
df['Time']=pd.Series(np.tile(np.repeat(['morning','afternoon'],30),3),index=df.index)
Don't know.
4. How would you write the formula ( shown as xxxxxxx) in the following code in order to assess the validity of keeping the "Time" factor in the analysis of the experiment data?
>>> import statsmodels as sm
>>> from statsmodels.formula.api import ols
>>> from statsmodels.graphics.api import interaction_plot
>>> from statsmodels.stats.anova import anova_lm
>>> formula='xxxxxxx'
>>> m = ols(formula, df)
>>> r = m.fit()
>>> print anova_lm(r)
>>> r.resid.plot()
>>> interaction_plot(df['Condition'],df['Time'],df['Score'])
Score ~ Condition * Time
Score ~ Condition + Time
Score(Condition,Time)
Don't know.
5. In a study on monozygotous twins, researchers are interested in testing average reading speed in each pair of twins. What scipy.stats test function would you use in this case?
ttest_1samp()
ttest_ind()
ttest_rel()
Any of the above.
Don't know.
6. What scipy.stats test function would you use if the study on those monozygotous twins was done with reaction time recordings, which ususally are poisson-like distributed?
mannwhitneyu()
wilcoxon()
ranksums()
Any of the above.
Don't know.
7. A data file with some missing values has been loaded using Pandas read_csv() as follows:
>>> import numpy as np
>>> import pandas as pd
>>> d = pd.read_csv("readings2.csv", na_values=".")
>>> d.count()
Treatment    44
Response     38
dtype: int64
>>> d.head()
  Treatment  Response
0   Treated        24
1   Treated        43
2   Treated        58
3   Treated        71
4   Treated        43
>>> d.Treatment[d.Response.isnull()]
8     Treated
17    Treated
29    Control
34    Control
41    Control
42    Control
Name: Treatment, dtype: object
>>> grp = d.groupby('Treatment')
>>> grp.agg([np.size, np.mean, np.std])
           Response
               size       mean        std
Treatment
Control          23  44.368421  15.724046
Treated          21  51.631579  10.111657
Suppose we have defined a function f that allows to replace every missing values with overall mean like this:
>>> f = lambda x: x.fillna(x.mean())
How would you use this function to fill missing values for each treatment group separately in the above data set?
grp.transform(f)
d.transform(f)
grp.filter(f)
d.filter(f)
Don't know.
8. Using the statsmodels scikit, we fitted a regression line to a bivariate series of 20 observations. The results are shown below:
>>> import numpy as np
>>> from scipy import stats
>>> import statsmodels.api as sm
>>> n = 20    
>>> x = np.random.uniform(0, 10, n)
>>> y = 1.1 + 0.8*x + np.random.normal(0, 5, n)
>>> X = sm.add_constant(x, prepend=False)
>>> m = sm.OLS(y, X)
>>> r = m.fit()
>>> print r.summary()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.489
Model:                            OLS   Adj. R-squared:                  0.460
Method:                 Least Squares   F-statistic:                     17.20
Date:                Mon, 19 Jan 2015   Prob (F-statistic):            xxxxxxx
Time:                        12:12:13   Log-Likelihood:                -47.392
No. Observations:                  20   AIC:                             98.78
Df Residuals:                      18   BIC:                             100.8
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.7781      0.188      4.147      0.001         0.384     1.172
const          2.1638      1.038      2.084      0.052        -0.018     4.345
==============================================================================
Omnibus:                        0.372   Durbin-Watson:                   2.174
Prob(Omnibus):                  0.830   Jarque-Bera (JB):                0.515
Skew:                           0.137   Prob(JB):                        0.773
Kurtosis:                       2.263   Cond. No.                         9.63
==============================================================================
What is the p-value for the F-test (shown as xxxxxxx in the above output)?
1 - scipy.stats.f.cdf(r.fvalue, 1, 18)
1 - scipy.stats.f.ppf(17.20, 1, 18)
r.f_pvalue
Any of the above.
Don't know.
9. In order to check whether a dice is crooked, we recorded the output of throws until each face appeared at least 5 times. The results are stored in the following python dictionary:
>>> rolls={"value":["one","two","three","four","five","six"],"count":[9,5,10,8,11,14]}
What command would you use assuming that you imported the stats module from scipy with: from scipy import stats.
stats.fisher_exact(rolls['count'])
stats.chisquare(rolls['count'])
stats.chi2(rolls['count'])
Any of the above.
Don't know.
10. The 'low birth weight study', available as birthwt in R built-in datasets, is a retrospective cohort study where data from 189 mothers were recorded during 1986. Medical scientists were interested in potential risk factors associated wih low infant birth weight. The data can be imported as follows (data are fetched from the web):
>>> import statsmodels.api as sm
>>> import statsmodels.formula.api as smf
>>> d = sm.datasets.get_rdataset('birthwt', package='MASS').data
>>> d.head()
    low  age  lwt  race  smoke  ptl  ht  ui  ftv   bwt
85    0   19  182     2      0    0   0   1    0  2523
86    0   33  155     3      0    0   0   0    3  2551
87    0   20  105     1      1    0   0   0    1  2557
88    0   21  108     1      1    0   0   1    2  2594
89    0   18  107     1      1    0   0   1    0  2600	
A baby is considered underweight if his weight is < 2.5 kg. This is recorded in the binary variable low. Explanatory variables of interest are the smoking status (smoke=1 means that the mother was smoking during pregnancy), age (in years) and ethnicity (race, where 1=white, 2=black, 3=other). Using smf.logit(), you can the parameter estimates from a logistic regression model that includes all three predictors. What is the value of the odds-ratio for the smoke variable?
3.006
3.033
1.095
Don't know.



When you are done, please validate your answers by clicking on the button.
Be careful: Submitting this form is definitive, you won't be able to check your answers again. Thank you to check them one last time.