This blog post won’t give as much information on that initial rejection as I wanted it to – I sought permission from the journal to publish all correspondence from the reviewer and the editor, which was denied. Below you find only my response to the editorial decision. As context, the manuscript was rejected on the basis of one review, in which it was suggested that we had adopted some unconventional and even nefarious practices in gathering and analysing our data. These suggestions didn’t sit well with me, so I sent the following email to the Editor via the Elsevier Editorial System.
Dear Prof XXXX,
Thank you for your recent consideration of our manuscript, ‘”Old?” or “New?”: The test question provokes a goal-directed bias in memory decision-making’ (Ms. No. XXXXX-XX-XXX), for publication in Consciousness & Cognition. We were, of course, disappointed that you chose to reject the manuscript.
Having read the justification given for rejection, we respectfully wish to respond to the decision letter. Whilst we do not believe that this response will prompt reconsideration of the manuscript (your decision letter was clear) we believe it is important to respond for two reasons. First, to reassure you that we have not engaged in any form of data manipulation, and second, to state our concerns that the editorial team at Consciousness & Cognition appear to view it as acceptable that reviewers base their recommendations on poorly substantiated inferences they have made about the motivations of authors to engage in scientific misconduct.
As you highlighted in your decision letter, Reviewer 1 raised “substantial concerns about the manner in which [we] carried out the statistical analysis of [our] data”. The primary concern centred on our decision to exclude participants whose d’ was below a threshold of 0.1. In this reply we hope to demonstrate to you that use of such an exclusion criterion is not only standard practice, but it is indeed a desirable analysis step which should give readers more, not less, confidence in the analyses and their interpretation. We will then take the opportunity to demonstrate, if only to you, that our data behave exactly as one would expect them to under varying relaxations of the enforced exclusion criterion.
Recognition memory studies often require that participants exceed a performance (sensitivity; d’) threshold before they are included in the study. This is typically carried out when the studies themselves treat a non-sensitivity parameter as their primary dependent variable (as in our rejected paper), as a means of excluding participants that were unmotivated or disengaged from the task. Below are listed a small selection of studies published in the past 2 years which have used sensitivity-based exclusion criteria, along with the number of participants excluded and the thresholds used:
Craig, Berman, Jonides & Lustig (2013) – Memory & Cognition
Expt2 – 10/54, Expt3 – 5/54
Davidenko, N. & Flusberg, S.J. (2012) – Cognition
Expt1a – 1/56, Expt1b –1/26, Expt2b –3/26
chance (50%) accuracy
Gaspelin, Ruthruff, & Pashler, (2013). Memory & Cognition
Expt1 – 3/46, Expt2 – 1/49, Expt3 – 3/54
Johnson & Halpern (2012) – Memory & Cognition
Expt1 – 1/20, Expt2 – 3/25
Rummel, Kuhlmann, & Touron (2013) – Consciousness & Cognition
Word classification (prospective memory task)
Expt1 – 6/145
“prospective memory failure”
Sheridan & Reingold (2011) – Consciousness & Cognition
Expt1 – 5/64
“difficulty following instructions”
Shedden, Milliken, Watters & Monteiro (2013) – Consciousness & Cognition
Expt4 – 5/28
You will note that there is tremendous variation in the thresholds used, but that it is certainly not “unusual” as claimed by Reviewer 1, not even for papers published in Consciousness and Cognition. Of course, it would not be good scientific practice to accept the status quo uncritically, and we must therefore explain why the employed sensitivity-based exclusion criterion was appropriate for our study. The reasoning is that we were investigating an effect associated with higher order modulation of memory processes. If we had included participants with d’s below 0.1 (an overall accuracy rate of approximately 52% where chance responding is 50%) then it is reasonable to assume that these participants were not making memory decisions based on memory processes, but were at best contributing noise to the data (e.g. via random responding), and at worst systematically confounding the data. To demonstrate the latter point, one excluded participant from Experiment 1 had a d’ of -3.7 (overall accuracy rate of 13%), which was most likely due to responding using the opposite keys to those which they were instructed to use, substituting “new” responses for “old” responses. As we were investigating the effects of the test question on the proportion of “new” and “old” responses, it is quite conceivable that if they had displayed the same biases as our included participants did overall, they would have reduced our ability to detect a real effect. Including participants who did not meet our inclusion criteria, instead of helping us to find effects that hold across all participants, would have systematically damaged the integrity of our findings, leading to reduced estimates of effect size caused by ignoring the potential for influence of confounding variables.
If we had been given the opportunity to respond to Reviewer 1’s critique via the normal channels, we would have also corrected Reviewer 1’s inaccurate reading of our exclusion rates. As we stated very clearly in the manuscript, our sensitivity-based exclusion rates were 5%, 6% and 17%, not 25%, 6% and 17%. Reviewer 1 has conflated Experiment 1’s exclusions based on native language with exclusion based on sensitivity. As an aside, we justify exclusion based on language once again as a standard exclusion criterion in word memory experiments to ensure equivalent levels of word comprehension across participants. This is of particular importance when conducting online experiments which allow anyone across the world to participate. In coding the experiment, we thought it far more effective to allow all visitors to participate after first indicating their first language, with a view to excluding non-native speakers’ data from the study after they had taken part. We wanted all participants to have the opportunity to take part in the study (and receive feedback on their memory performance – a primary motivator for participation according to anecdotal accounts gleaned from social media) and to minimise any misreporting of first language which would add noise to the data without recourse for its removal.
We would next have responded to Reviewer 1’s claims that our conclusions are not generalisable based on the subset of analysed data by stating that Reviewer 1 is indeed partially correct. Our conclusions would not have been found had we included participants who systematically confounded the data (as discussed above) – as Reviewer 1 is at pains to point out, the effect is small. Nonetheless as demonstrated by our replication of the findings over three experiment and the following reanalyses, our findings are robust enough to withstand inclusion of some additional noise, within reason. To illustrate, we re-analysed the data under two new inclusion thresholds. The first, a d’ <= 0 threshold, equivalent to chance responding (Inclusion 1), and the second a full inclusion in which all participants were analysed (Inclusion 2). For the sake of brevity we list here the results as they relate to our primary manipulation, the effects of question on criterion placement.
Original – old emphasis c > new emphasis c, t(89) = 2.141, p = .035, d = 0.23.
old emphasis c > new emphasis c, t(90) = 2.32, p = .023, d = 0.24.
no difference between old and new emphasis c, t(94) = 1.66, p = .099, d = 0.17
Main effect of LOP, F(1,28) = 23.66, p = .001, ηp2 = .458, shallow > deep.
Main effect of emphasis, F(1,28) = 6.65, p = .015, ηp2 = .192, old? > new?.
No LOP x emphasis interaction, F(1,28) = 3.13, p = .088, ηp2 = .101.
Shallow LOP sig old > new, t(28) = 3.05, p = .005, d = 0.62.
Deep LOP no difference, t(28) = .70, p = .487, d = 0.13.
No change in excluded participants – results identical.
Main effect of LOP, F(1,30) = 20.84, p = .001, ηp2 = .410, shallow > deep.
Main effect of emphasis, F(1,30) = 8.73, p = .006, ηp2 = .225, old? > new?.
Sig LOP x emphasis interaction, F(1,30) = 4.28, p = .047, ηp2 = .125.
Shallow LOP sig old > new, t(30) = 3.50, p = .001, d = 0.64.
Deep LOP no difference, t(30) = .76, p = .454, d = 0.14.
No main effect of response, F(1,28) = 3.73, p = .064, ηp2 = .117
No main effect of question, F < 1.
Significant question x response interaction, F(1,28) = 8.50, p = .007, ηp2 = .233.
“Yes” response format, old > new, t(28) = 2.41, p = .023, d = 0.45.
“No” response format, new > old, t(28) = 2.77, p = .010, d = 0.52.
No change in excluded participants – results identical.
No main effect of response, F <1.
No main effect of question, F < 1.
No question x response interaction, F(1,34) = 3.07, p = .089, ηp2 = .083.
“Yes” response format, old > new, t(34) = 1.33, p = .19, d = 0.23.
“No” response format, new > old, t(34) = 1.84, p = .07, d = 0.32.
To summarise, including participants who responded anywhere above chance had no untoward effects on the results of our inferential statistics and therefore our interpretation cannot be called into question by the results of Inclusion 1. Inclusion 2 on the other hand had much more deleterious effects on the patterns of results reported in Experiments 1 and 3. This is exactly what one would expect given the example we described previously where the inclusion of a participant responding systematically below chance would elevate type II error. In this respect, our reanalysis in response to Reviewer 1’s comments does not weaken our interpretation of the findings.
As a final point, we wish to express our concerns about the nature of criticism made by Reviewer 1 and accepted by you as appropriate within peer-review for Consciousness & Cognition. Reviewer 1 states that we the authors “must accept the consequences of data that might disagree with their hypotheses”. This strongly suggests that we have not done so in the original manuscript and have therefore committed scientific misconduct or entered a grey-area verging on misconduct. We deny this allegation in the strongest possible terms and are confident we have demonstrated that this is absolutely not the approach we have taken through the evidence presented in this response. Indeed if Reviewer 1 wishes to make these allegations, they would do well to provide evidence beyond the thinly veiled remarks in their review. If they wish to do this, we volunteer full access to our data for them to conduct any tests to validate their claims, e.g. those carried out in Simonsohn (2013) in which a number of cases of academic misconduct and fraud are exposed through statistical methods. We, and colleagues we have spoken to about this decision, found it worrying that you chose to make your editorial decision on the strength of this unsubstantiated allegation and believe that at the very least we should have been given the opportunity to respond to the review, as we have done here, via official channels.
We thank you for your time.
Akira O’Connor & Ravi Mill
Craig. K. S., Berman, M.G., Jonides, J. & Lustig, C (2013) Escaping the recent past: Which stimulus dimensions influence proactive interference? Memory & Cognition 41, 650-670.
Davidenko, N. & Flusberg, S.J. (2012) Environmental inversion effects in face perception. Cognition 123(2), 442-447.
Gaspelin, N., Ruthruff, E. & Pashler, H. (2013) Divided attention: An undesirable difficulty in memory retention. Memory & Cognition 41, 978-988.
Johnson, S. K. & Halpern, A. R. (2012) Semantic priming of familiar songs. Memory & Cognition 40, 579-593.
Rummel, J., Kuhlmann, B.G. & Touron, D. R. (2013) Performance predictions affect attentional processes of event-based prospective memory. Consciousness and Cognition, 22 (3), 729-741.
Shedden, J. M., Milliken, B., Watters, S. & Monteiro, S (2013) Event-related potentials as brain correlates of item specific proportion congruent effects. Consciousness & Cognition, 22 (4), 1442-1455.
Sheridan, H. & Reingold, E. M. (2011) Recognition memory performance as a function of reported subjective awareness. Consciousness & Cognition 20 (4), 1363-1375.
Simonsohn, U. (2013) Just Post It: The Lesson From Two Cases of Fabricated Data Detected by Statistics Alone. Psychological Science, 24(10), 1875-1888.
The Editor’s very reasonable response was to recommend we resubmit the manuscript, which we did. The manuscript was then sent out for review to two new reviewers, and the process began again, this time with a happier ending.