The Power of Replication – Bems Psi Research

Which raises the question of timing: Dienes provided an online app to compute Bayes-Factors for these prior distributions.

Improving the replicability of empirical research

For example, Honorton and Ferrari http: So you can see that Wagenmakers did not identify "the problem". In fact, Bem's work has stood up to criticism with notable robustness. Considered within in the context of the general precognition literature, the only "problem" here seems to be the ignorance and causal misinterpretation of the literature on your part. Prior to this, you've used the words "a healthy scientific response".

In fact, this idea of stopping is extremely unhealthy. What you're recommending people do, and what it seems you've done, is to stop reading the scientific literature as soon as you find a conclusion that is line with your expectations. This is extremely unscientific. It is starting with a conclusion, finding evidence for it and not bothering to look at the evidence against it.

There is another group of people who work in this way: Your behaviour in evaluating the precognition literature appears the same as creationists' behaviour in evaluating planetary science. Again, it seems you've stopped looking at the literature as soon as you found something in line with your explanations. Look at the key words and phrases you use: These are not the words of a person who is open-minded and evaluating the data honestly and with integrity; ie, a scientist.

Instead, they are the words of a person who is afraid by the idea that has been put before him and is grasping on to anything, no matter how tenuous, that might allow him to push the idea away. Why are you starting from the point of view that psi is impossible? The mechanism is not understood. Specific mechanisms that have been put forward as possible candidates for psi will be ruled our or not according to particular models or, if they do not rely on particular models, specific data or, they will not be ruled out.

Regardless, Bem has put forward no such mechanism. Your criticism of Bem shows too little work. I'll get to the rest when I have a little time to do it properly, but I have an answer for this: The impossibility of time travel.

OK, some comments now I've had 5 minutes to think this morning. This statement doesn't make sense and shows a clear lack of understanding about what the word "psi" means.

Psi, for this paper, is the hypothesis that information can travel backwards in time in a format the brain can register and use at some level.

It is not as generic as you say, so the rest of your analysis doesn't hold. In addition, there is no such things as theory-free observation. This is the first thing that comes up in any philosophy of science course.

In science, there are the data, and there is your proposed mechanism for the production of that data. But you cannot merely 'report the data' because they are meaningless without the context of the theory which motivated that experimental design, etc.

Either Bem is proposing a quantum mechanism, in which case I think he has revealed himself to have no clear story, or he is not proposing a mechanism, in which case his hat tipping to quantum physics is mere noise 'look - their stuff is weird too! It's true I haven't dedicated my life to this topic; I have many more interesting things to do. But the reason I haven't is that Bem's work fails to pass a basic scientific 'sniff test'.

I;m not afraid of his work - I am entirely unconvinced by it. Your rather literal minded and superficial analysis doesn't really alter that fact. You've linked to a popular magazine article that claims time travel is impossible because a physicist has shown that a single photon does not exceed the speed of light. It isn't clear why you believe the light barrier precludes psi.

You seem to be arguing against a theory of psi that depends on something travelling faster than light. As far as I know, there is no such proposed theory from serious parapsychologists like Bem. Your argument is a straw man. However, I asked you what made you believe that psi was impossible and you've answered that question, thank you. Now I ask you a further question in an effort to get to the heart of the matter: What made you believe that psi depended on time travel? Let's look at the first sentence of the abstract of Bem's paper which is identical to the first sentence of the text: Now let's look at the second sentence in the abstract: Again, you've shown that you do not understand what the word "psi" means or how it is distinct from the word "precognition".

Moreover, it seems that either you didn't read Bem's paper, or you didn't understand it. Psi is, for this paper and all other serious discussion, as generic as I've said. The rest of my analysis does hold. Again, this is a straw man. Yes, you are correct that there are underlying assumptions and theoretical frameworks from which parapsychologists must work in order to produce meaningful data.

However, these are not the same as formal scientific theory. They are not stated explicitly in parapsychology nor would one expect them to be for they are not stated explicitly in any field; that is why it described as tacit knowledge and not just knowledge. The fact that tacit knowledge exists within the field of parapsychology is not an argument against the data produced by it.

That is why we use the term "psi", a place holder for an unexplained process. The fact that the data cannot be explained does not mean it is invalid. Look at, for example the discovery of X-Rays; Roentgen's X-Ray data shocked the world of physics and could not be explained for years but his data was valid. The absence of a formal scientific theory does not mean Bem's data is invalid. You keep referring to quantum physics as if Bem's data depends on it.

It does not depend on it and Bem has made no claim that it does. His discussion of quantum theory is simply the elucidation of leading theoretical approaches in the field. In scientific papers, such elucidation is common and lauded. For information to arrive from the future, it must effectively travel faster than light. For that information to affect judgements made by a person, that information must be in a format usable by the brain.

Bem's description is vague, but essentially contains both these premises; given the impossibility of time travel, as understood by modern physics, I'm allowed to start from the premise that psi both precognition and premonition is impossible. You started on quantum physics. I've just been pointing out why it doesn't help Bem. The issue of theory-dependence of data kicks in here: These come with the risk of Types I and II errors, which we suck up on the understanding that replication will identify when these have occurred.

Without a clear and useful mechanism, Bem cannot be sure he hasn't made a Type I error without replication, and the entire point of this particular post was to be grumpy at how those replications, which support the Type I error interpretation, are being blocked that replication actually came out of someone's file drawer and doesn't count as a replication of Bem.

So his lack of a valid mechanism is critical, and the existence of informative failures to replicate is important. You have reigned in your use of the word "impossible" and qualified it with the words "as understood by modern physics". This is the critical turn where we can see that, in fact psi and precognition are not impossible but simply inconsistent with current physics.

Impossibility is not a premise. If you say that something is impossible, you have nowhere to go. You cannot "start" with something that might be disproved and call that thing "impossible". Again, it is clear that your use of the word "impossible" was an error and what you actually meant was just 'inconsistent with modern physics'.

As I noted in my previous comment, that is not the case. The history of the discovery of X-Rays shows that data can arise that appear ludicrous but which have meaning and validity.

That you believe Bem's data are ludicrous means nothing; it does not invalidate them or make them any less meaningful. You're still clinging desperately to the discussion of quantum physics as well. I don't know why you keep referring to it. Bem hasn't claimed any value due to quantum theory. The value in his data is not dependent on any particular theory, quantum or otherwise.

Bem doesn't need "help"; the data speaks for itself. Regardless, that's fine, be grumpy about journals. Just don't expect your views to be taken seriously when you use words like "impossible" and spout nonsense.

Imagine that Bem had invoked the ether in his paper. No one would begrudge me pointing out that modern science believes there is no such thing, and therefore it's not an entity you can use in your theorising.

Now remember that Bem has invoked time travelling information, and that modern science is currently full of evidence there can be no such thing. No one would begrudge me pointing out that modern science believes there is no such thing" Actually, and this is quite ironic, I think you'll find there are modern ideas of an aether: Regardless, the problem is that you've confused your own personal beliefs that precognition is impossible with modern science's beliefs that precognition is only inconsistent with modern physics.

You're saying there can be no valid theories that contradict current scientific knowledge, even if there are data to support it. If that were the case, no new theories would ever emerge and science would stagnate. There are no ideas which are barred from theorising if the data allow. Regardless, Bem has not presented a theory. You can't rely on such an entity as part of a mechanism for a separate process.

You can, of course, conduct science about whether such an entity actually exists. Bem does the former; anyone wanting to rock the paradigmatic boat can do the latter. They are quite distinct and not contradictory, and have nothing to do with my personal preferences at all. This is the basis of various flavours of realism, mostly Hacking's entity realism which came up in Chapter 9 of Chemero's book which I blogged about recently.

I recommend you bone up on those, and maybe some work on how scientists actually go about their business actually, Hacking's work might be useful for that too; I'll see if I can think of other examples. Also, this is a problem, not a reason he doesn't have to worry about these things. You are a fool. I don't think worry is necessary though as we have a method for solving the problem.

I don't think the Type I error problem is restricted to Bem's experiments on the paranormal of course. I blogged myself just today on one typical example in my mind from Psyc Science here: Hi Neil I'm not much of a fan of Psych Science either, for the reasons you cite eg this rubbish. That study you blogged about was terrible; Sabrina and I came up with about three things they could have done with that sort of experiment that might have been interesting in about 5 minutes, but of course they all required quite a bit of work and more space that Psych Science usually allows: Bigger journals like JPSP and the JEP series tend to be better, because they allow the space for detailed analyses and careful exposition of longer, more careful studies.

But replication is still a problem, both by not being done nor being published when it is done. Although I largely agree with your comments, I don't think it's helpful to continually call the hypothesis and idea "ridiculous".

Theory-based research is all well and good, but let's get real, here - a large part of what experimental psychologists do is effect-chasing, and if they're lucky, the theory can get tagged on later. I would encourage any hypothesis that was truly falsifiable, because this allows us to ask questions of nature. How do we know that astrology is nonsense? Because we tested it. Do we need a coherent theory before we test it? Putting ideas to the test is what science is all about.

If we banned all hypotheses on grounds of "apparent ridiculousness" then we'd be slamming the door on some potentially groundbreaking findings. Bem's results are interesting. As the post states, this whole business raises many questions about the scientific process.

Compare the coverage that faster-than-light neutrinos got with the response Bem's paper gets. Both present an anomaly. What then happens once an anomaly is reported is the true test of a scientific process. A precise replication should have no degrees of freedom, because all the choices were already made by the original study.

If the effect being researched is real, then the results should still come out positive. If they were the result of exploiting the degrees of freedom, then they should vanish. There are also the other recognized benefits of replication. The most obvious is that any unrecognized quirky aspects of study execution or researcher biases should average out over multiple replications. For this reason it is critical for replications to be truly independent.

Another often missed reason why replications are important is simply to look at a fresh set of data. It is possible for a researcher, for example, to notice a trend in data that generates a hypothesis. That trend may have been entirely due to random clustering, however. If the data in which the trend was initially observed is used in a study, then the original random clustering can be carried forward, creating the false impression that the hypothesis is confirmed. Replication involves gathering an entirely new data set, so any prior random patterns would not carry forward.

Only if there is a real effect should the new data reflect the same pattern. Clearly that quote reflects the prevailing attitude among psi researchers of external skepticism of their claims and research.

Every skeptic who has voiced their opinion has likely been met with accusations of being dismissive and closed-minded. But this is a straw man. Skeptics are open to new discoveries, even paradigm-changing revolutionary ideas. Often I am asked specifically — what would it take to make me accept psi claims. I have given a very specific answer — one that applies to any extraordinary claim within medicine.

It would take research simultaneously displaying the following characteristics:. Most importantly, it would need to display all four of these characteristics simultaneously. Psi research, like most research into alternative medicine modalities like homeopathy and acupuncture, cannot do that, and that is why I remain skeptical. These are the same criteria that I apply to any claim in science.

In addition, I do think that prior probability should play a role — not in accepting or rejecting any claim a priori, but in setting the threshold for the amount and quality of evidence that will be convincing. This is reasonable — it would take more evidence to convince me that someone hit Bigfoot with their car than that they hit a deer with their car.

There is a word for someone who accepts the former claim with a low threshold of evidence. You can convince me that psi phenomena are real, but it would take evidence that is at least as solid as the evidence that implies that such phenomena are probably not possible. It is also important to recognize that the evidence for psi is so weak and of a nature that it is reasonable to conclude it is not real even without considering plausibility.

But it is probably not a coincidence that we consistently see either poor quality or negative research in areas that do have very low plausibility. The least important implication of the recent paper by Galak et al is that it provides further evidence against psi as a real phenomenon, and specifically against the claims of Daryl Bem. Psi is a highly implausible hypothesis that has already been sufficiently refuted, in my opinion, by prior research.

It therefore justifies initial skepticism toward any such data, especially when extraordinary claims are involved. The insights provided by this excellent paper reflect many of the points we have been making at SBM , and should be applied broadly and vigorously to alternative medicine claims. Novella also has produced two courses with The Great Courses , and published a book on critical thinking - also called The Skeptics Guide to the Universe.

This makes it easy to produce potentially false positive results and very hard to remove false positive results from the literature. The authors of the original article may claim that they do not care about effect sizes and that their theoretical claim was supported.

To avoid this problem that replication researchers have to invest large amount of resources for little gain, it is important to realize that even a failure to replicate an original finding with the same sample size can undermine original claims and force researchers to provide stronger evidence for their original ideas in original articles. If they are right and the evidence is strong, others will be able to replicate the result in an exact replication study with the same sample size.

The main problem of Maxwell et al. They mention publication bias twice to warn readers that publication bias inflates effect sizes and biases power analyses, but they completely ignore the influence of publication bias on the credibility of successful original results Schimmack, ; Sterling; ; Sterling et al.

Ironically, Maxwell et al. This quote is not only an insult to Ritchie et al. First, Ritchie et al. There is nothing wrong with this statement, even if it is grounded in a healthy skepticism about supernatural abilities. More important, Maxwell et al. Given this wider context, it is entirely reasonable to favor the experimental artifact explanation over the alternative hypothesis that learning after an exam can still alter the exam outcome.

It is not clear why Maxwell et al. One reason why failed replication studies are so credible is that insiders know how incredible some original findings are.

The stark contrast between the apparent success rate and the true power to produce successful outcomes in original studies provided strong evidence that psychology is suffering from a replication crisis. This does not mean that all failed replications are false positives, but it does mean that it is not clear which findings are false positives and which findings are not. Whether this makes things better is a matter of opinion.

Publication bias also undermines the usefulness of meta-analysis for hypothesis testing. This result is meaningless because publication bias inflates effect sizes and the probability of obtaining a false positive result in the meta-analysis. Thus, when publication bias is present, unbiased replication studies provide the most credible evidence and the large number of replication failures means that more replication studies with larger samples are needed to see which hypothesis predict real effects with practical significance.

The whole point of Maxwell et al. As I have pointed out, this conclusion is based on some misconceptions about the purpose of replication studies and by blissful ignorance about publication bias and questionable research practices that made it possible to publish successful replications of supernatural phenomena, while discrediting authors who spend time and resources on demonstrating that unbiased replication studies fail. The real answer to Maxwell et al.

In the end, Maxwell et al. As I have demonstrated, this is exactly the conclusion that readers should draw from failed replication studies, especially if a the original study was not preregistered, b the original study produced weak evidence e.

We can only speculate why the American Psychologists published a flawed and misleading article that gives original studies the benefit of the doubt and casts doubt on the value of replication studies when they fail. Fortunately, APA can no longer control what is published because scientists can avoid the censorship of peer-reviewed journals by publishing blogs and by criticize peer-reviewed articles in open post-publication peer review on social media.

Is psychology suffering from a replication crisis? American Psychologist, 70, In , Daryl J. In an email exchange with Daryl Bem, I asked for some clarifications about the data, comments on the blog post, and permission to share the data. Bem granted me permission to share the data. He declined to comment on the blog post and did not provide an explanation for the decline effect. The undisclosed concoction of datasets is another questionable research practice that undermines the scientific integrity of significance tests reported in the original article.

At a minimum, Bem should issue a correction that explains how the nine datasets were created and what decision rules were used to stop data collection.

I find the enthusiasm explanation less plausible than you. Given the lack of a plausible explanation for your data, I think JPSP should retract your article or at least issue an expression of concern because the published results are based on abnormally strong effect sizes in the beginning of each study.

As I pointed out in my article that you reviewed that you reviewed points out, this success makes it even more likely that some non-significant pilot studies were omitted.

Your success record is simply too good to be true Francis, Have you conducted any other studies since ? A non-significant result is overdue. Regarding the meta-analysis itself, most of these studies are severely underpowered and there is still evidence for publication bias after excluding your studies.

Any serious test of the hypothesis requires much larger sample sizes. However, the meta-analysis and the existence of ESP are not my concern. My concern is the way social psychologists have conducted research in the past and are responding to the replication crisis. We need to understand how researchers were able to produce seemingly convincing evidence like your 9 studies in JPSP that are difficult to replicate.

You are well aware that your article was published with reservations and concerns about the way social psychologists conducted research. You can make a real contribution to the history of psychology by contributing to the understanding of the research process that led to your results.

This is independent of any future tests of PSI with more rigorous studies. The hypothesis that strikes me as most plausible is that it is an experimenter effect whereby experimenters and their assistants begin with high expectations and enthusiasm begin to get bored after conducting a lot of sessions.

This increasing lack of enthusiasm gets transmitted to the participants during the sessions. Now that I am retired and no longer have a laboratory with access to student assistants and participants, I, too, am shifting to online administration, so it will provide a rough test of this hypothesis. Were you planning to publish our latest exchange concerning the meta-analysis?

Thursday, January 25, I now started working on the meta-analysis. Can you please send me the original data for this study? I was not able to figure out how to leave a comment on your blog post at the website. I kept being asked to register a site of my own. So, I thought I would simply write you a note.

You are free to publish it as my response to your most recent post if you wish. But in the very first Table of our analysis, we presented the results for both the full sample of 90 studies and, separately, for the 69 replications conducted by independent researchers from 33 laboratories in 14 countries on 10, participants.

These 69 non-Bem-contaminated independent replications yielded a z score of 4. The Bayes Factor was 3. Of these 69 studies, 31 were exact replications in that the investigators used my computer programs for conducting the experiments, thereby controlling the stimuli, the number of trials, all event timings, and automatic data recording.

The data were also encrypted to ensure that no post-experiment manipulations were made on them by the experimenters or their assistants. My own data were similarly encrypted to prevent my own assistants from altering them. Both exact and modified replications were statistically significant and did not differ from one another. Both peer reviewed and non-peer reviewed replications were statistically significant and did not differ from one another.

Replications conducted prior to the publication of my own experiments and those conducted after their publication were each statistically significant and did not differ from one another. There was no evidence of p -hacking in the database, and the effect size for the non-bem replications was 0.

This is also higher than the mean effect size of 0. For various reasons, you may not find our meta-analysis any more persuasive than my original publication, but your website followers might. Saturday, January 20, 6: I am sorry if you felt bothered by my emails, but I am confident that many psychologists are interested in your answers to my questions.

Saturday, January 20, 5: I hereby grant you permission to be the conduit for making my data available to those requesting them. At the moment, I am planning to follow up our meta-analysis of 90 experiments by setting up pre-registered studies. That seems to me to be the most profitable response to the methodological, statistical, and reporting critiques that have emerged since I conducted my original experiments more than a decade ago.

To respond to your most recent request, I am not planning at this time to write any commentary to your posts. I am happy to let replications settle the matter. Almost all of the participants in my studies at Cornell were unpaid volunteers taking psychology courses that offered or required participation in laboratory experiments.

Nor did I discard failed experiments or make decisions on the basis of the results obtained. What I did do was spend a lot of time and effort preparing and discarding early versions of written instructions, stimulus sets and timing procedures. These were pretested primarily on myself and my graduate assistants, who served repeatedly as pilot subjects.

Changes were not made on the basis of positive or negative results because we were only testing the procedures on ourselves. When I did decide to change a formal experiment after I had started it, I reported it explicitly in my article.

In several cases I wrote up the new trials as a modified replication of the prior experiment. In some cases the literature suggested that some parameters would be systematically related to the dependent variables in nonlinear fashion—e.

In that case, I incorporated the variable as a systematic independent variable. That is also reported in the article. It took you approximately 3 years to post your responses to my experiments after I sent you the data.

Understandable for a busy scholar. Saturday, January 20, , 1. I want to post my blog about Study 6 tomorrow. If you want to comment on it before I post it, please do so today. Monday, January 15, , Experiment 8, the first Retroactive Recall experiment was conducted in and its replication Experiment 9 was conducted in Monday, January 15, , 8. Thank you for your table. I think we are mostly in agreement sorry, if I confused you by calling studies datasets.

The numbers are supposed to correspond to the experiment numbers in your table. The only remaining inconsistency is that the datafile for study 8 shows year , while you have in your table. Retroactive Facilitation of Recall II. Monday, January 15, , 4. Here is my analysis of your Table. I will try to get to the rest of your commentary in the coming week.

Unless I have made a mistake in identifying them, I find agreement between us on most of the figures. You have listed the dates for both as , whereas my datafiles have listed for all participant sessions which describe the Precognitive Avoidance experiment and its replication. Perhaps I have misidentified the two Datasets.

The second discrepancy is that you have listed Dataset 8 as having participants, whereas I ran only 50 sessions with a revised method of selecting the negative stimulus for each trial. As noted in the article, this did not produce a significant difference in the size of the effect, so I included all sessions in the write-up of that experiment.

This permits readers to read the method sections along with the table. Perhaps it will also identify the discrepancy between our Tables. Precognitive Avoidance of Negative Stimuli 8? Monday, January 15, I am sorry to bother you with my requests.

It would be helpful if you could let me know if you are planning to respond to my questions and if so, when you will be able to do so? Saturday, January 13, 3. I put together a table that summarizes when studies were done and how they were combined into datasets. Saturday, January 13, 2. I wrote another blog post about Study 6. If you have any comments about this blog post or the earlier blog post, please let me know. Also, other researchers are interested in looking at the data and I still need to hear from you how to share the datafiles.

Friday, January 12, 7. Now that my question about Study 6 has been answered, I would like to hear your thoughts about my blog post. How do you explain the decline effect in your data; that is effect sizes decrease over the course of each experiment and when two experiments are combined into a single dataset, the decline effect seems to repeat at the beginning of the new study.

As I pointed out on my blog, I think there are two explanations see also Schooler, Either unpublished studies with negative results were omitted or measurement of PSI makes the effect disappear. What is probably most interesting is to know what you did when you encountered a promising pilot study. Did you then start collecting new data with this promising procedure or did you continue collecting data and retained the pilot data?

Friday, January 12, 2. You are also correct that the first 91 participants Spring semester of were exposed to 48 trials: We continued with that same protocol in the Fall semester of for 19 additional sessions, sessions At this point, it was becoming clear from post-session debriefings of participants that the erotic pictures from the Affective Picture System IAPS were much too mild, especially for male participants. Recall that this was chronologically my first experiment and also the first one to use erotic materials.

The observation that mild erotic stimuli are insufficiently arousing, at least for college students, was later confirmed in our meta-analysis, which found that Wagenmakers attempt to replicate my Experiment 1 Which of two curtains hides an erotic picture? In all my subsequent experiments with erotic materials, I used the stronger images and permitted participants to choose which kind of erotic images same-sex vs. For this reason, I decided to introduce more explicit erotic pictures into this attempted replication of the habituation protocol.

Finally, Sessions 40 sessions increased the number of trials to With the stronger erotic materials, we felt we needed to have relatively more neutral stimuli interspersed with the stronger erotic materials. Friday, January 12, I also figured out that the first 91 participants were exposed to 16 critical trials and participants 92 to were exposed to 30 critical trials. Can you please confirm this? Thursday, January 11, Wednesday, January 10, 5. This means your article reported two more significant results Study 6, Negative and Erotic than the data support.

This raises further concerns about the credibility of your published results, in addition to the decline effect that I found in your data except in Study 6, which also produced non-significant results. Do you still believe that your studies provided credible information about timer-reversed causality or do you think that you may have capitalized on chance by conducting many pilot studies? Wednesday, January 10, 5: One Sample t-test data: Gender of participants matches.

Both retroactive habituation hypothesis were supported. On trials with negative picture pairs, participants preferred the target significantly more frequently than the nontarget, On trials with erotic picture pairs, participants preferred the target significantly less frequently than the nontarget, I will double check the datafiles that you sent me in against the one you are sending me now.

Wednesday, January 10, 4: Sorry for the delay. I have been busy re-programming my new experiments so they can be run online, requiring me to relearn the programming language. The confusion you have experienced arises because the data from Experiments 5 and 6 in my article were split differently for exposition purposes.

If you read the report of those two experiments in the article, you will see that Experiment 5 contained participants experiencing only negative and control stimuli. Experiment contained participants who experienced negative, erotic, and control stimuli.

I started Experiment 5 my first precognitive experiment in the Spring semester of I ran the pre-planned sessions, using only negative and control stimuli. So after completing my sessions, I used what remained of the Spring semester to design and run a version of my own retroactive experiment that included erotic stimuli in addition to the negative and control stimuli.

I was able to run 50 sessions before the Spring semester ended, and I resumed that extended version the experiment in the following Fall semester when student-subjects again became available until I had a total of sessions of this extended version. For purposes of analysis and exposition, I then divided the experiments as described in the article: No subjects or sessions have been added or omitted, just re-assembled to reflect the change in protocol. It contains all sessions ordered by dates.

The fields provided are: Saturday, January 6, Please reply as soon as possible to my email. Other researchers are interested in analyzing the data and if I submit my analyses some journals want me to provide data or an explanation why I cannot share the data. I hope to hear from you by the end of this week. Meanwhile I posted a blog post about your article. It has been well received by the scientific community.

I would like to encourage you to comment on it.