MEI's Barry Wills: Is most published research wrong?

Monday, 14 November 2016

Is most published research wrong?

Earlier in the year I published an article on the blog entitled Let's start to take bad science seriously, which generated a lot of interest, as I hoped that it would. It also increased awareness of the problem, which has been evident in papers submitted to Minerals Engineering, and also thanks to the vigilance of our Associate Editor Dr. Pablo Brito-Parada.

Now to supplement this posting is a thought provoking video, brought to my attention by Dr. Dee Bradshaw of University of Cape Town. I shall not attempt to summarise it, but I would suggest that anyone interested in research publication and the scientific method takes a close look at it, and hopefully comments on its content.

Dr. Norman Lotter, of Flowsheets Metallurgical Consultants Inc., Canada travels extensively giving lectures and workshops on sampling and statistics and comments thus:

An interesting discussion. It took me to my bookshelf and "Statistics for Experimenters" by Box and Hunter (1979), a book that I have used since the early eighties. Chapter 1, "Science and Statistics" offers a worthy discussion on the importance of experimental design and proper data analysis, and cautions as to the danger of incorrect data interpretation.

One point that the YouTube clip did not make was taught to me by Isobel Clark: "Make sure that you understand how the data are naturally distributed before you assume a Normal Distribution". Then tailor your approach accordingly.

In the field of flotation tests we are invariably dealing with small data sets rather than large ones, so one has to be cautious of the quirks and characteristics of these data. A good example is the Bessel correction to the sample standard deviation to compensate for the underestimation of this parameter by small data sets.

Prof. Tim Napier-Munn, former JKMRC Director, is another member of the Minerals Engineering Editorial Board who I turn to for advise on statistics and design of experiments, and I thoroughly recommend his text-book Statistical Methods for Minerals Engineers. Tim writes:

Yes, a nice video on interpreting the hypothesis test P-value. A well known problem that has (rightly) been getting a lot of air time in recent years, mostly in connection with clinical trials and medical experiments. Some comments:

1. Despite the pitfalls, the use of a P-value in a hypothesis test is still massively better than guesswork which is what our profession has often indulged in in the past. Looking at two unreplicated grade-recovery curves plotted according to Excel’s scaling rules and deciding by eye that they represent different metallurgy is not acceptable any more.

2. Fisher’s recommendation of P = 0.05 as a decision level was not ‘arbitrary’ as the video says, but a carefully considered compromise based on Fisher’s extensive experience at Rothampstead of designing and analysing agricultural experiments, which have a lot in common with mineral processing experiments: eg small samples and noisy data. I discuss his full quote in my stats course to explain the background to the choice of 0.05. And I also make the point that Fisher is dead so we can choose whatever hurdle rate we like as long as we understand exactly what P implies, which many people don’t.

3. In my view (and I also emphasise this in the course and my book) decision-making using P-values should always be complemented by calculating and quoting the confidence limits on the effect found in the experiment, eg the improved recovery was 2% ± 1% with 95% confidence, and we are 95% confident that the improvement was at least 0.5% (the worst case scenario), as well as saying that because P = 0.03 then we are 97% confident that the improvement was not zero (we reject the null hypothesis with a 3% chance of being wrong in doing so). This idea has now been re-discovered and called ‘The New Statistics’ (and a book written about it) as a way of de-emphasising P-values. I believe that they should all be used together to get a full picture of the result.

4. As the video said, the key to a good experiment is power, ie enough repeats to achieve an acceptable chance of detecting an effect if it is really there. “n is king”.

5. People always underestimate the malign effect of random behaviour. To illustrate the point, in my book (page 117) I quote the example of an experiment in which repeat leach tests (it could as well be flotation, to keep Dee and Norm happy!) are conducted to determine whether some change in conditions can increase extraction (recovery). Simple calculations show that if in truth there is no difference between the two conditions, then if the experimental error of the experiment is 1% (a low figure) there is still a probability of about 8% of getting a recovery difference as high as 2% by chance. If the experimental error is 4% (high but not rare) then the chance of getting false positives increases to 36%, ie over one third of experiments will produce a spurious improvement. This is why we need to have adequate sample sizes to minimise the chance of the wrong decision.

6. Expectation bias (preferring results that comply with our prejudices) and the arbitrary removal of inconvenient data is still a problem in some cases.

7. Norm (Lotter) rightly makes the point about the nature of the data distribution. However large samples can mitigate this effect thanks to the central limit theorem.

So, Norman and Tim have kicked off the discussion. More views would be welcome.

Twitter @barrywills

11 comments:

Unknown14 November 2016 at 13:05
If all published was right, there would not be a need for any more research? As over the years everyone would have buttoned down everything.

Therefore one must assume that what ever they have said has potential to be right, but with a good dose of scepticism?
ReplyDelete
Replies
Stuart Saich14 November 2016 at 14:31
Gents
I sometimes wonder what page you are all on!
You talk about statistical accuracy, understanding the data set you are dealing with etc.
How do I reconcile that, when I see prescribed metallurgical testwork defined for flotation tests that have no focus on understanding the fundamental flotation chemistry being used to define key recovery parameters for design and financial analysis for projects that are considering investing billions in their projects.
For example on one project we reviewed the reagents (collector and frother)were defined based on an unrelated project as being suitable and industry standard. That program failed to get reasonable recoveries or concentrate grades due to excessive mass recovery, i.e. gangue and ridiculously long residence times. Taking the same samples to another laboratory and getting them to actually watch the flotation process and add reagents in response to what the lab technician actually saw during the test resulted in that projects outcome being significantly different. In this case sea water flotation and basically significantly reducing the reagent addition. We don't actually do any statistical validation of the tests, we just get the reagents right and the difference in performance is a make or break for projects.
Following this is the subsequent use of mathematical models to actually design the process plant, based on either poor or good results and the tuning of flotation parameters to get desired outcomes versus obeying the integrity of the results obtained.
This, to me is one of our greatest challenges, and if you look closely you will see that several of the major projects recently completed in Latin America are reaping the rewards of this approach and failing to achieve the desired metallurgical performance. Our industry however does not like to share these and too often the source of this problem is never made public.
ReplyDelete
Replies
MEI14 November 2016 at 16:21
Coincidentally on the same theme, see Paul Coxon’s article in this month’s Materials World magazine
ReplyDelete
Replies
Anonymous15 November 2016 at 10:02
I was hoping for a meaty discussion; instead it is largely a rediscussion of basic hypothesis testing methods. I agree that the video is worth watching; although I would hope that it was targeted to either high school or mid-level university students rather than established academics.

I recently presented at IMPS, Turkey, and in one of my earlier versions of my talk I said that in mineral processing culture a mineral processing engineer with 2nd year stats is considered a guru. I ended up removing the comment; however the point remains that mineral processors have not been able (as a community) to develop and utilise high level quantitative skills. In some circles the 21st Century is being described as the 'mathematical age' - as mathematical algorithms (and AI) become more commonplace in process optimisation.

I have commented in other discussions that if mineral processors fail to develop high level quantitative skills then they will be removed from quantitative analysis of mineral processing data. That prediction has been proven correct. It is now commonplace to hear of Mining Companies establishing Data Centres with limited use of mineral processors. I am currently developing a research project with JKMRC which is addressing this important issue.

So in answer to the opening question "Is most published wrong?" I doubt it - particularly those which take hypothesis testing seriously. Journals in the biological domain are quite polarised. Some demand clear evidence that a journal has been scrutinised by a statistician. However there is a lot more to research than hypothesis testing. I would tend to question whether research (in the context on quantitative analysis in mineral processing) is very advanced. My answer would be "no".
Stephen Gay, Principal MIDAS Tech International, Australia
ReplyDelete
Replies
Charles Bradshaw15 November 2016 at 15:42
The other point I found very pertinent in the video focused on the incentives for academics and for journals. I found that this point relates to your title more that the discussion on statistical metrics. The fact that journals are more likely to publish positive unexpected/extraordinary results, and that researchers’ incentives are so strongly tied to journals' preferences, mean that the p-Value cannot be taken at face value. The survivor bias for a paper, having been filtered through the 'tests-likely-to-be-done' and 'results-likely-to-be-published' filters, means that the p-value will always show an optimistic value, even when no p-hacking has been done.

Ultimately, you can't complain about people doing what they're incentivised to do, especially when stakes are high and fields are competitive. Either you must be a force of change regarding the incentives, or you must update you prior's to anticipate human nature.
Charles Bradshaw
ReplyDelete
Replies
Anonymous16 November 2016 at 15:38
I feel the premise in the title that most published research may be wrong, is possibly a bit too negative (but than negative sells better than positives some times as 8 Nov 16 proves). That said, understanding the principles of proper design of experiments and how to analyze the results is a complicated process. Especially when dealing with situations where a minor change can have large consequences. I strongly agree with Prof. Napier-Munn that random behavior and conditions easily skew your results.

This is especially true in mineral processing where we deal with large quantities of material (ore and water) and relatively small quantities (in relation) of reagents. In addition the main components (ore and water) are not constants but change in portion and characteristics, some times quickly.

Just going out and running experiments can generate a large mass of data tha is difficult to manage, and may have unintended bias and systematic and random errors.

Mike Albrecht, Roberts Companies, USA
ReplyDelete
Replies
Trevor Yeomans17 November 2016 at 00:15
I find it interesting that in al this discussion the measure of success is recovery.

Recovery is a calculated value, and derived from assaying and weighing samples.
Both have inherent error.

But how often do we see a valid statistical comparision of tests, when the back calculated heads are different and yet draw a conclusion that one test gave superior results.
-Assay example often the actual assaying method for concentrate and tailings are different. For gold and silver Fire assay gravimetric for concentrates, and multi-acid ICP for tailings both having different detection increments.
-Similar for weighing products tailings to whole grams and concentrates to maybe 2DP.

For flotation tests, the real measure of success is increased NSR, not recovery.

A few thoughts to generate responses.
ReplyDelete
Replies
Peter Amelunxen18 November 2016 at 17:30
Wow what an interesting video and thanks much to Barry and Dee for sharing!

In my experience I think our field is rife with similar problems, it might even be more common in mineral processing given the relatively few large research institutions compared to, say, sociology or medicine. Question for Barry: how many large research institutes (aprox) are represented among editors of minerals engineering or other journals?

I think the question is pertinent because if a paper is submitted that refutes an orthodox theory (examples: Bond's third theory related to crack tip propagation, the Amira P9 flotation model, or the mechanism for xanthate activation of chalcopyrite) is the reviewer more likely to accept it or reject it based on the leanings of the research output of the particular institute in which he or she was "indoctrinated"? In the video they mention the fictional pentaquark, and in the space of two years it was the topic of ~1000 publications. So if Joe Researcher then submits a paper disproving the pentaquark, what would be the likelihood that a reviewer rejects that paper? Would the likelihood change if the reviewer works, worked for, or studied with an institute dedicated to pentaquarks? I would hope it wouldn't, but I suspect the answer depends on how many papers have already been published, rather than how good the research paper really is.

Some of the comments above I think are related to engineering decisions (test interpretation, scale-up, and optimization) and I think engineering is always subjected to time and budgetary constraints, and when you don't have enough time or money the easiest thing to drop is rigor. the 80/20 rule or something like that, and I'm okay with that in some cases. It isn't always necessary to prove something out to five sigma to make an operational or engineering decision.

Lastly I'd suggest that headline be moderated somewhat. Maybe "most of all research conclusions are false" rather than "most of all research is false". I have sometimes arrived at very different conclusions than my colleagues, interpreting the same data.
ReplyDelete
Replies

Add comment

If you have difficulty posting a comment, please email the comment to bwills@min-eng.com and I will submit on your behalf

MEI's Barry Wills

Monday, 14 November 2016

Is most published research wrong?

11 comments:

MEI Conferences

Featured post

Flotation '25 Programme Released: The Strongest Line-Up Yet

Most Popular Posts Last 7 days

Receive blog alerts by email

Most Popular Posts Last Month

Most Popular Posts Last Year

Dates of blog posts

Labels

Pageviews last month

Facebook Like Box