Monday 14 November 2016

Is most published research wrong?

Earlier in the year I published an article on the blog entitled Let's start to take bad science seriously, which generated a lot of interest, as I hoped that it would. It also increased awareness of the problem, which has been evident in papers submitted to Minerals Engineering, and also thanks to the vigilance of our Associate Editor Dr. Pablo Brito-Parada.
Now to supplement this posting is a thought provoking video, brought to my attention by Dr. Dee Bradshaw of University of Cape Town. I shall not attempt to summarise it, but I would suggest that anyone interested in research publication and the scientific method takes a close look at it, and hopefully comments on its content.
Dr. Norman Lotter, of Flowsheets Metallurgical Consultants Inc., Canada travels extensively giving lectures and workshops on sampling and statistics and comments thus:
An interesting discussion. It took me to my bookshelf and "Statistics for Experimenters" by Box and Hunter (1979), a book that I have used since the early eighties. Chapter 1, "Science and Statistics" offers a worthy discussion on the importance of experimental design and proper data analysis, and cautions as to the danger of incorrect data interpretation.

One point that the YouTube clip did not make was taught to me by Isobel Clark: "Make sure that you understand how the data are naturally distributed before you assume a Normal Distribution". Then tailor your approach accordingly.

In the field of flotation tests we are invariably dealing with small data sets rather than large ones, so one has to be cautious of the quirks and characteristics of these data.  A good example is the Bessel correction to the sample standard deviation to compensate for the underestimation of this parameter by small data sets.
 
Prof. Tim Napier-Munn, former JKMRC Director, is another member of the Minerals Engineering Editorial Board who I turn to for advise on statistics and design of experiments, and I thoroughly recommend his text-book Statistical Methods for Minerals Engineers. Tim writes:
Yes, a nice video on interpreting the hypothesis test P-value. A well known problem that has (rightly) been getting a lot of air time in recent years, mostly in connection with clinical trials and medical experiments. Some comments:
1. Despite the pitfalls, the use of a P-value in a hypothesis test is still massively better than guesswork which is what our profession has often indulged in in the past. Looking at two unreplicated grade-recovery curves plotted according to Excel’s scaling rules and deciding by eye that they represent different metallurgy is not acceptable any more.
2. Fisher’s recommendation of P = 0.05 as a decision level was not ‘arbitrary’ as the video says, but a carefully considered compromise based on Fisher’s extensive experience at Rothampstead of designing and analysing agricultural experiments, which have a lot in common with mineral processing experiments: eg small samples and noisy data. I discuss his full quote in my stats course to explain the background to the choice of 0.05. And I also make the point that Fisher is dead so we can choose whatever hurdle rate we like as long as we understand exactly what P implies, which many people don’t.
3. In my view (and I also emphasise this in the course and my book) decision-making using P-values should always be complemented by calculating and quoting the confidence limits on the effect found in the experiment, eg the improved recovery was 2% ± 1% with 95% confidence, and we are 95% confident that the improvement was at least 0.5% (the worst case scenario), as well as saying that because P = 0.03 then we are 97% confident that the improvement was not zero (we reject the null hypothesis with a 3% chance of being wrong in doing so). This idea has now been re-discovered and called ‘The New Statistics’ (and a book written about it) as a way of de-emphasising P-values. I believe that they should all be used together to get a full picture of the result.
4. As the video said, the key to a good experiment is power, ie enough repeats to achieve an acceptable chance of detecting an effect if it is really there. “n is king”.
5. People always underestimate the malign effect of random behaviour. To illustrate the point, in my book (page 117) I quote the example of an experiment in which repeat leach tests (it could as well be flotation, to keep Dee and Norm happy!) are conducted to determine whether some change in conditions can increase extraction (recovery). Simple calculations show that if in truth there is no difference between the two conditions, then if the experimental error of the experiment is 1% (a low figure) there is still a probability of about 8% of getting a recovery difference as high as 2% by chance. If the experimental error is 4% (high but not rare) then the chance of getting false positives increases to 36%, ie over one third of experiments will produce a spurious improvement. This is why we need to have adequate sample sizes to minimise the chance of the wrong decision.
6. Expectation bias (preferring results that comply with our prejudices) and the arbitrary removal of inconvenient data is still a problem in some cases.
7. Norm (Lotter) rightly makes the point about the nature of the data distribution. However large samples can mitigate this effect thanks to the central limit theorem.

So, Norman and Tim have kicked off the discussion. More views would be welcome.

11 comments:

  1. If all published was right, there would not be a need for any more research? As over the years everyone would have buttoned down everything.

    Therefore one must assume that what ever they have said has potential to be right, but with a good dose of scepticism?

    ReplyDelete
    Replies
    1. Sorry Dave, but I cannot follow your argument. It appears that you are saying that if all research was carried out according to good practice then there would now be no need for any further research? Maybe you could clarify?

      Delete
    2. that is exactly what I am saying. As science is an iteration, ideas and results regularly get flipped and often flopped, or even re-hashed. What I was trying to say was if best practice is followed perfectly there would not be such a need to keep going. The answers would be achieved? The desire for the next round of funding etc etc will steer the conclusions and the direction of the results?

      Delete
  2. Gents
    I sometimes wonder what page you are all on!
    You talk about statistical accuracy, understanding the data set you are dealing with etc.
    How do I reconcile that, when I see prescribed metallurgical testwork defined for flotation tests that have no focus on understanding the fundamental flotation chemistry being used to define key recovery parameters for design and financial analysis for projects that are considering investing billions in their projects.
    For example on one project we reviewed the reagents (collector and frother)were defined based on an unrelated project as being suitable and industry standard. That program failed to get reasonable recoveries or concentrate grades due to excessive mass recovery, i.e. gangue and ridiculously long residence times. Taking the same samples to another laboratory and getting them to actually watch the flotation process and add reagents in response to what the lab technician actually saw during the test resulted in that projects outcome being significantly different. In this case sea water flotation and basically significantly reducing the reagent addition. We don't actually do any statistical validation of the tests, we just get the reagents right and the difference in performance is a make or break for projects.
    Following this is the subsequent use of mathematical models to actually design the process plant, based on either poor or good results and the tuning of flotation parameters to get desired outcomes versus obeying the integrity of the results obtained.
    This, to me is one of our greatest challenges, and if you look closely you will see that several of the major projects recently completed in Latin America are reaping the rewards of this approach and failing to achieve the desired metallurgical performance. Our industry however does not like to share these and too often the source of this problem is never made public.

    ReplyDelete
    Replies
    1. Stuart, what you say rings so true. I spent 4 years trying to sell statistically significant results we had completed, to customers in South America only to have them to dismiss the results or show me how their results showed otherwise (usually based on a trivial number of data points). I see two hurdles in which prevent the wider acceptance of statistical results in our industry: first, education of statistical analysis is very poor at university and the majority of metallurgists/engineers/decision makers do not understand the concept well enough to feel comfortable with the results; second, if the results do not agree with their belief or objective, then they can simply choose to ignore it. It is so easy to take work performed elsewhere, or determined using questionable methods and put it into practice without question. This second reason also has a significant cost advantage as it is cheap. While I was in Chile I had a lot of contact with the commercial laboratories there. These laboratories are well respected and their data accepted without question. My concern was that the laboratories were required to bid against each other for work, with the lowest bidder always winning the contract. As with all things, you get what you pay for, and these lowest bids were always achieved by cutting the number of repeats or samples ("n") first and foremost. Naturally this degrades the quality of the results generated. I do not blame the laboratories for this as they are simply trying to survive in the world created by the mining companies themselves. Ultimately the blame falls back on the miners and their constant desire to sacrifice quality for cost savings (nothing new there).

      Michael Myllynen.

      Delete
  3. Coincidentally on the same theme, see Paul Coxon’s article in this month’s Materials World magazine

    ReplyDelete
  4. I was hoping for a meaty discussion; instead it is largely a rediscussion of basic hypothesis testing methods. I agree that the video is worth watching; although I would hope that it was targeted to either high school or mid-level university students rather than established academics.

    I recently presented at IMPS, Turkey, and in one of my earlier versions of my talk I said that in mineral processing culture a mineral processing engineer with 2nd year stats is considered a guru. I ended up removing the comment; however the point remains that mineral processors have not been able (as a community) to develop and utilise high level quantitative skills. In some circles the 21st Century is being described as the 'mathematical age' - as mathematical algorithms (and AI) become more commonplace in process optimisation.

    I have commented in other discussions that if mineral processors fail to develop high level quantitative skills then they will be removed from quantitative analysis of mineral processing data. That prediction has been proven correct. It is now commonplace to hear of Mining Companies establishing Data Centres with limited use of mineral processors. I am currently developing a research project with JKMRC which is addressing this important issue.

    So in answer to the opening question "Is most published wrong?" I doubt it - particularly those which take hypothesis testing seriously. Journals in the biological domain are quite polarised. Some demand clear evidence that a journal has been scrutinised by a statistician. However there is a lot more to research than hypothesis testing. I would tend to question whether research (in the context on quantitative analysis in mineral processing) is very advanced. My answer would be "no".
    Stephen Gay, Principal MIDAS Tech International, Australia

    ReplyDelete
  5. The other point I found very pertinent in the video focused on the incentives for academics and for journals. I found that this point relates to your title more that the discussion on statistical metrics. The fact that journals are more likely to publish positive unexpected/extraordinary results, and that researchers’ incentives are so strongly tied to journals' preferences, mean that the p-Value cannot be taken at face value. The survivor bias for a paper, having been filtered through the 'tests-likely-to-be-done' and 'results-likely-to-be-published' filters, means that the p-value will always show an optimistic value, even when no p-hacking has been done.

    Ultimately, you can't complain about people doing what they're incentivised to do, especially when stakes are high and fields are competitive. Either you must be a force of change regarding the incentives, or you must update you prior's to anticipate human nature.
    Charles Bradshaw

    ReplyDelete
  6. I feel the premise in the title that most published research may be wrong, is possibly a bit too negative (but than negative sells better than positives some times as 8 Nov 16 proves). That said, understanding the principles of proper design of experiments and how to analyze the results is a complicated process. Especially when dealing with situations where a minor change can have large consequences. I strongly agree with Prof. Napier-Munn that random behavior and conditions easily skew your results.

    This is especially true in mineral processing where we deal with large quantities of material (ore and water) and relatively small quantities (in relation) of reagents. In addition the main components (ore and water) are not constants but change in portion and characteristics, some times quickly.

    Just going out and running experiments can generate a large mass of data tha is difficult to manage, and may have unintended bias and systematic and random errors.

    Mike Albrecht, Roberts Companies, USA

    ReplyDelete
  7. I find it interesting that in al this discussion the measure of success is recovery.

    Recovery is a calculated value, and derived from assaying and weighing samples.
    Both have inherent error.

    But how often do we see a valid statistical comparision of tests, when the back calculated heads are different and yet draw a conclusion that one test gave superior results.
    -Assay example often the actual assaying method for concentrate and tailings are different. For gold and silver Fire assay gravimetric for concentrates, and multi-acid ICP for tailings both having different detection increments.
    -Similar for weighing products tailings to whole grams and concentrates to maybe 2DP.

    For flotation tests, the real measure of success is increased NSR, not recovery.

    A few thoughts to generate responses.

    ReplyDelete
  8. Wow what an interesting video and thanks much to Barry and Dee for sharing!

    In my experience I think our field is rife with similar problems, it might even be more common in mineral processing given the relatively few large research institutions compared to, say, sociology or medicine. Question for Barry: how many large research institutes (aprox) are represented among editors of minerals engineering or other journals?

    I think the question is pertinent because if a paper is submitted that refutes an orthodox theory (examples: Bond's third theory related to crack tip propagation, the Amira P9 flotation model, or the mechanism for xanthate activation of chalcopyrite) is the reviewer more likely to accept it or reject it based on the leanings of the research output of the particular institute in which he or she was "indoctrinated"? In the video they mention the fictional pentaquark, and in the space of two years it was the topic of ~1000 publications. So if Joe Researcher then submits a paper disproving the pentaquark, what would be the likelihood that a reviewer rejects that paper? Would the likelihood change if the reviewer works, worked for, or studied with an institute dedicated to pentaquarks? I would hope it wouldn't, but I suspect the answer depends on how many papers have already been published, rather than how good the research paper really is.

    Some of the comments above I think are related to engineering decisions (test interpretation, scale-up, and optimization) and I think engineering is always subjected to time and budgetary constraints, and when you don't have enough time or money the easiest thing to drop is rigor. the 80/20 rule or something like that, and I'm okay with that in some cases. It isn't always necessary to prove something out to five sigma to make an operational or engineering decision.

    Lastly I'd suggest that headline be moderated somewhat. Maybe "most of all research conclusions are false" rather than "most of all research is false". I have sometimes arrived at very different conclusions than my colleagues, interpreting the same data.

    ReplyDelete

If you have difficulty posting a comment, please email the comment to bwills@min-eng.com and I will submit on your behalf