As I am working through the preliminary courses that I need before I begin studying machine learning, I am learning a significant amount about scientific study design and the analysis of research data. As a quick reminder, and out of gratitude for the service, I wish to mention once more that I am taking all of these courses via the free, totally online Khan AcademyIf someone wants formal recognition for these courses, there is such an option. In my specific case, I just want to know the material.

A couple of days ago, I was reading a summary concerning the NFL sponsored research about concussions. Apparently, the New York Times reported that a series of 13 articles published in the respected journal Neurosurgery, were significantly flawed. Considering what we know now about the long-term risks of repeat concussions, with a special concern over children who begin playing football at a very young age, the fact that such research was overtly manipulated is criminal.

Apparently, there were glaring gaps lasting even years at a time in the collected data. The obvious question is how such flaws in the data could be missed specifically by those people who are  entrusted with the review  of scientific articles. A comment on this entire matter was submitted by a senior physician, and I include it below:

Shame on you.
Researchers lying due to publication pressure and money.
Plain and simple. Between … the American health care system incl. NEJM, we deserve zero intelligent respect.
They learned from Big Tobacco and of course the unmentioned King of All, the shining gem: Big Pharma. They pay only certain researchers and manipulate study design and if that doesn’t work: the data itself. The “Ethical” Pharmaceutical industry. How do you sleep at night? We blow away the world in ludicrous health care costs and inexcusable research. Shame. No other word.

Many years ago, I had a business which I greatly enjoyed whereby I prepared physical slides for various researchers, locally here in Israel. This was before the time when computerized projectors were affordable, or even available. I bought a special printer that would film the presentations I had prepared on a program called Harvard graphics, which was at the time far better than PowerPoint, IMHO.

Because I was a physician myself, I understood the material that was being brought to me. Many times I would invest extra time in the graphical preparation of the slides, not for an extra fee, but for the personal pride in being able to present information in a clear and even entertaining way. I remember one specific slide that I literally spent hours on, intended to show the functioning of a particular cell receptor within the membrane around a cell. I received very positive feedback on that particular slide. And I learned a great deal from preparing it. When you can make a living enjoying what you do, learning from it and contributing back, I personally  don’t think it gets any better than that.

In one meeting with a very senior professor, at that time one of the most respected doctors in the country, I showed him a series of slides with the data he gave me. At one point, he was viewing a graph that was intended to show a correlation between treatment and outcomes. The farthest point on the graph, representing the highest dose treatment for the given outcome, was low, in relation to the other points on the graph. In other words, from his data, it appeared as if there was a leveling off of the response to the treatment, as dose increased.

The professor pointed to this dot of data, and asked me to move it. Initially, I didn’t understand the request. He repeated that he wanted me to move this specific point of information, so that it created a straight line with the other data. If I still have not made myself clear, I apologize. What this professor was asking me to do was to blatantly falsify the data that he had given me. I was not involved in the research. My name was in no way associated with the actual study. I was just preparing a set of slides. Nevertheless, I was appalled. I looked at the professor and simply stated “you can’t do that”. Foolishly, I thought that my declaration would be enough to jumble his brain matter back into its proper location. But he retorted immediately that what he was asking was acceptable and consistent with his research.

I was actually left with an ethical dilemma. I was witness to a gross miscarriage of research standards. Falsifying data is as bad as it gets. We are not talking about playing with statistics until you get a straight line fit of your data, which looks great on the slide but has absolutely no clinical significance. In this case, we were talking about creating ex nihilo a new data point to satisfy a predrawn conclusion. Did I follow this up? No. I could have complained to the university and shown them the original data and slides. But I strongly suspected that this would be swept under the rug, and I justified to myself that no further action was worthwhile. Perhaps my chronic pain is G-d’s way of reminding me that there is no such thing as a minor ethical guffaw.

In the field of research, the term bias refers to anything that could possibly alter the recorded data, giving it an apparent correlation that does not actually exist. For example, when pharmaceutical companies run major studies on new medications, one of the things that is absolutely critical is to find groups of patients that reflect the real life variety of human genomes. If a pharmaceutical study is done that excludes all individuals of, for example, African-American descent, it is very difficult and very dangerous to make conclusions about this population subset, from the research done.

It could very well be that the medication being tested works very well with few side effects on the Caucasians, Asians and Hispanics that made up the research groups. But, unless the authors specifically state in the research protocol and in their conclusions, that this study cannot be applied to African-Americans, this is an overtly biased study against this group. There may be technical reasons why the research was done in this way. It may be that a totally separate study follows that is focused only on African-Americans. And there are legitimate research reasons for why this might be. The rule is that you can do any kind of research you want, as long as you are honest and clear about your hypothesis and stipulations, and then clearly address these issues in your conclusions. But to conveniently ignore such a point, is fraudulent. And I wish I could say it was rare.

Pharmaceutical companies are under tremendous pressure. They make their money from selling medications. As time has progressed, it has become more and more expensive to create new medications that pass all of the requirements for release to the public. It is often quoted that it takes well over 10 years and billions of dollars to get a new medication onto pharmacy shelves. It’s not difficult to understand why a pharmaceutical company could be encouraged to modify its own data, in order to advance a new medication.

In some cases, if an announced new medication fails to be proven safe and effective, a company can fail entirely. Tens of thousands of people could lose their jobs, and investors, which include middle income families that count on their investments to get their kids through university or to retire comfortably, could be left with nothing. There is clearly a lot of pressure to get new medications out through the door and to have them on the market long enough, at least, to recoup the cost of the research.

In extreme cases, the cost of such actions is not just financial. People can die from unsafe medications. People can be denied known working medications because they are switched to the new medication that does not in fact have the claimed effect. This is a very dangerous game. And it is not being played for noble reasons. It is like many things, mostly about the money.

One of the attempts being made now, to help overcome these types of hidden/manipulated data problems, is to force the researchers to publish the raw data on which their published paper is based. The intent it to allow others to review and double check the data and the conclusions drawn from it.

This is also a fantastic idea because it allows other groups of researchers to make use of the same data but perhaps for a totally different endpoint. It could be that a paper that studies the effect of high blood pressure on long-term patients, could be used as a foundation for another study that is looking to study the positive effect of a certain medication. The raw data available from the first study could be incorporated into the second study and save a tremendous amount of time and money for everyone involved. If the data can be trusted from the first study, then this group of patients could potentially be reviewed for multiple other studies. You could even create a standard set of patient data that is used, at least, for initial investigation of a theory.

With the tremendous advances in genomics, whereby the entire set of genes of individuals is mapped out, there is so much data available that no one study will make use of it all. I have referred to this particular issue in its own blog post in the past, and the conclusion was that totally new uses for existing, inexpensive medications can be found, by reviewing genomic information that has already been collected by other groups and posted for free.

To refer back  to the NFL concussion research, one of the questions to be asked is not why certain players developed long-term severe complications from repeated head injury; rather, I would like to know why so many other players sharing similar NFL experiences, did not develop brain damage, severe depression and dementia. It might be luck. It might be a question of technique. Or, it might be that the still healthy players are genetically different than the ones who develop complications. If that’s the case, a single blood test could tell a potential football player what positions to avoid. It would not end such a player’s career. It would definitely modify the trajectory of that career, but not necessarily end it. At the very least, the NFL player could make a far better informed decision about the risks involved in playing the game.

Imagine a doctor who accidentally gives a patient the wrong medication, but then simply lies on the chart to make it look as if the patient suffered a spontaneous heart attack. It’s hard to call this anything less than malpractice, if not even manslaughter. When data is manipulated by any group of researchers, this can easily end up costing the lives or at least the health of countless patients. Perhaps the researcher, sitting in his or her lab, pipetting out doses of the medication for testing on rats, doesn’t see any real direct link to patient welfare. But it’s there. The moment a paper is published  with the name of a researcher on it, there is the ethical expectation that the data collected was done so in the most valid way possible. Of course, humans make mistakes and these are forgivable, if the intent of the researchers was to act professionally and appropriately. But when data is consciously manipulated, it undermines any trust that both the general public and physicians have in published research.

I truly hope that within a few years, all journals will require submissions to include all of the raw data. With data analytics becoming easier and cheaper by the day, there is the real hope that every submitted study will have its data totally reviewed in order to identify [honest or dishonest] errors. Strictly speaking, those that submit the data should have already done this. But as is clear from this entire discussion, there is a need for an impartial judge that will scrutinize the data independent of any bias or pre-held prejudice. Whether it is possible to find such a human, is a whole different question. Maybe someone could do a study to see if finding such people is possible.

Thanks for listening.