Big Data in Biotech Research

by jeeg 10. May 2013 21:01


The shift from “hypothesis-driven“ to “data-driven” or hypothesis-free research is a major change for medical science. Although much research has always been based on new observations leading to the generation of new hypotheses, the scientific method is rooted in the idea of testing hypotheses to establish their validity. Omitting this step shifts the basis of research towards data-mining (the computational process of discovering patterns in large data sets), a sub-discipline of computer science.  Enthusiasts of Big Data in healthcare see the main objective as identifying correlations between genotype and phenotype (the physical characteristics of a person) [1}.  The use of large data sets and sophisticated statistical techniques increases the statistical power to detect weak correlations such as those between SNPs and common, complex diseases. However, predictions based on multiple correlations can have low predictive value and/or clinical utility and be misleading for a variety of reasons, particularly when the effect size of each SNP is expected to be small. Some statisticians have therefore questioned the value of Big Data as a means to do research [2],  whilst researchers from other disciplines such as evolutionary theory and psychiatry have highlighted the difficulties in making sense of all the information. [3,4]

According to the philosopher of science Thomas Kuhn a scientific paradigm defines:
•    what is to be observed and scrutinized
•    the kind of questions that are supposed to be asked and probed for answers in relation to this subject
•    how these questions are to be structured
•    how the results of scientific investigations should be interpreted
•    how is an experiment to be conducted, and what equipment is available to conduct the experiment.

Like all science, Big Data or “hypothesis free” science is based on hidden assumptions that define a paradigm: for example, an emphasis on using biological data (particularly genomic data) to predict individual risks, rather than environmental or social data (although the latter may be integrated at some stage in the future); the treatment of genetic variants such as SNPs as fixed risk factors, rather than context-dependent ones; an assumption that identification of future genetic variants will increase the utility of personalised risk assessments sufficiently for their use to improve health outcomes; and a focus on individuals and individual  actions (lifestyle changes or medical interventions) rather than population-level policy responses to improve public health (such as stricter regulation of medicines to prevent adverse drug reactions; or measures to restrict marketing of unhealthy foods) . 

 [1] Chen J, Qian F, Yan W, Shen B. Translational biomedical informatics in the cloud: present and future. Biomed Res Int. 2013;2013:658925. doi:10.1155/2013/658925.
 [2] Ioannidis JPA (2013) Informed consent, big data, and the oxymoron of research that is not research. Am J Bioeth. 13(4):40–42.
 [3] Buchanan AV, Sholtis S, Richtsmeier J, Weiss KM (2009) What are genes “for” or where are traits “from”? What is the question? Bioessays, 31(2):198–208.
 [4] Mewes HW (2013) Perspectives of a systems biology of the brain: the big data conundrum understanding psychiatric diseases. Pharmacopsychiatry, 46 Suppl 1:S2–9.


Dr Helen Wallace


GeneWatch UK



Comments are closed
Log in