First of all, I’m really saddened by the missing of Jim Gray at sea. I’d like to express my deep condolences to him, and the recently missing MH370 plane, and all other planes and boats that have gone missing in the ocean.
According to this reading, the 4 paradigms of science are experimental, theoretical, simulation (computational) and data-intensive scientific discovery. But in my experience and observation, for most people today — researchers or not, their understanding of rigorous scientific research is still experimental.
For experimental research, the researchers usually have a question and a hypothesis about this question, then they would go out and collect some data to test this hypothesis. The scale of the data collected is usually less than thousands. These data are all collected for the purpose of answering the specific question, the data collection process is carefully designed to reduce bias, and increase generalizability and representativeness, so they are usually high quality data.
Nowadays, digital devices are everywhere, and they record volumes and volumes of data. Large part of the data were not recorded for answering specific questions. So “what do we do with these data? how do we turn them into insights?” are questions for the 4th paradigm of scientific discovery. Therefore, many research in this area are exploratory. Researchers in this area often get questioned by experimentalists “so what is your hypothesis?”
It goes without saying, the datasets for data-intensive science is large. However, large doesn’t mean good. Because these data are not collected to answer your specific research questions, so it takes great effort to curate the data in order to make them useful. As this reading says, there are 3 basic steps for data-intensive science: capture, curate, and analyze/visualize. Data curation could be the most time-consuming part. Therefore, Jim Gray advocates funding for generic data curation and analysis tools. However, my questions are (or parts that I don’t understand), is there really a generic way of cleaning and curating data? Aren’t any method of data cleaning essentially a type of subjective bias? Are there successful Laboratory Information Management Systems (LIMS) today 7 years later after this concept was brought up?
Tony Hey, Stewart Tansley, & Kristin Tolle. (Eds.). (2009). The 4th Paradigm: Data-Intensive Scientific Discovery.