Guest Blog Post for HDF Group prior to our webinar
What does astrophysics have to do with psychiatry?
Most people would probably dismiss the question as impossible to know. After all, astrophysics and psychiatry seem like entirely different scientific fields. Perhaps most people would laugh off the question, or even go so far as to gibe the inquiring party simply for asking. One could also imagine the same scene unfolding in a different era about the flatness of the earth or any other commonly held belief that was later found to be untrue with more accurate and precise information.
With the advent of the scientific method, scientists agreed long ago to a set of steps that allow us to investigate these types of questions. Today, we collect buildings full of electronic data in order to answer questions about astrophysics and psychiatry and dozens of other fields. The problem, however, is getting data in a specific format for one field (such as astrophysics) to be useful for answering questions in another field (such as psychiatry) and using that data in a way that ensures reproducible results.
Dealing with storing, moving, and analyzing very large data sets is a problem we deal with daily at MyIRE. Genomics, imaging, and simulation can produce petabytes (peta = 10^15) of data. That’s thousands of terabytes (lots of cds)! When it comes to storing massive quantities of data for processing at supercomputing speeds, no one has a proven track record at NASA and other large-scale data processors comparable to HDF’s. For MyIRE, HDF5 is the obvious choice for an open source, portable, scalable, self-describing file format needed to run experiments across fleets of computers.
Along with the large data sets, dealing with storing, moving, and analyzing very small data sets is a problem we deal with daily at MyIRE. Survey data and small-scale clinical studies may generate only megabytes (mega =10^6) of data to be analyzed, but it must be usable by teams with many different individual computers and software for interacting with the data. In addition, for small data sets, MyIRE stores a lot of information about how the data is collected, relative to the actual amount of data collected.
These challenges would initially seem like very different problems and use cases. Why not just store the small data in something portable that everyone can use— like a spreadsheet or SQL table—and keep the large data in HDF? To MyIRE, the answer is obvious: The critical need to have the ability to reproduce results.
Scientists using massive or tiny datasets in their experiments all need software they can depend on to reach the same conclusion multiple times for any given question (AKA, reproducibility). When people create and store data in MyIRE, data sets are maintained as HDF5 files. Without the ability to replicate results, enormous amounts of time and money spent planning and conducting experiments, collecting data, and presenting flawed results, all go to waste. More importantly, volunteers and patients are subjected to unnecessary risks when science presents flawed results. MyIRE’s common interface allows the data sets to be combined, moved, and used repeatedly—whether the user is part of a team working for a large international organization, or an individual doctor in a small-town office. MyIRE data sets can then be checked and cross checked for errors and fraud. MyIRE data sets can also be overlaid to discover additional insights for drug targeting.
With MyIRE and HDF, everyone from a doctor in a small-town office to big data genetics researchers can work together to find new and powerful insights across data sets using a common set of tools – and do so in a repeatable way. And, because of HDF5, any user—whether large or small—is powered by the same technology used by CERN and NASA. We knew we wanted all of MyIRE’s users to have the power of NASA in their pocket. HDF5 made that possible.
If that’s possible, perhaps it is also possible to discover: what does astrophysics have to do with psychiatry?