Big data promises better and faster diagnoses. But managing the increasing volume of data, creating common standards and guaranteeing patient privacy will require medical researchers to coordinate their efforts. Many Swiss organizations are working on it.
Florian Fisch
Swiss National Science Foundation | Science editor
On first hearing, ‘big data’ sounds great. The more we know about the human body, brain or gene expression, the better we can treat patients. So many researchers and companies try to sell their project or their product using this buzzword. But on closer examination, it quickly becomes clear that bigger data does not mean better knowledge. Often people are overwhelmed by the flood of data and struggle to make sense of it – especially in life sciences.
Take the intensive care unit. Every day, one single, critically ill patient generates up to 100 gigabytes of data according to Switzerland’s National Research Program big data (NRP 75). This data comes from patient monitoring, computer and magnetic resonance tomography of the brain, or laboratory results and biosensors. The monitoring system triggers about 700 alarms a day; one alarm every two minutes and most of these, false. It is clear that patient safety can be improved by getting to grips with that amount of information.
As we continue to use more apparatuses and more assays, the data continues to grow. It varies from gene expression data to daily activities recorded on smartwatches worn by patients or trial participants. Even environmental information from where these people live is recorded. It is an enormous quantity of data from different sources and of varying levels of quality. The data mountain gets bigger every day. That is a simple fact.
Once it becomes possible to make sense of this big flood of data – and currently life sciences are struggling to stay afloat – there are many interesting uses. Take safety of the critically ill patient in the intensive care unit. A research project from the University Hospital Zurich, the ETH Zurich and IBM Research is working on procedures to filter the false alarms that occur almost every two minutes and thereby enable early detection of epileptic seizures and diagnose secondary brain damage, caused by cellular processes.
With data mining and machine learning, the researchers want to improve the quality of the alarm system and rapidly propose innovations to the medical community. “With this project, we want to initiate a fundamental development in emergency and intensive care medicine – and thus significantly improve the way hospitals work in day-to-day practice", says Emanuela Keller, professor at the University Hospital Zurich, in the NRP 75 press release.
Less urgent, but no less important, is to decide what therapy is suitable to treat a special type of cancer or any other disease. This means looking for informative biomarkers – these include “omics” data (like DNA sequences, gene expression profiles and metabolite levels), images, data from biobanks, doctor diagnoses, and environmental data. “Thanks to bioinformatics methods and tools and because we have now hundreds of thousands of data points, researchers can use tailor-made algorithms to identify biomarkers or common patterns. We can ask the algorithm: ‘please find a difference, between patients with disease X and healthy individuals’”, says Valérie Barbié from the Swiss Institute of Bioinformatics (SIB).
This could help to make the treatment decision for a cancer more precise, find unknown environmental factors that influence a lung problem or make it possible to diagnose a very rare bowel disease. SIB is working on the technical infrastructure, analysis methods, software tools and knowledge bases to make that dream come true.
The term big data comes from information technology: all mobile phone connection data or all the petabytes (1015 bytes) CERN has gathered. This is all very well structured. Not so in the life sciences where the data is very heterogeneous – meaning varying between healthy individuals and very sensitive to small perturbations. In addition, there is an astronomical number of possible interactions with molecules or physical factors. “And you can easily trick your data to get a significant p-value with almost any data. Whatever you are putting into your algorithm, you will get the answer you want”, warns Valérie Barbié.
To base conclusions on a solid foundation and deliver on promises, standards have to be established on how to generate, handle and analyze the data. The way data is produced has to be described clearly. The exact type of assay might also be important. Algorithms need to be able to distinguish between medium quality data from a smart watch for example and high quality data from a medical brain scan. One of the challenges for the Swiss Personalized Health Network (SPHN) is therefore to implement standards, which are essential to aggregate, compare and analyze data stemming from different sources, like hospitals from different parts of the country. Such standards are long and well established in other industries like the banking sector.
“If the data you put into the analysis is not well characterized, the answers you get out will lead to wrong diagnoses”, says Valérie Barbié. That is a problem because every country, every hospital and every medical field has its own vocabulary and its own categories. A condition named colorectal cancer in one place can be defined as colorectal adenocarcinoma in another, just to name one simple example. This means that the data is not comparable. And, only now are hospitals across Switzerland starting to implement electronic health records (EHRs).
The way hospitals are run adds to the confusion. Huldrych Günthard from the Zurich University Hospital is the head of the Swiss cohort survey on HIV that collects data that is comprehensive, well-structured and consistent over long time periods. He struggles with this. As he explained to the Swiss research magazine Horizons in 2016: “Specific diagnoses in hospitals are sometimes distorted by economic factors, such as codifying invoices according to flat-rate payments.”
Here again, the SPHN comes into play: Meetings are held to agree on international standards on how to name the information and store it. To allow the exchange of data among researchers, the Swiss Academy of Medical Sciences (SAMS) has implemented a general consent for patients to agree on the use of their data. Therefore, in principle, the way is open for big studies.
Genomic data contains a lot of information. Information, we hope to be able to use to decide how to best treat many conditions. But more is hidden in there: for example, information about other family members who may not agree to be part of the research. Many risk factors are genetic too, and might tell you something people do not want to know. Maybe because it scares them or they cannot do anything about it. Maybe they don’t want their insurance to know, as it might increase the premium or exclude them from its services.
In a connected world, data is easily exchanged but also easily stolen. Many companies have fallen victim to attacks on their IT systems – even places considered to be safe. Hospitals with their decentralised structures and different systems are not the safest places for data to start with. And with the introduction of electronic health records it becomes easier to access large quantities of data – to the benefit of research, but to the detriment of safety.
How can we guarantee that the data is safe? Of course, data has to be anonymized. It has to be encrypted. Researchers should not be allowed to download data but have to work on secured research platforms. This is possible. Nordic and Baltic countries like Estonia, Sweden and Denmark have shown that it is possible to go digital without having to suffer major security breaches. This is important. After all, research relies on the general consent of patients and study participants.
Biology is complex and delivers noisy data. To base conclusions on a solid foundation and deliver on promises, standards have to be established on how to generate, handle and analyse the data.