FAIR Data Cube, a FAIR data infrastructure for integrated multi-omics data analysis

Liao, Xiaofeng; Ederveen, Thomas H.A.; Niehues, Anna; de Visser, Casper; Huang, Junda; Badmus, Firdaws; Doornbos, Cenna; Orlova, Yuliia; Kulkarni, Purva; van der Velde, K. Joeri; Swertz, Morris A.; Brandt, Martin; van Gool, Alain J.; ’t Hoen, Peter A. C.

doi:10.1186/s13326-024-00321-2

Software
Open access
Published: 28 December 2024

FAIR Data Cube, a FAIR data infrastructure for integrated multi-omics data analysis

Xiaofeng Liao¹,
Thomas H.A. Ederveen¹,
Anna Niehues¹,
Casper de Visser¹,
Junda Huang¹,
Firdaws Badmus¹,
Cenna Doornbos¹,
Yuliia Orlova⁵,
Purva Kulkarni^1,2,3,
K. Joeri van der Velde⁴,
Morris A. Swertz⁴,
Martin Brandt⁵,
Alain J. van Gool^2,3 &
…
Peter A. C. ’t Hoen¹

Journal of Biomedical Semantics volume 15, Article number: 20 (2024) Cite this article

981 Accesses
Metrics details

Abstract

Motivation

We are witnessing an enormous growth in the amount of molecular profiling (-omics) data. The integration of multi-omics data is challenging. Moreover, human multi-omics data may be privacy-sensitive and can be misused to de-anonymize and (re-)identify individuals. Hence, most biomedical data is kept in secure and protected silos. Therefore, it remains a challenge to re-use these data without infringing the privacy of the individuals from which the data were derived. Federated analysis of Findable, Accessible, Interoperable, and Reusable (FAIR) data is a privacy-preserving solution to make optimal use of these multi-omics data and transform them into actionable knowledge.

Results

The Netherlands X-omics Initiative is a National Roadmap Large-Scale Research Infrastructure aiming for efficient integration of data generated within X-omics and external datasets. To facilitate this, we developed the FAIR Data Cube (FDCube), which adopts and applies the FAIR principles and helps researchers to create FAIR data and metadata, to facilitate re-use of their data, and to make their data analysis workflows transparent, and in the meantime ensure data security and privacy.

Introduction

It is now widely acknowledged that understanding the mechanisms underlying health and disease requires the concerted study of different molecular levels (Deoxyribonucleic Acid (DNA), Ribonucleic Acid (RNA), proteins, metabolites). Moreover, a transition is needed from static and somewhat simplified views, to dynamic and more comprehensive views on biological pathways. Description of these pathways is usually accomplished by measuring the comprehensive assembly of molecular features in a biological system on the level of genes, transcripts, proteins and metabolites, i.e., the study of -omics: genomics, transcriptomics, proteomics and metabolomics. Currently, this is not simple nor scalable. As such, there is an increasing need to combine -omics data from different sources (multi-omics) in order to achieve a better understanding of biological systems, but the data and their associated metadata are not always FAIR: Findable, Accessible, Interoperable, and Reusable [1]. For that reason, the Netherlands X-omics Initiative has developed a multi-omics data infrastructure that facilitates FAIR-compliant multi-omics data storage and analysis. The proposed data infrastructure provides an analysis environment for (federated) data handling and analysis, in the meantime ensure data security and privacy.

This paper introduces our solution of integrated analysis on FAIR multi-omics data in decentralized databases. In the remainder of this paper, Related work section investigates existing work in this research direction. Result section presents the design and implementation of the FDCube and Demonstration of FDCube in TWOC section showcases the use of FDCube in the Trusted World of Corona (TWOC) project [2]. Finally, Conclusion section discusses further developments.

Related work

There are several tools that aid researchers in managing research metadata in a FAIR manner, for instance the FAIR Data Station [3], the FAIR-in-a-box [4] approach, and the DataFAIRifier [5]. Most of these tools focus on the production of FAIR data, including ingestion, generation, and publication.

For a more comprehensive coverage of FAIR processes including data management, data security, data exchange, and federated analysis, additional tools are required. For example, MOLGENIS is an open-source web-application covering the typical flow of human genomics data including data collection, management, analysis, visualization, and sharing, as well as offering support to make data FAIR [6, 7]. MOLGENIS can be hosted on-site and stores the data locally in a PostgreSQL database. This offers all the advantages of a database system including a local access control system (in light of the European General Data Protection Regulation) together with detailed data management.

The Personal Health Train (PHT) [8] concept is underlying a number of approaches for decentralised analysis of health-related data. The essence of the PHT approach is the analogy of a station representing the data source and a train representing the research question (or a computational request) visiting the data stations. Stations range from very large databases to small personal lockers containing the data of one person. Each station has its own set of house rules describing what a visiting ‘train’ is allowed to do with its data [8]. By moving trains towards stations rather than moving data, copying of data is avoided, hence data remains under complete control of the person or institute generating the data, thereby reducing privacy concerns around data sharing.

DataSHIELD [9] implements the idea of bringing algorithms to the data to ensure data privacy and security. DataSHIELD facilitates (co-)analysis of (harmonised) biomedical, healthcare and social-science data stored at one or multiple locations. The analysis requests are sent from a central analysis machine to several data-holding machines, which store the harmonised data to be co-analysed. The datasets are then analysed simultaneously, but in parallel. MOLGENIS developed a DataSHIELD implementation called Armadillo in its MOLGENIS suite.

Vantage6 [10, 11] is a different implementation of the PHT concept. Vantage6 enables collaboration between multiple parties by allowing to participate in one or multiple studies across multiple data stations.

In terms of programming language, DataSHIELD restricts itself to a single language (R) [12] and to a pre-defined library of functions and algorithms. In contrast, Vantage6 allows the researcher to send a request to use their preferred programming language, as long as the language is supported by the targeted data station.

To advance and further build upon the currently available federated, FAIR solutions for the scientific community, we here present the FDCube for public use under an open MIT license. In contrast to the more generic MOLGENIS Armadillo approach, the FDCube contains more specialised services for the analysis of multi-omics data. For example, we adopt the Investigation, Study, Assay (ISA) metadata schema to capture metadata about (-omics) experiments in a hierarchical manner, in which the different omics layers can be integrated within one project and connected through common identifiers. To our best knowledge, this is the first federated infrastructure designed for multi-omics data analysis. The FDCube is developed based on the principle that data should be “as open as possible and as closed as necessary” [13]. By incorporating a FAIR Data Point (FDP) component [14], the metadata can be as open as possible and made FAIR-at-the-source. By integrating a Vantage6 component [10], the data security and privacy can be ensured during federated analysis.

In comparison to other FAIR initiatives such as CEDAR [15], FAIRDOM [16] and Omics Discovery Index (Omics Discovery Index (OmicsDI)) [17], the FDCube has a number of additional strengths. First of all, CEDAR and FAIRDOM both focus mostly on general metadata management (i.e., FAIRification of datasets), whereas the FDCube provides additional solutions for -omics (meta)data. In addition to metadata generation and publication, FDCube goes a step further by dealing with federated analysis tools and approaches in order to promote reusability of data. Furthermore, OmicsDI facilitates the access and dissemination of -omics datasets by indexing metadata coming from the public datasets from various resources, but it expects data in a common XML format. Because there is no use of standard ontologies it is difficult to adhere to the FAIR principles, whereas the FDC supports the use of ontologies by utlizing FAIR Data Station which combined a set of ontologies to support the metadata model based on ISA.

Result

The FDCube is a technological framework for the storage, analysis and integration of multi-omics data. The FDcube reuses and extends existing open software components/modules and initiatives. This includes the FDP [14] and Vantage6 [10]. Further elements of the FDCube are the ISA metadata framework [18, 19] for capturing general study metadata, sample (including basic sample characteristics), and assay metadata, and the Phenopackets [20] standards for capturing phenotypic description of a patient/sample. The concept of the FDCube is illustrated in Fig. 1 and detailed below from the perspective of a dataset owner and a researcher respectively. The complete and detailed documentation on the FDCube can also be found at https://github.com/Xomics/FAIRDataCube/wiki.

Dataset owner

A dataset owner (Fig. 1; right upper corner) acquires the dataset (1) and registers it (2, 3) by publishing the metadata on a metadata registry(4), such as in a FDP. The FDP is a metadata repository that provides public access to metadata in accordance with the FAIR principles [14]. The FDP helps dataset owners to publish the metadata of their dataset, and facilitates other researchers to find and access information (metadata) about these registered datasets, including pointers to the actual location of the data (which can in theory be anywhere). This is irrespective of data access restrictions and licenses, which is typically arranged by the dataset owner at the place where the data is stored.

Considering the various metadata formats adopted by the different research communities who focus on multi-omics data, it is desirable to adopt a standard metadata format as a template for submitting of study metadata. To this purpose, we employed the ISA metadata framework [18, 19] as our basic framework, to capture and standardize study (design) information from different -omics metadata schemes. The ISA metadata schema is widely adopted by a number of research communities, for example for submission of metabolomics data as implemented by EMBL’s European Bioinformatics Institute (EMBL-EBI) in their MetaboLights repository [21].

In biomedical studies, clinical characteristics and phenotypic information of study subjects may need to be collected in addition to (-omics or other) measurements data. This information is essential for making biologically-relevant interpretations from research experimental data. Thus, phenotype data need to be standardized as well, so that researchers and clinicians can more easily link these phenotypic characteristics also to other types of biomedical data. To achieve this, the Phenopackets framework [20] as developed by the Global Alliance for Genomics and Health, was adopted in the FDCube. This framework comprises a comprehensive data structure (data model), and makes use of common ontology terms, in order to categorise and connect different types of phenotype data.

Researcher

The researcher (Fig. 1; left lower corner) can be both a data set owner and a data set consumer. As a dataset consumer, the user can search any FDP for any dataset of interest. For example, one could query a FDP part of a FDCube containing multi-omics molecular study data, provided its metadata is properly ontologized. Since all metadata is represented in a linked data format, the researcher can conduct semantic searches on datasets and their corresponding study information by using the SPARQL Protocol and RDF Query Language (SPARQL) query interface. The information that can be queried is the ontologized description of, for instance: study samples and their (biological) source; sample preparation details; methods and techniques applied; (-omics) measurement and (data) analysis strategies, workflows and reports, including the detected (molecular) data features, research group affiliations, etc. Example questions that may be asked are:

1.
Find all studies which use mass spectrometry-based metabolomics and study a specific metabolic disorder;
2.
Find datasets with more than two -omics types and more than 100 individuals;
3.
Find measurements for proteins and metabolites that belong to a particular metabolic pathway.

To analyze access-protected data and explore more complex research questions, the researcher can (1) send a computational request to a private data storage and computing environment. This is achieved by the Vantage6 component of the FDCube. If the request is accepted by the dataset owner, the (2) aggregated results of the computational request are calculated at the data storage side and sent back to the researcher through Vantage6. These aggregated results prevent (re)identification of individual samples.

Demonstration of FDCube in TWOC

We adopted the Trusted World of Corona (TWOC) project to demonstrate how to utilize the FDCube for integrated multi-omics federated analysis. The TWOC project aims to contribute to a more sustainable, innovative high-quality and person-oriented healthcare system. To this end, they created a platform in which humans and machines can meet based on FAIR data, protocols and algorithms.

In Fig. 2, we provide an example of the creation and application of the FDCube based on a public dataset on Coronavirus disease 2019 (COVID-19) featuring multi-omics patient data by Su et al., 2020 [22], which was FAIRified as part of the TWOC project. To demonstrate the added value of data FAIRification, we integrated the multi-omics data with data on molecular pathways from another FAIR resource: WikiPathways [23], as described in detail in Fig. 5. Below is an overview of the workflows for creating, filling, and using the FDCube.

Storage of raw and processed -omics data

A publicly available multi-modal dataset from COVID-19 patients [22] was prepared, harmonized and FAIRified as part of the TWOC project. The dataset consists of paired -omics data layers describing transcriptomics, proteomics, and metabolomics of blood samples, and includes comprehensive phenotype information (Fig. 2, in red). The FAIRified dataset, including documentation of the relevant (meta)data and their FAIRification processes, is publicly accessible at the TWOC’s demonstrator GitHub repository [24].

To allow interactive and joint querying of data and metadata through Vantage6 (Fig. 2, in yellow), we store the processed -omics data along with their feature annotation files. These are both stored in a flat-text tabular .csv format, with features as rows and samples as columns.

Creation of metadata

In the TWOC project, both the ISA metadata schema and Phenopackets schema are adopted. The ISA metadata schema is used as a standard metadata schema to capture metadata about (-omics) experiments, and serializes them in an hierarchical ISA-json file using ISA tools [19, 25]. The ISA tools also provides additional functionalities to convert ISA objects into linked data file formats, for example into Turtle: a Terse RDF Triple Language file [26].

Example scripts, templates and documentation thereof are provided in our GitHub repository, in order to assist researchers in capturing study and experimental (meta)data [27]. Notably, for phenotype data, a Python script was developed based on the Phenopackets data schema, to automatically convert non-FAIRified phenotypic information into .csv format [27]. Furthermore, a YARRRML [28] template was written that embedded the Resource Description Framework (RDF) schema [29] of Phenopackets, by making use of the transformation service from FAIR-in-a-box [4]. This converts the .csv file into a linked data format. In the end, the final output with linked data, and including study and experimental (meta)data as well as phenotypic information, are uploaded into the triplestore within the FDP (Fig. 2, in blue). This FAIRified linked data can subsequently be queried by the user through SPARQL, to extract the requested study (meta)data information.

To best assist researchers in FAIRification of their experimental (meta)data that is used as input for the FDCube, a containerized environment was created for use of the ISA-API [30], with connection to the ISA cookbook [31].

Querying of metadata

The FDP portal can display complete/partial metadata in a human-readable format for browsing, searching and querying of metadata. The FAIRified metadata of the TWOC demonstrator dataset was published on a FDP portal [32], as shown in Fig. 3. A SPARQL query can be run against the metadata via the SPARQL query portal, to extract any requested study (meta)data information, as illustrated in Fig. 4. After finding an interesting dataset via browsing or by SPARQL queries, the researcher can further run follow-up analyses on a target dataset, for example by ordering a computation request to the Vantage6 server, and if successful to retrieve the computation results from the data station via Vantage6.

Multi-omics data analysis

Together with the previous steps as described in Demonstration of FDCube in TWOC section, the FDCube’s capability to support multi-omics data analysis is demonstrated in this subsection, based on the TWOC demonstrator example project. The FDCube makes use of several FAIR resources and uses pathway information collected from WikiPathways [23] to analyse transcriptomics and proteomics data from COVID-19 patients. In this example, the dataset is processed as described in Storage of raw and processed -omics data. Data and code are publicly accessible at the TWOC’s GitHub repository [24, 33].

The examples consists of the following steps, as illustrated in Fig. 5.

Querying FDPs to identify relevant COVID-19 resources and their storage location.
Fetching data of individuals participating in the study, including phenotypic information such as COVID-19 and ICU admission status.
Obtaining subject identifiers, and using them to fetch study samples including their measurements data, as collected from the subjects.
Retrieving experimental study group information (i.e., subject with COVID-19 disease, healthy control subject, ICU-admitted, and non-ICU-admitted patients) from the sample metadata in the FDP.
Identifying a COVID-19 relevant pathway (SARS-CoV-2 innate immunity evasion and cell-specific immune response, identifier WP5039),and
Retrieving the gene products for the identified pathway (proteins, genes, and metabolites) by querying the WikiPathways [23] SPARQL endpoint.
Identifying the proteins and genes in the COVID-19 data set that are part of the gene products retrieved. Then analyzing the identified transcript and protein feature levels for the different study groups. In this step, the BridgeDB web service was used for ontology-based cross-mapping of transcript and protein identifiers from the different data sources, which use different identifiers for the same features. The overlap of the features identified in both -omics datasets is illustrated in Fig. 6.

One of the common features from the SARS-CoV-2 immune response pathway identified at both the transcript and protein level was Interleukine-10 (IL-10). The abundances of the transcript and protein were retrieved from the transcriptomic and proteomics datasets, together with the phenotype information of the individuals in which these abundance levels were measured. There were three groups of individuals, namely, the COVID-19 patients in the ICU, the COVID-19 patients not in the ICU and healthy individuals. The resulting box plots of IL-10 levels for these groups of individuals are presented in Figs. 7 and 8.

The availability of FAIR data resources makes it possible to combine different data sources as shown in this multi-omics data analysis. This enables interoperability and reusability of data in a fast and efficient manner.

Federated analysis

This section demonstrates the federated analysis possibilities available in the FDCube, on how to deliver an algorithm to a dataset via the Vantage6 component. This example dataset is also a .csv file prepared from the TWOC demonstrator study. Unlike the previous example, where the dataset is publicly available on GitHub, this dataset remains in a secure environment managed by the dataset owner. The only way to access the dataset is via the help of a Vantage6 component.

A Vantage6 node is typically installed at a dataset station. For security reason, the dataset station could stay in an access-protected environment, for example, in a Digital Research Environment [34], which is a cloud based, globally available research environment.

The Vantage6 server handles authentication, keeps track of all computation requests, assigns them to nodes for computation, and stores the returning results of the analyses. The Vantage6 server could also host a private Docker registry.

Vantage6 delivers the user’s computational request to a (FAIR) data station. A computation request consists of:

A reference to a Docker image, which contains the code (computation algorithm) that the researcher would like to run on the target dataset;
A list describing the dataset of interest and its purpose-of-use.

Figure 9 shows the Vantage6 user interface, at which a researcher can create a task to send to the data owner(s) for federated analysis.

In this example, we used an averaging algorithm hosted on Docker Hub^{Footnote 1}. This algorithm expects an argument ‘column_name’ to be defined, and will compute the average over that column. We specified in the kwargs fields the parameter ‘column_name’ with value ‘age’. The averaging algorithm is dispatched to run on a Vantage6 node, where the dataset is stored. In this example, the dataset is a .csv file prepared from the FAIRified TWOC demonstrator study, which contains a column titled ‘age’. The ‘Database’ field in Fig. 9 is labeled as ‘default’, which is configurable in the Vantage6 node configuration file. For simplicity, this task is created for a collaboration with only one organization (in our example: Radboudumc).

Figure 10 shows the result of running the averaging algorithm on the patients’ age in the TWOC dataset, which specifically calculates the average value in the column labelled ‘age’. This result can be passed back as the response to the computation request.

Conclusion

We have created the FDCube, a software and programmatic infrastructure to make (multi-)omics data FAIR, and to facilitate the management, reuse, integration and (federated) analysis of biomedical (-omics) data. The FDCube ensures data sovereignty, by utilizing Vantage6’s capability of ‘bringing research questions to data’ rather than ‘sending data to research questions’. Vantage6’s management capability covers comprehensive aspects (including organization, collaboration, users, roles, nodes and tasks), and makes FDCube a useful platform to carry out cross-organization federated analysis on decentralized datasets.

We used the FDCube in the TWOC project to demonstrate its capability and usage in creating and publishing ISA and phenotype metadata, browsing and querying the metadata on the FDP, and creating and running federated data analysis on a real dataset.

There are several ways to improve and extend the design and implementation of the current FDCube.

We are exploring the FAIR Data Station [3] for the creation of metadata, which allows a user to create a metadata template by selecting metadata fields and sheets corresponding to the user’s research, in our case, the ISA metadata schema. The metadata information captured will be ultimately transformed into a Linked Data file after a validation process.

A Beacon [35] component can be integrated into FDCube. The reason for this integration is that a FDP (by design) only exposes metadata of datasets. In contrast, Beacon allows for more insights about the content of the dataset itself, for example the presence/absence of specific genomic mutations in a set of data [35]. The combined information from both metadata (via the FDP) and real data (via Beacon), would help a researcher to get more insights into possibly available datasets, before designing a data analysis request as dictated by the researcher’s study questions.

Another potential work would be, to in the FDCube also integrate DataSHIELD with Vantage6, in order to grant users of Vantage6 access to rich analysis algorithms as available in DataSHIELD.

Data availability

Not applicable.

Code availability

https://github.com/Xomics/FAIRDataCube.

Notes

harbor2.vantage6.ai/demo/average

Abbreviations

COVID-19:: Coronavirus disease 2019
DNA:: Deoxyribonucleic Acid
EMBL-EBI:: EMBL’s European Bioinformatics Institute
FAIR:: Findable, Accessible, Interoperable, and Reusable
FDCube:: FAIR Data Cube
FDP:: FAIR Data Point
ISA:: Investigation, Study, Assay
PHT:: Personal Health Train
RDF:: Resource Description Framework
RNA:: Ribonucleic Acid
SPARQL:: SPARQL Protocol and RDF Query Language
TWOC:: Trusted World of Corona
IL-10:: Interleukine-10
OmicsDI:: Omics Discovery Index

References

Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3(1):1–9.
Article Google Scholar
Trust World of Corona. https://www.health-holland.com/project/2020/trusted-world-of-corona. Accessed 19 Apr 2020.
Nijsse B, Schaap PJ, Koehorst JJ. FAIR Data Station for Lightweight Metadata Management & Validation of Omics Studies. bioRxiv. 2022. https://doiorg.publicaciones.saludcastillayleon.es/10.1101/2022.08.03.502622.
FiaB: FAIR-in-a-box. https://github.com/ejp-rd-vp/FiaB. Accessed 19 Apr 2020.
DataFAIRifier. https://github.com/MaastrichtU-CDS/DataFAIRifier. Accessed 19 Apr 2020.
van der Velde KJ, Imhann F, Charbon B, Pang C, van Enckevort D, Slofstra M, et al. MOLGENIS research: advanced bioinformatics data software for non-bioinformaticians. Bioinformatics. 2019;35(6):1076–8.
Article Google Scholar
van der Velde KJ, Singh G, Kaliyaperumal R, Liao X, de Ridder S, Rebers S, et al. FAIR Genomes metadata schema promoting Next Generation Sequencing data reuse in Dutch healthcare and research. Sci Data. 2022;9(1):169.
Article Google Scholar
Beyan O, Choudhury A, van Soest J, Kohlbacher O, Zimmermann L, Stenzhorn H, et al. Distributed Analytics on Sensitive Medical Data: The Personal Health Train. Data Intell. 2020;2(1–2):96–107.
Article Google Scholar
Gaye A, Marcon Y, Isaeva J, LaFlamme P, Turner A, Jones EM, et al. DataSHIELD: taking the analysis to the data, not the data to the analysis. Int J Epidemiol. 2014;43(6):1929–44. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/ije/dyu188.
Article MATH Google Scholar
Moncada-Torres A, Martin F, Sieswerda M, van Soest J, Geleijnse G. VANTAGE6: an open source priVAcy preserviNg federaTed leArninG infrastructurE for Secure Insight eXchange. In: AMIA Annual Symposium Proceedings. 2020. pp. 870–7.
Smits D, van Beusekom B, Martin F, Veen L, Geleijnse G, Moncada-Torres A. An Improved Infrastructure for Privacy-Preserving Analysis of Patient Data. In: Proceedings of the International Conference of Informatics, Management, and Technology in Healthcare (ICIMTH), vol. 295. 2022. pp. 144–7.
R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria; 2021. https://www.R-project.org/.
European Commission. Directorate-General for Research & Innovation. H2020 Programme Guidelines on FAIR Data Management in Horizon 2020. 2016.
da Silva Santos LOB, Burger K, Kaliyaperumal R, Wilkinson MD. FAIR Data Point: A FAIR-Oriented Approach for Metadata Publication. Data Intell. 2022;1–21. https://doiorg.publicaciones.saludcastillayleon.es/10.1162/dint_a_00160.
Musen MA, Bean CA, Cheung KH, Dumontier M, Durante KA, Gevaert O, et al. The center for expanded data annotation and retrieval. J Am Med Inform Assoc. 2015;22(6):1148–52. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/jamia/ocv048.
Article Google Scholar
Wolstencroft K, Krebs O, Snoep JL, Stanford NJ, Bacall F, Golebiewski M, et al. FAIRDOMHub: a repository and collaboration environment for sharing systems biology research. Nucleic Acids Res. 2016;45(D1):D404–7. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/nar/gkw1032.
Article Google Scholar
Perez-Riverol Y, Bai M, da Veiga Leprevost F, Squizzato S, Park YM, Haug K, et al. Discovering and linking public omics data sets using the Omics Discovery Index. Nat Biotechnol. 2017;35(5):406–9.
Article Google Scholar
Sansone SA, Rocca Serra P, Field D, Maguire E, Taylor C, Hofmann O, et al. Toward interoperable bioscience data. Nat Genet. 2012;44(2):121–6. https://doiorg.publicaciones.saludcastillayleon.es/10.1038/ng.1054.
Article Google Scholar
Johnson D, Batista D, Cochrane K, Davey RP, Etuk A, Gonzalez-Beltran A, et al. ISA API: An open platform for interoperable life science experimental metadata. GigaScience. 2021;10(9):Giab060. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/gigascience/giab060.
Article Google Scholar
Ladewig MS, Jacobsen JOB, Wagner AH, Danis D, El Kassaby B, Gargano M, et al. GA4GH Phenopackets: A Practical Introduction. Adv Genet. https://doiorg.publicaciones.saludcastillayleon.es/10.1002/ggn2.202200016.
MetaboLights. https://www.ebi.ac.uk/metabolights/. Accessed 19 Apr 2020.
Su Y, Chen D, Yuan D, Lausted C, Choi J, Dai CL, et al. Multi-Omics Resolves a Sharp Disease-State Shift between Mild and Moderate COVID-19. Cell. 2020;183(6):1479–1495.e20. https://doiorg.publicaciones.saludcastillayleon.es/10.1016/j.cell.2020.10.037.
Article MATH Google Scholar
Agrawal A, Balcı H, Hanspers K, Coort SL, Martens M, Slenter DN, et al. WikiPathways 2024: next generation pathway database. Nucleic Acids Res. 2023;52(D1):D679–89. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/nar/gkad960.
Article Google Scholar
TWOC demonstrator. https://github.com/Xomics/TWOCdemonstrator/tree/main/data/Su_2020_original/phenotypes_in_modules. Accessed 19 Apr 2020.
Rocca-Serra P, Maguire E, Taylor C, Field D, Wittenberger T, Santarsiero A, et al. 7 - Investigation-Study-Assay, a toolkit for standardizing data capture and sharing. In: Harland L, Forster M, editors. Open Source Software in Life Science Research. Woodhead Publishing Series in Biomedicine. Woodhead Publishing; 2012. pp. 173–88. https://doiorg.publicaciones.saludcastillayleon.es/10.1533/9781908818249.173.
Prud’hommeaux E, Carothers G, editor. RDF 1.1 Turtle. http://www.w3.org/TR/2014/REC-turtle-20140225/. Accessed 26 Dec 2024.
TWOC Demonstrator Tools. https://github.com/Xomics/TWOCdemonstrator/tree/main/tools. Accessed 19 Apr 2020.
Heyvaert P, De Meester B, Dimou A, Verborgh R, et al. Declarative Rules for Linked Data Generation at Your Fingertips! In: Gangemi A, Gentile AL, Nuzzolese AG, Rudolph S, Maleshkova M, Paulheim H, et al., editors. The Semantic Web: ESWC 2018 Satellite Events. Cham: Springer International Publishing; 2018. p. 213–7.
Chapter Google Scholar
Phenopackets RDF Sschema. https://github.com/LUMC-BioSemantics/phenopackets-rdf-schema. Accessed 19 Apr 2020.
ISA tools API. https://isa-tools.org/isa-api/content/index.html. Accessed 19 Apr 2020.
ISA tools environment. https://github.com/Xomics/Isatools_environment. Accessed 19 Apr 2020.
The FAIR Data Point in CMBI. https://fdp.cmbi.umcn.nl. Accessed 19 Apr 2020.
TWOC Demonstrator Interleukine-6 (IL-6) Analysis. https://github.com/Xomics/TWOCdemonstrator/blob/main/tools/python_read_omics/IL6.ipynb. Accessed 07 May 2020.
Digital Research Environment. https://www.radboudumc.nl/en/research/radboud-technology-centers/data-stewardship/digital-research-environment. Accessed 19 Apr 2020.
Rambla J, Baudis M, Ariosa R, Beck T, Fromont LA, Navarro A, et al. Beacon v2 and Beacon networks: A “lingua franca’’ for federated data discovery in biomedical genomics, and beyond. Hum Mutat. 2022;43(6):791–9. https://doiorg.publicaciones.saludcastillayleon.es/10.1002/humu.24369.
Article Google Scholar

Download references

Funding

This work was funded by a Dutch Research Council (NWO) grant to The Netherlands X-omics Initiative (project 184.034.019), a Horizon2020 grant to the European Joint Programme on Rare Diseases (grant agreement Number 825575), a Horizon2020 grant to the EATRIS-Plus project (grant agreement Number 871096), a NWO Open Science Fund (grant agreement number 17703) and a LSH HealthHolland grant to the Trusted World of Corona (TWOC) consortium.

Author information

Authors and Affiliations

Medical BioSciences Department, Radboud University Medical Center, Nijmegen, The Netherlands
Xiaofeng Liao, Thomas H.A. Ederveen, Anna Niehues, Casper de Visser, Junda Huang, Firdaws Badmus, Cenna Doornbos, Purva Kulkarni & Peter A. C. ’t Hoen
Translational Metabolic Laboratory, Department of Laboratory Medicine, Radboud University Medical Center, Nijmegen, The Netherlands
Purva Kulkarni & Alain J. van Gool
Department of Human Genetics, Radboud University Medical Center, Nijmegen, The Netherlands
Purva Kulkarni & Alain J. van Gool
Genomics Coordination Center, University of Groningen and University Medical Center Groningen, Groningen, The Netherlands
K. Joeri van der Velde & Morris A. Swertz
SURF, Science Park 140, 1098 XG, Amsterdam, The Netherlands
Yuliia Orlova & Martin Brandt

Authors

Xiaofeng Liao
View author publications
You can also search for this author inPubMed Google Scholar
Thomas H.A. Ederveen
View author publications
You can also search for this author inPubMed Google Scholar
Anna Niehues
View author publications
You can also search for this author inPubMed Google Scholar
Casper de Visser
View author publications
You can also search for this author inPubMed Google Scholar
Junda Huang
View author publications
You can also search for this author inPubMed Google Scholar
Firdaws Badmus
View author publications
You can also search for this author inPubMed Google Scholar
Cenna Doornbos
View author publications
You can also search for this author inPubMed Google Scholar
Yuliia Orlova
View author publications
You can also search for this author inPubMed Google Scholar
Purva Kulkarni
View author publications
You can also search for this author inPubMed Google Scholar
K. Joeri van der Velde
View author publications
You can also search for this author inPubMed Google Scholar
Morris A. Swertz
View author publications
You can also search for this author inPubMed Google Scholar
Martin Brandt
View author publications
You can also search for this author inPubMed Google Scholar
Alain J. van Gool
View author publications
You can also search for this author inPubMed Google Scholar
Peter A. C. ’t Hoen
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

P.A.C.H., A.J.G, M.A.S conceived the project. J.H. worked on phenotype data modelling. A.N., C.V worked on ISA metadata. T.E. managed connection to the TWOC project and FAIRification of the presented dataset. P.K worked on lipidomics metadata. C.D. promoted FDCube and provided scientific feedback. F.B worked on the Multi omics analysis example. Y.O implemented the FDCube as a catalog item in SURF Research Cloud. M.B supported the hosting environment. K.J.V provided insights from MOLGENIS perspective. A.N. presented the high level concept diagram. C.D. revised the Fig. 2. X.L. implemented and set up the architecture of FDCube with help from all team members. X.L. wrote the manuscript with critical input and revisions from T.E., A.N., C.D., C.V., J.H., P.A.C.H, P.K., K.J.V., A.J.G. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Xiaofeng Liao or Peter A. C. ’t Hoen.

Ethics declarations

Ethics approval and consent to participate

Not applicable

Consent for publication

Not applicable

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Liao, X., Ederveen, T., Niehues, A. et al. FAIR Data Cube, a FAIR data infrastructure for integrated multi-omics data analysis. J Biomed Semant 15, 20 (2024). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13326-024-00321-2

Download citation

Received: 13 November 2023
Accepted: 02 December 2024
Published: 28 December 2024
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13326-024-00321-2

You are viewing the site in preview mode

FAIR Data Cube, a FAIR data infrastructure for integrated multi-omics data analysis