Linking raw data, derived data and publications in a VO-compliant and FAIR open science environment: here we present some functionalities of a model for a science gateway in INAF
This paper was born from a collaboration between a librarian (Francesca Martines) and the head of INAF IA2, Italian center for Astronomical Archives, (Cristina Knapic) for the drafting of:
a DMP template to be used in different use cases, such as different data providers, telescopes and/or simulations
a study over a complete data model as much as possible FAIR
a policy suggestion/rules to enhance the citation in paper of observational data
Moreover, this work will allow the INAF astronomical community to benefit of an integrated system to improve authors and paper citation investigating the used data sets. In this paper we will focus on a study over a complete data model as much as possible FAIR, looking for an efficient way to identify/cite data and datasets, and link them to related publications: we will briefly describe the work done in recent months and future perspectives.
Today science produces huge amounts of data, which need to be correctly managed at every stage of their existence, even after the project that generated the data is finished. There is also a very strong push towards open science, based on the principle that the results of publicly paid research must be (more or less) freely available. It is therefore necessary to adopt FAIR (Findable, Accessible, Interoperable, and Reusable) data as much as possible.
The skills of librarians can make a useful contribution to the management of research data  beyond the classic tasks of cataloging and classification, since the essence of a librarian’s work is to organize information in standard ways and allow its discovery and use. To achieve this goal, the information must be correctly structured and described, as is the case for the materials that librarians traditionally deal with and any other type of information. Librarians’ skills are therefore very important in Research Data Management (RDM), with the implementation of archival standards and consulting for data descriptors standardization and semantics.
Based on these assumptions, a collaboration was developed between Francesca Martines, a librarian based at the Astronomical Observatory of Palermo and Cristina Knapic, Head of INAF-IA21, based at the Astronomical Observatory of Trieste. IA2 is the Italian center for Astronomical Archives, an e-infrastructure of Italian National Institute of Astrophysics (INAF). IA2 aims to coordinate various national initiatives to improve the quality of services for astronomical data and to facilitate access and reuse of data for research purposes, and has three main activities: a) managing/hosting the archives of telescopes, including LBT and TNG, Radio Italian Antennas and minor Italian Telescopes, etc. b) providing support for the publication of data and resources using the Virtual Observatory guidelines and suggestions c) offering the INAF community a medium and long-term storage support, including data sharing.
Our work has several objectives:
provide a Data Management Plan (DMP) template to be used in different cases by different data providers, telescopes and/or simulations
achieve a complete data model based on FAIR principles
develop policy suggestions/rules to enhance the citation of observational data in papers
Most astronomers' work is based on observational data, which are subsequently reduced, filtered, and processed to arrive at the final result, which is normally the subject of publication. While raw data are normally stored in the archives of the observational structures, and therefore structured and equipped with metadata, subsequent processing (i.e. what we call “derived data”) are normally outside the scope of the observing infrastructures. In fact, unless there is a specific Data Management Plan (DMP) that establishes where the data must be stored and how they must be structured and provided with metadata, all the processing work done by researchers is normally stored on the PCs of individual users or, at best, in some generic storage space. In this way data is not findable, accessible, interoperable or reusable.
The best solution would be to have a DMP scheme, with related support, to give researchers insights into how to manage their research data. Furthermore, an institutional data policy is essential in order to ensure the maximum consistency of meta information. At the same time, we are also looking for a management model that makes it possible to:
1) correctly identify and quote raw and processed datasets
2) link them to publications while making the data as FAIR as possible.
Since IA2 provides all INAF researchers with medium and long-term storage space, together with the possibility of creating Digital Object Identifiers (DOIs), it was decided to offer in the future the possibility of assigning a DOI to one or more folders that contains "data" in the broadest sense of the term. Raffaele D'Abrusco and Sherry Winkelman of the Chandra Data Archive (CDA) at the Chandra X-ray Center (CXC)2, confirmed that the key to tracing the path from the data to the publication is a unique ID that can be identified in the Observation ID or the Proposal ID.
Our proposal can be summarized as follows: derived data will be stored in INAF-IA2 cloud storage and a DOI will be assigned. Then, in the DOI application process, the applicant should provide a set of information about the hosted data that will constitute the related metadata. In the case of data derived from observations, metadata of raw data must also be provided, based on the IVOA Observation Core Components Data Model. In this way, a link is established between the raw and derived data. The link is represented by the metadata attached to the DOI, which, in addition to describing the data to which they point, also contains information about the observational data from which they are derived. Thus, if one of the elements identifying the observation or the DOI of the derived data is cited in the publication, all the other elements may be accessed. In this way a relationship between data and DOI can be defined programmatically, while the relationship between DOI of the dataset and publication is managed externally to the institution, but can be traced (fig.1).
Concerning the metadata assigned to the derived data, there are currently no formal standards for the citation/metadating of datasets. Much work has been done by various organizations, including the Research Data Alliance (RDA), which has a working group on data citation3 that issued several outputs4, and the UK Digital Curation Center (DCC), which has produced a guide on how to cite datasets and connect to publications . We used these and other works to identify a set of metadata to attribute to the derived data, in order to make them as FAIR as possible. INAF IA2 has developed a draft of the datasets metadata sheet (fig. 2) to be filled by the applicant in order to obtain a DOI. Clearly this type of metadata can be attributed not only to datasets, but also to software, multimedia products and more, even though with some difference.
Each INAF structure, DG, DS
Directories, .7z,.gz, .tar, .tarbz2, .tbz2, .tar.gz, .tgz, .tar.tlz, .tlz, .tar.xz, .txz, .zip
KB, MB, GB, TB
Each INAF structure, DG, DS
Associated papers (if any)
DOI, Title/author/publication date
Apache 2.0, BSD-3-Clause, GPLv3, MIT, Other
File upload (local)
Figure 2. Draft of the DOI metadata generator elements by INAF-IA2
Our work involves completing the model definition for citing raw and derived data. The next steps will be the drafting of a data citation policy based on the model developed, which obliges INAF personnel to correctly quote the data in their work, and its integration with INAF's Open Access policy and repository. At the same time, we are working on a DMP template to be provided to INAF personnel to correctly manage their research data throughout the life cycle. All this will be proposed to the President and the Scientific Direction of INAF.
One of the main problems is represented by the difficulty of checking and verifying that INAF researchers cite the data in their papers, since such verification would require a considerable use of staffing power, which at the moment is difficult to imagine as possible. It is clear that in this way that new positions could be created, such as Data Stewards, which INAF does not currently have. We will see what happens in the future.
We would like to deeply thank Raffaele D'Abrusco and Sherry Winkelman of the Chandra Data Archive (CDA) at the Chandra X-ray Center (CXC), for their kindness and help.