The ADS is expanding its editorial policy to include the indexing of research objects such as data and software, and tracking their citations and usage
Today’s scholarly, born-digital articles are no longer best represented by single documents but rather consist of a narrative connecting a collection of research components. They contain references to other papers, data products, people, institutions, and funding sources, among others [1]. Since the whole of the products referenced by a scholarly article forms the best representation of the science discussed in the article, having a proper coverage of these sources will help in properly representing different aspects and stages of research life cycles. Just capturing citations to other scholarly publications would not only leave out important information, it would also omit proper attribution to researchers who contribute as authors of non-traditional research products. The ADS has implemented a workflow for capturing software citations, the concept of which was presented at LISA VIII [2]; this workflow allows the detection and ingest of citations to software products used in scholarly publications [3]. Since both data and software citations are crucial for the transparency of research results and for the transmission of credit [4], the ADS will implement indexing of high-level data products, in particular those published by NASA Archives, and track their citations. We will conclude by speculating how additional text mining and curation efforts can be used to further link the literature to additional resources mentioned in the papers.
The NASA Astrophysics Data System (ADS), a digital library widely used in Astronomy and related disciplines, provides a single discovery platform used on a daily basis by thousands of scientists and the general public. In order for the ADS to deliver on its mission, the project must ensure that our community needs and priorities are reflected in the content we curate and the services we provide. As the gatekeepers of the literature produced by our research community, decisions about what should be included in the ADS and how it should become discoverable have major consequences for a variety of individuals and projects. ADS users consist of more than just scientists in need of material pertinent to their research. Program managers may need to find the scholarly impact of their research portfolio (such as missions, instruments, facilities and grants), while librarians may use it to maintain and curate their institutional bibliographies.
While the ADS was developed as a service in support of NASA Astrophysics research, it has been recognized as an important asset in NASA’s Science Mission Directorate (SMD) following the publication of its “Strategy for Data Management and Computing for Groundbreaking Science 2019-2024”1 and NASA’s recent efforts to promote Open Science. Open science is an evolving paradigm that seeks to foster greater inclusivity, diversity, and participation in the scientific process while increasing transparency and reproducibility. Providing support for Open Science requires building an information ecosystem which facilitates the discovery, access, and reuse of research artifacts which include literature, data, and software. In that regard, an information system such as the ADS can become a critical asset in supporting the FAIR principles (Findable, Accessible, Interoperable, Reusable) for data and software. To the extent that the ADS can enable the discovery and access to data and software connected to the scholarly literature, the system can be seen as a major component of the infrastructure supporting the goals of Open Science.
With the evolution of electronic publishing and the increasing availability of article content to indexing services, the ADS has been able to provide additional functionality to its users, such as the ability to search the full-text of the manuscripts, rather than only their metadata. The availability of the full-text (whether directly from a publisher or through an OA version of a paper) is also providing us with additional opportunities for metadata enrichment, including entity recognition, document classification, and resource linking. These developments, coupled with the adoption of best practices in publishing embodied in the FAIR principles, makes it possible for the ADS to implement new approaches to the ingest and curation of its content.
In this paper we provide an overview of recent developments in ADS’s curation policies, highlighting the ways in which new content is selected, indexed, and linked to existing records in support of FAIR principles.
To further promote the FAIR principles and support Open Science goals, several publishers and scholarly societies have been encouraging (and in some case mandating) that authors follow a number of best practices in the paper preparation process. Here we focus on a small subset of such requirements which have a significant impact on the process of content dissemination and discovery.
The introduction of Digital Object Identifiers (DOIs) over 20 years ago was a major step in creating a framework to support persistent link resolution and metadata indexing in scholarly publishing. Ten years later, the founding of DataCite2 along with community-driven efforts to promote data citation have generated universal agreement on the usefulness of registering research data products with DOIs.
While most scholarly users reap the benefits of DOI adoption thanks to their link persistence, the availability of metadata associated with these identifiers makes it possible for indexing services such as the ADS to easily acquire machine-readable versions of the associated resource metadata. While the descriptive metadata currently published by DataCite and other DOI-minting organizations do not provide the level of discipline-specific detail usually required by data discovery portals aimed at science use cases, their availability is nonetheless significant since it can enable a basic level of data discovery for these resources at web scale.
Since the introduction of the FAIR principles, a number of publishers have drafted publishing guidelines requiring that papers include an explicit statement about the availability of data and software required to support the paper’s conclusions. Some journals have gone even further aiming at enabling reproducibility of the paper’s results by providing the readers with the needed data and methods required to conduct the data analysis presented in the article. Examples of such requirements can be found in the AAS’s policy statement on software3, or the AGU’s Data & Software availability section4.
The inclusion of these sections in the manuscripts delivered to the ADS provides a way for us to extract from the full-text links to data and software products associated with them. This activity, in conjunction with additional heuristics described in Section 3.3, also provides us with a useful way to develop “data" filters enabling archive-specific selection of papers at search time (e.g. “find Cassini papers which have links to PDS data products").
Armed with increasingly open access to institutional, disciplinary, or general-purpose repositories for data and software, and empowered by DOI-minting technologies, scientists are now able and often willing to share their research software and data. The degree of enthusiasm towards this trend is often a personal matter, but appears to be in part a generational one. The tide of open science, amplified by mandates by funding agencies, requirements by publishers, and a cultural shift in the collaborative nature of the research enterprise, seems destined to change the culture for good.
Tempering the increased demands imposed on scientists by software and data sharing requirements as a prerequisite for publishing an article, is a built-in reward system which turns software and data into citable research objects. This provides the scientists who have invested time and energy in preparing their data and software for public release a formal way to cite them in their paper, along with the prospect that they will be cited further by those who work with them in the future. The support for software and data citation, now universally accepted as a best practice in the physical sciences, usually involves the creation of a data or software release which can be uniquely identified by one or more DOIs. This, in turn, makes it possible for indexing services which track citations to identify the cited resources, which would otherwise be difficult to discover.
The ADS has used a well-established editorial process to identify and ingest content in its database5. The basic principles behind the selection criteria can be summarized as: in order for something to belong in the ADS it has to be relevant to astronomy, and it has to be scholarly. This usually meant considering the content produced by scholarly publishers, identifying the relevant astronomy journals and conference proceedings, and ingesting them as part of its “core collection." In addition to the records identified by this initial process, the ADS aims to also ingest content that is connected to the core collection by citation analysis. Research papers and other artifacts that are connected to the core astronomy papers via formal citations represent the knowledge layer which is required to understand the research papers themselves. Thus the ADS periodically evaluates its coverage of the cited literature, leading us to make further decisions about what should be incorporated in our system on a regular basis.
This model has provided us with a successful methodology to increase the ADS data holdings over time based on the evolution of research disciplines and community feedback, including, most recently, the addition of exoplanet literature to our core collection. The process has also provided us with new opportunities to make decisions about what to incorporate based on citation analysis, rather than content classification. In this section we discuss some of the new workflows that the ADS is using to determine what content should be ingested in its system as well as how to enrich its existing records.
The availability of the full-text documents from publishers provides the ADS the opportunity to perform text mining efforts which were previously difficult or impossible. The typical source full-text article created by a publisher is a well-structured XML document where content is properly sectioned and annotated. This means, among other things, that the extraction of metadata, references, and other interesting parts of the text such as acknowledgments and the body of the paper is straightforward. Difficult tasks which used to require a number of heuristics such as the extraction of affiliations, footnotes, or references are now more easily accomplished.
In addition, it is now easy to identify in these documents hyperlinked resources, persistent identifiers such as ORCIDs and DOIs, and extract the links associated with them. These links may appear in different sections of the paper, and the context in which they appear plays an important role in how the ADS treats them. This is particularly significant in the case of a citation of a resource vs. a mere mention of it. A citation is the formal inclusion in the bibliography section of a paper of a reference to a work. In the typical case, the reference itself is associated with the DOI of the cited work, if available, which makes its identification straightforward. Indexing systems such as the ADS and Web of Science will use the references from a paper’s bibliography to create and maintain their citation database.
In contrast, a mention to a work, whether a paper, software package, data product, or other artifact, may appear anywhere else in the paper. The typical place where such mentions are found in scientific articles are sections dedicated to acknowledgments, data and software availability statements, and footnotes. Including references to scholarly works as mentions will not generate a formal citation to them in the ADS. However, the presence of these mentions, when accompanied by a hyperlink (ideally via a persistent identifier) may be used by the ADS to create a link to the associated resource, depending on the context in which it appears, as discussed in section 3.3.
Resources that are ingested in the ADS are represented by digital records which have, at a minimum, a set of metadata fields associated with them. What this means is that these resources have record identifiers (bibcodes) and metadata (author, title, abstract) assigned to them and are stored in a searchable database. We call this content “indexed" in the system. Most of the indexed content in the ADS consists of literature records, but a significant and fairly recent addition to our collection consists of records representing data and software. The first type of data products to be featured in the ADS since 1995 were the data catalogs maintained by the VizieR database at CDS [5]. In the early 2000s we added observing proposals for a number of NASA missions and observatories. A decade ago we included NASA Astrophysics funding proposals, and, as of 2012, records from the Astrophysics Source Code Library [6]. More recently, as part of the Asclepias project [7], we have been indexing software products cited in the astronomy literature.
By definition, indexed records in the ADS represent scholarly research objects, and are no longer limited to scientific articles, but now include digital artifacts cited by the research literature. As such, they represent a more complete collection of the scholarly building blocks of the scientific research enterprise. Because they are described in the ADS as records with rich metadata, they are easily discoverable and citable. In addition, their metrics such as citation and usage are tracked by ADS and exposed by its services.
One of the distinguishing features of the ADS is the presence of links from its records to research archives and databases. This is particularly significant for the core astronomy journals which work with archive curators to create and maintain rich online resources associated with the articles they publish. As an example, 68% of articles published in the ApJ during 2020 have links to data products.6 Historically, the links have been created by curators upon the ingestion of datasets published with the article (as is the case for VizieR tables), cross-referencing of astronomical object measurements (SIMBAD and NED), or telescope bibliographies (Astrophysics Data Archives such as Chandra, MAST, IRSA, HEASARC, etc). The ADS has been working with collaborators at these institutions to enable links between its records and these datasets on a regular basis, allowing for the curated links to be exposed to the end user.
However, it is now possible to text-mine much of this information from papers which provide it. As mentioned in section 2.2, many journals now encourage the inclusion of information that identifies the origin of the data used in the papers, often including links to the archives or even specific data sets within them. Similarly, the AAS journals allow users to directly specify links to software and datasets which are properly tagged as such in the annotated digital documents. This means that it is now possible for the ADS to automatically identify links to data products found in these sections of the full-text documents, adding these links to the collection of the ones curated by the archives [8].
In addition to providing individual links to data products from the associated records, the ADS labels the links according to the archives that maintain them, thus creating archive-specific collections and making this information searchable. This makes it possible, for instance, to find papers that use data products hosted by a particular archive. As an example, one can use the ADS to identify papers studying Active Galactic Nuclei and making use of Chandra data7.
Linked resources, as opposed to indexed resources, are not represented as ADS records but rather as typed links from ADS records to external data systems. While it is not possible to search the ADS for the metadata associated with the linked resource (including, e.g. its DOI or author list), the existence of a linked collection can be used as a filter in ADS searches as illustrated by the example above. Having a linked data collection available in the ADS also offers the data managers responsible for it the ability to evaluate the research impact of the data sources themselves. If one takes the publication of papers about a dataset as a proxy for its scientific impact, having papers connected to datasets means that a rich set of related bibliometric indicators become trivially available. These are often used by funding agencies to periodically review their priorities concerning data management strategies. Multiple studies [9][10][11], leveraging this methodology, have shown that re-use of archival data doubles the scientific output of the original research.
The experience gained with the Asclepias project has provided us with the ability to detect citations to a particular class of scholarly products (software packages) and trigger the automated retrieval and indexing of their metadata. This has lead to the creation of a new ingest policy (ingest upon citation) and a corresponding workflow (citation capture pipeline). As the online availability of digital scholarly artifacts expands, the ADS is well positioned to discover their existence and index their metadata as appropriate by expanding the scope of the Asclepias workflow. Over the next several years we will focus our attention to the citation and indexing of research objects such as high-level datasets and notebooks, whenever these are formally cited in the literature and registered with a DOI. At the moment, both of these requirements are important: citation to a data or software product in an astronomy paper provides us with the information that such an object is relevant to research in our field, while the presence of the associated DOI provides us with the metadata necessary to ingest and index the resource.
In order for these efforts to be successful in promoting the FAIR goals of Open Science, all the relevant stakeholders need to work together by continuing to promote a set of best practices which follow a simple set of recommendations. The first recommendation to data archives is to properly register the research objects that they host (data, software, documentation) with DOIs, making their existence more easily discoverable and their retrieval more durable. The second recommendation is for publishers to require that authors be explicit about the use of software and data in their manuscripts and adopt the use of DOIs, whenever possible, to identify the corresponding resources. The third recommendation is for authors to use the reference section of a paper to formally cite all appropriate scientific contributions used in the research process. This should include software and data products when appropriate. Including formal citations will ensure that indexing services such as the ADS, CrossRef8 and Scholix9 will be able to properly identify the citation and ingest the relevant records if necessary.
The ADS will continue its efforts in leveraging Machine Learning techniques to enrich its corpus and make it more discoverable. However, no amount of AI is a substitute for curation, especially when this can be used to validate the results of automated tools and pipelines. This is where the collaboration with curators and librarians becomes important: the ADS is best-suited to provide services to improve the productivity of curation activities, rather than replace them.
With the addition of full-text indexing for all relevant astronomy publications, the ADS API and its notification services now provide the capability of identifying papers which mention particular telescopes, instruments, or even data products. While the ADS is actively working on improving the identification of these entities using a variety of machine learning efforts, accurate bibliographies and linked data generation will still require a supervised process facilitated by curators for some time to come. Are you a curator who is looking for a way to make these workflows more efficient while using ADS as part of your workflow? We would like to hear from you.
The NASA Astrophysics Data System is operated by the Smithsonian Astrophysical Observatory under NASA Cooperative Agreement 80NSSC21M0056.