We describe the infrastructure developed at the NASA Astrophysics Data System that implements a software citation detection, metadata capture and ingest, and event-driven notification system, highlighting challenges to promoting software into first-class research objects.
In September of 2016, the NASA Astrophysics Data System (ADS) started to work on the implementation of first-class support of software. This work was started as a result of the Asclepias project, funded through a grant from the Alfred P. Sloan Foundation to the American Astronomical Society. The main goal of the Asclepias project is to promote scientific software into an identifiable, citable, and preservable object. It highlighted the fact that no single stakeholder can solve the software citation problem. It requires close collaboration between a publisher (the American Astronomical Society), a repository (Zenodo) and an indexing service (ADS). This paper focuses on the contribution the ADS has made to this project. Five years later, the ADS has indexed just over 10k Zenodo software records, representing almost 19k citations. How did we get here? We describe the underlying infrastructure developed at ADS which implements a software citation detection, metadata capture and ingest, and event-driven notification system to a broker used by ADS collaborators. We include a discussion of the challenges that we have encountered in the implementation and operation of the system.
In 2016 the American Astronomical Society (AAS) successfully applied for a grant from the Sloan Foundation with a proposal titled “Enabling software citation and discovery workflows.” The proposal described a solution to build a technical framework and promote a set of social practices aimed at “fixing” many of the problems associated with software citations (Muench et al., 2020), (Henneken, Accomazzi, Blanco-Cuaresma, Muench, & Holm Nielsen, 2017). The proposed approach was designed to enable workflows which provide practical, robust solutions to the following issues:
Status: software products are to be treated as a first-class citizen in scholarly abstracting and indexing systems and citations to them are to become an encouraged practice in the publication of scientific papers;
Preservation: releases of software products and associated documentation are to be deposited in a trusted repository, so that individual versions are archived as separate entities and proper authorship information is collected for each;
Identification: archived software releases will be assigned unique, persistent identifiers and metadata so that precise, persistent connections can be made between papers and software;
Attribution: software developers will be in full control of the process which determines the proper list of contributors for software packages on a release-by-release basis;
Credit: cited software and its impact will be represented in discipline-specific indexing systems which are used by the targeted community for discovery and evaluation of scholarly content.
The proposal described the problematic issues associated with software citation afflicting the Astronomical community, which then required hand curation of software records via the Astrophysics Source Code Library (ASCL; (Allen et al., 2013)) and did not allow for specificity of the citation at the version level, thus hindering reproducibility. At that time, software products were only “citable” (as tracked by the ADS) if they had a corresponding “software paper” in the ADS or an associated record in the ASCL. This meant that the software had to have been previously mentioned somewhere in the literature. A major deliverable of the Asclepias collaboration was to reduce the need for human curation and gatekeeping as prerequisites for a piece of software becoming part of the scholarly literature.
A significant fraction of this effort focused on updating the ADS infrastructure to allow indexing of versioned software records and detecting of citations to them in the scholarly literature. ADS development focused on enabling the following capabilities:
Detect, ingest and index records corresponding to DOI-based software citations found in journal articles;
Provide a rich search interface for software records, making them discoverable and more easily citable;
Provide citation statistics for software records;
Aggregate attributions of software records per author/contributor and version.
By far, the most effort was spent in enabling the first goal — detection, ingest and indexing of new content found in citations. Up until this effort, the ADS ingest processes have been based on a collection management policy which evaluates content based on the reputation of its publishing venue (e.g. journal status) and disciplinary relevance (e.g. Space Science topics of interest to astronomers). While citation analysis is periodically conducted to detect new sources of relevant information, the selection criteria is driven by a deliberate curatorial process.
As part of this effort, the ADS team adopted a new ingest policy and implemented the necessary workflow to support it. According to this policy (“ingest upon citation”), software products which are cited in the literature and which are properly archived, are automatically ingested in the ADS and their citations are properly accounted for. In this context, “properly archived” means that the software in question has been released, deposited in a trusted repository, registered with appropriate metadata, and assigned a DOI. Thus the workflow built by the ADS has been developed for software that has been (a) properly released and preserved in a FAIR-aligned repository such as Zenodo and (b) properly cited in the literature, including the specification of the software version and associated DOI.
It is important to emphasize that the ADS workflow can only exist because other workflows are in place. The workflow implemented at Zenodo makes it possible to assign a DOI to specific software releases published in Github; as a result, harvestable metadata has been registered in DataCite. For software registered via a DOI, the workflow implemented at the AAS journals (and other journals) provides authors a means to properly cite that software when used in their articles ((Muench, Accomazzi, & Holm Nielsen, 2017), (Muench et al., 2020)). The ADS implementation further contributes to this virtuous ecosystem by crediting the software citation in its system (rather than silently dropping it as was happening in the past) and providing an easy way to cite it further (via a formatted citation or BibTeX export).
Overall, the concerted efforts of the Asclepias team were aimed at implementing and improving FAIR-enabling processes and practices which have since gained increasing support across much of the scientific community and are now recognized as good practices. Still, it’s worth mentioning the specific workflows which the project aimed to enable for the benefit of the research community:
Author/Journal Workflow: this workflow focuses on the publishing process. The workflow centers on typical behaviors for journal authors: obtaining and inserting software citations into manuscripts; validating and typesetting software citations for published articles;
Reader Workflow: Readers of journal articles will be able to track the software behind an article to its source, including specific versions. They will have immediate access to necessary metadata (licensing, software language) to determine the applicability of that cited code for their own specific research project;
Developer Workflow: The primary actions by the software developer include: 1) maintaining software reference metadata at codebase; 2) issuing releases to DOI repositories (Zenodo); 3) responding to curation pull requests issued by the Software Broker,
Software User Workflow: Software users can utilize different search engines to find software of interest to their research or software engineering project. These endpoints will expose the rich data that comes from combining software code metadata and citation metadata.
The success of our project can be measured against how well these workflows are supported and followed. The main astronomy journals all have explicit instructions for authors for how to incorporate software citations into their bibliographies. The author instructions for the AAS Journals (https://journals.aas.org/references/#software) have a separate section for software citations while those for Astronomy & Astrophysics only include an example of the citation of an ASCL record (https://www.aanda.org/for-authors/latex-issues/references); the author instructions for the Monthly Notices of the Royal Astronomical Society mention them as part of their data availability statement (https://academic.oup.com/journals/pages/authors/preparing_your_manuscript/research-data-policy#data2).
One of the core curation activities performed by the ADS project is the harvesting, aggregation and curation of scholarly publication records. Initially, this consisted in indexing the publication metadata, but over time it grew to the digitization of historical publications, followed by text mining of this content to allow the creation of a citation database. The ADS database currently consists of over 16M records accounting for over 150M citations. Its ingest workflow is designed to properly identify and reconcile manuscripts which may have appeared in different repositories, for example an early preprint published on the arXiv followed by a peer-reviewed paper published in the Astrophysical Journal. During this process, ADS persistent identifiers (bibcodes) are created for each ingested record, linked with any other identifier associated with the input records (DOIs, arXiv ids), and then reconciled with each other, producing a merged bibliographic record identified by a canonical bibcode.
Along with bibliographic records, the ADS has been indexing other non-traditional scholarly resources such as research proposals, high-level data products, and, since 2012, software packages registered in the ASCL. The curation workflows see to it that this content is not only accurately represented, but also correctly linked to connected resources, both internally (in the form of another ADS record) or externally (in the form of a link to another scholarly repository).
The ADS receives citation data from various sources in a variety of formats. The goal of the ADS citation processing pipeline is to parse individual citation strings and attempt to match these to existing ADS records ((Demleitner et al., 2004); (Accomazzi et al., 2007)). In addition to tokenizing the input citation metadata, the parsing process detects the presence of a digital object identifier (DOI) in citation data and when found, stores it separately. Besides looking for a DOI, the citation processing pipeline also looks for some other identifiers, some of which may be persistent (such as ASCL ids), others not (such as URLs). All these identifiers are extracted and considered further downstream for potential ingest or indexing. Currently only DOIs are used for the actual automated ingest of software metadata. ASCL records are already indexed in ADS via a regular feed from their publisher. Both kinds of cited ids, when found in citation lists, result in the assignment of software citations. Table 1 gives a snapshot of the number of DOIs cumulatively captured at different time intervals during the project development.
Table 1. The number of DOIs cumulatively captured at different time intervals. The Month column contains the month by the end of which the measurement was done. The second column lists the total number of DOIs captured by the end of that month. The next column shows the number of DOIs not corresponding with an ADS record. The last column corresponds with the number of Zenodo DOIs.
During this initial implementation of the ADS workflows, we limited ourselves to considering only those DOIs which were registered by Zenodo, one of our collaborators within the Asclepias project ((van de Sandt et al., 2019), (Nielsen & Sandt, 2019)). All Zenodo DOIs found in citation lists are potential candidates for software records, but the only way to find out if any given DOI corresponds to a software record is to retrieve the metadata associated with it from the appropriate registration agency (in this case DataCite) and check the value of the resourceType field. Of course, other DOIs can correspond to software records as well, and we plan to expand our ingest policies to a wider array of repositories in the near future.
In order for an indexing service, like the ADS, to be able to discover and attribute citations, and software citations in particular, the citation processing workflow has to be able to uniquely identify the cited source. The use of generally accepted patterns for citation strings (like the APA style citations), in combination with persistent identifiers (like DOIs), is the ideal way to make this identification possible. While the traditional ADS citation processing pipeline is able to deal with both structured and unstructured citation strings, with or without persistent identifiers for literature records, in the current implementation of the ADS Citation Capture Pipeline the presence of DOIs is required to enable the discovery and attribution of software citations.
An analysis of citation data for software products has exposed a number of problems with the way software products are listed in citation lists. Despite being provided with clear instructions on how software citations should be structured, a number of problematic citation patterns still prevent software citations from being correctly recognized and assigned to the proper record. Table 2 lists some examples of problematic software citations.
Droettboom, M., Caswell, T. A., Hunter, J., et al. 2017, https://zenodo.org/record/1098480. Source: D. Stansby and T. S. Horbury (2018), A&A, 613, A62
The authors tried to do the right thing by citing a Zenodo record. However, a URL is not a citable object in the strict sense, since there is no structured metadata registered with URLs
ECO Code, 2018, [online] Available: https://github.com/martin-danelljan/ECO. Source: L. Zhang, A. Gonzalez-Garcia, J. van de Weijer, M. Danelljan and F. S. Khan, "Synthetic Data Generation for End-to-End Thermal Infrared Tracking," in IEEE Transactions on Image Processing, vol. 28, no. 4, pp. 1837-1850, April 2019
Citing a Github URL has the same problem as citing a Zenodo URL. In addition, the authors did not follow the citation instructions provided by the developers within the Github repository. Here they ask authors to cite the software paper published in the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017
Foreman-Mackey D. 2016 JOSS 24 doi:10.5281/zenodo.45906 Source: Guillermo Torres et al 2018 ApJ 866 67
This is an ambiguous citation: both the software paper in JOSS and the software record in Zenodo are cited. The developers specifically ask to cite the Zenodo record in their Github repository
Table 2. Illustration of a number of problematic software citations.
The software citation capture pipeline follows a shared pattern among all ADS back-office pipelines and it relies on the same technology backbone. The ingestion is triggered on a weekly basis by a time-based job scheduler (cron), which is the event that signals the beginning of the processing for the updated dataset. From that moment on, the system uses a message-broker (in our case RabbitMQ) to orchestrate and parallelize all the necessary data processing. The pipeline keeps the internal state of the processed data in a relational database (PostgreSQL) and it forwards the final properly formatted data to the ADS master pipeline, which centralizes the ingested data from all the existing ADS pipelines. Additionally, each new ingestion will also generate events to the external Software Broker maintained by Zenodo (https://asclepias-broker.readthedocs.io). These events have a pre-defined JSON payload schema containing source and target identifiers (e.g. bibcodes or DOIs) as well as their relationship (e.g. “Cites”, “IsIdenticalTo”).
The input data associated with this initial triggering event consists of a list of citing ADS record identifiers (bibcodes) and their cited Zenodo DOIs. These DOIs (e.g. “10.5281/zenodo.11813”) were automatically matched using regular expressions against the textual content of all the articles ingested by ADS. Whenever it is possible, the input data also includes potential existing ADS records that may already represent or be linked to the detected DOI (e.g., the DOI “10.5281/zenodo.32790” and the ADS record identified by “2016ascl.soft05005B”, both citations for the software package named TMBIDL).
The first necessary step after the triggering event consists of comparing the current input data to the previously ingested data. This comparison leads to the identification of new, deleted and updated items in the input data. These detected changes should be understood as follows:
New items: new citations that were not previously processed by the pipeline (i.e., citations from a new article or new citations from an already existing article),
Updated items: citations that were already processed by the pipeline and which have been updated, typically indicating that new ADS records have been associated with them by the citation processing pipeline,
Deleted items: citations that were already processed by the pipeline which have been removed from the input data. These cases correspond to ADS records that have been merged (e.g., the same pre-print and publisher articles have been matched and merged), ADS records for which their identifier has changed (e.g., manual corrections due to incorrect name parsing) or a different matching pattern has been used (e.g., correcting regular expressions to avoid incorrectly interpreting certain strings as valid DOIs).
Once this first step has been completed, every detected change will be processed individually and in parallel using the message-broker system and specialized workers. The following steps will depend on the type of detected change:
The metadata associated with the DOI is fetched from doi.org (in case of failure, datacite.org acts as a fallback service). If the field “resourceType” contains the value “software”, the processing of this item continues, otherwise it will not be processed any further,
The provided ADS identifiers (i.e., bibcodes) are checked against the ADS API and whenever an alternative bibcode is resolved to a canonical bibcode, an event is sent to the Zenodo Software Broker to signal that these two bibcodes are identical,
If this is the first time that the DOI is processed by the pipeline, a new ADS software record is created (via ADS master pipeline) with a new assigned bibcode. In addition, an event is sent to the Zenodo Software Broker to signal that the new bibcode and the DOI are identical,
If the DOI was already processed (i.e., it was already stored in the internal pipeline database), the existing ADS software record is updated (via the ADS master pipeline) with a new citation. Additionally, a citation event is sent to the Zenodo Software Broker which links the citing article with the detected DOI,
Updated items: the internal pipeline state is updated with the new information but no further events are necessary.,
Deleted items: the internal pipeline state is updated with a deleted flag. As mentioned above, these deletions are mostly related to bibcode changes which will be seen as a deletion plus a new addition, the system will account for this generating events to signal that the previous bibcode is identical to the new bibcode. In any case, no ADS software records get removed from the system once they are ingested to ensure persistence and recoverability.
The changes from each new ingestion take less than 24 hours to be accessible through the ADS search interface. The implemented architecture has proven to be resilient and it has been running smoothly since its deployment to production. Nevertheless, there are some challenges that are not yet fully resolved. The main one is related to the trustworthiness and quality level of the metadata retrieved using the DOI. Since this is an automated ingest process, there is no metadata quality control applied to it. The user provides titles, descriptions, and the author list (among other fields) and the pipeline directly uses that information with no manual verification or intervention. For instance, author names can be difficult to parse, names do not strictly follow a determined pattern and sometimes just GitHub handles are included. Another related difficulty is that at the moment, the system does not provide a way to overwrite the metadata received, so it is up to the users to correct the data upstream and this may become a problem if co-authors ask ADS curators to make changes because the main author is not capable anymore of correcting the metadata.
The ADS search interface allows users to do fielded and unfielded queries. Unfielded queries are free text queries; these only search the basic metadata of ADS records (authors, titles, abstracts, journal names). Fielded searches make use of a rich query language built into the ADS search engine. This query language is an extension of the Lucene search engine and has been developed within the ADS (Chyla et al., 2015). This query language consists of query modifiers and second order operators; while query modifiers essentially represent search filters, second order operators represent follow-up queries that work on the result set generated by an initial query. A detailed description of all query modifiers and second order operators can be found in the ADS Online Help (https://ui.adsabs.harvard.edu/help/). Using the query parameters, the collection of all software records in the ADS can be found using the fielded query doctype:software. This query finds 13,278 results (as of May 11, 2022), consisting of 10,481 Zenodo records, 2,780 ASCL records and 17 records from the Astrophysics Software Database (http://www1.astrophysik.uni-kiel.de/asd/, now defunct). If one sorts this result list by citation count, the interface displays the total number of cumulative citations to this list of results, which is in excess of 28,000. This simple query already provides a wealth of information, visible on the content displayed in the search results page (see figure 1). On the left menu bar, a list of facets is displayed, offering the user basic statistics as well as the option to filter the list of results based on authorship, publications, institutional affiliation, and bibliographic groups, among others.
The ADS search results page provides an interesting view about more “sociological” aspects of the selected records, like identifying the most cited works or most prolific authors and institutions. Inspection of the author facet shows that currently Brigitta Sipőcz is associated with the largest amount of software records. As far as institutions are concerned, most software records have authors affiliated with one of the Helmholtz Association of German Research Centers. These are closely followed by Max Planck and MIT. Another interesting observation is that 37 software records are in the CfA bibliographic group (containing the bibliography of the Center for Astrophysics | Harvard & Smithsonian); since this is a collection curated by the library of the Center for Astrophysics, it means that these software records are considered by the CfA curators as official observatory publications. This kind of early endorsement is an essential part of the promotion of software to first-class citizen in the world of scholarly research output. There are currently in excess of 1,000 software records in ADS which have had authorship claimed in ORCID, another sign that coders care about adding this content to their ORCID profiles.
The facets mentioned above can be used to filter the search results. Alternatively, query modifiers can be used to generate record sets with certain additional properties. For example, to just generate the set of Zenodo software records, one needs to execute the query bibstem:zndo doctype:software. The "bibstem" modifier refers to the journal abbreviation as used in the ADS. Since the ADS receives metadata for more than just software records from Zenodo, the second modifier is necessary to specify that only software records are required. Additional modifiers can be added to e.g. limit the range of publication years. A comprehensive list of query modifiers can be found in the ADS online Help (https://ui.adsabs.harvard.edu/help/search/comprehensive-solr-term-list).
To gain further insights into the uptake of software citations observed in the ADS, we can rely on a number of available tools which allow us to query the ADS for these records and analyze the associated citation network. Second order operators can be used to generate result sets by performing two-step queries on any ADS query (Kurtz, Chyla, & Team, 2020). The most useful second-order operators that we will use in this investigation on the citation of software records are the citations() and references() operators.
The citations() operator is useful when exploring where, in the literature, software records are being cited. It generates the list of all the works which cite one or more records appearing in the first-order query results. Note that this operator does not return all citation instances for a given set of results, but only a unique list of the citing publications; these numbers can differ significantly, which is important to know when doing bibliometrics. This is especially useful when assessing the success of a project like Asclepias which aims at promoting citations to a particular class of scholarly works, and provides an overview of the update of software citation practices. The references() operator works in the opposite direction: it generates the list of all the works which have been cited by one or more records appearing in the first-order query results.
As an example of the application of these operators, consider the list of software records currently ingested by ADS from Zenodo for the LMFIT package: title:lmfit bibstem:zndo doctype:software. This query, which currently returns 15 records, cited more than 400 times, searches for all records which contain a mention of the term “lmfit” in their title and which were published in the Zenodo repository as software records. If we are interested in see the set of ADS records that cite these records, we can simply apply the citations() operator to the original query citations(title:lmfit bibstem:zndo doctype:software). The resulting list consists of a list of 388 papers citing any one of the original LMFIT software packages. All of these papers were published since 2016, when the Github integration was first implemented in Zenodo. The ADS search capabilities can also be used for quality assurance purposes: they offer a way to look for mention or citation patterns that fail to get recognized. The following query looks for mentions of the software package LMFIT in the fulltext of indexed publications and excludes records that cite the Zenodo records: full:lmfit -citations(title:lmfit bibstem:zndo doctype:software). This query will first search for all papers which contain a mention of the term “lmfit” in their full-text (full:lmfit) and then remove the records which, according to ADS, cite all the Zenodo software records which contain “lmfit” in their title (title:lmfit bibstem:zndo doctype:software). The resulting list of records are therefore papers which mention LMFIT but do not cite it formally (at least not in the recommended way, via a DOI). If we consider only papers published since 2016, we find that this list consists of 572 records. These are the papers that could have included a formal citation to LMFIT but instead only had a mention.
One publication that meets these criteria is Brandstetter, Dominik et al. (2021), Computer Physics Communications, Volume 263, article id. 107905. It actually includes a formal citation for LMFIT, but in the following way:
M. Newville, R. Otten, T. Stensitzki, A.R.J. Nelson, A. Ingargiola, D.B. Allen, M. Rawlik, Lmfit - non-linear least-squares minimization and curve-fitting for python, copyright 2020, URL https://lmfit.github.io/lmfit-py/intro.html.
The citation includes the Github URL instead of the Zenodo DOI. We note that the LMFIT developers provide citation instructions in their Github repository (corresponding with version 0.8.0), while the ASCL record lists the concept DOI (corresponding with the latest version) as preferred citation.
Another example is Sebokolodi M.L.L et al. (2020), The Astrophysical Journal, Volume 903, Issue 1, id.36: in this publication the authors say that they used the LMFIT software package, but only include the Github URL as a footnote. A footnote is a “mention,” not a formal citation, and as such it is not even considered by the ADS Citation Capture pipeline.
The authors in Ko, B. et al. (2020), Physics of the Earth and Planetary Interiors, Volume 305, article id. 106490 included the following citation:
Newville et al., 2016 M. Newville, T. Stensitzki, D.B. Allen, M. Rawlik, A. Ingargiola, A. Nelson Lmfit: Non-linear Least-Square Minimization and Curve-Fitting for Python. Astrophysics Source Code Library (2016).
This is an incomplete citation, because it does not include the ASCL identifier (ascl:1606.014). A drawback of this way of citing is that it is not possible to cite a specific release of LMFIT.
Finally, the authors in Corporaal, A. et al. (2021), Astronomy & Astrophysics, Volume 650, id.L13 write that they used the LMFIT software package, but do not include any footnote, formal citation or any other link to the software.
We use the query from the previous section to retrieve all Zenodo software records. This results in 10,481 records. Their distribution over publication years is shown in figure 2.
The criterion for a Zenodo software to be added is the fact that it was cited, so the increase observed in Figure 1 correlates with an increase in citations of software records. A logical followup question is: where are these records being cited? This can be done with the query citations(bibstem:zndo doctype:software). This query returns all publications that cite Zenodo software. The answer to the question is found in the Publications facet which lists all publication venues in the form of ADS bibstems (publication abbreviations). Table 3 lists the top 10 of most citing publication venues.
Number of citations
The Astrophysical Journal
Monthly Notices of the R.A.S.
Geoscientific Model Development
The Astronomical Journal
Nature Scientific Reports
Journal of Open Source Software
Physical Review D
Astronomy & Astrophysics
Table 3. Top citing publication venues overall (top 10, May 11, 2022).
Since the focus of the Asclepias project is on astronomy, it makes sense to restrict the publication venues to those in the ADS Astronomy collection. This can be done using the query citations(bibstem:zndo doctype:software) collection:astronomy. As mentioned, the caveat here is that this query returns citing papers, rather than the actual number of citations. Table 4 lists the top 10 of most citing publication venues, restricted to the Astronomy collection in the ADS. It lists this top 10 by both the actual number of citations and the number of citing publications. Note that the same journals are in the top 10, by either criterion. The only difference is that some journals switch places, depending on criterion. The table also illustrates the noticeable differences between the actual number of citations and the number of citing papers, which is indicative of the number of publications that cite multiple Zenodo software records.
The Astrophysical Journal
Monthly Notices of the R A S
The Astronomical Journal
Astronomy & Astrophysics
The Astrophysical Journal Supplement Series
Physical Review D
Geophysical Research Letters
Journal of Geophysical Research (Planets)
Journal of Cosmology and Astroparticle Physics
Table 4. Top citing publication venues within astronomy, as measured by the actual number of citations and the number of citing papers (top 10, May 11, 2022).
Table 4 shows that most of the main astronomy journals (the Astrophysical Journal, the Astrophysical Journal Supplement Series, the Astronomical Journal, the Monthly Notices of the Royal Astronomical Society and Astronomy & Astrophysics) are all represented in this top 10. Of the journals presented here, ApJ, MNRAS, and A&A are roughly comparable in annual number of articles but have different citation rates of Zenodo software records. How this is related to the software citation policies implemented by these journals may need further study.
The ADS search interface allows us to explore some questions regarding the social aspects of software citation, specifically insights into who is citing software. One of the workflows in the Asclepias project as a whole is to create awareness within the scientific community. We use the Institutions facet in the ADS interface to explore this question. The data in this facet are the results of the enhanced affiliation processing within the ADS; this maps variants of an affiliation onto a canonical affiliation (Templeton & Grant, 2021). The query that finds all cited works in works published works generated at the Center for Astrophysics Harvard & Smithsonian during the time period 2014-2021 is references(inst:(CfA OR SI/SAO) year:2014-2021). The citations for Zenodo software records are retrieved by adding bibstem:zndo doctype:software to this query. Table 5 lists the number of Zenodo software records cited during the period 2014-2021 for 3 institutes that are comparable in size in the sense of scholarly output.
Table 5. Number of publications that cite at least one Zenodo software record for three institutes that are comparable in size in the sense of scholarly output. These institutes are the Center for Astrophysics | Harvard & Smithsonian (CfA), Caltech and the Max Planck institutes.
The numbers in this table are compatible with a significant increase in citations of Zenodo software records since 2016, a growth larger than the relative growth of the number of software records in Zenodo; this could suggest a successful change in culture. The fact that a similar change is seen in Europe suggests that this would be mostly due to updated journal policies. By the end of 2015 there were already 6k software records in Zenodo, but only since late 2016 did journals actively start incorporating policies for the inclusion of software citations.
We use the ADS search to investigate the context in which Zenodo software records are being cited (in refereed astronomy literature), during the time period 2019-2020. The query citations(doctype:software bibstem:zndo) year:2020 property:refereed collection:astronomy generates a list of all refereed publications, published in 2020, contained in the Astronomy collection, that cite a Zenodo software record. The clustering provided in the Paper Network provides a glimpse into this context. This Paper Network in the ADS detects groups of papers based on shared citations between those papers; the grouping is done using the Louvain algorithm (Blondel, Guillaume, Lambiotte, & Lefebvre, 2008). Inspection of the clustering indicates that the main topic areas are cosmology and stellar dynamics. Besides these two topic areas, the clustering also found heliophysics, planetary science and radio astronomy.
The power of the Paper Network is illustrated by using it to sub-cluster any of the initial clusters found. This will make the most sense for one of the bigger clusters. If we take the 200 publications associated with the largest cluster (purple) and generate the Paper Network for this cluster, we get the results shown in figure 3. This is a good example of how iterative clustering can show structure within a topic. The figure below shows how a cluster with a broad, general topic is split up into smaller segments with different aspects of that larger topic. It shows that the cluster label consisting of general terms, broadly suggesting topics related to galaxies and lensing, at a more detailed level relates to cosmological simulations, supernova classification and gas dynamics (all topics that heavily rely on computer modeling and data processing).
Similarly, we can run a query and require that the resulting publications are NOT in the Astronomy collection in the ADS. Figure 5 shows the Paper Network for this query. A significant number of the segments detected are related to Earth Sciences. This does not mean that this is the general context in which software is being cited. This is the context as it exists within the holdings of the ADS database.
Figure 6 shows the number of citations to Zenodo software records received from the main astronomy journals. Currently, 96% of the publications in the main astronomy journals that cite a Zenodo software record were published after 2016. A software package like astropy had releases in 2015 and the mechanism to add software records to Zenodo was already in place, but the earliest citations for Zenodo records for astropy started in 2016. This does not unequivocally show a change in culture, but it does not contradict it either.
Is the workflow, as implemented in the Asclepias project, sufficient to capture the majority of software citations in its current form? If it is not, can it be adapted to reach that level? To explore this question, we need to get a feel for how many software citations are not being captured in Asclepias in its current form.
This is an example of attributions being missed: the Machine Learning software Keras (https://github.com/keras-team/keras) is formally cited on a regular basis. It could have accumulated 1,447 citations by now in the ADS, but it has not because it is being cited using the generic Github URL of the repository. This also means that citations for specific versions are not recorded either. These version citations are important from a reproducibility perspective, but they are also useful information for the software developers.
Table 6 provides an overview of the number of formal URL citations, as a function of year. These numbers are for all publications indexed in the ADS. In total there are 81,892 of these citations in the period 2011-2021. When restricted to the main astronomy journals, only 540 URL citations are found for this period (0.6%). For comparison, during this same period 14,785 Zenodo software records were cited, with 2,876 citations coming from the main astronomy journals (19%).
Table 6. Counts of URLs found in ADS citation data.
The results in Table 6 show that Github has become the main source of URL citations. These Github URLs probably represent software, but it is known that sometimes Github is used to store other types of information. The table also shows a significant increase in the number of URL citations. How many of the cited Github URLs correspond to Zenodo records, i.e. could have been replaced with a citation with a Zenodo DOI? It turns out that there are 4,343 such records, of the 56,726 total Github citations during this period. This is 7.7%. In other words, the great majority of Github URLs do not have a corresponding Zenodo record. One take-away message here is that formal citation of software is on the rise, but to turn this into measurable quantities, just relying on citations with persistent identifiers does not seem to be enough, unless something drastic happens. The fact that this is largely the case for publications outside of astronomy may be the result of the push for formal software citations using Zenodo DOIs by the main astronomy journals.
For bibliographic indices like the ADS to include all possible citations to software products would require the inclusion of URL citations as well. This is not trivial, since there is no metadata associated with most URLs. Publishers could choose to endorse GitHub URL citations via the Citation File Format support now available in Github (https://citation-file-format.github.io/). These files are plain text files with human- and machine-readable citation information for software. This would enable the citation capture pipeline in the ADS to parse this CITATION file and use the metadata provided to create a record and assign the citation. The recent Github announcement on Twitter (https://twitter.com/natfriedman/status/1420122675813441540) by its CEO Nat Friedman about built-in citation support drew a significant response. Unfortunately, however, that built-in support has a serious flaw at the writing of this paper: the automatically created bibtex couples the current software version/tag release in GitHub to the citation DOI, regardless of what that DOI is in actuality.
One final takeaway is one of a technical nature: the quality of the data in the Zenodo software records we create in our database is as good as the quality of the metadata that was registered with their DOIs. In the case of e.g. journal articles, this is rarely an issue, as these publications go through a rigorous process of peer review and editorial copy-editing. However, in the case of Zenodo software records quality control can be an issue, as there is no manual curation process in place to review the metadata provided by users when they create records in Zenodo. This is particularly relevant for author names: quite regularly, we see Github handles instead of actual names entered in the metadata records, making attribution problematic.
Since the focus of the Asclepias project has been the astronomy scholarly ecosystem, we discuss our conclusions within this context. Within this view, this paper focused on the ADS workflows. As the various query examples show, the ADS not only provides a powerful discovery environment for software cited in scholarly publications, but also for the context in which this software was cited. One drawback of the current approach is that only software citations that have been detected and successfully assigned are discoverable in this way; software that has not yet been cited, or software citations that currently can not be processed are not discoverable through the ADS.
Figure 6 shows a significant increase of citations to Zenodo software records from main astronomy journals. This means that developers published their software (releases) to Zenodo and the authors, who used the software in their research (and subsequent publications) included a citation with the Zenodo DOI. All of these actions are crucial steps in the Asclepias workflows. Without these steps, the ADS workflow would not be possible.
At the same time, table 6 also shows a significant increase in citations using URLs, indicating a desire from the community to acknowledge software using other means. Right now, from the point of view of a bibliographic index (specifically the ADS), these are essentially lost citations. However, limited to the astronomy context, the portion of URL-based citation is relatively small, as shown in the discussion section.
The challenges we encountered in setting up the ADS workflow were mostly of a metadata processing character. In some publications, like the arXiv preprints, citation data comes as unformatted ASCII strings. Successfully parsing these strings critically depends on the level of sophistication in the processing software. Parsing structured metadata (usually a form of XML, like JATS) is more robust, but it does not necessarily mean that there are no problems. In some cases we receive citation data in XML format that are just plain ASCII reference strings in a single XML tag. In all other cases the problems are caused by incorrect data (not caught in the editing process) or data we cannot process (like URLs). So far we have not seen any evidence of software citations being detectable as such based on specific XML tags. From the ADS point of view, ideally, we would love to see the adoption of an attribute in the citation data that indicates whether the cited resource is software.
Research in data-intensive disciplines is increasingly consuming and generating a variety of digital resources during the course of scientific investigations. This has steadily increased the need for means to systematically capture the life cycle of scientific investigations, which at the same time provide a single-entry point to all the related resources, including data, publications, presentations, computational resources (software, Jupyter Notebooks, protocols), and the researchers involved in the investigation. This means that for these research artifacts in general, with the emphasis on software in this paper, rich metadata is of crucial importance. The only way to make this available in a systematic, programmatically accessible, way is to register it together with the object itself. So, while technical solutions may be available to index URLs in a bibliographic database, this would preclude all the benefits coming with the use of persistent identifiers such as DOIs for software.
The Asclepias project is funded through a grant from the Alfred P. Sloan Foundation to the American Astronomical Society, 2016. The ADS is operated by the Smithsonian Astrophysical Observatory under NASA Cooperative Agreement 80NSSC21M0056. Zenodo has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreements (OpenAIRE-Advance) no. 777541 and (OpenAIRE-Connect) no. 731011.