A case study investigating the impact of informal software recognition on the astronomy community
Software has been a crucial contributor to scientific progress in astronomy for decades, but practices that enable machine-actionable citations have not been consistently applied to software itself. Instead, software citation behaviors developed independently from standard publication mechanisms and policies, resulting in human-readable software citations that cannot effectively represent the influence software has had in the field. These historical software citation behaviors need to be understood in order to improve software citation guidance and develop relevant publishing practices that fully support the astronomy community. To this end, a twenty-three year retrospective analysis of software citation practices in astronomy was developed. Astronomy publications were mined for 410 aliases associated with nine software packages and analyzed to identify practices and trends that negatively impact software citation implementation.
Librarians are accustomed to the idea that patrons tend not to contact them unless they are genuinely having trouble finding something. Patrons rarely reach out to librarians with questions that can be answered via Google Search. Instead, Google Search often functions as a “starting point” for research questions.1 This general trend in information seeking behavior often leaves patrons defaulting to library services only when trying to locate exceptionally difficult to find resources. At the John G. Wolbach Library,2 located at the Center for Astrophysics, inquiries like these increasingly deal with research software. For example, the Wolbach Library receives patron requests concerning older pieces of software (Fig. 1).
Patrons also utilize Wolbach Library librarians when encountering “link rot” or non-functional hyperlinks(Fig. 2).3
Dead hyperlinks are particularly problematic when they are used as the primary method of identifying software in citations. This issue is prominent in astronomy and associated fields and will only become more prevalent as these domains become more computational and reliant on software.4 Therefore finding and accessing research software and treating it as an essential scholarly object (i.e., research artifact)is slowly becoming the norm. For these reasons software citation practices are becoming ever more important and relevant to astronomy librarians. Without proper citation, locating and accessing any resource can become problematic, but this is especially true with software.
In order for software to be formally recognized through citation, it is imperative that citations be informed by and adhere to best practices. Best practices in this context, however, have only been newly defined through work primarily undertaken by the FORCE 11 Software Citation Working Group (SCWG).5 In 2016, the SCWG published the “Software Citation Principles.”6 Designed to guide authors on how software citation and attribution should be conducted, the Principles recognize software as integral to science and state that software citations must enable normative and legal credit be given to software authors. The Principles also state that software needs to be uniquely and persistently identifiable so that access to the software itself and its associated metadata is possible.
Although the Software Citation Principles may seem straightforward, software citation presents new challenges for software authors, as well as article authors wishing to cite software. This is because standard approaches to citation typically do not conform with normative practices employed by people citing software. For example, our research has shown that astronomers tend to cite proxies to represent their software (e.g., papers and websites) instead of citing software directly [1]. Astronomers have taken this approach because for decades there was no standard way to persistently identify the software itself and there was no other way to take advantage of existing mechanisms for quantifying citations (e.g., citation indexing by the Astrophysics Data System (ADS)7). Software citation is also challenging because software exists as a dynamic object with many versions and associated authors making proxy citation particularly problematic. The following examples and case study description illustrates the shortcomings of in-direct software citation via proxies.
Consider how one would cite an archival object. For instance, the Wolbach Library is in possession of the oldest known photograph of the Moon. In the form of a daguerreotype, the photograph exists as a physical object, and is also described by many documents including a 1849 handbook by the daguerreotypist Samuel Dwight Humphrey (Fig. 3).
If a patron needed to cite this daguerreotype, their motivation may be to give credit to Humphrey for creating the object; he is, afterall, the photographer. Additionally, the patron would also want to create a citation that specifically locates the daguerreotype in a physical place because it is an actual tangible thing. For these reasons librarians would discourage authors from citing a proxy like Humphrey’s handbook rather than the object itself– although citing the handbook may give Humphrey some form of credit, it would not help anyone locate or access the actual daguerreotype (the handbook is not the object we are trying to refer to). Instead, standard practices would guide authors to cite the persistent identifier assigned to the archival record that describes the daguerreotype. In the case of the daguerreotype, citing an identifier is not a challenge. Archives create persistent identifiers for their holdings and these identifiers enable disambiguation between things like daguerreotypes and their associated literature. Identifiers assigned to archival records exist to represent the physical object and specify its location, and a separate identifier can be used to represent the handbook.
In this context, the daguerreotype and its handbook are akin to a piece of software and a website or a paper that describes the software. When attempting to give credit to the software’s authors using a citation, it would be most effective to directly cite a persistent identifier for the software itself, rather than citing any kind of proxy documentation associated with the software. This idea is reflected clearly in the Software Citation Principles, but problems arise because software itself typically does not have an identifier and historically has not been archived by traditional institutions. In these cases authors have few options other than to cite proxy identifiers if they want to give and receive credit for their code.
Minting identifiers for scholarly objects so that they may be located and connected to associated materials is now straightforward in most contexts except software. This is because unlike a daguerreotype or other type of digital object, software typically exists dynamically; it is continually developed in a decentralized way with many versions and associated authors and it usually interacts with other software dependencies that similarly are unlikely to have unique identifiers. Additionally, software has traditionally not been seen as a scholarly contribution in its own right, and therefore static versions of software have not traditionally been archived, making software identifiers historically extremely uncommon.
Although creating persistent identifiers for software is a very recent practice, researchers have nevertheless attempted to give credit for software used in their articles for decades. Our research has shown that in astronomy article authors usually attempt to give software authors credit by including URLs to websites associated with software in their article footnotes or citing papers that in some way describe the software.
Mentioning website URLs (e.g., GitHub repos, personal websites) over permanent identifiers (e.g., Zenodo DOIs) has become common practice for astronomers because URLs have simply been the only available way to represent software. However, these URLs are fragile (i.e., prone to link rot) in addition to not supporting persistent identification. For example, if an astronomer used the software package Astroblend and wanted to give credit to the authors using a citation, they may go to the AstroBlend website8 and add the URL used to download the software to their paper in a footnote. This is troublesome though because despite AstroBlend being a relatively new software package from 2017, the URL used to download the software now leads to a 404 error.9
In addition to the link rot, URLs pose additional issues because they are not formally indexed. Unlike bibcodes, DOIs, catalog numbers, etc., it is unfeasible to view how many authors have cited a particular URL. In astronomy, the ADS has cleverly devised a way to keep track of URL citations as usage statistics, but indices like the ADS are out of the ordinary. A similar hack that astronomers have come up with is citing a paper pertaining to a software package instead of the software itself. Like Humphrey’s handbook and the daguerreotype, these “software papers” merely describe a software package and its history, functionality, etc., but often fail to connect readers to the actual code itself. Further, as previously stated, software is more complex since there are often many versions of the same package. This means that authors would have to publish new software papers for each new iteration of code. The practice of using software papers for citations makes it challenging to keep track of differing contributors for each new version, and it also means that authors have to carefully cite the right version of software via the correct software paper.
Additionally, citing a software paper may still not get readers to the actual code. Although some software papers may contain identifiers to properly archived code, pointing to the archive via URLs is still common and problematic as described above. Another related issue is that software papers on open source code may be published in non-open access journals, creating a barrier of entry and putting documentation for open source software behind paywalls. Some software papers are also not entirely dedicated to the software itself. For example, some papers are about scientific findings and only detail the software in a methods section, which makes citations to these papers for the purpose of citing software indistinguishable from citations for other reasons.
At the CfA, we conducted a case study in 2020 that attempted to address the current status of software citation in astronomy by better understanding how software citations have been used by astronomers over the past two decades.10 The goal of the case study was to get a sense of what needs to be done to get astronomers comfortable with creating and citing software identifiers. To do this, we first needed to describe and respond to the existing norms.
We began by identifying nine different software packages with different scopes and functionalities. Some packages were pipelines (like DEEP2 DEIMOS in the spec2d package11), while others were small community-developed packages with very specific applications. All of the packages studied were developed in whole or in part at the CfA and were chosen because of their likelihood to be cited, in addition to their respective developments occurring over a long year range; our earliest package was released in 1990 (SAO Image DS912) and the latest was released in 2017 (PlasmaPy13)(TABLE 1).
Table 1 | ||
Earliest Release | ||
Software Package | Year | |
Astroblend | 2016 | |
Astropy | 2012 | |
PlasmaPy | 2017 | |
RADMC-3D | 2004 | |
SAO Image DS9 | 1990 | |
spec2d | 2002 | |
Stingray | 2015 | |
TARDIS | 2013 | |
WCSTools | 1996 |
A search was conducted for all mentions of these software packages over the last 20 years. This search was conducted using text-mining techniques to parse through XML files representing 76,791 full-text articles published in the following American Astronomical Society (AAS) journals: Astronomical Journal (AJ), Astrophysical Journal (ApJ), Astrophysical Journal Letters (ApJL), and the Astrophysical Journal Supplements Series (ApJS) between July 1995 July and May 2018 (TABLE 2).14
Table 2 | ||
AAS XML Coverage | ||
Journal | Begin Date | End Date |
AJ | 1998 Jan | 2018 May |
ApJ | 1996 Nov | 2018 May |
ApJL | 1995 Jul | 2018 May |
ApJS | 1997 Jan | 2018 Apr |
Mentions of the selected software packages were identified in areas such as bibliographies, footnotes, acknowledgement sections, and other areas of papers (TABLE 3).
Table 3 | ||
Types of Recognizable Attempts at Attribution | ||
Location | XML Tags | |
bib | nlm-citation | |
bibr | person-group | |
Bibliographic | citation-alternatives | pub-id |
Entry | collab | ref |
contrib-group | ref-list | |
element-citation | source | |
mixed-citation | xref | |
Footnote | fn | |
Acknowledgment | ack | |
Other | back | |
ex-link |
To assess the effectiveness of using the ADS full-text search to identify software mentions in papers, we conducted the same search over the same time period using the ADS API.15 This second search also ensured that our results were publisher independent since it included non-AAS journals.
In order to be maximally comprehensive, we searched for more than just the titles (i.e., “astropy,” “spec2d,” etc.) of software packages. Instead, we searched for a list of carefully selected “aliases” -- keywords and URLs pertaining to the software package and, if available, preferred citations as specified by each package’s author. In total, we identified 410 unique aliases for the nine software packages.16
As a normative phenomena, authors suggesting their own “preferred citations” appears to be a practice unique to software. The preferred citation for spec2d, for example, includes a mention of the funder that supported its development. The Astropy Collaboration, on the other hand, recommends citing two Astropy papers in addition to including a specific acknowledgements message.17
It is completely up to the discretion of article authors to follow the instructions of preferred software citations. Given the confusing nature of these instructions, some of which are even in conflict with one another, their capacity to give “proper” credit as defined by the software authors is severely limited.
There were limitations to the compiled alias lists and some that were confounding or ambiguous were not included. When searching for mentions of one software package called “stingray,” for example, we got hits for not only the software but for actual stingrays, the “stingray nebula,” other stingray-shaped objects, and even the Stingray Corvette. We also received false positives that included instruments named “stingray” in addition to unrelated email domains. For these reasons, despite efforts to remove false positives, our results for the Stingray software package likely include many erroneous results.
Another example of false positives came when searching for the DEEP2 DEIMOS pipeline in the spec2d package. False positives for this package included mentions of Deimos, which is also the name of the outermost Martian moon. Although cumbersome to work with and weed out when conducting the case study, these false positives really drove home the idea that relying on using search strings like titles to identify software is problematic when searching indices.
Use of the ADS API also incurred some limitations. The ADS API is not designed for keyword parsing and when we compared its results with those from the XML we did find some discrepancies. This is likely due to the ADS APIs’ use of full text search -- meaning that some search strings may get lost due to formatting-related syntax (e.g., stylization, whitespace, etc.). However, the ADS API does not have a problem, rather, it is not viable for quantifying software citations based on full-text search.
Astronomers want to give credit to software and they believe they probably are because it is unlikely that astronomers realize that the kinds of “citations” they include in their papers are not necessarily machine-actionable and typically do not enable access to stable versions of software. Subsequently their citations do little to help software authors and others looking to access their code. Nevertheless, astronomers have been consistently trying to cite software, and mentions of software continue to increase over time (Fig. 4). Note that the drop-off in Fig. 4 is the result of incomplete data for the last year studied.
Our ADS API search using the same software aliases also matched the trend (Fig. 5).
This suggests that software mentions are independent of publisher (non-AAS vs AAS) and that there is little need to convince astronomers that software citation is important. Rather, the need to change software citation practices such that citations are machine-actionable is paramount.
109 aliases were found for the nine packages. This means that ~26% of software aliases were found in our search. Many of the aliases were not in bibliographic entries as expected. Instead, many software mentions were found in footnotes and acknowledgment sections (Fig. 6).
Software aliases were also found in numerous unique places. Mentions of spec2d, for example, had aliases in 51 different locations. Such variability is problematic for indexing purposes, leading to issues in locating and identifying software citations. Concerningly, 343 papers were also found to include a mention of a software alias in the text, but these papers did not give any recognizable form of credit. For example, aliases such as “astropy.io” were found in-sentence without any footnote, reference, or hyperlinks.
Relying on full text search and preferred citations results in software authors losing credit and hinders efforts to access software mentioned in astronomical studies. There is seemingly ample advocacy encouraging proper credit giving practices, however, most software citations to date cannot be found using standard tools.
Librarians can support software citation even if they are not the ones archiving it. Librarians should encourage authors to archive copies of their code by making deposits of it in trusted repositories and to create machine-actionable citation files. The latter is accomplished by creating machine-actionable metadata (e.g., codemeta18 or CFF19 files). Librarians can also promote the idea that preferred citations and instructions about attribution only enable software citation when they actually point directly at archived code. Resources and guidelines can be shared that enable a change in norms, best practices, and behaviors.
Librarians can also influence publishers by advocating for enforced software citation policies that follow software citation principles. Our case study suggests that it is insufficient to say that you need to give credit for software. People are already doing this. The issue is that these citations are not necessarily actionable or quantifiable. Librarians can also advocate for publishers to give article authors examples to build on and to make it clear how much editorial review software citations will receive. Article authors reasonably believe that if they make a mistake it will be caught by the publishers. However, at the present moment, the norms are in flux and article reviewers may or may not actually know what to look for in software citations.
For a more detailed overview of the case study, a comprehensive description and discussion was published in the Astrophysical Journal Supplement Series [2].
This research has made use of NASA’s Astrophysics Data System