Abstract

It has been hypothesized that articles that practice open science have out-sized impact in their communities. We have analyzed four years of citation data of the Astrophysical Journal, the Astrophysical Journal Letters, the Astrophysical Journal Supplement, and the Astronomical Journal to investigate if including data and enhanced figure displays increases citations. Articles with these enhanced digital assets have equal or higher citations than articles without these products. The implication is that authors that take the time to include or link to underlying tabular and figure data or utilize enhanced graphics, i.e., animations or interactive figures, reap bibliometric benefits.

Introduction

The American Astronomical Society (AAS) has long provided its authors the ability to provide the data with their peer reviewed articles. In 1993, curated versions of long tables were mailed to subscribers on CD-ROMs. This series was discontinued when the data and author generated video files could be hosted in the HTML article. More data options and enhanced graphics abilities were subsequently added. Due to their nature, these “digital assets” are only available via the HTML article which is the version of record. The digital assets available to all authors are:

Machine readable tables (MRT),
Data behind the figure (DbF),
Figure sets (FS),
Interactive figures (IF),
Animations, and
DOI links to external repositories containing data related to the article.

These items allow the authors to communicate their science results beyond static figures and provide direct and indirect access to the underlying article data. All digital items are curated by two PhD data editors who work with authors to make sure the published content reflects the author’s original intent while ensuring conformity and standardization. Having curated data content means that tabular data can be read by numerous, popular astronomical software packages, such as astropy [1][2] and TOPCAT [3], and is regularly ingested by CDS into their VizieR catalog service.

Over the 2017-2020 range, the two AAS data editors curated 3626 articles, of which 2122 articles had MRTs, 532 had DbFs, 407 had FSs, 35 had IFs, 633 had animations, and 444 had external data links.

Articles with any of these digital assets can be easily identified by a data tab at the top of the HTML article that directly links to the assets in the article. Recent examples are [4] which has a MRT for Table 2 and the spectra shown in Figures 4 and 5 as DbFs and [5] which has an inline animation for Figure 12 and a multi-extension FITS file in a .tar.gz package containing the data behind Figure 2.

Approximately 20% of our articles published over the last six years have taken advantage of these options. Having a large fraction of AAS journal content available in these formats has obvious advantages to readers and archives, but is there a more tangible benefit for the author for doing the extra work required to incorporate these digital assets into a manuscript?

As a first step towards addressing this question in astronomy and astrophysics, we obtained from our publisher, the Institute of Physics Publishing (IOP), the read and citation information for almost 18,000 articles published between 2017 and 2020. Studies in other fields indicate that articles that make data available have higher citations than articles that do not [6]. With this data set, we test if citations are greater for AAS articles with digital assets. The citation data comes from the IOP’s internal Midas database which is populated by the bibliographic database, Web of Science1.

The data set includes the Astrophysical Journal (ApJ), the Astrophysical Journal Letters (ApJL), the Astrophysical Journal Supplement (ApJS), and the Astronomical Journal (AJ).

Figure 1. The citation distribution for the 2017-2020 articles. The y-axis is the number of articles and the x-axis is the citation bin. The red squares show the non-digital asset articles while the blue diamonds are articles with digital asset. The purple "x"s show the digital asset articles multiplied by 4.3 to scale them to the same numbers as the non-digital asset articles.

Results

The total sample has 17,980 articles of which 3,400 have digital assets. The articles with these assets have higher citations both in the mean and median (10.9 and 5) versus the 14,580 articles without (9.3 and 4). Figure 1 shows the distribution of the number of citations in bins of increasing five citation widths. The blue diamonds and red squares represent the articles with and without digital assets, respectively. Also included are the digital asset articles multiplied by 4.3 (purple ‘x’s) to scale them to the same numbers as the non-digital asset articles. The digital asset articles have relatively fewer articles with less than five citations. The figure also shows that the digital asset articles have slightly more relative citations across the majority of citation bins. Table [T1] shows the citation distribution for Figure 1.

Table 1. Citation Distribution. The table gives the citation bin size, total number of articles, number of articles without digital assets, number of articles with digital assets, and the number of articles with digital assets multiplied by 4.3 to scale to the same number of articles without assets.

With such a large data set, we can investigate how the citations look across the different journals and in time. Table [T2] shows the break down by journal. All of the journals have higher citations for digital asset articles except for the ApJL. The ApJL has the same median (six citations) but the mean is greater for articles without digital assets (14.35) versus articles with digital assets (12.65). The reason for this is that the ApJL was dominated by relatively few articles with very high citations counts in this period. Of the 41 ApJL articles with more than 100 citations, 36 of them do not have digital assets which greatly skews the mean values. In addition, of the top six highest cited article in the database, four are in the ApJL and they have 1,393; 1,098; 632; and 539 citations, respectively. Only the 632 citation Letter had digital assets.

Table 2. Citation by Journal. The rows are the number of articles, the mean number of citations, the median number of citations, the standard deviation, the standard deviation in the mean, the maximum number of citations, the percent of the total number of articles, and the percent of the articles with no citations.

The ApJS results might be the most revealing as the ApJS is by its nature is a very data rich journal. It has the highest fraction of articles with digital assets (45.1%). The median number of citations for both types is 6 but the digital asset articles have almost 1.5 more mean citations. It would be interesting to further divide this subset to search for alternative reasons for the increase in citations. For example, are there the same trends by sub-discipline or theory versus observational articles? Unfortunately, it would not be easy to categorize the data set further and with less than 1,000 ApJS articles, any further subdivision of the data increases statistical uncertainties.

Splitting the data by publication year shows an interesting trend, see Table [T3]. As expected, the number of citations increase for all articles with time, however, the increase for digital asset articles is greater than those articles without digital assets. After four years, the difference in the mean and median citations for digital asset articles (18.4 and 12) is significantly larger than the articles without these items (16.8 and 9). This is strong evidence that these articles have greater long term impact.

Table 3. Citation by Publication Year. The rows are the number of articles, the mean number of citations, the median number of citations, the standard deviation, the standard deviation in the mean, the maximum number of citations, the percent of the total number of articles, and the percent of the articles with no citations.

What about the so called open access citation advantage (OACA)? It is well known that an article outside of a subscription paywall will generate more citations. This is true whether the article is itself open access or has a preprint available. For example, [7] and [8] showed this effect for articles with astro-ph preprints. The assumption for this effect is that more people can read an open access article or preprint and the greater accessibility is reflected in the higher citation rate. Could the OACA effect explain the observed citation increase?

It is unlikely because of three factors. First, prior to the AAS journals becoming gold open access in 2022, all articles where open access one year after publication. This is a relatively short time span behind a paywall. Second, authors could also make their articles open access at publication by paying a higher author publication charge but less than 6% of the 2017-2020 articles were open access. Third, posting to astro-ph at submission or after acceptance is wide spread except in a few disciplines such as heliophysics and laboratory astrophysics. Effectively, the vast majority of the results of the articles in this study were available to the community at or shortly after publication indicating that OACA cannot explain the increase in citations.

Caveats

While we present strong evidence that articles that adhere to open science principles by providing digital assets have greater scientific impact there are important considerations to consider for this initial attempt to assess impact.

First, the citations come from the Web of Science, which likely does not capture all of the citations that ADS would. For example, citations of important but not peer viewed resources like ATEL2, the GNC3, or ASCL4 are not counted by the Web of Science. This may be an important effect in some disciplines like astronomical software or time domain astrophysics, but it should be a good proxy for the majority of the articles in the data set.

Second, this is a bulk analysis and makes no attempt to differentiate between different types of articles, for example, theory vs data rich observational articles. Another example would be differences in sub-disciplines. A better test would be to focus on a specific sub-discipline and compare citations between similar article types, published in the same time frame. Unfortunately, splitting the articles into these categories is not trivial and will have to wait for a subsequent analysis.

Recommendations

The preliminary evidence strongly suggests that articles with enhanced figures and underlying data content receive more citations than articles without. How can or should all AAS authors take advantage? We should all strive for reproducible articles, but what does it mean to “provide the data”? Here are some recommendations for future submissions.

While not perfect, astronomy is fortunate to have well established archives. Authors should take advantage and provide the necessary information for readers to obtain the original data sets. At a minimum, this should include observational information such as proposal/observation identifiers, observation dates, target names, etc. for the specific archives where the data can be found. The AAS already asks authors if they have used MAST or IPAC data in their articles at submission. Authors should take advantage of the available resources to facilitate links and references to the original archival data sets.

Highly processed data should be made available for reuse along with detailed descriptions on how the data were processed. Authors should always strive for long term preservation of this type of data. This means publishing with the article, MRTs, DbFs, etc., in a specific institutional archive that accepts data (MAST, WISeREP, AAVSO, CDS, etc.), or a general repository that issues DOIs like Zenodo, Figshare, or DataVerse. Storing data on a personal website all too often results in lost data. Authors should consult the data guide prior to submission to see what needs to be done, what tools or tutorials are available to help the author, and what options work best for the specific data needs.

Likewise, authors should also review the graphics guide to see how their science narrative can be improved by including animations, figure sets, or interactive figures.

Finally, authors should ask themselves prior to submission, “what data gets the attention?” Like it or not, time and resources are not infinite and all data are not created equally. Be judicious with your selections. It is better to place effort in a well formatted and described data set that follows established guidelines, [9], rather than arbitrarily dumping everything.

Conclusions

An investigation of the citations of 17,980 AAS Journal articles published between 2017 and 2020 shows that articles that take advantage of digital assets have higher citations than articles without these products. The effect is most pronounced for older articles. Currently only about 20% of articles have one or more digital assets, but these articles have out-sized citation impact.

Authors: Improve Your Bibliometrics with Digital Enhancements