Skip to main content
SearchLoginLogin or Signup

Open Science Data without Curation. Is It Useful? An American Astronomical Society Publishing Perspective

The AAS has encouraged and enabled authors to practice good scientific citizenship by including relevant underlying research data in their published articles. A critical component is curation by our data editors, which is effectively a peer review of the published data.

Published onApr 27, 2022
Open Science Data without Curation. Is It Useful? An American Astronomical Society Publishing Perspective
·

Abstract

The American Astronomical Society (AAS) publishes approximately 5000 scholarly astronomical articles each year in our six Journals. For the last 20 years we have accepted and solicited data from authors to be integrated into the published article. Data includes machine readable tables, data behind figures, interactive figures, and external links to data repositories.

All of these data types are reviewed by two Data Editors who create or edit metadata, enforce formatting standards, and obtain final author approval before publication. This process is time consuming but ultimately results in robust and useful data products that are ready for immediate reader use and can be ingest into other databases, e.g., CDS/VizieR. This curation process is essentially a peer review of the data.

A publishing data policy that follows the tenets of the Open Science movement would mandate that all data used in the article be available at publication for reproducibility. This can mean archiving in public repositories or even self archiving for a specific period of time leaving the level of curation left up to the author.

Our data review work provides unique insights into the effort authors put into making data available and useful. In short, the quality is often lacking which means significant challenges for the end user. Errors are common in both the metadata (generally inadequate documentation) and the data (missing data, duplication, significant digit issues, etc.). The reasons for poor data products is due to lack of author training in curation and laziness. Given these author limitations any Open Science policy without Data Editors does not fully support its underlying ethos of data reproducibility. Data needs to be treated with the same considerations as the science itself and reviewed accordingly.

Background

The American Astronomical Society (AAS) publishes six scholarly Journals. They are the Astrophysical Journal (ApJ), Astrophysical Journal Letters (ApJL), Astrophysical Journal Supplements (ApJS), the Astronomical Journal (AJ), the Planetary Science Journal (PSJ), and Research Notes of the AAS (RNAAS). In 2020, the AAS published over 5000 articles which was \approx 30% of the total astronomical content. In addition to 36 Science Editors, we have one statistics editor and two Data editors that work with manuscripts during the peer review process.

In the beginning

The AAS began its foray into electronic publishing with the VHS tape series for animations in 1992 and a CD-ROM1 series for tabular data in 1993. The CD-ROM series was billed as a way to keep data from being orphaned and make it available for the long term. While this innovative approach got electronic data to the reader in standardized formats, it was not timely. The VHS tapes and CD-ROMs were only mailed to subscribers once or twice a year depending on how many animations and tables were published since the last mailing. Wile both series were ultimately discontinued, they set the precedent of making electronic information available, predating the open data principle of the open science movement by five years.

In 1995, the ApJ Letters was the first AAS journal to publish an HTML version of the accepted article. By 1998, all of the AAS Journals had HTML articles. With electronic publishing, supplementary materials and data products that were once hosted on private websites or mailed on physical devices, could now be presented with the article. The first animation was published in 1998 [1]. To encourage and then process the publication of large tabular data sets, I was hired in 2000 as the first AAS Journals data editor. Continuing the precedent set with the CD-ROM series, I worked with authors to create a machine readable table (MRT) version. MRTs use the VizieR2 catalog standard developed at CDS3.

With the success of both electronic publishing and MRT adoption, the AAS expanded its online only capabilities. Sometimes these needs were driven by author demand. For example, Figure Sets were introduced in 2004 to allow authors to publish large numbers of similar figures online only without having to pay exorbitant publication charges that would result with having all the figures in print. The first article with a Figure Set was published in 2005 [2]. It has 114 components in the Figure Set associated with Figure 9.

Next was support to allow authors to provide the data behind the figures which began in 2010. Data from vector graphics are converted to MRTs while image data are generally provided in author supplied FITS files. This option has been used extensively to capture photometric and spectroscopic data on transient events such as supernovae, cataclysmic variables, and micro-lensing events.

In 2014, another data editor, Dr. August Muench, was hired to handle the growth in online only items and help build new data products. He was instrumental in the development (and continual support) of interactive figures and external repository support in 2014 and 2015, respectively. In addition, 2015 marked the end of print publication of AAS Journals. Our journals are electronic only meaning that the HTML is the article of record. The PDF copy many readers are familiar with is only a derived copy as it contains no electronic items.

In 2020, the data editors processed over 1000 articles with digital assets which is approximately 20% of the total articles published in 2020 by the AAS. Since 2000, the AAS has published almost 15,000 MRTs, 4,277 animations, 1,484 Figure Sets, almost 1,500 figures with data, and 120 interactive figures. Note that interested parties can browse and view all the published animations, Figure Sets, and interactive figures via the Astronomy Image Explorer.4

The AAS Publishing Data Philosophy

AAS Journals encourages data publication but has no hard policy and shuns generic "Data Availability" statements. Instead, the AAS encourages authors to utilize our different data publishing options to provide the data and information. This makes the information available in well structured formats and since these are considered part of the published record, they will migrate with the article in the future which makes them long lived. Unfortunately, many authors are not always aware of our publishing capabilities so the data editors also review new submissions for data content. A short report directed to the author is written by the data editor for candidate manuscripts. If the Science Editor agrees with the Data editor, the report is sent to the author along with the reviewer’s comments to consider for revision. The report can include requests to provide data behind specific figures, information on how to improve tables or animations, tips on how to save money on publication charges by merging similar figures into a Figure Set, citing software and archival data properly, and establishing third party repositories that issue DOIs for supplementary materials.

Why Data Curation is Necessary

Obviously any data that is lost, i.e., through a hard drive failure, is tragic. Likewise, data that lacks metadata or is poorly formatted can be as useless as lost data. Curation provides both a standard to utilize the information and a long term home. Some of the earlier AAS attempts to at data collection suffered when the curation component was neglected. Take the initial attempts at providing astronomical animations in the HTML article. Readers could download the author’s animation file but the reader had to figure out how to display the movie which could be problematic as there were no video or codec standards. In addition, animations did not always match the supporting figure in the article. With no descriptive text in the figure caption, readers had little information on the animation contents. So while early animation files are available with their HTML article, they are not always viewable for the reader due to changing standards and obsolete formats much like the VHS series.

To address these issues, the AAS made numerous changes. First, animations now stream inline with the HTML article. Readers no longer have to download and play the video locally, although that is still an option. Second, we have a set of animation guidelines that recommend standardizing videos to the .mp4, with a H.264 codec, and >> 15 frames per second frame rate. Text describing what the animation shows must also be added to the figure captions so it is clear what to expect prior to clicking play. If authors do not, or can not do this, a data editor makes the necessary changes to the video files and figure caption text to ensure the highest quality animation experience for the Journal’s readers.

The Benefits of Well-formatted, Standardized Data

The data editor’s job is to curate the data products the AAS publishes. This involves reviewing how the data relates to the article and standardizing the data. The first part ensures that the reader knows what is in the data while the second fixes many problems with author generated data including lack of documentation, errors in the data, missing data, unrealistic significant digits, etc. With good curation, the data will stand the test of time. As an example, look at the tables from the first articles in the 1995 CD-ROM series. While the CD-ROMs are no longer viable storage formats, the data has migrated to the web and is still readable and usable. Utilizing this standard for over 25 years gives readers a consistent format that can be read by a wide variety of tools/programs (TOPCAT5, astropy6, etc.).

Conclusions

Providing access to article data is only the first step for open science. Equally important is curated data that is available for the long term. Over the last 25+ years, the AAS has learned quite a bit about data publication and curation. We find that it critical to adopt the best available standards which provides wider integration and long term preservation. To achieve these standards, dedicated staff is necessary otherwise you get a garbage in/garbage out scenario. In essence, data needs to be reviewed just like the science. Having in house data experts to do curation allows authors to concentrate on the science and not worry about preservation and standardization. Unfortunately, with only two data editors, the AAS can not process as much as we would like. Training and education of authors should help reduce the load by producing better input.

Acknowledgments

I would like to thanks the conference leaders for their tireless efforts in organizing and running the LISA IX meeting.

A pdf of the Powerpoint talk given at the LISA XI meeting is available at Zenodo at 10.5281/zenodo.4884917. A YouTube video of the talk is available at https://www.youtube.com/watch?v=_74DDH8SsjY.

Comments
0
comment

No comments here