Skip to main content
SearchLoginLogin or Signup

Illuminating the Characteristics and Accessibility of Data Behind Papers

This article describes a dissertation study that explored the features and accessibility of astronomy data used to produce scholarly papers. The study found that nearly three-quarters of sampled papers correspond to some inaccessible data, raising considerations for open science.

Published onApr 27, 2022
Illuminating the Characteristics and Accessibility of Data Behind Papers


This paper informs the LISA community about a dissertation conducted in the field of Library and Information Science. To understand the extent to which astronomy data is accessible to other researchers, a questionnaire was sent to corresponding authors of a sample of astronomy journal articles published over the past two decades, with completed responses received from authors around the world. Respondents provided information about the various kinds, formats, and locations of data that contributed to their published papers, along with demographic information. Paper-level metadata was also incorporated into analysis of the questionnaire. The questionnaire was informed by a series of background interviews, which provided qualitative insight into the quantitative results. The interviews also incorporated a “think-aloud” exercise to map the searching strategies used by astronomers while reading the literature and attempting to locate data behind papers. An overview of study outcomes is presented, along with considerations for successfully integrating relevant data associated with publications.

1 Introduction

This paper summarizes a dissertation conducted in the field of Library and Information Science (LIS) that is relevant for astronomy librarianship and scholarly communication [1]. Research studies in LIS often combine qualitative and quantitative methodological approaches to draw conclusions about a population of interest [2]. The mixed-methods project presented here focused on astronomers as the target population, while seeking generalizable insights into scientific research practices and scholarly information behavior. An overarching goal of the project was to estimate the extent to which astronomy data is accessible to other researchers. The scholarly literature served as a lens for investigating the nature and characteristics of data used to produce published papers. An initial series of background interviews was conducted to better understand the practices and attitudes of astronomers, as well as the decision-making processes through which astronomers identify data while reading published papers. Based on analysis and interpretation of the interviews, a questionnaire was sent to corresponding authors of a sample of journal articles inquiring about the data underlying specific articles. Responses to the questionnaire were interpreted alongside author-level and bibliographic information associated with the sampled articles. The following sections present an overview of the study rationale (background), methodology, and outcomes, followed by a brief discussion and conclusions.

2 Background

Linking data to publications and other scholarly products such as software is an increasing priority for journal publishers, research communities, and the Open Science movement more broadly [3]. The FAIR principles for making data Findable, Accessible, Interoperable and Reusable have been widely adopted for data stewardship planning and implementation [4]. However, particularly for data corresponding to older papers, much research data across disciplines is not presently “FAIR”. Heidorn [5] refers to such data as “dark data” that could support new discoveries if “brought to light”. Scholars have documented many reasons that scientific data may remain hidden, ranging from lack of time, resources, and incentives to concern about being “scooped” [6]. Data can also fall into the category of older “legacy” or “heritage” data for which significant contextual knowledge and effort would be required to migrate from older files formats and media to modern standards and infrastructures [7], [8].

These dynamics present challenges to research transparency. Moreover, in astronomy and other disciplines phenomena change over time, requiring an empirical record that encompasses longer timespans and a collective ethic of data preservation [9]. In seeking to develop a more comprehensive understanding of the availability of various astronomy data associated with publications, the study described here was inspired by the adage, “You cannot manage what you do not measure” [10]. In other words, to overcome barriers to sharing and access and ensure that the needs of the research community are met, it is essential to scope the problem at hand. Through an exploratory survey of data behind papers alongside qualitative insight, this study overall aimed to inform current and future data management practices, infrastructures, and policies in astronomy and beyond.

3 Methodology

The study was guided by a series of research questions surrounding whether, how, and why data are available online for others to access. Research questions covered areas such as:

  • What are characteristics of the astronomy researchers most likely to possess inaccessible datasets?

  • What are the significant attributes of inaccessible astronomical data?

  • Where are accessible and inaccessible astronomical data located and in what format(s)?

A phased approach to study design and analysis was implemented, with two distinct components. First, a series of background interviews was conducted in which astronomers reviewed a small sample of journal articles while describing their decision-making processes to locate underlying data. Second, an online questionnaire was sent to corresponding authors of a much larger sample of journal articles to obtain information about data behind specific papers on a broader scale. These two phases are described below.

3.1. Phase I: Background interviews

Research field work was conducted in late-2018 and early-2019, including interviews with a convenience sample of 7 astronomers. Interviews lasted approximately one hour each. The first half of each interview consisted of a semi-structured interview discussing research and collaboration practices, data use, and data sharing. For the second half of each interview, participants engaged in a “think-aloud” activity. This task involved reviewing a pre-selection of approximately 5 published journal articles each and speaking aloud about decision-making processes to understand and locate underlying data while reading the articles. Twenty-seven unique papers were reviewed by participants, with 2 interviewees reviewing the same set of 5 papers for validation purposes (32 searches total). These articles were selected based on keywords shared with each interviewee’s most-cited publication. For analysis, the interviews were transcribed, and semi-structured and think-aloud portions were treated as separate datasets. Semi-structured interviews were analyzed using an open coding strategy to identify themes and sub-themes [11] and following a Grounded Theory approach [12].

Think-aloud exercises were analyzed using protocol analysis, which is a method that has been used in fields such as psychology to interpret decision-making processes through verbal reports from participants [13],[14]. Protocol “statements” were extracted from interview transcriptions and each statement was coded according to Belkin, Marchetti & Cool’s [15] eight facets of information-seeking with text - Scan, Search, Learn, Select, Recognize, Specify, Information, and Metainformation. Coding each statement according to this simple taxonomy illuminated distinctions in individual and article-level search patterns. To better understand these patterns, the percentage of a search (defined as a single interviewee reviewing a single paper) in which each facet was used was calculated. Note that where appropriate, multiple facets/codes were applied to the same think-aloud statements. A k-means cluster analysis [16] was then conducted using the percentages of each facet occurring within each search. A 6-cluster model was created, revealing similarities in search strategies among clustered searches, to be discussed below.

3.2. Phase II: questionnaire

For the second phase of the study, an online questionnaire was sent to 1,571 corresponding authors of a large sample of astronomy journal articles on May 24, 2019. Each invitation to complete the questionnaire included a citation to a single sampled paper authored by the invitee/corresponding author. Invitees received no more than one invitation to complete the questionnaire, targeting one of their published journal articles.

Sampled journal articles belong to two subsamples. The first subsample was selected based on an initial, yet-unrealized goal of the study to perform text mining with a specific corpus of journal articles. This subsample included authors of papers published in Publications of the Astronomical Society of the Pacific (“PASP”) over approximately two decades (1994-2016) and for which bibliographic records included author email addresses (n=1,094). The second subsample was selected as part of a separate research project that conducted a topic analysis of grant proposals, described in Stahlman & Heidorn [17]. This subsample included authors of papers associated with National Science Foundation Astronomy & Astrophysics grants originating in 2016 (n=477), where papers were linked to grants through award numbers in the funding statements. By the completion deadline (June 7, 2019), 211 responses were received from corresponding authors of sampled journal articles – 104 responses related to the “PASP” subsample of journal articles and 107 responses related to the “NSF” subsample of journal articles.

The questionnaire was designed to obtain a comprehensive picture of data underlying specific journal articles to the extent possible. Questions targeted whether the respondent utilized various kinds, formats, and types of observational and archival data to produce the focal paper. Questions also targeted whether the respondent utilized simulation data and/or produced new derived data through the analysis process for the focal paper. Other data-related variables were captured as well, such as whether underlying data was discovered through reading the literature, whether other researchers have requested the data, whether specialized software would be required to understand the data, and importantly, whether the various data are published online for others to access. These variables were analyzed alongside career-related information about the respondents (such as profession, career stage, and tenure status) and bibliographic information associated with the papers. Descriptive statistics are presently documented in Stahlman [1]. Simple statistical tests were also conducted where possible for exploratory purposes (Welch’s t-test and chi-square). However, due to the relatively small number of respondents and sparse responses for many question areas, capabilities for inferential statistics were limited.

4 Outcomes

The mixed-methods study was designed for evidence-based triangulation [18], where each data collection and analysis component contributed to holistic understanding of data accessibility and surrounding social implications. This section describes key results of each phase, followed by a synthesis of outcomes.

4.1. Interviews (Phase I)

The semi-structured portion of the background interviews (described in section 3.1) was primarily leveraged to inform design and interpretation of the questionnaire rather than for theory development. However, a forthcoming paper (currently in the revision process) will present a deeper exploration of these interviews in the context of data preservation. The think-aloud portion of the interviews resulted in a methodological demonstration through the 6-cluster k-means analysis of search facets. While another forthcoming paper will present these results in more detail, one cluster is presented in Figure 1 as an example of the technique.

Figure 1: Cluster example taken from Stahlman 2020; red indicates smaller frequencies of the occurrence of a search facet, while blue indicates higher frequencies (C.4 = Kim & Joner 1997; F.3 = Seaton & Partridge 2001; F.4 = McKenzie & Schaefer 1999; G.2 = Krisciunas, Margon & Szkody 1998; G.5 = Veres, et al. 2012)

This example shows a strategic pattern in which the context of the papers is already understood by the interviewees and data were collected by well-known facilities including Kitt Peak National Observatory (KPNO), Cerro Tololo Inter-American Observatory (CTIO), the Very Large Array (VLA), Sloan Digital Sky Survey (SDSS) and the Panoramic Survey Telescope and Rapid Response System (Pan-STARRS1). In this case, interviewees have less need to “learn” about the underlying data or “scan” the text looking for relevant information, with more emphasis placed on “searching” online for the known data and identifying “metainformation” descriptors about the papers and data. By grouping searches together based on these sorts of similarities, the method demonstrated by the study can potentially inform development of systems that support users and use of the literature to find data in more nuanced ways.

4.2. Questionnaire (Phase II)

Responses to the online questionnaire were received from 211 corresponding authors of sampled journal articles (described in section 3.2). Respondents provided information about specific sampled papers, as well as demographic and career-related details. To briefly summarize key demographics, a majority of questionnaire respondents (corresponding authors) held a Ph.D. degree at the time of questionnaire completion (87.6%), with most degrees in the field of Astronomy/Astrophysics (75.2%) and Physics (20.5%). Many respondents were early in their career at the time of questionnaire completion, with 35.2% between the ages of 30 and 39, and 66.7% earning their highest degree between 2011 and 2019. The international astronomy community was well-represented, with 41% of respondents from outside of the United States.

Specific papers addressed by questionnaire respondents (n=211) were published in 14 astronomy journals (see Table 1) between 1998 and 2019 (see Figure 2), with a majority of papers published in PASP (49.3%). As a result of the “NSF” subsample focus on newer papers, the mean year of publication is 2014 and the median year of publication is 2016. Some papers fell primarily into the “astronomical instrumentation” category, with 78.7% categorized as research papers.

Table 1: Journals represented in questionnaire responses

Journal title


Publications of the Astronomical Society of the Pacific


Monthly Notices of the Royal Astronomical Society


Astrophysical Journal


Astronomical Journal


Astrophysical Journal Letters


Astronomy & Astrophysics




Astrophysical Journal Supplement Series


Nature Astronomy


Physical Review D


Classical and Quantum Gravity




Publications of the Astronomical Society of Japan


Solar Physics


Total journal articles represented =


Figure 2: Frequency by year of publication

A key result of the questionnaire analysis is that 69.2% of represented papers correspond to some inaccessible data of the following types: observational data (32.7%), simulation data (38.9%), and enhanced/derived data (26.5%). This prevalence is consistent over time since publication, with 68.1% of papers represented by the questionnaire and published before 2010 (n=47) and 69.5% of papers represented by the questionnaire and published after 2010 (n=164) corresponding to some inaccessible data. In the pre-2010 group, papers correspond to some inaccessible data of the following types: observational data (40.4%), simulation data (31.9%), and enhanced/derived data (14.9%). In the post-2010 group, papers correspond to some inaccessible data of the following types: observational data (30.5%), simulation data (40.9%), and enhanced/derived data (29.9%). A more rigorous analysis of these and the many other variables captured by the questionnaire will be presented in future publications, including exploring features of researchers, papers and research processes that can further point to various types, formats, and locations of inaccessible data.

4.3. Triangulation

Taken together, the study components provided further evidence that much astronomy data is inaccessible. Newer papers represented in the study correspond more to inaccessible derived and simulation data, while older papers correspond more to inaccessible observational data. These results may indicate a trend towards derivative analyses with archival data and computational modeling, while observational data collected in the past may be endangered and remain hidden. The study also highlighted how researchers identify data through reading published articles, with possible implications for information system design and user interactions.

5 Discussion and Conclusions

Through a questionnaire and interviews, this study has shed some light on the prevalence of inaccessible data behind papers published over time. Within the context of open science, conversations about making data available for reuse [19] and research transparency [20] and linking data to literature [21] are gaining traction. As we have seen in other LISA IX presentations, management of data behind papers has evolved through improved resources, policies, guidelines, and support for publishing data within and alongside the literature [22],[23], [24], [25]. However, despite the availability of curated data archives and open repositories, potentially valuable astronomy data remain at risk of being lost due to format obsolescence, “sunsetting” instruments, and insufficient resources and incentives for data sharing and preservation – particularly for individual researchers and smaller teams and facilities [26]. In a recent Astro2020 white paper, Lattis et al. [27] review the current state of data preservation as astronomical heritage, recommending focused effort and funding over the next decade towards rescuing and making older analog and digital data discoverable (in other words, preventing “digital ‘landfills’ of very limited use”, p. 2). Through preservation and curation, such legacy data could be linked to associated journal articles and other products, including papers represented in the current study, augmenting the empirical record in alignment with open science.

The dissertation study summarized here stopped short of fully exploring the utility of various data behind papers. While funder and publisher policies increasingly mandate data sharing, Baker & Mayernik [28] point out that data production and knowledge production are separate activity streams, where making data Findable, Accessible, Interoperable and Reusable (FAIR) requires extra work and where scientific knowledge has been constructed successfully apart from laboriously sharing data as independent research products. As the two activity streams increasingly converge through open science practices, further resources and research can lead to enhanced support for authors and deeper understanding of the dynamics highlighted by study presented here. For example, the study’s finding that more newer papers may correspond to inaccessible simulation data points to current discussions across disciplines about making computationally intensive models and their output FAIR [29], [30]. Meanwhile, inaccessible observational data may fall into the category of astronomical heritage urgently in need of attention. Forthcoming and future work will build upon these insights to further examine the value of data for society alongside the benefits and costs of preservation.

6 Acknowledgments

The author is tremendously grateful to interviewees and questionnaire respondents for participating in this research. This dissertation was supported by the University of Arizona’s Social & Behavioral Sciences Research Institute (SBSRI) and the National Science Foundation (award #1542446).

No comments here