Abstract

The Unified Astronomy Thesaurus (UAT) is a living resource, with regular updates and opportunities for integration that ensure its place as a critical tool that meets the needs of the astronomy community, including researchers, authors, publishers, observatories, data centers, and librarians. The UAT has been integrated into multiple systems and institutions in recent years. We share the latest integrations with LISA participants and engage in a discussion on how we can work together to continue to build and leverage the UAT as a community. Future integrations and areas of exploration, such as ADS integration, applying the UAT as a controlled vocabulary for your institution’s needs, and the UAT’s potential as a testbed for Natural Language Processing, are also discussed.

1 Introduction

At LISA VIII, in 2017, the Unified Astronomy Thesaurus (UAT) Curator provided an update on the UAT. Since then, the UAT has grown considerably. These proceedings review the ways in which the entire astronomy and astrophysics community — researchers, publishers, librarians, and developers — can help ensure a comprehensive, open-source vocabulary remains available to all and continues to grow to meet the evolving needs of the astronomical community.

2 Evolution of the Unified Astronomy Thesaurus

2.1 Summary of the UAT: Purpose and Scope

The Unified Astronomy Thesaurus is an open, interoperable and community-supported thesaurus, formalizing astronomical concepts and their inter-relationships. The UAT contains over 2,000 unique concepts, organized into 11 categories and arranged in a deep hierarchy. In order to better reflect astronomical knowledge, the UAT is constructed as a polyhierarchy meaning that a concept can have more than one parent concept, and there can be multiple paths through the hierarchy to get to a specific concept [1]. For example, a researcher focused on variable stars might envision novae as primarily a part of a cataclysmic variable star system, while an astronomer studying the lifecycle of a star might be focused on novae as a potential endpoint of stellar life. Both conceptualizations are valid, and following paths in the UAT along the lines of both subjects will lead a user to the UAT concept of Novae.

The 2,122 concepts in the UAT create a standardized and consistent set of keywords which help to codify the language used to describe astronomy. Drawing descriptive metadata from a controlled vocabulary helps to assure that researchers are all using the same words and concepts to describe similar astronomical phenomena. Even though full text searching is becoming more common, if two researchers use vastly different words to describe similar events in their latest papers, even the most sophisticated AI program might not be able to make connections between their two documents.

Updates to the UAT are driven by user suggestions and feedback, which are synthesized into releases towards the end of every calendar year. These updates frequently include adding new concepts, removing deprecated concepts, and additional content such as definitions or expanded connections between concepts. Guidelines for managing feedback and finalizing decisions, which were developed and documented over the last four years since LISA VIII, along with information about the Curation Process1 is available on the UAT website. The process to submit suggestions and contribute2 to the growth of the thesaurus, and the schedule for evaluating suggestions and releasing new versions3 is also available.

2.2 Milestones in UAT Development and Publisher Integration

2.2.1 Latest enhancements and release cycle

Since the previous LISA conference, held in 2017, there have been several updates to the UAT, which is now up to version 4.0.1, including the addition of over 350 new concepts in areas such as astrostatistics, laboratory astrophysics, astronomy software, and planetary science. The first three of these subjects were added as new branches, while the fourth underwent a much needed overhaul and expansion.

During the 2020 release, over 850 concepts, representing almost a third of the UAT, received definitions derived from the Etymological Dictionary of Astronomy and Astrophysics4. Though we’ve pulled all the definitions that we can from the dictionary, we are planning to source additional definitions for many of the remaining concepts in subsequent years.

Over the last few years there were some major updates to the UAT website as well. In addition to the guidance documentation mentioned earlier, the concept browsing interface has been modernized and updated to be mobile friendly, making it easier to look through the existing UAT concepts. The search feature has also been updated from a simple auto-complete field into a true search for matches and partial matches within the content of the UAT. The search results now clearly highlight where they match the query (Figure 1), making it much more obvious why any particular concept was returned as a potential result. Lastly, the UAT website contains a Select Concept widget. This bit of code was developed by eJournalPress on behalf of the American Astronomical Society in order to assist authors with finding relevant concepts to attach to their manuscripts. It has been integrated into the AAS publication platform, the UAT website, and, since the source code is freely available with an open source MIT License, could be integrated into almost any website.

Figure 1. Example of *Search for UAT Concepts* with instances across terms highlighted.

2.2.2 UAT Adoption in AAS Journals

The American Astronomical Society (AAS) has been owner and community steward of the open-source UAT since 2012 and established the UAT Steering Committee that now reports to the AAS Publications Committee. AAS Science Editors contributed to the version of the UAT that was ultimately implemented in June 20195 for all AAS research journals: The Astronomical Journal, The Astrophysical Journal, The Astrophysical Journal Letters, The Astrophysical Journal Supplements, The Planetary Science Journal, and Research Notes of the AAS. The Astronomical Society of the Pacific’s Publications of the ASP has also adopted the UAT. AAS Authors are now asked to assign UAT concepts to their submitted manuscripts using the concept selector widget noted above. For the AAS, the UAT replaced legacy astronomical subject keywords, which were no longer sufficient to cover modern astronomy and astrophysics, were infrequently updated, followed a loose hierarchy, and had no definitions or relationships.

3 Existing Use Cases and Real-world Applications of the UAT

3.1 ADS Integration

3.1.1 Improving Discovery in ADS

The NASA Astrophysics Data System (ADS) currently indexes over 2.6 million records of scholarly works in Astronomy and Astrophysics. Basic metadata for these records includes title, abstract, the list of authors, and, for a subset of these records, a list of keywords supplied by the publisher. Given the lack of a universal keyword system and the evolution of existing keyword systems over time, the ADS has in the past attempted to normalize keywords to help expose concepts in a uniform way. These efforts have yielded very limited results, with a majority of records in the system still lacking any descriptive keywords.

With the introduction of the UAT, information systems using it have an opportunity to overcome some of the existing shortcomings. Use of the UAT in ADS will alleviate a number of issues, which include: term normalization (allowing for multiple expressions of the same concept to be captured in a single entry); concept disambiguation (allowing for different meanings of the same words to be specified depending on their context); and concept hierarchy (providing a structured way to navigate research topics).

During the next year, the ADS intends to integrate use of the UAT in its system by enabling the cross-linking of papers with UAT terms, increasing their exposure to users. In order to improve the coverage of UAT terms across the ADS corpus, we are developing Machine Learning based techniques to automatically assign UAT concepts to records in ADS which currently lack them. Finally, we will update ADS’s visualizations to use UAT terms rather than keywords to provide a better representation of research topics associated with a list of records.

3.2 CDS Integrations

The Strasbourg Astronomical Data Center (CDS) uses different taxonomies or standard vocabularies across its different services, and efforts have been made to integrate the Unified Astronomy Thesaurus in CDS services.

3.2.1 SIMBAD otypes

In SIMBAD, the database of astronomical objects, the different astronomical object types (otypes) are represented in a hierarchy, with a name and a 3-character shortcode for each existing object type. A mapping of all SIMBAD otypes to UAT terms has been completed. The UAT is much broader than SIMBAD otypes, but this ensures that all SIMBAD otypes were present and properly described in the UAT.

3.2.2 VizieR and VO Registry

In the VizieR catalogue service, each catalogue is described by a set of keywords which are currently assigned from a small corpus derived from the list of ADC keywords6. The catalogues descriptions are published in the Virtual Observatory registry to allow for resource discovery, and these resource descriptions are in turn harvested by the European EUDAT B2FIND research data collections discovery service. In this process, a mapping of VizieR keywords to UAT terms has been made.

3.2.3 Future plans

In the future, a direct use of UAT keywords in the VizieR catalogue descriptions for metadata searching and exportation is planned. This is not a trivial task because it requires training of the documentalists on a much larger vocabulary than the original VizieR catalogue keywords.

3.3 IVOA – VO Registry

VO Registries are distributed searchable collections of metadata for VO-compatible services and datasets. The most recent release of the IVOA VOResource metadata recommendation used for describing resources in VO registries adopted the Unified Astronomy Thesaurus for subject terms (IVOA 2018), which can be used for exploratory searches or for narrowing search results. This is a change from the previously recommended IAU Thesaurus.

Adoption of the UAT presents some challenges for the IVOA, addressed in the most recent Vocabularies in the VO, v.2 recommendation [2]. These challenges include the lack of human-readable URIs and the fact that the UAT is managed outside of IVOA. One proposed solution to this issue is to keep a mirror of the UAT under the IVOA namespace. This is currently being explored as a “Tech Study” and a draft is available on the IVOA Vocabularies website [3].

3.4 Hubble Space Telescope and James Webb Space Telescope Proposals

Space Telescope Science Institute (STScI) is responsible for the scientific operations of the Hubble Space Telescope (HST) and the science and flight operations of the James Webb Space Telescope (JWST). The Mikulski Archive for Space Telescopes (MAST) is located at STScI and holds astronomical data for dozens of missions including the flagship HST and upcoming JWST missions. In 2016, the institute saw an opportunity to align internal proposal vocabularies with the UAT. So far, STScI has completed two of three major integrations.

3.4.1 JWST Integration

STScI adopted UAT keywords for target level keywords for JWST proposals. These keywords are provided by the guest observer when submitting a proposal and describe the targets of the planned observations [4]. Target level keywords get attached to observational metadata and are searchable within the MAST Data Discovery Portal7.

3.4.2 HST Integration

The UAT was formally adopted and aligned with JWST target keywords before it was later implemented for the shared HST and JWST proposal keywords [5]. Proposal keywords are applied to broadly categorize the type of research that will be conducted and are used for a variety of purposes including grouping proposals for expert panel review by the wider community. These panels ultimately decide which proposals will be granted time or orbits using HST and JWST, so matching the science to the individuals most capable of reviewing the proposed research is a critical part of the HST and JWST science operations process.

Aligning proposal terms with the UAT was challenging. Although the HST observatory has been in operation for three decades, it was necessary to consider modern vocabulary, new fields of astronomy, and a future in which infrared data is captured by JWST while aligning and updating the shared proposal keyword set. The Science Policy Group (SPG) and STScI Library worked together to align all existing proposal keywords, and add additional terms where needed to describe JWST’s unique capabilities. Nearly 80% of existing terms allowed for direct, synonymous mapping. Approximately 20% resulted in contributions and proposed new terms back to the UAT, and approximately 1.5% could not be mapped, and were reassessed or removed.

3.4.3 Remaining Phases of HST & JWST Integration

There are still a number of steps remaining before STScI achieves cohesive keyword mapping in its systems, but the institute is well on the way to offering a UAT aligned experience to proposers.

To satisfy software requirements, the JWST target keyword vocabulary is currently a subset of the UAT, represented using the preferred terms and not concept identifiers. This subset has been curated by subject matter experts from MAST, but there is not yet a process in place to update the vocabulary to keep up with new releases of the UAT. Direct integration with the UAT uris as opposed to human readable terms, and direct mapping of those uris using the available UAT API will minimize the need for annual review and manual verification of terms as the UAT evolves. As with the Virtual Observatory (VO), the decision to link to a living vocabulary maintained by the broader community is anticipated to have a lasting, positive impact — both in keeping internal vocabularies current, and in expanding and contributing back to the UAT.

3.5 Organic Adoption of the UAT

3.5.1 Wikidata & Similar Resources

There have been instances of organic adoption of the UAT - that is to say instances when the UAT is discovered and integrated into a resource, a project, or workflow without direct communication or promotion by the UAT Steering Committee members. We welcome this type of adoption as it demonstrates the importance and value of the UAT as a resource and tool. More specifically, the Steering Committee was excited to learn that WikiData, the Semantic Web knowledge base project affiliated with Wikipedia has crosswalked nearly 1,000 records with UAT concept identifiers. An example is fast radio burst8.

Additionally, there are other projects that build on WikiData’s knowledge graph and leverage UAT taxonomic information. An example is Knowledia9, which uses the UAT in its topic categorizations, such as red supergiant10.

4 Future UAT Integrations

In this section, we explore future integrations with the UAT under investigation or in the early phases of adoption. We encourage you to consider how the UAT might be integrated in your local systems or other open-source tools to create linked vocabularies to describe research activities, data, and resources.

4.1 Tagging ORCID and Author Profiles

A simple way to introduce your users and researchers to the UAT is to encourage them to browse or search the UAT itself for the purpose of selecting terms that best describe their research interests. The user may wish to use just the human readable form (Figure 2.), or find it useful to link to the UAT uris. For societies and publishers, consider asking your members or authors to link to their tagged ORCID11 profile, or encourage them to tag their local society profile with UAT terms.

Figure 2. Example of ORCID Profile with UAT terms selected

4.2 Research with NLP and the UAT

As a curated and openly available controlled vocabulary, the UAT is also a promising research tool for learning about the evolution and community dynamics of astronomy as an intellectual domain, as well as a platform for innovation in machine learning and natural language processing applications.

Broadly speaking, controlled vocabularies offer robust capabilities for subject classification and efficient database searching [6]. Classification schemes like the UAT also inevitably reflect epistemic social structures that can be analyzed by social and information scientists as cultural information systems [7] [8]. Beyond subject classification and societal impact, controlled vocabularies have been used to measure field development in other domains, such as medicine [9] and physics [10]. Furthermore, in the areas of natural language processing (NLP) and machine learning, ongoing innovations in supervised and unsupervised machine learning, text analysis and computational linguistics techniques have opened a new research frontier in which text data can be used on a large scale to address pertinent research questions.

As discussed in section 3.1, UAT terms will soon be automatically assigned to previously published documents, utilizing modern methodologies through which it should be possible to accurately predict terms for untagged documents [11], and at the same time methodologically informing other disciplines attempting a similar task. Finally, the UAT can also be used to observe how the use of concepts changes over time through what can be referred to as “semantic drift” or “concept drift” [12]. Overall, these and other research efforts depend on widespread use of the UAT over time, ensuring that a sufficient number of documents are tagged with UAT terms, in order to fully exploit the UAT as a research tool for social science, library and information science, and astronomy.

5 Benefits of a Controlled Vocabulary

5.1 Benefits to Science Policy and Proposal Groups

A community shared and controlled vocabulary benefits science policy groups and those responsible for managing the proposal process at their institution in a number of ways.

By aligning your internal terms with a community system, you naturally adopt a systematic process to review, update, and continually automate the maintenance of terms. This helps avoid a stagnation in internal vocabularies. By having the entire community’s input and vocabulary terms at your fingertips in a hierarchical structure, it is also easier to be more or less expansive, depending on your needs. As an example, the Science Policy Group at STScI did not need to think of each core concept in infrared astronomy on their own; they adopted proposal keywords such as Zodiacal cloud and Luminous infrared galaxies from the existing UAT.

With a more precise and hierarchical vocabulary, it is easier to match proposers to those most capable of reviewing the content from a scientific perspective. This will improve even more once the UAT is fully integrated in ADS, and reviewers' literature is pre-mapped to UAT terms.

For reporting, the hierarchical nature of the UAT has simplified the process. When asked to report on how many proposals support solar system research or galaxy formation, it is now possible to group terms from within related UAT branches, or group parent and child terms like galaxy quenching within the parent group of galaxy evolution.

5.2 Benefits to Data Archives

Being able to classify observations according to target allows archive users to search by science area or target class. Furthermore, enabling search term expansion through the hierarchical structure of a true thesaurus allows for the discovery of related data across heterogeneous data sets. Keyword-enriched datasets can be linked to the literature, enabling further discovery outside the archive.

Astronomers often search archives based on the type of science that they would like to perform with the data they intend to collect through new observations, but most archives only tag the source/target of the observation or the original proposal terms. The scientific significance of archival observations often evolves over the life of the data archive, and often extends well beyond that anticipated by the proposers of the observations.

Keeping track of this evolution by tagging observations with terms reflecting their later use in the linked literature benefits both the archives and future users of the archival data. It is then possible to track the evolving significance and scientific legacy of data holdings and find relevant observations based on common vocabularies. UAT controlled vocabularies enable mission- and field-specific tagging of archival observations, and will allow for more comprehensive queries, based on the continual scientific relevance of the observations and the hierarchical structure of the thesaurus.

5.3 Benefits to Bibliographers

Curated mission bibliographies are powerful tools to assess the scientific impact of a mission and allow astronomers to back-search for observations that have contributed to one or more publications. The granular linking of observations and publications that lies at the core of the bibliographic effort is asymmetric though: the wealth of metadata and properties associated with each observation is not matched by a similarly comprehensive characterization of the scientific concepts discussed in the linked publications. In other words, it is relatively easy to retrieve publications linked to a selected set of observations, but the opposite is not true, leaving only connections to strictly bibliographic information (bibcode, journal, authors info, authors’ institutions, year of publication, etc). The UAT provides a great opportunity to tag publications associated with a set of observations with astrophysical concepts, sources, and phenomena during the data linking procedure, either by ingesting the UAT keywords applied to the manuscript during publication, or by manually classifying the research paper if no UAT keywords have already been applied.

5.4 Benefits to Publishers

It is a challenge for publishers to ensure that their vocabularies grow to reflect new research areas. The community ownership of the UAT helps in many ways:

Community ownership ensures that the coverage of the UAT meets the community’s needs, providing publishers with an accurate and up-to-date vocabulary that reflects the view of the subject matter experts. Also of value is the process by which the community agrees to the correct labels to describe terms. There has been some debate on this already in several areas of the UAT, in particular planetary and exoplanetary terms. Publishers would find it slow and costly to get this level of subject expertise for proprietary vocabularies.
With wide adoption the UAT offers a single vocabulary to provide a more cohesive user experience. This should extend well beyond publishers, to grant funding, data gathering, data publication, preprints, search and alerting. Publishers can be confident that their metadata aligns with community applications, and that users are familiar with the terms through other systems.
A strong motivation for publishers to adopt the UAT is to ensure that the keywords and terms that describe publications improve the discovery of the content. High-quality metadata helps with indexing for search and the growing number of automated discovery services that are based on AI. The UAT makes a high-quality, detailed vocabulary accessible to all discovery services due to its interoperability and open source code.

There are also some challenges in working with a community-owned vocabulary:

There are new boundaries between the different sources of vocabularies, and very likely overlap of terms that need deduplication.
The granularity of concepts in the vocabulary may not match with the application: for example, a specialist publication may need many low-level terms to describe the content, e.g., lipid biomarkers, compared with a general publication where high-level terms may be sufficient, e.g., biosignatures. The potential to accommodate these situations through hierarchical vocabulary is a strength, but work is also required to ensure that the vocabulary is suitable for most applications and does not grow beyond its intended scope or create unnecessary duplication with other community thesauri.

5.5 Benefits to Researchers

As more publishers adopt the UAT and the UAT is integrated into ADS, the Steering Committee envisions a future where a researcher can click on a hierarchical concept and then expand or drill down to additional concepts to aid in discovery and research using the literature, a data archive, or proposal software system.

Having an expansive, polyhierarchical vocabulary at one’s fingertips enables researchers to better describe their own research findings; they are not tied to a finite, antiquated, or imprecise list of keywords which may or may adequately capture new phenomena or a deeper understanding of existing concepts. As noted in Section 2.1, drawing descriptive metadata from a controlled vocabulary helps to assure that researchers are all using the same words and concepts to describe similar astronomical phenomena, thereby aiding each other’s mutual discovery of new research.

6 Community Resources

There are a number of official crosswalks between the UAT and former vocabularies, such as the former IVOA Thesaurus, the former IAU Thesaurus, and the legacy Astronomical Subject Keywords (ASK), still in use by many publishers considering transition to the UAT. The official UAT GitHub repository includes existing crosswalks, with a new crosswalk added for Icarus journal keywords since the LISAIX conference.

Additional information about the UAT can be found on the UAT website12.

7 Conclusion: Future of the UAT

The UAT will continue to influence the field of astronomy and astrophysics and in turn be augmented and updated by new discoveries in the field. As the UAT Steering Committee looks to the future and the continued development of the UAT, we welcome thoughts, input and feedback from the community. More specifically, we are eager to learn of UAT integrations at your institutions, and welcome feedback on the curation of existing or new concepts, their interrelationships, and the need for new branches or expansion of concept families. We look forward to ongoing conversations and the further success of the UAT.

Acknowledgments

The current UAT Steering Committee wishes to thank:

The LOC and SOC of the LISA IX committee, who worked to plan the international conference not once, but twice, due to the worldwide COVID-19 pandemic. We also thank: Dr. Ethan Vishniac, Editor-in-Chief for AAS Journals, for his support of the UAT from the start and his service as a former Steering Committee (SC) member; Jill Lagerstrom, for her efforts as an SC member in early stages of the UAT; the Mikulski Archive for Space Telescopes and the Space Telescope Science Policy Group for boldly exploring integration in the early days of the UAT; Dr. Markus Demleitner for championing the UAT in IVOA applications; and Dr. Michael Lesk of Rutgers University for pioneering possible NLP applications using the UAT.

Building the UAT as a Community