A review of the changes to the data stewardship work of the CDS documentalists over the 50 year history of the CDS. The science-driven specialisations and adaptations to new technologies are highlighted in the context of providing high quality services to the astronomy community.
The CDS is a science driven data centre for reference astronomy data. The objectives have been defined since its creation in 1972: collecting, improving and distributing data for the international astronomical community. However, science and techniques have significantly evolved over the past 50 years, resulting in a deluge of data in terms of both quantity and complexity. The CDS has been able to constantly adapt to these evolutions to offer the community appropriate services and quality data with high added value. In this presentation we will focus mainly on the impact of these changes on the the skills of the CDS data stewards, the so-called documentalists, in terms of specialisation, adaptation to technologies and understanding of data taking into account the evolution of astronomy. All this while ensuring a high level of quality and quick availability of the data.
The CDS has evolved a lot since its creation in 1972 (50 years ago), following both astronomy and technology evolutions. From the beginning the CDS goals have remained the same: collect useful data on objects in electronic form; improve them by critical evaluation and combination; distribute the results to the international community and conduct research using these data.
One of the most important aims is to give the astronomers the necessary resources to conduct their research. That means taking into account the research evolution, and make the database content constantly adapt and evolve. It also means taking technological evolutions into account, and making the database systems and user interfaces evolve. We also need to take into account the continuously increasing volume of data, by adapting tools and work practices to cope with this deluge of data. The continuous increase in data complexity also requires greater knowledge and skills.
The CDS has been able to develop more and more services over the years, thanks to the evolution of techniques and science, and the availability of data: different versions of SIMBAD1 the database dedicated to identification, data and bibliography associated to the objects of interest  ; VizieR2 the database dedicated to tables and other data attached to publications and catalogues ; the Dictionary of Nomenclature of Celestial Objects3 defining the origin, formats, and definition of the acronyms used in link with the IAU4; Aladin5 the atlas of the sky and image database ; the CDS Portal6 giving access to all the CDS services through a unique page ; and the X-Match Service7 allowing one to make fast cross-matches of very large tables and catalogues .
To have a good illustration of how the science and technology have evolved in the past 50 years at CDS, we can take the example of the BD (Bonner Durchmusterung) catalogue8 first published in 1852. At the CDS, as illustrated in Figure 1, we first had the paper version of this catalogue, stored in books. The catalogue was reproduced on micro-fiches in the 70’s , saved on magnetic tape in the 80’s. To consult this catalogue, magnetic tapes had to be requested and sent through postal services. It could take several weeks between the request and the consultation of data. The computers also evolved a lot between the 80s – 90s and today, from core computers to working stations. Data storage has grown significantly, as have sharing capacities, with the advent of the internet. The first use of the internet in France (at that time its precursor arpanet) was a demo of SIMBAD made by the NASA partners at the LISA I conference in Washington (also an IAU meeting)9. The BD catalogue was entered as an electronic table in Vizier at the end of the 90s. It is now available from a laptop, almost instantaneously from everywhere, and can be found and queried through the Virtual Observatory. As one can imagine, typing micro-fiches and creation of online tables with interactive data are not the same, and the work practices of data stewards has evolved a lot during these decades!
As for astronomy and scientific development, a good illustration of the huge advances in the field is to consider the emergence and development of ground and space observatories. These brought new wavelengths in SIMBAD, through observations made in UV, X or gamma-ray bands. Also IRAS brought new data in the infrared, and Gaia added the magnitude G in optical band in SIMBAD. These surveys brought a large number of objects, and new data in SIMBAD, like the redshifts added with IRAS , proper motions and parallaxes added with USNO . The level of astrometric precision grew spectacularly in 30 years, between the Hipparcos and the Gaia missions, from milli-arceseconds to tens of microarcseconds! Also these missions brought more accuracy in proper motions, parallaxes etc.
The primary requirement of the database SIMBAD was to give the cross-identifications for objects and keep the link with bibliography. After 1972, there was the CSI (Catalogue of Stellar Index) and the BSI (Bibliographical Star Index). SIMBAD was born from the fusion of the two . We had only stars in the database, and these were from just a few catalogues. The data were restricted to: object types, object names, coordinates, magnitudes in B and V bands, spectral types, proper motions and some other measurements (as illustrated in Figure 2). There were few publications taken into account (dating back to 1950, and only those with stars). There were few people at CDS, but many collaborators from other institutes participated to construct SIMBAD10.
Today in SIMBAD we find much more data: object types, coordinates, magnitudes, radial velocities and/or redshifts, parallaxes, proper motions, spectral and morphological types, hierarchies between objects, occurrences and positions of the objects in articles. There are more fields, and better quality and precision (see Figure 3). All the CDS services can interact together (SIMBAD, Vizier and Aladin, see ) and with the Virtual Observatory tools (see , ).
A good example of how documentalists have to adapt is to consider the evolution of object types in SIMBAD. At the beginning there were only stars in the CSI. Galaxies were added in 1983 . Then there were only 4 object types: stars, galaxies, stars in galaxies and unknown. In the 90s there were already 100 object types, and today we have more than 200 object types in SIMBAD11. The new object types classification provides hierarchy between the different types and a complex tree structure (Figure 4). To deal with all the object types in SIMBAD today requires a rather good knowledge of astronomy!
As for Vizier, the documentalists need to deal with various data. Catalogues are larger and larger, which means dealing with more and more rows and columns in tables (today we can have as many as 2 billion rows in one table). There are more data, which are more and more precise. The documentalists have to describe each column, each data, which format and units are used, to standardize them, to add notes, Read Me, Unified Content Descriptors, links… They have to deal with various content (time series, surveys) and different formats, and to homogeneize the data to create standardized tables and other output formats. We also add data associated to publications in Vizier, like spectra, light curves, images… which have to be curated through downloads, plots and so on.
A good example of working mode adaptation are the UCDs (for Unified Content Descriptors) used for Vizier tables since 1998. The UCDs give a semantic description of columns contents. This is a list of standard words and combinations to express semantic meaning. This allows one to find, select and compare data through the different tables. This is also a tool to manage data. They are used to validate catalogue descriptions, and also used since 2004 as a standard for VOTables in the Virtual Observatory .
Standards have been used for many years in astronomy to describe bibliography and data. A few examples are Bibcodes , UCDs, ReadMe in Vizier, and the FITS format. These standards existed long before the Virtual Observatory era. In a first step they were created as tools to manage or exchange data, and they were largely used internationally. As interoperability requires standards, some of them became Virtual Observatory standards, and thus participate in the FAIRisation of astronomical data. Standards are a good example of the impact that data stewards’ practices may have on science.
Indeed, if science has an impact on documentalists activities, the work done by documentalists has also definitely an impact on science. Since the beginning of the CDS, the work regularly done by the documentalists allows to check the data, homogenise them, compare them and detect errors in the publications. These good practices ensure the accuracy and the quality of the data, and also provide a high level of added value to the data. They also enable data to be Findable, Accessible, Interoperable, and Reusable.
Today, the process from data reception to completion of data curation at the CDS is quite complicated and involves several services, which means several kinds of documentalists, with different specialisations. The CDS is a cooperative team of around 38 people, equally divided into astronomers, documentalists and computer scientists . A team work was developed over the years to adapt to the evolution of both the quality and quantity of data. During the process there are ongoing interactions with the IT engineers to improve and/or develop specific tools, and ongoing interactions with the astronomers to ensure data and metadata quality and select the more relevant data. Documentalists share their expertise in-house, with common documentation, the sharing of good and new practices, regular meetings, trainings and seminars. They constantly adapt their practices to maintain the workflow and quality of the services.
To summarise the impact of astronomy evolution during the past decades on the documentalists’ activities at CDS, one first needs to mention the significant increase in the number of published articles and data (Figure 5 and ). Along with this constraint, data stewards have to get a deep understanding of the different aspects of their work, that means: data understanding, identification, selection and verification of data, homogenisations, descriptions and corrections of data. Their expertise has to evolve constantly by gaining knowledge of new data and new science topics, adaptation to new tools and new formats, new ways of performing work and new team organisation. This means permanent learning, adaptation and knowledge development. Today we have more tools to deal with data, but also more data to curate, and despite the extraordinary developments of technology and tools, the data stewards expertise remains essential and mandatory, in particular for some parts of the work that remain non-automatable, for which more knowledge and analytical skills are required from data stewards.
Science and technology have evolved together during the past decades, producing data and permitting to store, curate and share it. At the core of this process, the work of data stewards is essential, it involves curating data, producing added value, using and spreading standards, as well as producing FAIR data (see Figure 6). This enables Open Science, and the availability of data has been revolutionising scientific working methods. The data curated in this way are useful for science and have an high impact on its development. Thus the daily work of documentalists evolves with science, but is also entirely part of its evolution.
We want to thank here some retired people of the CDS, who participated to this paper by giving us information about the beginning and history of the CDS: François Ochsenbein, Marc Wenger and Pascal Dubois. We also would like to remind that the CDS as we know it today is the result of 50 years of continuous team work and collaborations, so we have a thought for all our collaborators around the world and throughout the time.