NASA Frontier Development Lab (FDL) is a research accelerator that brings together data scientists and space scientists to solve some of the most difficult space and planetary problems using AI. This project is a spin-off of one of the main challenges, that identified star spots in Kepler data. But in this spin-off project we are looking at applications of specific AI techniques (natural language processing — NLP) to time series (light curves) in order to identify both unique features and patterns in time series in general, and in light curves in particular. We both construct and derive informational building blocks that are characteristic to the light curves of the stars in a subset of Kepler data and we compare these methods to more traditional machine learning applications (clustering). We show how this new methodology, rooted in NLP, can be a good alternative for the analysis of light curves and potentially for identifying exoplanetary transit as unique “linguistic” features.
The idea for this project came from asking the following questions, one pertaining to advancing a potentially new methodology in machine learning, and another one pertaining to astrophysics:
1. Can we use NLP to discover features in time series? if yes, how good is it comparatively to other methods, such as clustering?
2. Can we create a “dictionary” of star features that we can use as a genetic code to catalogue and identify any star, and that we can also use to simulate stars that we have not yet observed?
Starting with these questions, we embarked on an exploratory research, to understand whether a duo of a combination of ML methods and an application to star light curves can help us discover features and patterns within time series, in general, and within light curves, in particular. The rationale or the big WHY of such methodological & science specific exploration stems from a few facts that we tried to connect coherently: NLP is good at discovering patterns in messy/noisy, unstructured data (such as languages); NLP is great for creating vocabularies, dictionaries, taxonomies; NLP is also good at creating new and large texts (data) from small lists of dictionaries and vocabularies. Based on these assumptions, our first methodological challenge came from trying to understand the best method or algorithm to create textual data (for our NLP goals) from numeric data (from our given time series). In other words, the first step was to create the “words”, “letters” or the “n-grams” from light curves data. For this proof of concept, we used 632 original Kepler light curves, with the idea to scale it up to analyze and parse more than 110K light curves, data available during the FDL program (summer 2020); if this proves successful, we aim to afterwards add TESS light curve data as well. The light curve data we used therefore consists of 632 time series, collected over a period of about 4 years on a cadence of every 20 minutes.
We used 6 different methodologies to create 6 different corpora from the entire dataset — each corpus is a collection of 632 individual “books”, where each book/light curve is a sequence of n-grams that we created based on these methods:
1.1. Bin-based (large) — we binned the data in bins of 10 (1 order of magnitude), and for each bin we assigned a “binXX” n-gram;
1.2. Bin-based (small) — we binned the data in bins of 100 (2 orders of magnitude), and for each bin we assigned a “binXXX” n-gram;
1.3. Peaks and troughs — for each sequence of consecutive peaks and trough in the time series, we assigned “posXX” or “negXX” n-gram, where “pos” stands for the peak in the time series, “neg” stands for the trough in the time series, and XX is the number of consecutive peaks or troughs observed in the data;
1.4. PD clustering-based — this method is based on measurements of entropy and complexity in the time series;
1.5. Zipf distribution-based — in this method, we fitted a Zipf distribution to each star light curve and created the n-grams based on the rank of the frequency of the data given by the distribution. The Zipf Law is one of the most important laws observed in human languages, but also in physical phenomena such as earthquakes, and is scale invariant, a very important property for pattern detection in a wide range of scales;
1.6. 3-movement-based — in this method, we partitioned the data into 6 types of movements of any 3 consecutive data points in the light curves.
Entropy measurements. A first observation from our analyses has been that methods 1.1 and 1.2 show the Shannon entropy of the n-grams is the closest to the Shannon entropy of the light curves, and can be interpreted as the method that closest preserves the information from the light curve through the text transformation. Shannon entropy is one of the most important measures of information in natural language processing.
PRELIMINARY RESULTS. Clustering. We tried many clustering methods on the actual data, in order to extract features that we would a posteriori use for n-gram creation (i.e., unsupervised k-means clustering, knn, hierarchical, etc.). Out of all the tried clustering methods, the one that is also based on entropy and which we used in our n-gram method 1.4, PD clustering, shows the most promising results in isolating specific features within the light curves. We also clustered based on the difference time series, and the difference isolates even better specific features in the light curves.
Topic Modeling. After creating the n-grams, we performed topic modeling (TM), an NLP specific method, that is grouping the n-grams within a corpus based on their probability of occurrence within a star. The TM method showed us which star features are most likely to occur next to each other across all 632 light curves.