An exoplanet detection system based on stellar brightness time series and stellar parameters using convolutional neural networks.
This project is still under active development.
GAIA allows the downloading, processing and visualisation of stellar brightness time series and stellar parameters from missions such as Kepler/K2 or TESS. A separate module allows the training of deep learning models and the detection of exoplanet candidates. The whole system can be run locally or partially on the Google Cloud Platform.
Detection of exoplanets is possible using light curves (time series of stellar brightness), for which so-called TCEs (Threshold Cross Events) are identified, i.e. repeated signals associated with period (the amount of time between successive signals), duration (the time from the beginning to the end of the signal) and epoch (the time of first observation). TCEs can represent transits of exoplanets, but can also be the result of other events such as binary systems or measurement errors. Such a transit is shown above. In addition to the primary transit (when the object obscures the star), a secondary transit (when the star obscures the object) is also distinguished. The purpose of training the models is to teach them to recognise transits of exoplanets and to distinguish them from other phenomena.
Data processing mainly involves processing light curves and creating local and global views from them. The local view contains information about the averaged brightness of the star during the transit and is created separately for each TCE. The global view, also created for each TCE, covers the entire TCE period.
Views are created for:
- stellar brightness,
- even and odd transits separately,
- light centroids (centres of the light source),
- secondary transits.
Other data processing steps include standardisation of time series and stellar parameters, among others.
Raw Kepler time series downloaded from the MAST archive contain observations divided into several periods (quarters).
The plot above clearly shows the transits of two planets: Kepler-90 h - the largest planet in the system with the greatest transit depth, and Kepler-90 g - the second largest planet in size and transit depth.
It can be noticed that the brightness of the star even far away from transits is not constant - brightness fluctuations result from the natural variability of the observed star. This small variability significantly complicates exoplanet detection and must therefore be removed. In order to remove noise a normalization curve is fitted for each series and the transits are interpolated linearly. Such interpolation allows the curve to be fitted only to noise without removing any changes caused by planets or other objects.
Then the original series is divided by the normalization curve. The result of such an operation is a light curve with noise removed and transits preserved, as shown below.
The final stage of processing is the creation of views, global and local. Both are phase-folded, which means that all periods of the detected TCE are combined into one curve in which the detected event is centered and values are averged.
Centroid series are processed in a similar way, but only a local and global view is created for them. Even and odd transits are extracted from the normalized curves and local views of star brightness are created from them. A local view is also created for the second transit brightness.
To run this project locally ensure you have Python 3.11 or newer installed on your machine:
$ python --version
# Python 3.11.8
Then install poetry and optionally tox:
$ pip install tox poetry
Install project dependencies:
$ poetry install
GAIA allows for efficient (asynchronous) downloading of stellar time series, TCE scalar values and stellar parameters from official NASA and MAST archives via REST API. The script implements mechanisms to retry the download after error occures and allows to stop and resume at any time without losing download progress. The downloaded data is automatically saved in the specified local location as .fits
files.
To download data locally, run following script from top level directory:
$ python -m gaia.scripts.download_data
NOTE The amount of data is significant (approx. 280 000 files, 120GB).
download-data.mp4
Data preprocessing allows to extract relevant information from raw files and, in the case of TCE, to combine values from several sources into one file. It also changes the data format to one that is easier to further process (default to .parquet
). The size of the data is also reduced (from 120GB to over 20GB). To preprocess raw data locally run:
$ python -m gaia.scripts.preprocess_data
The interactive visualizations allow for graphical representation of both TCE and stellar scalar data as well as stellar time series. Implemented as a website, the dashboard provides basic operations on charts (filters, zooming, moving plots, selecting specific observations, etc.). To open dashboard web page on localhost run:
$ python -m gaia.scripts.run_dashboard
NOTE For the dashboard to work properly, it is required to pre-process the data and change it format from
.fits
to.parquet
using thepreprocess_data.py
script.
dashboard.mp4
Data processing implemented as a PySpark pipeline allows to transform interim data into final features that can be used to train deep learning models. The final data is in the format .tfrecords
. This operation includes: removing noise from light curves, creating appropriate local and global views, spliting the data into training, validation and test sets, normalizing observations. To create features locally run:
$ python -m gaia.scripts.create_features
or
$ python -m gaia.scripts.submit_spark_create_features_job
Not implemented yet.
Make sure you have Docker installed on your machine: docker --version
.
To run part of this project on Google Cloud Platform (GCP), the following steps are required:
-
Install gcloud.
-
Create a new GCP project or select exiting one:
$ gcloud init
- Set up local Application Default Credentials (ADC) and create a credential JSON file:
$ gcloud auth application-default login
Learn more about ADC here and here.
- Set up
GOOGLE_APPLICATION_CREDENTIALS
environment variable to provide the location of a credential JSON file. This environment variable is used by GCP Python Client Libraries to communicate with GCP services. - Enable billing for a selected project.
- Enable Service Usage API.
- For data storage enable Google Cloud Storage API.
- For PySpark data processing enable Google Cloud Dataproc API.
- For DL models training enable Google Cloud Vertex AI API.
- For PySpark and DL models custom containers enable Artifact Registry API.
- Configure Docker to use the Google Cloud CLI to authenticate requests to Artifact Registry in specified region (e.g. europe-central2):
$ gcloud auth configure-docker europe-central2-docker.pkg.dev
or add following lines to local Docker config (default location is ~/.docker/config.json
):
"credHelpers": {
"europe-central2-docker.pkg.dev": "gcloud"
}
- Create Artifact Registry container repository.
- Build and push Docker image. The Docker image name MUST have a format
{region}-docker.pkg.dev/{GCP_project_ID}/{repo}/{image_name}:{tag}
e.g.:
docker build -t europe-central2-docker.pkg.dev/project-132/test/test-dataproc:tag1 . \
docker push europe-central2-docker.pkg.dev/project-132/test/test-dataproc:tag1
- Change Dataproc config in
gaia/configs/create_features/kepler_gcp.yaml
to use your image.