Abstract

Jorge Bruno Morgado
Reproducible science in the context of the SKA by the use of virtualization technologies

The Square Kilometre Array (SKA), is an ongoing effort to build the world's largest radio telescope, planned to go into full operation in 2027.

SKA's 1st phase will comprise 130,000 low-frequency antennas (50 MHz to 350 MHz) and ~200 mid-frequency antennas (350 MHz to 15.5 GHz) producing a raw data rate of ~10 Tb/s, requiring a computing power of 100 Pflop/s and archiving capacity of ~500 of PB/year.  Next phase, will increase the number of both low and mid antennas by a factor of 10 and the computing requirements accordingly.

Such amount of data is not suitable for traditional data analysis methods relying on local storage and multiple computing trials, neither for reproducibility checks relying on running full data processing pipelines to validate results. Such massive amount of data can't be easily reprocessed multiple times to check the validity of a scientific result, as such, it should be the algorithms and tools used to process data that are examined whenever possible.

We propose a data collection and processing pipeline that focuses on ensuring reproducible results while minimizing computing requirements by relying on virtualization and containerization.  The aim is that all scientific results and intermediate data products, will be intrinsically linked with all information necessary to allow for the unequivocal definition of any transformation applied to data starting from the moment of the observation, sparing resources that can be allocated to other endeavours.