Abstract

Javier Moldon
A fully-reproducible workflow for the SKA Data Challenge 2 HI-FRIENDS solution

Reproducibility of data processing and scientific analysis gain especial relevance when dealing with large volumes of data from new and future observatories, for which analysis can be computationally expensive. In preparation for this situation, the SKAO is issuing a series of data challenges where competing teams process simulated SKA data products. The second SKA data challenge (SDC2) consisted of finding and characterizing HI sources on a single 1TB data cube, and incorporated a Reproducibility Award for those teams that presented an Open and reproducible workflow following best practices that assured reproducible and reusable results. The pipelines were evaluated following 29 reproducibility criteria covering different areas: being well documented, easy to install and to use, with an open license, accessible source code, following coding standards and containing code tests. We participated in the SDC2 as the HI-FRIENDS team and we put special emphasis on following the criteria for reproducibility and openness when developing our code. We developed a workflow managed with the workflow management system snakemake, and put the code and the documentation publicly available in Github, Zenodo and WorkflowHub. Apart from complying with the SKAO reproducibility checklist, we suggested additional actions that could be implemented in future data challenges. Here we will present the development procedure and standards we followed on our workflow, and the challenges we found.