Cleaning Images in the Cloud.

In the area of astronomical computing it is currently possible to generate terabytes of charge-coupled device (CCD) image data on a daily basis, and this capability is shared with institutes both large and small. All CCD images must undergo a pre-processing step of calibration and cleaning prior to the generation of magnitude values for stars or other light sources. This must be done for each CCD image before their ultimate use in the construction of light curves for analysis. As the volume of data increases so does this pre-processing requirement. Existing data processing pipelines are either primarily sequential in nature, and thus fail to exploit the parallel nature of the captured data or rely on high performance computing solutions close to the dataset. As datasets grow to terabytes-per-day, sequential processing approaches create a processing bottleneck prior to the creation and analysis of photometric light curves, and require ever larger and more complex data centre solutions.

This research is focused on the calibration and reduction phase of astronomical pipelines, up to the point of creating magnitude values but prior to the production or analysis of light curves. Light curve generation and analysis is considered beyond the scope of this research. Using a reference dataset of 26GB from the Blackrock Castle Observatory (BCO) in Cork, a data processing pipeline is proposed which incorporates the characteristics of distributed computing and cloud computing, such as elasticity, parallel processing, and the utilisation of commodity computing resources. This unique pipeline framework will demonstrate how a decentralised elastic computing module can be created to process terabytes of image data per day.

This research has already led to the creation of a distributed pipeline spanning three institutes, demonstrating 98% reduction in processing time over an existing BCO processing pipeline. Further performance enhancements are sought to demonstrate the feasibility of a parallel distributed cleaning pipeline for datasets in the order of tens of terabytes per day. Research is ongoing through a series of over 300 sizing and performance experiments using a mix of the Amazon Web Servers infrastructure, the HEAnet storage infrastructure and a private cloud spanning multiple institutes of technology in Ireland. Central to this research is the use of EC2 Instances operating as worker nodes within the pipeline accessing NginX web servers serving static image files hosted on AWS EBS and the HEAnet iScsi storage farm. The pipeline is controlled using a series of Pyton scripts which launch instances from per configured AMIs which obtain work via the SQS service. The hypothesis  under test is to see if it is possible to process 100TB of RAW astronomical Data using a distributed processing pipeline in less than 24 hours. Such a system could be relevent for the data processing challenges facing the Large Synoptic Survey Telescope due to come online within the next few year.  

Talk at HEANET 2012 

Talk at HEANET 2010

Slide Presentation 2009

SPIE Paper 2012 Astronomical Data Processing in the Cloud

ACN Pipeline Githib repository (PhD Reseach)

NIMBUS Pipeline Github repository (PhD. Research)

Research Associates:   BlackRock Castle Observatory Cork, ITTD Tallaght


Computing @ DIT are not responsible for content on external sites
     Find us on Facebook      Follow us on Twitter      Follow us on LinkedIn

Member of the European University Association