1 Introduction

The iterative map-maker developed for Smurf [1] is a sophisticated algorithm that is able to reduce the large amounts of data collected by SCUBA-2 in a reasonable amount of time. The algorithm measures and removes correlated noise by iteratively solving for sources of noise along with the sky signal. This is much less computationally intensive than a full generalized least squares solution (e.g. [2]), which requires careful measurements of noise auto- and cross-power spectra and inversion of very large matrices. Due to the iterative nature of the algorithm, however, the full data set must be accessed repeatedly. Since the data rate is high (10s of GB per hour of observation), reading the full data set (often many hours) from disk is slow. Re-reading the data on every iteration takes a considerable amount of time, so it is desirable to keep the data in memory once it has been read in the first time. This limits the amount of data that can be reduced at a time. Longer observations with data sizes larger than than the available memory must be reduced in chunks and combined after the fact, if caching and re-reading the data is to be avoided.

The next generation of submillitre instruments, such as SWCam [3], the short-wavelength camera for CCAT [4], will have 10 times as many detector elements as SCUBA-2 and will be sampled at a rate of 1 kHz, compared to SCUBA-2’s 200 Hz. This increase in data volume by a factor of 50 presents a serious problem for data reduction, both in terms of memory usage and processing time. Several approaches have been considered to account for this large increase in data volume:

(1)
High-pass filter the time-streams to suppress low-frequency correlated noise. If the detectors’ filtered noise power spectra are reasonably flat and uncorrelated, the map-making problem becomes one of simply rebinning (averaging all samples that fall within a given sky pixel), requiring only one pass through the data. The high-pass filter limits the angular scales recovered by the mapping process, however; this is fine for maps consisting of only point sources, but when larger scales are important, this approach will not be sufficient.
(2)
Use machines with large amounts of memory and limit the length of observations that can be reduced in a single pass. CCAT and SWCam are still several years away, and one can expect increases in hardware speed and capacity in the intervening years. By simply scaling the run time and memory usage for reducing 15 minutes of SCUBA-2 data, we estimate that a 32-core machine with 2 TB memory could reduce 15 minutes of SWCam data in about an hour [5]. Four such machines would therefore be needed to reduce the data as it is collected, and would be limited to small maps.
(3)
Write a parallel version of map-maker to run on a distributed-memory cluster of machines. The iterative algorithm is not trivially parallelizable, as each iteration and each model within each iteration must be calculated sequentially. The solution then is to parallelize each model calculation. This is also non-trivial, as some models require access to all detectors at each time slice (e.g. the common model model), some models require all time samples for each detector (e.g. the high-pass filter), and some require all samples (e.g. the astronomical model). We solve this problem by message-passing via MPI.1

The distibuted-memory parallelized version of the iterative map-maker has been described elsewhere [6], but in this document we add more detail and discuss some practical considerations.

A note on terminology: a distributed-memory cluster consists of a number of connected nodes. Each node may consist of a number of processors or cores. A single node can support multithreaded parallelization using shared memory between threads. The term process refers to a single executable that is run on a node; a process is able to spawn multiple threads to take advantage of a multicore node. The term system is used to refer to the set of all processes involved in the calculation.

1Message Passing Interface: a standard defining a set of routines to pass messages between processes.