### 2 Clump Identification Algorithms

2.1 FellWalker
2.2 GaussClumps
2.3 ClumpFind
2.4 Reinhold

This section lists and describes the clump identification algorithms which are implemented within the CUPID FINDCLUMPS command.

Note, it is assumed that any background extended emission has been removed from the data before using FINDCLUMPS. This can be done using the FINDBACK command.

#### 2.1 FellWalker

This algorithm was developed specifically for CUPID to address some of the problems associated with ClumpFind. It is fully described in Berry (2015, Astronomy & Computing, vol. 10,p22-31).

It is most simply described assuming the data array is 2-dimensional, although it applies in a similar manner to 3-dimensional data. Its name was chosen to suggest a parallel between a contour map of a 2D astronomical data array, and the height contours seen in a geographical map of a hilly area, such as used by most fell-walkers. The algorithm used to assign a data pixel to a clump of emission can be compared to that of a fell-walker ascending a hill by following the line of steepest ascent (not perhaps the most sensible way to ascend a hill in practise but one which lends some verisimilitude to the present algorithm!).

The algorithm considers every data pixel in the supplied array in turn as a potential start for a “walk” to a neighbouring peak. Pixels which are below a nominated background level (specified by configuration parameter FellWalker.Noise) are ignored and are flagged as not belonging to any emission clump. These are skipped over, as are pixels which have already been assigned to a clump. Once a pixel is found that has not yet been assigned to a clump and is above the background level, the algorithm proceeds to trace out a route from this pixel to a neighbouring peak. It does this in a series of steps (comparable to the steps of a fell-walker). At each step the algorithm looks at the 3$×$3 square of neighbouring pixels (a 3$×$3$×$3 cube for 3D data), and selects the neighbour which would produce the highest gradient for the next step. The algorithm then moves to that pixel and proceeds with the next step.

Eventually, this algorithm will reach a local maximum; a point from which all routes go down-hill. But this may be merely a noise spike, rather than a significant peak, and so a check is made over a more extended neighbourhood to see if a pixel with a higher data value than the current pixel can be found (the extent of this extended neighbourhood is specified by the configuration parameter FellWalker.MaxJump). If so, the algorithm “jumps” to the pixel with the highest value in the extended neighbourhood, and then proceeds as before to walk up-hill. If no pixel with a higher value is found within the extended neighbourhood, the pixel is designated as a significant peak, and is assigned a unique integer identifier. This integer is used to identify all pixels which are within the clump of emission containing the peak, and all pixels which were visited along the walk are assigned this identifier.

If, in the process of walking up-hill, a pixel is visited which has already been assigned to a clump, then the walk is terminated at that point and all the pixels so far visited are assigned to the same clump.

In some cases, the initial part of a walk may be over very flat “terrain”. The significant part of a walk is considered to start when the average gradient (taken over a 4 step length) first reaches the value of configuration parameter FlatSlope. Any pixels visited prior to this point are deemed not to be in any clump. However, this only applies if the walk starts at or close to “sea level”. For walks which start from a higher level (i.e. from a pixel which has a data value greater than the selected background level plus twice the RMS noise level), the entire length of the walk is used, including any initial flat section.

Once all pixels in the data array have been considered as potential starts for such a walk, an array will have been created which holds an integer clump identifier for every data pixel. To reduce the effect of noise on the boundaries between clumps, a cellular automata can be used to smooth the boundaries. This replaces each clump identifier by the most commonly occurring value within a 3$×$3 square (or 3$×$3$×$3 cube for 3D data) of neighbours. The number of times which this cleaning process should be applied is specified by configuration parameter CleanIter.

If the high data values in a clump form a plateau with slight undulations, then the above algorithm may create a separate clump for each undulation. This is probably inappropriate, especially if the dips between the undulations are less than or are comparable to the noise level in the data. This situation can arise for instance if the pixel-to-pixel noise is correlated on a scale equal to or larger than the value of the MaxJump configuration parameter. To avoid this, adjoining clumps are merged together if the dip between them is less than a specified value. Specifically, if two clumps with peak values PEAK1 and PEAK2, where PEAK1 is less than PEAK2, are adjacent to each other, and if the pixels along the interface between the two clumps all have data values which are larger than “PEAK1 - MinDip” (where MinDip is the value of the MinDip configuration parameter), then the two clumps are merged together.

The results of this merging process are the final clump allocations for every data pixel, from which the catalogue of clump parameters is produced.

#### 2.2 GaussClumps

This is based on the algorithm described by Stutzki & Gusten (1990, ApJ 356, 513). This algorithm proceeds by fitting a Gaussian profile to the brightest peak in the data. It then subtracts the fit from the data and iterates, fitting a new ellipse to the brightest peak in the residuals. This continues until any one of the “termination criteria” described below is satisfied. Each fitted ellipse is taken to be a single clump and is added to the output catalogue. In this algorithm, clumps may overlap, and for this reason each input pixel cannot simply be assigned to a single clump (as can be done for algorithms such as FellWalker or ClumpFind). Therefore, when using GassClumps, the primary output NDF from FINDCLUMPS does not hold the clump index at each input pixel. Instead it holds the sum of all the fitted Gaussian that contribute to each input pixel position.

Any input variance component is used to scale the weight associated with each pixel value when performing the Gaussian fit. The most significant configuration parameters for this algorithm are: GaussClumps.FwhmBeam and GaussClumps.VeloRes which determine the minimum clump size, and GaussClumps.Thresh which (together with the ADAM parameter RMS) determine the third of the termination criteria listed below.

Note, this implementation of the GaussClumps algorithm is a completely independent re-write, and includes some differences from other GaussClumps implementations. Specifically, these include the following modifications.

• The Gaussian fitting is based on the SUMSL module (algorithm 611) from the TOMS library available from http://www.netlib.org/.
• Any available variance information is used to weight the pixels when doing the Gaussian fit. This is in addition to the Gaussian weighting function implied by configuration parameters GaussClumps.Wwidth and GaussClumps.Wmin.
• The termination criteria are different. FINDCLUMPS stops finding further clumps if any one of the following criteria is met.
(1)
the total data sum in the fitted Gaussians is equal to or exceeds the total data sum in the supplied input data (this is the original termination criterion used by Stutzki & Gusten).
(2)
The number of clumps already found equals the value of configuration parameter GaussClumps.MaxClumps.
(3)
The number of consecutive fitted peaks with peak value below the value of configuration parameter GaussClumps.Thresh reaches the value of configuration parameter GaussClumps.NPad (the final group of NPad clumps are not included in the returned list of usable clumps).
(4)
The number of failed attempts to fit consecutive clumps reaches the value of configuration parameter GaussClumps.MaxSkip.
• A clump will be ignored if its fitted peak value is a long way above or below the peak value of the previously fitted clump. The definition of “a long way” is more than GaussClumps.NSigma times the standard deviation of the previous GaussClumps.NPeak fitted peaks. This restriction is only imposed once GaussClumps.NPeak peaks have been fitted.
• In certain situations the chi-squared value that is minimised when fitting a Gaussian clump to a peak in the data array may be dominated by pixels that are largely unaffected by changes in the parameters of the Gaussian clumps1. This can result in a very poor fit to the clump. To avoid this, an attempt is made to identify such pixels and to lower the weight associated with them.

#### 2.3 ClumpFind

This algorithm was developed by Jonathan Williams and had been described fully in Williams, de Geus & Blitz (1994, ApJ 428, 693).

Briefly, it contours the data array at many different levels, starting at a value close to the peak value in the array and working down to a specified minimum contour level. At each contour level, all contiguous areas of pixels that are above the contour level are found and considered in turn. If such a set of pixels includes no pixels that have already been assigned to a clump (i.e. have already been identified at a higher contour level), then the set is marked as a new clump. If the set includes some pixels that have already been assigned to a clump, then, if all such pixels belong to the same clump, that clump is extended to include all the pixels in the set. If the set includes pixels that have already been assigned to two or more clumps, then the new pixels in the set are shared out between the two or more existing clumps. This sharing is done by assigning each new pixel to the closest clump. Note, this is based on the distance to the nearest pixel already assigned to the clump, not the distance to the central or peak pixel in the clump. The above paper refers to this as a “friends-of-friends” algorithm.

This process continues down to the lowest specified contour level, except that new clumps found at the lowest contour level are ignored. However, clumps found at higher contour levels are followed down to the lowest specified contour level.

##### 2.3.1 Comparing CUPID ClumpFind with other Implementations

The CUPID implementation of ClumpFind is a completely independent and total re-write, based on the description of the algorithm in the 1994 Williams, de Geus & Blitz paper. Consequently, it differs slightly from other implementations such as the IDL implementation available from Williams own web page at http://www.ifa.hawaii.edu/~jpw/). In particularly, several extra configuration parameters have been added by CUPID to provide more detailed control of the algorithm. If you want to do a direct comparison between the CUPID implementation of the ClumpFind algorithm and another implementation, then you should set these extra parameters to the following values:

• ClumpFind.FwhmBeam = 0
• ClumpFind.MaxBad = 1.0
• ClumpFind.MinPix = 6 (for 3D data - use 20 for 2D data)
• ClumpFind.VeloRes = 0
• ClumpFind.IDLAlg = 1
• ClumpFind.AllowEdge = 1

You should also note that the pixel co-ordinate system used by the two implementations differ, and consequently the positions reported for the clump peaks will also differ. The IDL implementation of ClumpFind uses a pixel coordinate system in which the first pixel (i.e the bottom left pixel of a 2D image displayed “normally”) is centred at (0,0). This differs from both the FITS and NDF definition of pixel co-ordinates. In FITS, the centre of the first pixel in the array has a value of 1.0 on every pixel axis. In NDF, the centre of the first pixel has a value of $LBND\left(I\right)-0.5$ on axis $I$, where $LBND\left(I\right)$ is an attribute of the NDF known as the “pixel origin”. For an NDF which has been derived from a FITS file and which has been subjected to no geometric transformations, the pixel origin will be 1.0 on every pixel axis, resulting in the centre of the first pixel having a value of 0.5 on every pixel axis.

Some implementations of ClumpFind do not conform exactly to the description in the published paper. Specifically:

(1)
the way in which areas containing merged clumps are divided up between individual clumps can differ slightly
(2)
the restriction that all peaks must extend at least as far as the second contour level is sometimes omitted
(3)
the restriction on the minimum number of pixels contained within a clump is sometimes varied in value

CUPID provides an option to use either the published algorithm, or the algorithm implemented by the IDL package available from Jonathan Williams WWW site (as available at 28th April 2006). Setting the configuration parameter ClumpFind.IDLAlg to a non-zero value will cause the IDL implementation to be used. The default value of zero causes the published algorithm to be used.

#### 2.4 Reinhold

This algorithm was developed by Kim Reinhold whilst working at the Joint Astronomy Centre in Hilo, Hawaii. Its overall strategy is first to identify pixels within the data array which mark the edges of clumps of emission. This typically produces a set of rings (in 2D), or shells (in 3D), outlining the clumps. However, these structures can be badly affected by noise in the data and so need to be cleaned up. This is done using a sequence of cellular automata which first dilate the rings or shells, and then erodes them. After cleaning, all pixels within each ring or shell are assumed to belong to a single clump.

##### 2.4.1 Identifying the Clump Edges

The initial identification of the edges is done by considering a set of 1-dimensional profiles through the data array. Each such profile can be represented by a plot of data value against distance (in pixels) along the profile. For each such profile, the algorithm proceeds as follows.

(1)
Find the highest remaining (i.e. unused) data value in the profile.
(2)
If this value is less than a specified background level (given by the configuration parameter Reinhold.Noise), then there are no remaining significant peaks in the profile so continue with the next profile.
(3)
Work out away from the peak position along the profile in both directions to find the edges of the peak. A peak ends when it either i) meets a pixel which has already been included within another peak, or ii) two adjacent pixels are both below the background level, or iii) the average gradient of the profile over three adjacent pixel drops below a minimum value specified by the configuration parameter Reinhold.FlatSlope, or iv) the end of the profile is reached.
(4)
If the peak was not truncated by reaching either end of the profile, and if the peak spans sufficient pixels, the positions of the two edges of the peak are marked in a mask array which is the same shape and size as the 2D or 3D data array. The minimum number of pixels spanned by a peak in order for the peak to be usable is given by the configuration parameter Reinhold.MinPix.
(5)
The position of the peak itself is also marked so long as its peak value is above a specified minimum value (given by configuration parameter Reinhold.Thresh).
(6)
The pixels within the 1D profile which fall between the upper and lower edges of the peak are marked as “used”, and the algorithm loops back to the start (i.e. step 1 above).

This algorithm is applied to each 1D profile in turn. These profiles are divided into groups of parallel profiles; the first group contains profiles which are parallel to the first pixel axis, the second group contains profiles which are parallel to the second pixel axis, etc. There are also groups which contain parallel profiles which proceed diagonally through the data array. Thus there is a group of parallel profiles for every pixel axis and for every possible diagonal direction. Within each group, there are sufficient profiles to ensure that every element in the data array is included in a profile within the group.

Once all profiles have been processed, a 2 or 3D array is available that identifies both the edges of the peaks and also the peak positions themselves. Pixels which are flagged as peaks are only retained if the pixel was found to be a peak in every profile group. That is, pixels which appear as a peak when profiled in one direction but not when profiled in another are discarded.

##### 2.4.2 Cleaning the Clump Edges

The initial identification of clumps edges results in a mask array in which each data pixel is marked as either an edge pixel or a peak pixel or neither. Usually, the edge pixels can be seen to follow the outline of the visible clumps, but will often be badly affected by noise in the data. For instance, there may be holes in the edges surrounding a peak, or spurious pixels may have been marked as edge pixels. Before continuing, it is necessary to reduce the effect of this noise. This is done in two steps described below.

(1)
The edge regions are “dilated” (i.e. thickened) using a cellular automata algorithm which proceeds as follows: if a pixel is marked as an edge pixel, then all immediate neighbours of the pixel are also marked as edge pixels. Each pixel is considered to be the central pixel in a square of 3$×$3 neighbouring pixels for 2D data, or the central pixel in a cube of 3$×$3$×$3 neighbouring pixels for 3D data.
(2)
The thickened edge regions are then “eroded” (i.e. made thinner) using another cellular automata algorithm which proceeds as follows. If the number of neighbouring edge pixels surrounding a central pixel is greater than a specified threshold value (given by the configuration parameter Reinhold.CAThresh), the central pixel is marked as an edge pixel. If the number of neighbouring edge pixels is equal to or below this threshold, the central pixel is not marked as an edge pixel. This transformation can be applied repeatedly to increase the amount of erosion by setting a value greater than one for the configuration parameter Reinhold.CAIterations.
##### 2.4.3 Filling the Clump Edges

Once the edges have been cleaned up, the volume contained within the edges can be filled with an integer which identifies the associated peak. This algorithm proceeds as follows.

(1)
The mask array is scanned for pixels which are marked as peaks. Recall that only those pixels which are seen to be peaks when profiled in all directions have been retained. Each of these pixels thus represents a local maximum in the data value, and has a significantly higher data value than any of the surrounding pixels. Each such peak is given a unique integer identifier. This identifier is used within the following steps to label all pixels in the clump of emission surrounding the peak.
(2)
A line of pixels parallel to the first pixel axis, and which passes through the peak, is then considered. The line is followed away from the peak, in both directions, until pixels are encountered which are flagged as edge pixels. All the pixels along this line between the two edge pixels are assigned the clump identifier associated with the central peak.
(3)
For each such pixel, another line of pixels parallel to the second pixel axis and passing through the pixel is considered. The line is followed away from the pixel, in both directions, until edge pixels are encountered. All the pixels along this line between the two edge pixels are also assigned the clump identifier associated with the central peak.
(4)
For each such pixel, another line of pixels parallel to the third pixel axis and passing through the pixel is considered. The line is followed away from the pixel, in both directions, until edge pixels are encountered. All the pixels along this line between the two edge pixels are also assigned the clump identifier associated with the central peak.

The above process will fill the volume between the edges, but may miss some pixels (e.g. if the initial line parallel to the first pixel axis spans the clump at an unusually narrow point). In order to alleviate this effect, the above process is repeated, but scanning the pixels axes in a different order (2,3,1 instead of 1,2,3). For 3D data, it is repeated a third time using the axis order (3,1,2).

Even so, it should be recognised that this filling algorithm will fail to fill certain shapes entirely. For instance, “S”-shaped clumps could not be filled completely using this algorithm.

##### 2.4.4 Cleaning up the Filled Clumps

The use of cellular automata to clean up the edges reduces the likelihood of “holes” within the clump edges, but does not eliminate this risk entirely. When the clump-filling algorithm described above encounters a hole in the edges surrounding a peak, the clump identifier value will “leak out” through the hole into the surrounding areas. This is where the limitations of the filling algorithm have a positive benefit, in that they prevent the leak from spreading round corners without limit. Instead, such leaks will tend to produce straight features radiating out from a clump parallel to a pixel axis, which will terminate as soon as they meet another edge.

It is thus possible for the two or more clumps to “claim” a given pixel. This will happen if there are holes in the edges surrounding the peaks which allow the filling process to leak out. In this case, each pixel is assigned to the clump associated with the nearest peak.

Another cellular automata is used once the filling process has been completed to reduce the artifacts created by these leaks. This cellular automata replaces each clump identifier by the most commonly occurring clump identifier within a 3$×$3$×$3 cube (or 3$×$3 square for 2D data) of neighbours. This process can be repeated to increase the degree of cleaning, by assigning a value greater than one to the configuration parameter Reinhold.FixClumpsIterations.

The results of this cleaning process are the final clump allocations for every data pixel, from which the catalogue of clump parameters is produced.

1This can happen for instance when neighbouring peaks are present within the area of pixels used to fit a central peak.