This section lists and describes the clump identification algorithms which are implemented within the CUPID FINDCLUMPS command.
Note, it is assumed that any background extended emission has been removed from the data before using FINDCLUMPS. This can be done using the FINDBACK command.
This algorithm was developed specifically for CUPID to address some of the problems associated with ClumpFind. It is fully described in Berry (2015, Astronomy & Computing, vol. 10,p22-31).
It is most simply described assuming the data array is 2-dimensional, although it applies in a similar manner to 3-dimensional data. Its name was chosen to suggest a parallel between a contour map of a 2D astronomical data array, and the height contours seen in a geographical map of a hilly area, such as used by most fell-walkers. The algorithm used to assign a data pixel to a clump of emission can be compared to that of a fell-walker ascending a hill by following the line of steepest ascent (not perhaps the most sensible way to ascend a hill in practise but one which lends some verisimilitude to the present algorithm!).
The algorithm considers every data pixel in the supplied array in turn as a potential start for a “walk”
to a neighbouring peak. Pixels which are below a nominated background level (specified by
FellWalker.Noise) are ignored and are flagged as not belonging to any
emission clump. These are skipped over, as are pixels which have already been assigned to a clump.
Once a pixel is found that has not yet been assigned to a clump and is above the background level, the
algorithm proceeds to trace out a route from this pixel to a neighbouring peak. It does this in a series
of steps (comparable to the steps of a fell-walker). At each step the algorithm looks at the
33 square of neighbouring pixels (a
cube for 3D data), and selects the neighbour which would produce the highest gradient for the next
step. The algorithm then moves to that pixel and proceeds with the next step.
Eventually, this algorithm will reach a local maximum; a point from which all routes go down-hill. But
this may be merely a noise spike, rather than a significant peak, and so a check is made over a more
extended neighbourhood to see if a pixel with a higher data value than the current pixel can be found
(the extent of this extended neighbourhood is specified by the configuration parameter
FellWalker.MaxJump). If so, the algorithm “jumps” to the pixel with the highest value in the extended
neighbourhood, and then proceeds as before to walk up-hill. If no pixel with a higher value is found
within the extended neighbourhood, the pixel is designated as a significant peak, and is assigned a
unique integer identifier. This integer is used to identify all pixels which are within the clump of
emission containing the peak, and all pixels which were visited along the walk are assigned this
If, in the process of walking up-hill, a pixel is visited which has already been assigned to a clump, then the walk is terminated at that point and all the pixels so far visited are assigned to the same clump.
In some cases, the initial part of a walk may be over very flat “terrain”. The significant part of a walk is
considered to start when the average gradient (taken over a 4 step length) first reaches the value of
FlatSlope. Any pixels visited prior to this point are deemed not to be in any
clump. However, this only applies if the walk starts at or close to “sea level”. For walks which start
from a higher level (i.e. from a pixel which has a data value greater than the selected background level
plus twice the RMS noise level), the entire length of the walk is used, including any initial flat
Once all pixels in the data array have been considered as potential starts for such a walk, an array will
have been created which holds an integer clump identifier for every data pixel. To reduce the effect of
noise on the boundaries between clumps, a cellular automata can be used to smooth the
boundaries. This replaces each clump identifier by the most commonly occurring value within a
33 square (or
cube for 3D data) of neighbours. The number of times which this cleaning process should be applied is
specified by configuration parameter
If the high data values in a clump form a plateau with slight undulations, then the above
algorithm may create a separate clump for each undulation. This is probably inappropriate,
especially if the dips between the undulations are less than or are comparable to the noise level
in the data. This situation can arise for instance if the pixel-to-pixel noise is correlated
on a scale equal to or larger than the value of the
MaxJump configuration parameter. To
avoid this, adjoining clumps are merged together if the dip between them is less than a
specified value. Specifically, if two clumps with peak values PEAK1 and PEAK2, where
PEAK1 is less than PEAK2, are adjacent to each other, and if the pixels along the interface
between the two clumps all have data values which are larger than “PEAK1 - MinDip” (where
MinDip is the value of the
MinDip configuration parameter), then the two clumps are merged
The results of this merging process are the final clump allocations for every data pixel, from which the catalogue of clump parameters is produced.
This is based on the algorithm described by Stutzki & Gusten (1990, ApJ 356, 513). This algorithm proceeds by fitting a Gaussian profile to the brightest peak in the data. It then subtracts the fit from the data and iterates, fitting a new ellipse to the brightest peak in the residuals. This continues until any one of the “termination criteria” described below is satisfied. Each fitted ellipse is taken to be a single clump and is added to the output catalogue. In this algorithm, clumps may overlap, and for this reason each input pixel cannot simply be assigned to a single clump (as can be done for algorithms such as FellWalker or ClumpFind). Therefore, when using GassClumps, the primary output NDF from FINDCLUMPS does not hold the clump index at each input pixel. Instead it holds the sum of all the fitted Gaussian that contribute to each input pixel position.
Any input variance component is used to scale the weight associated with each pixel value when
performing the Gaussian fit. The most significant configuration parameters for this algorithm are:
GaussClumps.VeloRes which determine the minimum clump size, and
GaussClumps.Thresh which (together with the ADAM parameter RMS) determine the third of the
termination criteria listed below.
Note, this implementation of the GaussClumps algorithm is a completely independent re-write, and includes some differences from other GaussClumps implementations. Specifically, these include the following modifications.
GaussClumps.Threshreaches the value of configuration parameter
GaussClumps.NPad(the final group of NPad clumps are not included in the returned list of usable clumps).
GaussClumps.NSigmatimes the standard deviation of the previous
GaussClumps.NPeakfitted peaks. This restriction is only imposed once
GaussClumps.NPeakpeaks have been fitted.
This algorithm was developed by Jonathan Williams and had been described fully in Williams, de Geus & Blitz (1994, ApJ 428, 693).
Briefly, it contours the data array at many different levels, starting at a value close to the peak value in the array and working down to a specified minimum contour level. At each contour level, all contiguous areas of pixels that are above the contour level are found and considered in turn. If such a set of pixels includes no pixels that have already been assigned to a clump (i.e. have already been identified at a higher contour level), then the set is marked as a new clump. If the set includes some pixels that have already been assigned to a clump, then, if all such pixels belong to the same clump, that clump is extended to include all the pixels in the set. If the set includes pixels that have already been assigned to two or more clumps, then the new pixels in the set are shared out between the two or more existing clumps. This sharing is done by assigning each new pixel to the closest clump. Note, this is based on the distance to the nearest pixel already assigned to the clump, not the distance to the central or peak pixel in the clump. The above paper refers to this as a “friends-of-friends” algorithm.
This process continues down to the lowest specified contour level, except that new clumps found at the lowest contour level are ignored. However, clumps found at higher contour levels are followed down to the lowest specified contour level.
The CUPID implementation of ClumpFind is a completely independent and total re-write, based on the description of the algorithm in the 1994 Williams, de Geus & Blitz paper. Consequently, it differs slightly from other implementations such as the IDL implementation available from Williams own web page at http://www.ifa.hawaii.edu/~jpw/). In particularly, several extra configuration parameters have been added by CUPID to provide more detailed control of the algorithm. If you want to do a direct comparison between the CUPID implementation of the ClumpFind algorithm and another implementation, then you should set these extra parameters to the following values:
ClumpFind.MinPix= 6 (for 3D data - use 20 for 2D data)
You should also note that the pixel co-ordinate system used by the two implementations differ, and consequently the positions reported for the clump peaks will also differ. The IDL implementation of ClumpFind uses a pixel coordinate system in which the first pixel (i.e the bottom left pixel of a 2D image displayed “normally”) is centred at (0,0). This differs from both the FITS and NDF definition of pixel co-ordinates. In FITS, the centre of the first pixel in the array has a value of 1.0 on every pixel axis. In NDF, the centre of the first pixel has a value of on axis , where is an attribute of the NDF known as the “pixel origin”. For an NDF which has been derived from a FITS file and which has been subjected to no geometric transformations, the pixel origin will be 1.0 on every pixel axis, resulting in the centre of the first pixel having a value of 0.5 on every pixel axis.
Some implementations of ClumpFind do not conform exactly to the description in the published paper. Specifically:
CUPID provides an option to use either the published algorithm, or the algorithm implemented by the
IDL package available from Jonathan Williams WWW site (as available at 28th April 2006). Setting
the configuration parameter
ClumpFind.IDLAlg to a non-zero value will cause the IDL
implementation to be used. The default value of zero causes the published algorithm to be
This algorithm was developed by Kim Reinhold whilst working at the Joint Astronomy Centre in Hilo, Hawaii. Its overall strategy is first to identify pixels within the data array which mark the edges of clumps of emission. This typically produces a set of rings (in 2D), or shells (in 3D), outlining the clumps. However, these structures can be badly affected by noise in the data and so need to be cleaned up. This is done using a sequence of cellular automata which first dilate the rings or shells, and then erodes them. After cleaning, all pixels within each ring or shell are assumed to belong to a single clump.
The initial identification of the edges is done by considering a set of 1-dimensional profiles through the data array. Each such profile can be represented by a plot of data value against distance (in pixels) along the profile. For each such profile, the algorithm proceeds as follows.
Reinhold.Noise), then there are no remaining significant peaks in the profile so continue with the next profile.
Reinhold.FlatSlope, or iv) the end of the profile is reached.
This algorithm is applied to each 1D profile in turn. These profiles are divided into groups of parallel profiles; the first group contains profiles which are parallel to the first pixel axis, the second group contains profiles which are parallel to the second pixel axis, etc. There are also groups which contain parallel profiles which proceed diagonally through the data array. Thus there is a group of parallel profiles for every pixel axis and for every possible diagonal direction. Within each group, there are sufficient profiles to ensure that every element in the data array is included in a profile within the group.
Once all profiles have been processed, a 2 or 3D array is available that identifies both the edges of the peaks and also the peak positions themselves. Pixels which are flagged as peaks are only retained if the pixel was found to be a peak in every profile group. That is, pixels which appear as a peak when profiled in one direction but not when profiled in another are discarded.
The initial identification of clumps edges results in a mask array in which each data pixel is marked as either an edge pixel or a peak pixel or neither. Usually, the edge pixels can be seen to follow the outline of the visible clumps, but will often be badly affected by noise in the data. For instance, there may be holes in the edges surrounding a peak, or spurious pixels may have been marked as edge pixels. Before continuing, it is necessary to reduce the effect of this noise. This is done in two steps described below.
Reinhold.CAThresh), the central pixel is marked as an edge pixel. If the number of neighbouring edge pixels is equal to or below this threshold, the central pixel is not marked as an edge pixel. This transformation can be applied repeatedly to increase the amount of erosion by setting a value greater than one for the configuration parameter
Once the edges have been cleaned up, the volume contained within the edges can be filled with an integer which identifies the associated peak. This algorithm proceeds as follows.
The above process will fill the volume between the edges, but may miss some pixels (e.g. if the initial line parallel to the first pixel axis spans the clump at an unusually narrow point). In order to alleviate this effect, the above process is repeated, but scanning the pixels axes in a different order (2,3,1 instead of 1,2,3). For 3D data, it is repeated a third time using the axis order (3,1,2).
Even so, it should be recognised that this filling algorithm will fail to fill certain shapes entirely. For instance, “S”-shaped clumps could not be filled completely using this algorithm.
The use of cellular automata to clean up the edges reduces the likelihood of “holes” within the clump edges, but does not eliminate this risk entirely. When the clump-filling algorithm described above encounters a hole in the edges surrounding a peak, the clump identifier value will “leak out” through the hole into the surrounding areas. This is where the limitations of the filling algorithm have a positive benefit, in that they prevent the leak from spreading round corners without limit. Instead, such leaks will tend to produce straight features radiating out from a clump parallel to a pixel axis, which will terminate as soon as they meet another edge.
It is thus possible for the two or more clumps to “claim” a given pixel. This will happen if there are holes in the edges surrounding the peaks which allow the filling process to leak out. In this case, each pixel is assigned to the clump associated with the nearest peak.
Another cellular automata is used once the filling process has been completed
to reduce the artifacts created by these leaks. This cellular automata replaces
each clump identifier by the most commonly occurring clump identifier within a
cube (or 33
square for 2D data) of neighbours. This process can be repeated to increase the degree of cleaning, by
assigning a value greater than one to the configuration parameter
The results of this cleaning process are the final clump allocations for every data pixel, from which the catalogue of clump parameters is produced.
1This can happen for instance when neighbouring peaks are present within the area of pixels used to fit a central peak.