20 Pairing two catalogues

 20.1 Requirements
 20.2 Running catpair
 20.3 Pairing criteria
 20.4 Rows in the output catalogue
 20.5 Multiple matches
 20.6 Pairing algorithm

catpair is provided to identify ‘corresponding’ objects in two catalogues; objects are considered to correspond if they have similar positions. An output catalogue is generated from the list of corresponding objects.

In astronomical catalogues the ‘corresponding’ rows in two catalogues are usually rows which contain data for the same astronomical object. Traditionally in relational database systems corresponding rows are identified by having identical values for some field, such as a name. For example, two rows might be considered to correspond if a name field in both catalogues adopted the value ‘NGC 1305’ for both rows. This operation is usually called joining the two catalogues.


            +------------------------------------------------------+
            |         x                    *                 *     |
            |  x                *                *                 |
            |                         x                x*          |
            |           x                  *                  *    |
            |    x          x*       *             *               |
       Dec. |                                           *          |
            |        x*            x         x*                    |
            |                                      *          *    |
            |   x        x*  x         *                           |
            |                                 *       x*           |
            |      x       x      x                          *     |
            |                             *                        |
            +------------------------------------------------------+
  
                                      R.A.
x - Object in primary dataset.
- Object in secondary dataset.

Adjacent objects are pairs.


Figure 7: Two datasets for joining


In astronomical problems such joining by an exact match is relatively uncommon. A more common case is where corresponding objects are identified by similar positions in both catalogues. This situation is illustrated in Figure 7. The important point here is that, essentially because of measurement errors, the corresponding positions are merely similar, not an exact match. This circumstance makes establishing corresponding rows a much more complicated and problematic process. In practice the positions used are almost always some type of two-dimensional coordinates; usually celestial coordinates such as Right Ascension and Declination, or possibly Cartesian coordinates of some sort. In principle one, three or higher dimensional coordinates could be used though they are not important in practice. catpair only supports joining based on two-dimensional coordinates, though the coordinates may be either Cartesian or spherical-polar.

In CURSA this special sort of join based on an approximate match in two-dimensional coordinates is called pairing. Thus, in this usage, pairing is a special case of joining catalogues, albeit one which is important in astronomical practice.

catpair operates on two input catalogues, known as the primary and secondary catalogues. To fix ideas, think of the primary as being a small list of target objects which you have compiled, and the secondary as being a standard catalogue, such as the SAO star catalogue, one of the Durchmusterungen or Dreyer’s New General Catalogue of non-stellar objects. The final result of the pairing is a new catalogue containing the paired objects; the output catalogue.

If you wish to pair several catalogues to create a single output catalogue you should invoke catpair several times, creating intermediate paired catalogues as appropriate.

Pairing is a relatively complicated process and you must answer several prompts to fully specify the operations to be performed. The following two sections, ‘Requirements’ and ‘Running catpair’ respectively describe the requirements for catpair and how to run it. You should read at least these two sections. Subsequent sections describe various aspects of the pairing process in greater detail. While it is not strictly necessary to read these latter sections, they may help you to understand what catpair is doing and hence to use it more effectively.

20.1 Requirements

Obviously before running catpair you must have a primary and a secondary catalogue. The secondary catalogue must be sorted on the second column that is to be used for the pairing (usually this will be the y or Declination coordinate). If your secondary is not sorted in this way then use catsort (see Section 15, above) to create a suitably sorted secondary catalogue.

You need to know the names of the columns in both catalogues which contain the coordinates which are to be used for the pairing (and whether they are Cartesian or spherical-polar coordinates). If you are in doubt about the columns in the catalogues use catheader (see Section 13, above) to obtain the details. If the coordinates are Cartesian then the coordinates in both input catalogues must be in the same system, with the same units,14 zero point and orientation. That is, a given value for the coordinates (say 23.5, 105.7) should correspond to the same position in both catalogues. If the coordinates are spherical-polar they must always be in units of radians. The coordinates in the two catalogues should be of the same type (equatorial, Galactic etc.) and if they are equatorial they should have the same system, epoch and equinox.

Finally you need to specify the critical distance, D, which determines whether two objects, one in each catalogue, are considered pairs or not. If the actual separation of the two objects is less than or equal to this distance then they are considered pairs; if it is greater then they are not. In catpair this critical distance may be either a constant, a column in the primary (so it varies for different objects in the primary) or an expression based on columns in the primary. In practice the value adopted for the critical distance is often derived from the errors associated with the positions in the catalogues. If you do not already know the errors on the positions in your catalogues, you could consult the textual information information associated with the catalogue, which will often contain these details. Again use catheader (see Section 13) to access this information.

20.2 Running catpair

To run catpair simply type:

  catpair

By default catpair writes a summary of the pairing options specified as textual information in the output catalogue. This information is useful documentation of the pairing and you will usually want to retain it. However, you can specify that it is not to be written by specifying an extra item on the command line, as follows:

  catpair  text=none

There must be one or more spaces between ‘catpair’ and ‘text=none’. catpair has an option to include in the output catalogue three special columns containing additional details for the paired objects. These columns are described in Section 20.2.1, below. By default these additional columns are not created. To include them in the output catalogue type:

  catpair  spcol=true

You must answer a fairly long series of prompts in order to specify the behaviour of catpair. These prompts are listed below, in the order in which they are issued by the program, together with a corresponding explanation. In this list the prompts are identified by the corresponding ADAM parameter name, which appears at the start of the prompt line.

PRIMARY
Enter the name of the primary input catalogue.
SECOND
Enter the name of the secondary input catalogue. This catalogue must be sorted on the second column to be used in the pairing (usually the y or Declination coordinate).
OUTPUT
Enter the name of the output catalogue to contain the set of paired objects. A catalogue with this name must not already exist. catpair will automatically create the output catalogue in toto.
CRDTYP
(default = ‘S’) Specify the type of coordinates which are to be used for the pairing. The possibilities are either Cartesian coordinates (‘C’) or celestial spherical-polar coordinates (‘S’) such as Right Ascension and Declination.
PCRD1
Enter the name of the column in the primary catalogue containing the first column to be used in the pairing. This column will usually be an x coordinate or a Right Ascension.
PCRD2
Enter the name of the column in the primary catalogue containing the second column to be used in the pairing. This column will usually be a y coordinate or a Declination.
SCRD1
Enter the name of the column in the secondary catalogue containing the first column to be used in the pairing. This column will usually be an x coordinate or a Right Ascension.
SCRD2
Enter the name of the column in the secondary catalogue containing the second column to be used in the pairing. This column will usually be a y coordinate or a Declination. The secondary catalogue must be sorted on this column.
PDIST
Enter the critical distance determining whether two objects, one in each catalogue, are considered pairs or not. If the actual separation of the two objects is less than or equal to this distance then they are considered pairs; if it is greater then they are not. In the simplest case this critical distance is a simple numeric value, such as twenty-three minutes of arc, constant for all the objects in the catalogues. However, it may also be a column in the primary catalogue (but not a column in the secondary) or an expression involving columns in the primary (see Section 20.3, below).

If the pairing coordinates are Cartesian then a constant critical distance would typically be specified as a simple decimal number, for example ‘23.0’. However, if they were celestial coordinates then it could be specified as any of the forms in which an angle can be input: a floating point number in radians, or a sexagesimal value in hours or degrees. In addition a special format is available in catpair in which the separation is given as a floating point number expressed in seconds of arc, immediately followed by the string ‘arcsec’. For example, a separation of twenty-three minutes of arc could be entered as any of the following values:

+00:23:00     (sexagesimal degrees)
1380.0arcsec     (seconds of arc)
00:01:31.99     (sexagesimal hours)
6.6904288E-3     (radians)

Note that the sign is necessary in the value in sexagesimal degrees to ensure that the value is interpreted as degrees, not hours. The examples in sexagesimal hours and radians are not particularly sensible here.

PRTYP
(default = ‘C’) Select the ‘type of pairing’ required, that is specify which set of rows from the two input catalogues are to be retained in the output catalogue. Briefly, the options are:
C
(COMMON) retain only the common or paired rows in the two catalogues,
M
(MOSAIC) retain all the rows in the primary and the unpaired rows in the secondary,
P
(PRIMARY) retain all the rows in the primary (for unpaired objects columns copied from the secondary are set to null).
R
(PRIMREJ) retain only the unpaired rows in the primary,
A
(ALLREJ) retain the unpaired rows in both the primary and the secondary.

These options are described in greater detail in Section 20.4, below.

MULTP
(default = ‘yes’) Specify how multiple matches in the primary are to be handled. The options are either to retain the single closest match or to retain all the matches. The treatment of multiple matches is described in detail in Section 20.5, below.
MULTS
(default = ‘no’) Specify how multiple matches in the secondary are to be handled. The options are either to retain the single closest match or to retain all the matches. The treatment of multiple matches is described in detail in Section 20.5, below.
ALLCOL
(default = ‘yes’) Specify the set of columns to be retained in the output catalogue. The options are to either retain all the columns from both input catalogues or to retain specified columns from either input catalogue. If you are in doubt you should retain all the columns. This alternative is the ‘safest’ and simplest, though it may result in the output catalogue containing columns which you do not need and consequently using more disk space than is strictly necessary.

If you choose to retain all the columns they are simply copied automatically from the input catalogue, without further intervention on your part. However, if you choose to specify the columns to retain you will subsequently be prompted for the names of the columns to be retained (and hence you must be prepared with this information). The details of specifying named input columns are described in Section 20.2.2, below.

If you choose to retain all the columns, the columns created in the output catalogue will have the same names (and other attributes) as the corresponding columns in the input catalogue. However, in the case where identically named columns in the primary and secondary catalogues would cause the output catalogue to contain two identically named columns, the names of the columns in the output catalogue are disambiguated by appending ‘_S’ to the name of the column originating in the secondary.

PRMPAR
(default = ‘yes’) Specify whether the parameters of the primary are to be copied to the output catalogue.
SECPAR
(default = ‘no’) Specify whether the parameters of the secondary are to be copied to the output catalogue.
PTEXT
(default = ‘C’) Specify what textual information associated with the primary is to be copied to the output catalogue. The options are: ‘A’ - all, ‘C’ - comments and history only and ‘N’ - none.
STEXT
(default = ‘N’) Specify what textual information associated with the secondary is to be copied to the output catalogue. The options are: ‘A’ - all, ‘C’ - comments and history only and ‘N’ - none.
20.2.1 Special columns

If catpair is invoked with the option spcol=true then three special columns giving details of the pairing for each object will be included in the output catalogue. These columns are:

SEPN
the separation between the paired primary and secondary objects,
PMULT
the number of matches in the primary,
SMULT
the number of matches in the seconary.

Usually fields in columns PMULT and SMULT will have a value of one for paired objects. However, in cases where there were multiple matches for the pair the values will be larger. See Section 20.5, below for a discussion of the handling of multiple matches.

20.2.2 Retaining specified columns

If you choose to retain in the output catalogue only some of the columns in the two input catalogues you will be prompted to supply the names of the columns required and hence you must be prepared with this information. If you are not familiar with the details of the columns in your input catalogues you can use catheader (see Section 13, above) to obtain the necessary information.

Once you have indicated that you are to retain only specified columns (by replying ‘NO’ to prompt ALLCOL) you will be prompted to enter the names of columns to be retained from the primary catalogue. Type the name of the first column required then hit return. For example to retain column X simply type:

  X

A corresponding column with the same name and other attributes will be created in the output catalogue. Columns may also be retained with a name in the output catalogue which differs from the name of the corresponding input column. In this case you type: the name of the input column, a right chevron and the name required for the new output column. For example, if the column was called X in the input catalogue and X_PRIM in the output catalogue you would type:

  X > X_PRIM

An arbitrary number of spaces may appear on either side of the right chevron. A column with the specified new name will be created in the output catalogue, and all its other attributes will be the same as those of the corresponding column in the input catalogue.

Continue in this fashion until you have entered all the columns required from the primary. Then type:

  END

Next you will be prompted for the names of the columns required from the secondary. Proceed exactly as for the primary and again type END when you have finished.

If you are retaining a large number of columns it is inconvenient (and, indeed, error-prone) to have to supply all the column names interactively in response to prompts. In this case it is much more convenient to run catpair from a script, and I strongly recommend that you do so. This option is described in Section 20.2.3, below.

The handling of multiple columns with the same name in the output catalogue is rather different when column names are being specified than when all the columns are being copied automatically. A single column with the specified name is created in the output catalogue and values for all the appropriate columns in the input catalogue are written to the field of this column for the current row. This behaviour is adopted because there there are cases, particularly in MOSAIC and ALLREJ pairing where you might want fields for corresponding columns in the two input catalogues to be written to a single column in the output catalogue. In the case where fields are available from both the primary and secondary catalogues it is always the field from the secondary which is retained.

20.2.3 Running from a script

Often it is more convenient to run catpair from a prepared script rather than answering the prompts interactively. This end is simply achieved using Unix’s input redirection mechanism. Simply enter the responses to the various prompts into a text file, in the correct order, using a text editor. Then type:

  catpair  < script_file

where script_file is the name of the file you have created. Figure 8 shows an annotated example catpair script for pairing with Cartesian coordinates. This script is available as file:

  /star/share/cursa/catpair_cart.script

An example script showing pairing with spherical-polar coordinates is available as file:

  /star/share/cursa/catpair_sphplr.script

It may be convenient to use these scripts as starting points for developing your own scripts.


prim primary catalogue
sec secondary catalogue
out output catalogue
C the pairing coordinates are Cartesian
X column with x-coordinate for pairing in the primary
Y column with y-coordinate for pairing in the primary
X column with x-coordinate for pairing in the secondary
Y column with y-coordinate for pairing in the secondary
10.0 the critical distance
C COMMON pairing
Y include all the primary multiple matches
Y include all the secondary multiple matches
N specify the columns to retain
X }
Y } columns retained from the primary
ROW }
END end of list of columns from the primary
X > X_SEC } columns retained from the secondary
Y > Y_SEC } (note the renaming of these columns)
ROW > ROW_SEC }
END end of list of columns from the secondary
Y include primary parameters
N exclude secondary parameters
N exclude primary textual information
N exclude secondary textual information

The column on the left (in a courier font) shows the entries in a catpair script file. The column on the right (in a roman font) briefly describes the corresponding entry.


Figure 8: An annotated example catpair script


20.3 Pairing criteria

This section discusses the criteria used to determine whether two objects, one from each of the two input catalogues, ‘correspond’ or pair. The two objects pair if the difference in their two-dimensional coordinates is smaller than some specified critical distance, D. The formulæ differ for Cartesian and celestial coordinates.

20.3.1 Cartesian coordinates

If the two objects have Cartesian coordinates x1, y1 and x2, y2 then the criterion is simply that D should be less than or equal to the Pythagorean distance between the two points:

D (x1 x2 )2 + (y1 y2 )2 (7)
20.3.2 Celestial coordinates

If the two objects have celestial spherical-polar coordinates (in practice Right Ascension and Declination) α1, δ1 and α2, δ2 then the criterion is that D should be less than or equal to the great circle distance between the two coordinates:

D arccos(abs(sin δ1 sin δ2 + cos(α1 α2) cos δ1 cos δ2)) (8)

Equation 8 is the natural form for the great circle distance, simply derived by applying spherical trigonometry to the two coordinates. In practice it has the disadvantage that because of numerical errors it is inaccurate when the great circle distance is a small angle. There are algebraically equivalent formulations which retain numerical accuracy for small angles. In catpair the great circle distance is calculated using the appropriate SLA routine15, which uses such a formulation in order to ensure accuracy for small angles.

20.3.3 Cases for the critical distance

The following three cases for the value of the critical distance, D, are supported by catpair.

(1)
It is a constant, for example twenty-three minutes of arc. Any objects in the catalogues correspond if their positions differ by twenty-three minutes of arc or less. Of the various cases this is the simplest.
(2)
It adopts the value of a column in the primary. Typically such a column would be an error associated with the position; objects with a small error would only pair with a nearby object, but objects with a large error would pair with objects further away.
(3)
It adopts a value computed from an expression involving columns in the primary. This case is a generalisation of the preceding one.

A fourth case in which the critical distance is computed from an expression involving columns in both catalogues is not supported in catpair. A special instance of this case which sometimes arises is where both input catalogues have errors in their coordinates which vary with the objects in the catalogues and thus are stored as columns, one in each catalogue. Objects are considered to pair when their error circles overlap. Here the expression for the critical distance, D, would involve columns (containing the errors) from both catalogues and hence this case is not supported.

20.4 Rows in the output catalogue


                 Primary           Secondary
            row catalogue          catalogue row
             1   xxxxxxx     +----->XXXXXXX   1
             2   XXXXXXX-----+      xxxxxxx   2
             3   xxxxxxx            xxxxxxx   3
             .   XXXXXXX-----+      xxxxxxx   .
             .   xxxxxxx     +----->XXXXXXX   .
                 xxxxxxx            xxxxxxx
                 XXXXXXX----------->XXXXXXX
                 xxxxxxx     +----->XXXXXXX
                 xxxxxxx     |      xxxxxxx
                 XXXXXXX-----+      xxxxxxx
                 XXXXXXX-----+      xxxxxxx
                 xxxxxxx     |      xxxxxxx
                 xxxxxxx     |      xxxxxxx
                             +----->XXXXXXX
                                    xxxxxxx
                                    xxxxxxx
                                    xxxxxxx

Figure 9: Rows in paired catalogues


Figure 9 illustrates the result of pairing two catalogues, with a set of corresponding rows in the catalogues identified. There are a number of options for the set of rows to be included in an output catalogue generated from such a pairing. The various alternatives available in catpair are described below.

COMMON
(often called the ‘inner join’ in relational database terminology). Only the objects common to both catalogues are retained; that is, only the paired objects are retained. This option might be used when pairing a list of target stars with a standard catalogue.
PRIMARY
(often called the ‘outer join’ in relational database terminology). All the rows in the primary catalogue are retained. For paired objects fields corresponding to the secondary will contain actual values, for unpaired objects they will contain null values. The corresponding case of retaining all the rows in the secondary can be realised by regarding the primary as the secondary and vice versa. This option might also be used when pairing a list of target stars with a standard catalogue.
MOSAIC
The output catalogue contains a row for every paired row in the input catalogues and also a row for every unpaired row in either catalogue. This is option useful for constructing a mosaic of a larger area of sky from several slightly overlapping catalogues.
PRIMREJ
Only the unpaired objects in the primary catalogue are retained. The corresponding case of retaining all the unpaired rows in the secondary can be realised by regarding the primary as the secondary and vice versa. This option might be used in proper motion studies.
ALLREJ
The output catalogue contains a row for all the unpaired objects in either catalogue. This option might also be used in proper motion studies.

20.5 Multiple matches

This section describes how multiple matches are handled by catpair. Multiple matches can arise because the pairing techniques are matching objects with similar rather than identical positions and an object in one catalogue can pair with several in the other catalogue. The terminology used in this section is:

match
a match is any object which lies within the critical distance, D, for an object in the other catalogue,
pair
a pair is any object chosen from amongst the set of matches to correspond to an object in the other catalogue.

That is, any match is potentially a pair and the pairing algorithm must prescribe which matches are considered pairs. There are three cases for multiple matches:

(1)
a single object in the primary matches several objects in the secondary (see Figure 10),
(2)
a single object in the secondary is matched by several objects in the primary (see Figure 11),
(3)
in crowded catalogues more complicated situations can arise, as illustrated in Figure 12. The results of pairing such catalogues are, in general, unpredictable.

catpair is unsuitable for handling the third case, and should not be used with catalogues where it is likely to be important. There are, however, several options for handling the first two cases:

(1)
only accept the closest of the matches as the pair,
(2)
accept all the matches as pairs,
(3)
use further information from the catalogues (such as magnitude or colour) to disambiguate a single pair from amongst the matches.

The third option is not practical in a general purpose program such as catpair because it relies on astronomical knowledge about the catalogues being paired. Either of the first two options may be appropriate, depending on the details of the pairing being performed. catpair provides both options separately for multiple matches in the primary and secondary, and you should choose the alternatives appropriate for your work.


                                          Primary           Secondary
                           o              xxxxxxx    +------>XXXXXXX
              o    +---------+            xxxxxxx    |       xxxxxxx
                   |o     o  |            XXXXXXX----+------>XXXXXXX
                   |    *    |  o         xxxxxxx    |       xxxxxxx
               o   |  o     o|                       +------>XXXXXXX
                   +---------+   o                   |       xxxxxxx
                                                     +------>XXXXXXX
                                                             xxxxxxx
- Object in primary.
o - Object in secondary.

For secondary objects to match with the primary object they must fall inside the square (strictly speaking the square should be a circle with a radius equal to the critical distance, D).


Figure 10: A single primary object matches several secondary objects



                                          Primary           Secondary
                                 o        xxxxxxx            xxxxxxx
                   +---------+            XXXXXXX----+       xxxxxxx
               o   | +-------|-+          xxxxxxx    +------>XXXXXXX
                   | |  *  o | |          XXXXXXX----+       xxxxxxx
                   | |    *  | |     o                       xxxxxxx
                o  +---------+ |                             xxxxxxx
                     +---------+ o                           xxxxxxx
                                                             xxxxxxx
See Figure 10 for details of the symbols.

Figure 11: A single secondary object is matched by several primary objects



                     o           o
                  o+---------+
              o    | +o------|-+  o
                   | |  *  o | |o
                o  | |o   * o| |     o
               o   +---------+ |
                     +--o------+ o
                  o          o
See Figure 10 for details of the symbols.

Figure 12: A crowded field with multiple matches of both primary and secondary objects


An example might help to illustrate the difference between multiple matches in the primary and secondary. Suppose the primary was a private list of target objects and the secondary was the NGC catalogue. Table 13 shows the equatorial coordinates for the triplet of galaxies NGC 3623, NGC 3627 and NGC 362816. Consider the following two cases.


NGC 
α
δ





  h
3623  11 18.9  +13 05
3627  11 20.2  +12 59
3628  11 20.3  +13 36
 

Table 13: Coordinates for a triplet of galaxies

20.6 Pairing algorithm


                 Primary           Secondary
            row catalogue          catalogue row
             1   xxxxxxx            xxxxxxx   1
             2   xxxxxxx            xxxxxxx   2
             3   xxxxxxx    +------>xxxxxxx   3
             .   xxxxxxx    |       xxxxxxx   .
             .   xxxxxxx----|       xxxxxxx   .
                 xxxxxxx    |       xxxxxxx
                 xxxxxxx    +------>xxxxxxx
                 xxxxxxx            xxxxxxx
                 xxxxxxx            xxxxxxx
                 xxxxxxx            xxxxxxx
                                    xxxxxxx
                                    xxxxxxx
                                    xxxxxxx
                                    xxxxxxx

Figure 13: The index join


This section describes the pairing algorithm used by catpair. Strictly speaking you should not need to know the details of the algorithm in order to use catpair, but the information is provided for reference and completeness. catpair uses an index join technique which is illustrated in Figure 13. The secondary catalogue is sorted on the second coordinate to be used in the pairing.17 The algorithm is then as follows. Every entry in the primary catalogue is examined sequentially and for each entry the critical distance, D, is used to compute the minimum and maximum values of the sorted coordinate which could pair with the primary row. The rows in the secondary catalogue corresponding to these minimum and maximum values are then identified (remember that the secondary is sorted on this column) to yield a range of rows which might pair. All of these rows are then examined individually to check if they do pair.

The advantages of this technique are that it is relatively straightforward and it does not require the primary catalogue to be sorted (though the secondary must). The main disadvantage is that the ranges in the secondary corresponding to subsequent rows in the primary may overlap, thus leading to multiple reads of rows in the secondary. The technique is most appropriate where a small primary is being paired with a large secondary; perhaps a small personal list of target objects is being paired with a large standard catalogue. However, it will certainly work if the primary and secondary are of similar size; it will merely take somewhat longer to execute than is strictly necessary.

14catpair does not actually check that the units attribute is the same for the various columns holding the coordinates because in CURSA units are treated purely as comments.

15See SUN/67[32]. The actual routine used is SLA_DSEP.

16These data were taken from NGC 2000.0 by R.W. Sinnott[27].

17Spherical-polar coordinates must be sorted on Declination or latitude in order to avoid problems with the zero – twenty-four hour boundary.