It has taken a long time and a large amount of effort to agree the design which is described here. The solutions to many apparently straightforward problems came only after protracted discussion and thought. Many superficially attractive ideas proved on careful examination to be half-baked and unfortunately had to be dropped. Some of the arguments are subtle, and it must be acknowledged that not all contributors are fully convinced that the rejection of their ideas was justified. This is inevitable given the size and diversity of the Starlink community.
The most fundamental issue (certainly the most fruitful source of controversy) is whether we are trying to design formats which are so comprehensive that any piece of information, however specific to an instrument or processing phase, has a defined location ready to receive it, or whether we are instead trying to design the simplest possible system which can do the job.
The first approach, sometimes called the “we’ve thought of everything” philosophy or TOE, is the one that has been traditionally employed. The designers of such systems have tried to predict everything that might be needed (in their experience as optical spectroscopists, aperture synthesists, X-ray observers, etc.) for the general case of a picture, spectrum, time series, or whatever. These designs generally work well for their inventor, but when others try to use them they find omissions, inconsistencies, ambiguities and limitations, and either have to add new items of their own or use existing items in a non-standard way. Even where a new form of data is expressed in what appears to be the standard way, experience has shown that precise interpretation by different application programs of all the various ancillary items (e.g. exposure time, astrometric parameters, etc.) cannot be relied on, and so these items become little more than comments.
The Starlink designs reject the TOE approach in favour of one where:
The third point—the treatment of extensions—is crucial. Most astronomer/programmers feel drawn to the familiar TOE approach, where there is a place to put the , exposure time, polarimeter setting, relative humidity, feed-horn collimation parameters, etc., and are unhappy that many of the items they wish to include have to be “demoted” by being moved into an extension. Alternatively, they are willing to accept the need for extensions, but only for the idiosyncratic data required by other astronomers. It is important to understand that the extensions in the Starlink standard formats are an essential part of the scheme, safe havens where important but specialised items can reside, accessible to programs which understand them, and automatically copied from generation to generation. All extensions should be registered with the Starlink Head of Applications to avoid clashes between different groups of applications. Certain general-purpose extensions will be highly standardised, and will be used by many application packages.
The combination of (i) trying to keep the formats simple and (ii) defining precisely how the different items should be interpreted by application programs has produced a result which has remarkably little evidence of astronomy in it. This should not be regarded as a worry; the astronomical information, relating to astrometry, radiometry, timing, and so on, will reside in standard extensions which will be defined in due course. A byproduct of this conservatism (which came largely from the need to reduce the task to a manageable size) is that the standard structures, and the general-purpose applications which process them, may have uses outside astronomy.
The Starlink standard data structures can be divided into two categories: low-level, and composite. Low-level structures are self-contained; coupled with individual data objects they are used to build the composite formats, and include axis information, title, history, and quality. The composite formats are akin to the Interim Environment’s Bulk Data Frame (see SUN/4).
Since the idea of some completely general data format has been rejected (for reasons already presented), even for astronomical data, a number of composite data formats will be defined for various classes or forms of data as required.
The only current example of a composite data format, the NDF, is presented in Table 1. (NDF is short for Extensible n-Dimensional-Data Format and will be described fully in Section 11.) The NDF is based on the -dimensional data array, which is the most natural way to express many sorts of astronomical data set—notably spectra, pictures and time series.
Within an HDS container file, the NDF structure resides usually, though not necessarily, at the top level. For example, the top-level structure within a container file might be built from several NDFs, each an observation of the same source but through a different filter.
Component Name | TYPE | Brief Description |
[VARIANT] | _CHAR | variant of the NDF |
[TITLE] | _CHAR | title of [DATA_ARRAY] |
[DATA_ARRAY] | various | NAXIS-dimensional data array |
[LABEL] | _CHAR | label describing the data array |
[UNITS] | _CHAR | units of the data array |
[VARIANCE] | s_array | variance of the data array |
[BAD_PIXEL] | _LOGICAL | bad-pixel flag |
[QUALITY] | various | quality of the data array |
[AXIS(NAXIS)] | AXIS | axis values, labels, units and errors |
[HISTORY] | HISTORY | history structure |
[MORE] | EXT | extension structure |
(The and signs are not actually part of the TYPE, nor are the brackets around the NAME. They are just a notation convention to distinguish TYPE from NAME.)
There are several low-level objects within the NDF, each with a NAME for access and identification, and a TYPE to control the general processing. They comprise:
[TITLE], |
[LABEL], |
[UNITS], |
[BAD_PIXEL], |
[HISTORY], |
[MORE]; |
[DATA_ARRAY], |
[VARIANCE], |
[QUALITY]. |
Only the [DATA_ARRAY] is mandatory—so a primitive HDS object containing a 2-D array of numbers (for example) is valid as the only component of an NDF. A full description of the components is given in Section 11 and the meaning of the TYPEs in Section 3.
The omission from the NDF of the maximum and minimum values of the data array, and other quantities which can be deduced from the data, is deliberate. This is because experience has shown that sooner or later applications come along which fail to keep these numbers up to date.
In designing the NDF, efforts were made to retain some compatibility with the original Wright-Giddings proposals. A limited but useful degree of compatibility was achieved—the former location of the main data array (i.e. at the top level) is still recognised, and the name has been retained—but it proved impossible to accommodate more of the original proposals without enormously adding to the complexity of the processing rules.
As was mentioned earlier, the NDF is a simple structure devoid of any obviously astronomical items but which can be used to express many different varieties of astronomical data. IPCS spectra, CCD pictures and Taurus datacubes, for example, are all essentially -dimensional arrays together with some ancillary items, and fit naturally into the form of the NDF, simple though it is. In the next section, we will look at the general question of layering—how the elaborate requirements of any particular data type can be broken down into different levels of generality. We will then go on to consider the topics of substructures (how common building blocks may be identified and exploited), extensibility (how items peculiar to a data type or an application package may be accommodated), and various processing considerations including the propagation of extensions.
There are several well-defined levels of generality in designing data formats, with each level building on those below it (see Table 2). Having designed and implemented HDS, the Starlink Project could have left it at that and, beyond stipulating that applications must use HDS for their data storage, have allowed each software package to be autonomous. This is the HDS level. The second level uses a mathematical representation to provide generality; the rules for processing data objects are mathematical abstractions, though chosen to map well onto the processing of pictures and spectra and other types of astronomical work. The astronomical representation is much more complex than the mathematical level, and contains information relating to astrometry, radiometry, etc. Specialist structures for instrument- or application-specific processing are still more complex.
The HDS-level approach was rejected because it fails to ensure the required degree of compatibility between application packages. The astronomy level, on the other hand, could not have been defined properly in the time available, and the highest level will, of course, only be supplied by the authors of specialist packages. Therefore, Starlink initially selected the mathematical representation. As well as offering the possibility of a useful standard in a reasonable time (it will be much more capable than INTERIM’s BDF, which has nonetheless proved extremely successful and versatile), concentrating on the mathematical level makes it easier to identify a subset of common low-level data objects. Because the processing rules are well defined, software to handle these low-level objects is relatively easy to write, and once written can be used by many different packages.
KAPPA (Kernel Application Package — see SUN/95) is a set of applications which processes these abstracted data objects. KAPPA is intended to be a quick, exploratory data-analysis tool, and its applications will act as paradigms or templates for specialist software packages.
Another important package is FIGARO (SUN/86), which has been influential in defining the Starlink standard data formats and is at present undergoing a refit to bring it fully into line. FIGARO is a large and mature set of picture processing and spectral applications, and though not as formally correct as KAPPA will be the dominant general purpose ADAM application package for some time to come and will have a profound influence on the design of other packages.
Standard software interfaces will be written to access the low-level structures. They are currently being designed and will be described in a future version of this document.
The essence of HDS is the ability to define multiple levels rather than having everything at the top level in a “flat” design. Substructures make it easy to copy or delete parts of a structure, and provide flexibility and extensibility. Some implementors mistrust multiple levels of hierarchy, and have advocated the use of flat arrangements, combined with elaborate naming schemes (with wildcarding) to distinguish between different classes of object. However, this approach has been discredited as inefficient and arcane, and has not been used in the Starlink standard formats. It is also deprecated within NDF extensions.
Low-level structures should be kept as small and simple as possible. They should contain related objects, whose meanings are defined and whose processing rules are specified. If a structure is created which contains unrelated objects all lumped together, it will be unwieldy, the processing rules will be restrictive or highly complex, and it will be a difficult object for programmers to manipulate. Furthermore, modular substructures can be used in different data formats without duplication of software. The conventions for the interpretation of a structure, as well as its components, must be defined if there is to be compatability between different software packages.
On the other hand, over-elaborate structuring (structure which would demand more work of the applications programmer and reduce the efficiency of applications) should be avoided. For example, it would probably be inefficient to represent a celestial position with separate components for hours, minutes, seconds and degrees, arcminutes, arcseconds, rather than expressing the two angles as floating-point numbers or using a single character string (notwithstanding the example in SSN/27, which is a demonstration rather than a recommendation!).
Component Name | TYPE | Brief Description |
[VARIANT] | _CHAR | ‘SIMPLE’ |
[ORIGIN(NAXIS)] | integer | origin of the data array |
[DATA] | narray | actual value at every pixel |
Component Name | TYPE | Brief Description |
[VARIANT] | _CHAR | ‘SCALED’ |
[ORIGIN(NAXIS)] | integer | origin of the data array |
[DATA] | narray | scaled numeric value at every pixel |
[SCALE] | numeric | scale factor |
[ZERO] | numeric | zero point |
Example structures are shown in Tables 3–5. The first two are variants of the
ARRAY
structure, and are different ways of storing an
-dimensional
array of numbers. A variant qualifies the TYPE and the processing rules of the structure,
and may appear in any structure. The most basic form is specified by [VARIANT]
= ‘SIMPLE’
, which is the default should [VARIANT] not be present. In this case
ARRAY
comprises an array of numbers plus the origin in each dimension. [ORIGIN] defines the zero point of
the pixel coordinate system. For [VARIANT] = ‘SCALED’
, the array of numbers has been scaled,
perhaps as 16-bit integers to save disk space for large-format data. [SCALE] and [ZERO] are used to
generate the actual array values. In almost all cases, standard subroutines will deal automatically
with the different variants and simply present the application with a locator to an array of
numbers.
Component Name | TYPE | Brief Description |
[CREATED] | _CHAR | creation date and time |
[EXTEND_SIZE] | _INTEGER | increment number of records |
[CURRENT_RECORD] | _INTEGER | record number being used |
[RECORDS()] | HIST_REC | array of history records |
All the objects in the first two examples have primitive TYPEs. This need not be the case. The HISTORY structure contains a further structure [RECORDS] to store history text and time tag. History records are brief and are only intended to assist the user. Their contents must not be used by applications to control processing.
The rules for designing new structures are presented in Section 13.
No data format design will ever satisfy all requirements, and some provision has to be made for accommodating ancillary information, specific to a group of applications or a particular instrument for example. What should an application do if it encounters objects which are not part of a standard structure?
In most of the ASPIC applications (see SG/1) such objects — which have to be expressed in simple ways akin to FITS descriptors — are simply ignored, and consequently do not appear in any new data frames created by the applications.
A hierarchical data system like HDS enables all these specialist data to be expressed in natural ways and to be attached to the main data structure at appropriate places within the structure. During processing these extension substructures can be copied to output data structures. Of course, as a result of the computations there may then be inconsistencies between the specialist and the general data objects; this is unavoidable. Rules will have to be laid down about what applications can be run on what types of data and in what order, and sometimes it may be necessary to resort to utilities which edit HDS structures to fix up inconsistencies. Specialist packages could implement all this transparently by providing command procedures.
In the NDF, extensibility is provided through the [MORE] structure. Information required by application packages (and under the complete control of those packages) is arranged into structures usually of TYPE EXT and stored within [MORE]. In order to allow applications to recognise their own extension objects without risk of clashes, the names and types of the extension structures must be registered with the Head of Applications. An example of an extension might be one called ASTROMETRY, which would contain information about the relationship between the data and the celestial reference frame.
[MORE] structures can occur once at the top level of the NDF, and once in each [AXIS] element.
As well as processing the extensions they recognise, applications are obliged to propagate all other extensions to any output structures.
Details of the defined extensions are given in other documents.
Although the data objects in the NDF are general, not all applications will know how to process all the objects. For example, the [VARIANCE] in the NDF becomes meaningless after thresholding. Therefore, the rule for propagation, is as follows.
“Tree walking”, i.e. by one means or another moving to a higher position within an structure and processing data objects there, is forbidden. If such objects are required their names must be acquired from outside the application, and a new locator obtained.
It will often be most convenient and efficient to delay conversion of instrumental data into standard formats until calibration, reformatting and other preprocessing operations have been completed. The software for performing this conversion is best provided by the instrument builders. However, in some instances it will prove convenient to use a composite structure (e.g. an NDF) for uncalibrated data so that general-purpose applications can be used for for display and other quick-look operations.
Two methods are available for dealing with bad data. The magic value method uses special values stored within the data themselves to indicate that a datum is undefined or invalid. The second method is quality. Each data value may have associated with it a data-quality value, an 8-bit code which flags whether the data value is special in some way or combination of ways.
The NDF has a [BAD_PIXEL] flag, which allows applications to find out in advance if there are any magic-value data within the [DATA_ARRAY]. If not, a version of the algorithm which does not do the checks can be invoked, in order to save time. Note that in this document the term pixel is used in its generic sense, i.e. equivalent to an array element or bin, and therefore, pixel does not imply membership of a two-dimensional array.
Full details of bad-pixel methods are presented in Section 7.