3 Representation of Data

 3.1 Numeric Types
 3.2 Characters
 3.3 Logical Types
 3.4 Pointer Types
 3.5 Arrays
 3.6 Same Language – Different Compiler

Different languages have differing fundamental data types on which they can operate. FORTRAN has the types INTEGER, REAL, DOUBLE PRECISION, COMPLEX, LOGICAL and CHARACTER. The only aggregate data type that it supports is the array, although a character variable can store many characters. C supports the fundamental types int, float, double, char, and void. It also allows int to be modified by the type specifiers short or long, signed or unsigned, char to be modified by signed or unsigned and double to be modified by long. However, on any given machine, some of the short, normal and long types may be represented in the same way. C also provides a range of pointer types which may, for purposes of interchange with FORTRAN, all be condensed into the generic pointer type void*. Note that, unlike FORTRAN, a C character variable can only store a single character. Also unlike FORTRAN, a C character variable is treated as a type of integer rather than as a separate type. Finally, ANSI C has the type enum, an enumerated list of values. The aggregate data types are the array, structure and union. New types can be defined in terms of the basic types by means of a typedef statement.

When writing mixed language programs, it is clearly important to know which FORTRAN types map on to which C types; in particular, which similar types use the same amount of storage. This is discussed more fully in the machine dependent sections in appendix A; however, there are some general points to be considered first.

3.1 Numeric Types

If data are to be passed between routines that have been written in different languages, then it is important that those languages represent the data in the same way. The FORTRAN standard makes no statements about how any of the data types should be implemented and there is almost nothing in the C standard either. For example, if a certain bit pattern was interpreted as the integer 2 by FORTRAN, yet the same bit pattern was interpreted by C as 1, then there are going to be serious problems trying to communicate between routines written in different languages. Fortunately, the hardware on which the program is running provides a constraint for those data types that are implemented directly in the hardware. For example, all reasonable computers have instructions for operating on integers and it would be a particularly perverse compiler writer who chose not to use the hardware representation. Something that is slightly more likely to be a problem is the way that floating point numbers are represented. If the hardware supports floating point arithmetic, then you are in the same situation as for integers and all should be well. However, if the hardware does not support floating point arithmetic, then there could be problems. Some older PCs do not have floating point hardware, although modern PCs either support floating point operations directly in hardware, or there is a recognised way of representing floating point numbers that is generally adhered to. The bottom line on numerical data types is that it is most unlikely that different languages will represent the same number in a different manner on the same hardware.

3.2 Characters

When considering character data, things are a bit more complicated in that the hardware does not impose a meaning on a given bit pattern. It is the operating system that does that. The character codes that are in common use are the ASCII collating sequence and the EBCDIC collating sequence. EBCDIC is only used by IBM mainframe and minicomputers (and their clones), but there are a lot of IBM computers around. (The IBM PC does not use EBCDIC.) Again it would be rather perverse if, on a given computer, FORTRAN and C used a different representation of characters, so that is not really worth worrying about. What certainly is worth paying attention to is the possibility that any given program may be run on several different computers, some using ASCII characters and some using EBCDIC. That is not a concern that is particular to mixed language programming though.

An important point about character data is that they are stored differently in FORTRAN and C. FORTRAN stores character data as a fixed-length string padded with trailing blanks whereas C stores character data as a variable-length, null-terminated string. The difference is standardized, so it does not lead to problems with portability, but it is something that will involve extra work when passing character data between routines written in different languages.

3.3 Logical Types

So far all seems well. However, a place that can certainly cause problems is the representation of logical values. In principle, it is completely up to the compiler writer to chose how logical values are represented. What is even worse as far as C is concerned is that there is no logical type at all! In C, a numerical value of zero represents false and anything else represents true, but these are numeric data types, not logical types. On a VAX/VMS system, FORTRAN represents a logical value of false by an integer zero and true by an integer minus one; however, only the bottom bit is tested, so if an integer value of 2 were to be treated as a logical value, then it would be taken as false. C, on the other hand, would treat it as true.

3.4 Pointer Types

The main reason for passing pointer information between C and FORTRAN is to pass references to dynamically allocated memory, which is especially useful given FORTRAN 77’s lack of dynamic memory allocation. In addition, the referenced memory may contain data values which are, in effect, also being exchanged and which we must therefore be able to reference from both languages.

C provides a wide range of pointer types which can be constructed to refer to any other C type. Each of these pointer types can, at least in principle, have different storage requirements. Indeed, on some machines and operating systems there are variations in pointer length according to the type of data being referenced, and even variations in the way bit patterns are interpreted according to where the referenced data are stored in memory. Fortunately, the more arcane of these schemes are now regarded as historical anomalies and are unlikely to be met in future.

C provides the generic pointer type void*, to which all pointer types may be cast, and from which the original pointer may later be recovered by casting back to the original type. Since the void* type must therefore cater for the “lowest common denominator” of C pointer types, it is very likely to contain just a simple memory address for the referenced data (or something equivalent) on all machines. Therefore, exchanging the void* type is the key to interchanging pointers between C and FORTRAN.

However, FORTRAN 77 does not have a pointer type, and its INTEGER data type must be pressed into service in order to store pointer values passed from C. Unfortunately, on some platforms, a C pointer is longer than a FORTRAN INTEGER, which means that there is no suitable standard (and therefore portable) FORTRAN data type of sufficient length to store an address in memory. To overcome this limitation, some trickery in required, the upshot of which is that there are some restrictions on the particular pointer values which may be passed from C to FORTRAN.

In practice, this means that pointer exchange between C and FORTRAN is really only safe when referring to dynamically allocated memory (and not, for example, when referring to static memory allocated in C, where you have no control over the address used). It also means that CNF must provide special facilities for allocating dynamic memory from C which will later be passed to FORTRAN, and for “registering” the associated pointers. It also provides functions for converting between the C and FORTRAN representations of these pointers.

3.5 Arrays

Although the representation of a single numerical value is unlikely to cause a problem, the way that arrays of numbers are stored is different between different languages. One dimensional arrays are the least problem, but even then there are differences. In C, all arrays subscripts start at zero, and this cannot be changed. In FORTRAN, subscripts start at one by default, but this can be modified so that the lower bound of a dimension of an array can be any integer. What must be remembered is that the array element with the lowest subscript in a FORTRAN array will map on to the array element with a zero subscript when treated as a C array. This is not a serious problem as long as you remember it.

Multi-dimensional arrays are a well known problem since FORTRAN stores consecutive array elements in column-major order (this is specified in the FORTRAN standard) whereas other languages store them in row-major order. For example, in FORTRAN, the order of elements in a 2 x 2 array called A are A(1,1), A(2,1), A(1,2), A(2,2), whereas in C this would be A[0][0], A[0][1], A[1][0], A[1][1]. In practice this is rarely a serious problem as long as you remember to take account of the reversed order when writing a program. However, when coupled with the difference in default lower bounds (zero in C, one in FORTRAN) it is a fruitful source of bugs.

There are additional problems with FORTRAN character arrays. This is because C handles a one dimensional FORTRAN character array as a two dimensional array of type char, i.e. the FORTRAN statement:

  CHARACTER * ( NCHAR ) NAMES(DIM)

is equivalent to the C statement:

  char names[dim][nchar]

3.6 Same Language – Different Compiler

In the preceding sections, reference is often made to “the way that FORTRAN does something” or “the way that C does something”. However, even different compilers for the same language can do things in a different way if the standard does not specify how that something should be done. A reasonable example is that one FORTRAN compiler might represent a true logical value by the integer 1, whereas another might just as reasonably use 1. This is not just a hypothetical problem; the FORTRAN for RISC compiler from MIPS and the DEC FORTRAN for RISC compiler both work on the DECstation and interpret the same number as different logical values. I shall continue to refer to “the way that FORTRAN does it”, even though it is more correct to refer to “the way that FORTRAN compiler XYZ implements it”. The distinction is rarely important, but should be borne in mind.