HP OpenVMS MACRO Compiler Porting and User's Guide

HP OpenVMS MACRO Compiler
Porting and User's Guide

Contents

Index

2.10 Using Floating-Point Instructions

All floating-point instructions and directives, with the exception of POLYx, EMODx and all H_floating instructions, are supported.

These instructions are emulated by means of subroutine calls. This support is provided to allow hands-off compatibility for most existing VAX MACRO modules and is not designed for fast floating-point performance.

Besides the overhead of the emulation routine call, on OpenVMS Alpha systems, all floating-point operands must be passed through memory because the Alpha architecture does not have instructions to move values directly from the integer registers to the floating-point registers. In addition, on the first floating-point instruction, the FEN (floating-point enable) bit is set for the process which will cause the entire floating-point register set to be saved and restored on every context switch for the life of the image.

2.10.1 Differences Between the OpenVMS VAX and OpenVMS Alpha/I64 Implementations

The differences between the implementations on OpenVMS VAX and OpenVMS Alpha/I64 systems are noted in the following list:

Floating-point instructions and atomicity
Since all floating-point instructions are emulated by means of subroutine calls, they are not atomic or restartable, and cannot be made atomic by the .PRESERVE ATOMICITY directive or /PRESERVE=ATOMICITY option. Also, there is no guarantee that registers from R16 through R28 are preserved across the instruction.
Use of argument registers
Arguments are passed to the emulation routines by means of the argument registers (R16 through R21 on OpenVMS Alpha systems); attempts to use these registers as arguments in a floating-point instruction will not work correctly.
On OpenVMS I64 systems, the first parameters are passed not in registers R16 through R21, but in registers R32 through R39. Thus, there are eight rather than six argument registers. Also, on OpenVMS I64, the MACRO-32 compiler does not provide a way to refer to these registers by name. Therefore, there is no opportunity to write conflicting uses of these registers.
On OpenVMS Alpha systems, to prevent problems with AP-based references when floating-point instructions are used, it is simplest to specify /HOME_ARGS=TRUE on the entry point of the routine that contains these references. This forces all AP-based references to be read from the procedure frame, rather than the argument registers, so that no changes need to be made to the instructions.
If /HOME_ARGS=TRUE is not used, the source code that contains parameter references based on the AP register as operands to floating-point instructions should be changed. Copy any AP-based operand to a temporary register first, and use that temporary register in the floating-point instruction. For example, change the following code:
MOVF @8(AP), @4(AP)
to:
MOVL 8(AP), R1 MOVL 4(AP), R2 MOVF (R1),(R2)
Note that, on OpenVMS I64 systems, the incoming argument registers are disjoint from the outgoing argument registers.
D_floating format on OpenVMS Alpha systems
D_floating format is not fully supported in the Alpha architecture. All arithmetic operations or conversions must be done by converting D_floating format to G_floating format, doing the operation in G_floating format, and converting back to D_floating format. This results in the loss of the extra 3 bits of precision in the D_floating format mantissa, besides taking the time in conversion. This means that there is nothing to be gained by using D_floating format. It is available for compatibility only, with the caveat that some precision will be lost.
VAX floating-point formats on OpenVMS I64 systems
Since the Itanium architecture only supports IEEE S and T formats, all arithmetic operations or conversions must be done by converting D_floating and G_floating format to T_floating format, doing the operation in T_floating format, and converting back to D_floating or G_floating format. This results in the loss of the extra 3 bits of precision in the D_floating format mantissa, besides taking the time in conversion. This means that there is nothing to be gained by using D_floating format. It is available for compatibility only, with the caveat that some precision will be lost. Likewise, all F_floating operations are done by converting to S_floating, doing the operation, and converting back to F_floating.
Overflow traps
It is not possible to enable traps on integer overflow or floating-point underflow. Alpha systems will always trap on floating-point overflow. The /ENABLE qualifier and the .ENABLE directive have no effect on overflow trapping. Overflow traps that lead to predictable results on OpenVMS VAX systems will give the same results on OpenVMS Alpha and OpenVMS I64 systems.
Dirty zeros
Code may need modification to eliminate dirty zeros. A true zero in all VAX floating-point formats has all bits set to zero. If the exponent bits are all zero but some of the remaining bits are set, this is a dirty zero, and the OpenVMS VAX system will treat this as a zero. An OpenVMS Alpha system will take a reserved operand trap.
OpenVMS I64 does not support VAX format floats. However, the VAX float emulation routines preserve the VAX behavior.
Restriction on format of arguments
Because these instructions are implemented by means of macros, there is one restriction on the format of the arguments. In a macro invocation, an initial circumflex (^) is interpreted to mean that the parameter is a string, and the character immediately following the circumflex is the string delimiter. Because of this, you cannot use arguments that begin with an operand type specification, such as ^x20(SP). Note that immediate mode arguments, such as #^XFF, can use an operand type specification because the circumflex is not the initial character.
Floating-point return values
A MACRO program that calls out to a routine and expects a floating-point return value in R0 may require a "jacket" between the call and the called routines. The purpose of the jacket is to move the returned value from floating-point register 0 to R0.

2.10.2 Impact on Routines in Other Languages

This support does not make the floating-point register set visible to the compiler. It simply allows floating point-operations to be done on the integer registers. This means that routines in other languages that want to interface with a VAX MACRO routine, either calling it or being called by it, must not expect any floating-point values as inputs or outputs. Compilers for other languages will pass these values in the floating-point registers. Floating-point arguments can be passed into or out of a VAX MACRO routine only by pointer.

Calls to run-time library (RTL) routines of other languages fall into this category. For example, a call to MTH$RANDOM returns a floating value in floating-point register F0. The compiler cannot directly read F0. You need to either create a jacket routine (in another language), which makes the call to MTH$RANDOM and then moves the result to R0, or write a separate routine that only does the move.

2.11 Preserving VAX Atomicity and Granularity

The VAX architecture includes instructions that perform a read-modify-write memory operation so that it appears to be a single, noninterruptible operation in a uniprocessing system. Atomicity is the term used to describe the ability to modify memory in one operation. Because the complexity of such instructions severely limits performance, read-modify-write operations on an Alpha or I64 system can be performed only by nonatomic, interruptible instruction sequences.

Furthermore, VAX instructions can address single-aligned or unaligned byte, word, and longword locations in memory without affecting the surrounding memory locations. (A data item is considered aligned if its address is an even multiple of the item's size in bytes.) Granularity is the term used to describe the ability to independently write to portions of aligned longwords.

Because byte, word, and unaligned longword access also severely limits performance, an OpenVMS Alpha system can only access aligned longword or quadword locations. Therefore, a sequence of instructions to write a single byte, word, or unaligned longword causes some of the surrounding bytes to be read and rewritten.

While Itanium has instructions for accessing bytes and words, there is a performance penalty if they are unaligned.

These architectural differences can cause data to become corrupted under certain conditions.

In an OpenVMS Alpha system, atomicity and granularity preservation are not provided by locking out other threads from modifying memory, but by providing a way to determine if a piece of memory may have been modified during the read-modify-write operation. In this case, the read-modify-write operation is retried.

In an OpenVMS I64 system, atomicity is achieved by retrying the operation as for OpenVMS Alpha.

To ensure data integrity, the compiler provides certain qualifiers and directives to be used for the conditions described in the following sections.

2.11.1 Preserving Atomicity

On OpenVMS VAX, OpenVMS Alpha, and OpenVMS I64 multiprocessing systems, an application in which multiple, concurrent threads can modify shared data in a writable global section must have some way of synchronizing their access to that data. On a OpenVMS VAX single processor system, a memory modification instruction is sufficient to provide synchronized access to shared data. However, it is not sufficient on OpenVMS Alpha or OpenVMS I64 systems.

The compiler provides the /PRESERVE=ATOMICITY option to guarantee the integrity of read-modify-write operations for VAX instructions that have a memory modify operand. Alternatively, you can insert the .PRESERVE ATOMICITY and .NOPRESERVE ATOMICITY directives in sections of VAX MACRO source code as required to enable and disable atomicity.

For instance, assume the following instruction, which requires a read, modify, and write sequence on the data pointed to by R1:

INCL (R1)

In a OpenVMS VAX system, the microcode performs these three operations. Therefore, an interrupt cannot occur until the sequence is fully completed.

In an OpenVMS Alpha system, the following three instructions are required to perform the one VAX instruction:

LDL R27, (R1) ADDL R27, 1, R27 STL R27, (R1)

Similarly, in an OpenVMS I64 system, the following four instructions are required:

ld4 r22 = [r9] sxt4 r22 = r22 adds r22 = 1, r22 st4 [r9] = r22

The problem with this Alpha/Itanium code sequence is that an interrupt can occur between any of the instructions. If the interrupt causes an AST routine to execute or causes another process to be scheduled between the LDL and the STL, and the AST or other process updates the data pointed to by R1, the STL will store the result (R1) based on stale data.

When an atomic operation is required, and /PRESERVE=ATOMICITY (or .PRESERVE ATOMICITY) is specified, the compiler generates the following Alpha instruction sequence for INCL (R1):

Retry: LDL_L R28,(R1) ADDL R28,#1,R28 STL_C R28,(R1) BEQ R28, fail . . . fail: BR Retry

and the following Itanium instruction sequence:

$L3: ld4 r23 = [r9] mov.m apccv = r23 mov r22 = r23 sxt4 r23 = r23 adds r23 = 1, r23 cmpxchg4.acq r23, [r9] = r23 cmp.eq pr0, pr6 = r22, r23 (pr6) br.cond.dpnt.few $L3

On the OpenVMS Alpha system, if (R1) is modified by any other code thread on the current or any other processor during this sequence, the Store Longword Conditional instruction (STL_C) will not update (R1), but will indicate an error by writing 0 into R28. In this case, the code branches back and retries the operation until it completes without interference.

The BEQ Fail and BR Retry are done instead of a BEQ Retry because the branch prediction logic of the Alpha architecture assumes that backward conditional branches will be taken. Since this operation will rarely need to be retried, it is more efficient to make a forward conditional branch which is assumed not to be taken.

Because of the way atomicity is preserved on OpenVMS Alpha systems, this guarantee of atomicity applies to both uniprocessor and multiprocessor systems. This guarantee applies only to the actual modify instruction and does not extend interlocking to subsequent or previous memory accesses (see Section 2.11.6).

The OpenVMS I64 version of the code uses the compare-exchange instruction (cmpxchg) to implement the locked access, but the effect is the same: If other code has modified the location being modified here, then this code will loop back and retry the operation.

You should take special care in porting an application to an OpenVMS Alpha or OpenVMS I64 system if it involves multiple processes that modify shared data in a writable global section, even if the application executes only on a single processor. Additionally, you should examine any application in which a mainline process routine modifies data in process space that can also be modified by an asynchronous system trap (AST) routine or condition handler. See Migrating to an OpenVMS AXP System: Recompiling and Relinking Applications¹ for a more complete discussion of the programming issues involved in read-modify-write operations in an Alpha system.

Warning

When preserving atomicity, the compiler generates aligned memory instructions that cannot be handled by the Alpha PALcode unaligned fault handler. They will cause a fatal reserved operand fault on unaligned addresses. Therefore, all memory references for which .PRESERVE ATOMICITY is specified must be to aligned addresses (see Section 2.11.5).

2.11.2 Preserving Granularity

To preserve the granularity of a VAX MACRO memory write instruction on a byte, word, or unaligned longword on Alpha means to guarantee that the instruction executes successfully on the specified data and preserves the integrity of the surrounding data.

The VAX architecture includes instructions that perform independent access to byte, word, and unaligned longword locations in memory so two processes can write simultaneously to different bytes of the same aligned longword without interfering with each other.

The Alpha architecture, as originally implemented, defined instructions that could address only aligned longword and quadword operands. However, byte and word operands for load and store were later added.

On Alpha, code that writes a data field to memory that is less than a longword in length or is not aligned can do so only by using an interruptible instruction sequence that involves a quadword load, an insertion of the modified data into the quadword, and a quadword store. In this case, two processes that intend to write to different bytes in the same quadword will actually load, perform operations on, and store the whole quadword. Depending on the timing of the load and store operations, one of the byte writes could be lost.

The Itanium architecture has byte, word, longword, and quadword addressibility, so granularity of access is easily obtained without having to request it with an option or declaration.

The compiler provides the /PRESERVE=GRANULARITY option to guarantee the integrity of byte, word, and unaligned longword writes. The /PRESERVE=GRANULARITY option causes the compiler to generate Alpha instructions that provide granularity preservation for any VAX instructions that write to bytes, words, or unaligned longwords. Alternatively, you can insert the .PRESERVE GRANULARITY and .NOPRESERVE GRANULARITY directives in sections of VAX MACRO source code as required to enable and disable granularity preservation.

For example, the instruction MOVB R1, (R2) generates the following Alpha code sequence:

LDQ_U R23, (R2) INSBL R1, R2, R22 MSKBL R23, R2, R23 BIS R23, R22, R23 STQ_U R23, (R2)

If any other code thread modifies part of the data pointed to by (R2) between the LDQ_U and the STQ_U instructions, that data will be overwritten and lost.

The following Itanium code sequence is generated:

st1 [r28] = r9

If you have specified that granularity be preserved for the same instruction, by either the command qualifier or the directive, the Alpha code sequence becomes the following:

BIC R2,#^B0111,R24 RETRY: LDQ_L R28,(R24) MSKBL R28,R2,R28 INSBL R1,R2,R25 BIS R25,R28,R25 STQ_C R25,(R24) BEQ R25, FAIL . . . FAIL: BR RETRY

In this case, if the data pointed to by (R2) is modified by another code thread, the operation will be retried.

The Itanium code sequence would be unchanged, because the code is already only writing to the affected memory locations.

For a MOVW R1,(R2) instruction, the Alpha code generated to preserve granularity depends on whether the register R2 is currently assumed to be aligned by the compiler's register alignment tracking. If R2 is assumed to be aligned, the compiler generates essentially the same code as in the preceding MOVB example, except that it uses INSWL and MSKWL instructions instead of INSBL and MSKBL, and it uses #^B0110 in the BIC of the R2 address. If R2 is assumed to be unaligned, the compiler generates two separate LDQ_L/STQ_C pairs to ensure that the word is correctly written even if it crosses a quadword boundary.

Similarly, for Itanium, the compiler will simply generate st2 [r28] = r9 if the address is word-aligned.

Warning

The code generated for an aligned word write, with granularity preservation enabled, will cause a fatal reserved operand fault at run time if the address is not aligned. If the address being written to could ever be unaligned, inform the compiler that it should generate code that can write to an unaligned word by using the compiler directive .SET_REGISTERS UNALIGNED=Rn immediately before the write instruction.

To preserve the granularity of a MOVL R1,(R2) instruction, the compiler always writes whole longwords with a STL instruction, even if the address to which it is writing is assumed to be unaligned. If the address is unaligned, the STL instruction will cause an unaligned memory reference fault. The PALcode unaligned fault handler will then do the loads, masks, and stores necessary to write the unaligned longword. However, since PALcode is noninterruptible, this ensures that the surrounding memory locations are not corrupted.

When porting an application to an OpenVMS Alpha system, you should determine whether the application performs byte, word, or unaligned longword writes to memory that is shared either with processes executing on the local processor, or with processes executing on another processor in the system, or with an AST routine or condition handler. See Migrating to an OpenVMS AXP System: Recompiling and Relinking Applications for a more complete discussion of the programming issues involved in granularity operations in an OpenVMS Alpha system.

Note

INSV instructions do not generate code that correctly preserves granularity when granularity is turned on.

2.11.3 Precedence of Atomicity Over Granularity

If you enable the preservation of both granularity and atomicity, and the compiler encounters VAX code that requires that both be preserved, atomicity takes precedence over granularity.

For example, the instruction INCW 1(R0), when compiled with .PRESERVE=GRANULARITY, retries the write of the new word value, if it is interrupted. However, when compiled with .PRESERVE=ATOMICITY, it will also refetch the initial value and increment it, if interrupted. If both options are specified, it will do the latter.

In addition, while the compiler can successfully generate code for unaligned words and longwords that preserves granularity, it cannot generate code for unaligned words or longwords that preserves atomicity. If both options are specified, all memory references must be to aligned addresses.

2.11.4 When Atomicity Cannot Be Guaranteed

Because compiler atomicity guarantees only affect memory modification operands in VAX instructions, you should take special care in examining VAX MACRO sources for coding problems that /PRESERVE=ATOMICITY cannot resolve on OpenVMS Alpha or OpenVMS I64 systems.

For example, consider the following VAX instruction:

ADDL2 (R1),4(R1)

For this instruction, the compiler generates an Alpha code sequence such as the following, when /PRESERVE=ATOMICITY (or .PRESERVE ATOMICITY) is specified:

LDL R28,(R1) Retry: LDL_L R24,4(R1) ADDL R28,R24,R24 STL_C R24,4(R1) BEQ fail . . . fail: BR Retry

Note that, in this code sequence, when the STL_C fails, only the modify operand is reread before the add. The data (R1) is not reread. This behavior differs slightly from VAX behavior. In an OpenVMS VAX system, the entire instruction would execute without interruption; in an OpenVMS Alpha or OpenVMS I64 system, only the modify operand is updated atomically.

As a result, code that requires the read of the data (R1) to be atomic must use another method, such as a lock, to obtain that level of synchronization.

For this instruction, the compiler generates an Itanium code sequence such as the following:

ld4 r19 = [r9] sxt4 r19 = r19 adds r16 = 4, r9 $L4: ld4 r17 = [r16] mov.m apccv = r17 mov r15 = r17 sxt4 r17 = r17 add r17 = r19, r17 cmpxchg4.acq r17, [r16] = r17 cmp.eq pr0, pr7 = r15, r17 (pr7) br.cond.dpnt.few $L4

Consider another VAX instruction:

MOVL (R1),4(R1)

For this instruction, the compiler generates an Alpha code sequence such as the following whether or not atomicity preservation was turned on:

LDL R28,(R1) STL R28,4(R1)

The VAX instruction in this example is atomic on a single VAX CPU, but the Alpha instruction sequence is not atomic on a single Alpha CPU. Because the 4(R1) operand is a write operand and not a modify operand, the operation is not made atomic by the use of the LDL_L and STL_C.

On OpenVMS I64 systems, the code sequence would be something like the following:

ld4 r14 = [r9] sxt4 r14 = r14 adds r24 = 4, r9 st4 [r24] = r14

Finally, consider a more complex VAX INCL instruction:

INCL @(R1)

For this instruction, the compiler generates an Alpha code sequence such as the following, when /PRESERVE=ATOMICITY (or .PRESERVE ATOMICITY) is specified:

LDL R28,(R1) Retry: LDL_L R24,(R28) ADDL R24,#1,R24 STL_C R24,(R28) BEQ fail . . . fail: BR Retry

Here, only the update of the modify data is atomic. The fetch required to obtain the address of the modify data is not part of the atomic sequence.

On OpenVMS I64 systems, the code sequence would be similar to the following:

ld4 r16 = [r9] sxt4 r16 = r16 $L5: ld4 r14 = [r16] mov.m apccv = r14 mov r24 = r14 sxt4 r14 = r14 adds r14 = 1, r14 cmpxchg4.acq r14, [r16] = r14 cmp.eq pr0, pr8 = r24, r14 (pr8) br.cond.dpnt.few $L5