The primary words of gmIL are called Opcodes. Each Opcode defines a basic operation or value reference performed within the intermediate language. There are a total of 206 opcodes. Each opcode has a type and a role which combine to define the actual type of the binary emission used by the operation and the semantic role that operation performs. Most opcodes also have subcodes associated with them. These subcodes can be a simple member of a fixed enumeration, or they can be a value, or an address into the symbol storage area. The opcode and subcode identifiers are then organized into a reverse-polish language like the pseudo-code produced by the first pass of contemporary compilers. The gmIL is this language. It is needed to organize every phase of the translation process.
The techniques used to produce the target language are based on the work done in the area of recreation of source code. In particular, P.J. Brown -- "More on the Recreation of Source Code from Reverse Polish." Software Practice and Experience, Vol. 7, p545-551 (1977) -- discusses the recreation of source code from reverse polish notation. This discussion reviews that work and shows how it can be easily extended via a "surface-form description language" and via "external library descriptions" to allow not only for the surface form variation needed to accommodate different target languages, but also to give the final user of the translator the ability to manipulate the translation to his own needs.
If the references upon which this approach is based seem old, they are. The original coding of this approach was done in Fortran for a 1972 Linguistics Masters thesis "SPELL, Systematic Procedures for Equating Logic and Language." For SPELL, the term "Logic" referred to a formal notation used to express meaning and "Language" referred to English. It then quickly became FTSQR, Formula Translation Squared, A Fortran program that converted programs written in nonstandard Fortran into machine independent Fortran. Then, in the early 80's, it branched into PROMULA, Processor of Multiple Languages, and PFC, Portable Fortran Compiler. These are both implemented in C on desk top desktop computers and use C as the target language. The primary difference between PROMULA and PFC is that PROMULA can both directly execute and translate its source code; while PFC translates and then compiles the C along with a runtime library to do the execution. Both PFC and PROMULA have been actively in use since the 80's.
The problem with the standard theories of compiler design is that though there is ample discussion of how to do the conversion from the source language to the intermediate language, there is little discussion of how to move from an intermediate language back to a higher level target language. For compilers, the target language is always a low-level machine code. Fortunately, there has been some work done in the area of the re-creation of source code from intermediate code. Many BASIC interpreters need to be able to execute BASIC statements efficiently; while still giving the user access to the source code. The re-creationists solve this problem by compiling the source statements into a reverse polish notation and then rewriting the source code from that internal notation.
The technique of source code recreation from reverse polish notation hinges on the observation that executing such notation via a stack-oriented pseudo-machine is identical to writing and combining the character sequences that perform those operations in some target language via a stack-oriented string manipulation machine. The above can best be understood via a simple example.
Consider the following statement which we wish to both execute via a p-machine and re-create
A = B * C + D.
PUSHADR A PUSHADR B GETVAL PUSHADR C GETVAL MULT PUSHADR D GETVAL ADD PUTVAL EOS
OpCode Meaning ------ ------- PUSHADR means push the address of the variable onto the stack. GETVAL means pop the address from the stack, obtain the value at that address, and push it onto the stack. MULT means pop the top two values from the stack, multiply them together, and push the result onto the stack. ADD means pop the top two values from the stack, add them together, and push the result onto the stack. PUTVAL means pop the value and address from the stack and then store the value at the address. EOS means end the current statement.
OpCode Meaning ------ ------- PUSHADR means enter the notation for a pointer to the variable onto the stack. GETVAL means pop the top string from the stack, convert it to the notation for a value at the indicated address, and push that string back onto the stack. MULT means pop the top two strings from the stack, concatenate them with the symbol '*' in the middle, and push that string back onto the stack. ADD means pop the top two strings from the stack, concatenate them with the symbol '+' in the middle, and push that string back onto the stack. PUTVAL means pop the top two strings from the stack, convert the lower one to a value string, concatenate the two strings with the symbol '=' in the middle, and push the resultant string onto the stack. EOS means write the current string.
Instruction Execution stack String-manipulation stack ----------- --------------- ------------------------- PUSHADR A 45 "&A" PUSHADR B 45, 50 "&A", "&B" GETVAL 45, 2.0 "&A", "B" PUSHADR C 45, 2.0, 55 "&A", "B", "&C" GETVAL 45, 2.0, 3.0 "&A", "B", "C" MULT 45, 6.0 "&A", "B*C" PUSHADR D 45, 6.0, 60 "&A", "B*C", "&D" GETVAL 45, 6.0, 4.0 "&A", "B*C", "D" ADD 45, 10.0 "&A", "B*C+D" PUTVAL "A=B*C+D" EOS
A serious problem faced by the re-creationists, is that the production of the output string is completely independent of the original input made by the user. Though the original and the result mean the same thing, they might look very different. Fortunately, in this application this is not a problem. The user wants semantic and not syntactic identity.
The other important point is that the string machine need not operate against the reverse polish notation as originally formed by the compiler. The string machine can operate on any well-formed representation. This fact is key for the deep-refactoring operations which manipulate the low level notation directly.
A = B * (C + D)
PUSHADR A PUSHADR B GETVAL PUSHADR C GETVAL PUSHADR D GETVAL ADD MULT PUTVAL EOS
Note that the effect of the parentheses in intermediate form was simply to reorder the operations.
Now, using the same execution and string-manipulation rules as before, the following would result.
Instruction Execution stack String-manipulation stack ----------- --------------- -------------------------- PUSHADR A 45 "&A" PUSHADR B 45, 50 "&A", "&B" GETVAL 45, 2.0 "&A", "B" PUSHADR C 45, 2.0, 55 "&A", "B", "&C" GETVAL 45, 2.0, 3.0 "&A", "B", "C" PUSHADR D 45, 2.0, 3.0, 60 "&A", "B", "C", "&D" GETVAL 45, 2.0, 3.0, 4.0 "&A", "B", "C", "D" ADD 45, 2.0, 7.0 "&A", "B", "C+D" MULT 45, 14.0 "&A", "B*C+D" PUTVAL "A=B*C+D" EOS
The solution taken is quite simple and general. Each rule in the string manipulation machine has associated with it a precedence code. Whenever the output of an operation is written to the string-stack, the precedence code of the rule that produced the output is also written. Whenever an element is combined within a rule, if its precedence code is nonzero and lower than the code of the rule, then that element is enclosed in parentheses as it is concatenated into the new output.
Applying this extension to the example above, the precedence codes for the various operations are as follows:
Instruction String-manipulation stack ----------- ------------------------- (0) PUSHADR A (0) "&A" (0) PUSHADR B (0) "&A", (0) "&B" (0) GETVAL (0) "&A", (0) "B" (0) PUSHADR C (0) "&A", (0) "B", (0) "&C" (0) GETVAL (0) "&A", (0) "B", (0) "C" (0) PUSHADR D (0) "&A", (0) "B", (0) "C", (0) "&D" (0) GETVAL (0) "&A", (0) "B", (0) "C", (0) "D" (1) ADD (0) "&A", (0) "B", (1) "C+D" (2) MULT (0) "&A", (2) "B*(C+D)" (0) PUTVAL (0) "A=B*(C+D)" EOS
The critical point in the above occurs when the output of the ADD operation, which has a precedence code of 1, is combined via the MULT operation which has a precedence of 2. Since 1 is nonzero and less than 2, the string "C+D" is enclosed in parentheses as it is entered into the output string of the MULT instruction.
<pattern id="PUSHADR" > <all narg="0" level="0" code="\v"/> </pattern> <pattern id="GETVAL" > <all narg="1" level="0" code="%1i"/> </pattern> <pattern id="MULT" > <all narg="2" level="2" code="%1d * %2d"/> </pattern> <pattern id="ADD" > <all narg="2" level="1" code="%1d + %2d"/> </pattern> <pattern id="PUTVAL" > <all narg="2" level="1" code="%1i = %2d"/> </pattern> <pattern id="EOS" > <all narg="1" level="0" code="%1d\c"/> </pattern>
Within the pattern strings there are three types of specifications:
|Aithmetic||The arithmetic operators perform calculations, which may be actual arithmetic but might well be any other type of calculation like masking or concatenation or comparison. They typically have suboperations for the different binary types that may be involved.|
|Generic||The generic operations are the operations that are performed by any language as opposed to being specific to the particular source languages being processed here.|
|Patterns||The pattern support operations are used to facilitate the transformations of the opcodes produced by to compiler into opcodes that produce the correct translations when they are authored.|
|gmSL||The gmSL support operations are used by the gmSL compiler and runtime support to execute the methods compiled using that language. It also contains a number or dummy opcodes that can be used by gmSL programmers to facilitate migration of codes to specific targets|
|Support||The VB6/ASP support operations are used by the source language compiler and to characterize the VB6/ASP particular features.|
|Enumerations||The VB6 enumerations represent the various enumerations and their entries.|
|Classes||The VB6 classes represent the classes built into VB6 and their individual components.|
|Arithmetic||The arithmetic operators perform calculations, which may be actual arithmetic but might well be any other type of calculation like masking or concatenation or comparison. They typically have suboperations for the different binary types that may be involved.|
|SingleByte||The single byte operators have simple generic roles within the intermediate language like introducing a new code sequence or marking the end of some sequence.|
|DoubleByte||The double byte operators have generic roles within the intermediate language. Their emission consists of the operation emission code followed by the suboperation emission code. Within other statements they are referenced as OPC.Subcode.|
|ShortValue||The short value operator emission is followed by a two-byte integer value used to specify a simple integer constant or number.|
|SymbolAddr||The symbol address operators are followed by a four-byte integer value that specifies the root offset in the current storage area of a symbol defined within the source code.|
|StringAddr||The string address operators are followed by a four-byte integer value the specifies the offset in current storage of a character string. This string might be an actual string or it might be a value constant of some sort.|
|OpcodeAddr||The opcode address operators are followed by a four-byte integer value that marks the offset of the start of an intermediate language code sequence.|
|BinaryType||The binary type operators, not to be confused with Arithmetic binary operators, have a suboperation code that defines a binary type. They are used extensively in the intermediate code to specify the binary result-type of a sequence of operations.|
|LibraryAdr||The library address operators are followed by a four-byte integer value that specifies the root in either current storage or language storage of an external symbol in use by the source code. Positive roots are to current storage; while negative roots are to language storage.|
|LibPattern||The library pattern operators are followed by the offset of a surface pattern description either in the current storage area or in language storage.|
|ObjectType||The object type operators have a suboperation code that specifies the type of object that the some sequence of operations pertain to.|
|NewLev||There are many nested sequences of operators. This role starts a new nesting level.|
|EndLev||There are many nested sequences of operators. This role ends a nesting level.|
|LoadSym||This role loads a source code symbol.|
|LoadLib||This role loads a library symbol.|
|LoadInt||This role loads a short integer value.|
|LoadStr||This role loads a character string.|
|LoadLng||This role loads a string representation of a long integer constant.|
|LoadDbl||This role loads a string representation of a real constant.|
|LoadLog||This role loads a binary representation of a logical (boolean) constant.|
|LoadDat||This role loads a string representation of a date constant.|
|LoadSpv||This role loads a subcode that defines some source language particular special value like Me.|
|LoadTyp||This role loads an enumeration entry value.|
|Unary||This role defines a unary operator.|
|Binary||This role defines a non-relational binary operator.|
|Relate||This role defines a relational binary operator.|
|Member||This role asserts the membership of a code sequence in the preceding code sequence.|
|Clsref||This role asserts that the subcode refers to a class component|
|Convert||This role performs a type-conversion of some sort.|
|Comment||This role introduces a non-executable comment.|