John Musgrave

Compiler Research

In taking a recent course in compiler theory, I decided to write about the implementation details briefly, although the full source code is available at this repository. This compiler does not, however, feature any machine independent optimizations during the code generation phase. I may do another post on compiler optimization, and making use of parallelism in compilers.

The goal of a compiler is to translate

A compiler by definition translates a program in one language into an equivalent program in another language.

Lexical Analysis

So, the lexical analysis phase attempts to break a program in tokens, specifically lexemes separated by a delimiter, which is usually whitespace or curly braces. The goal of the lexer is to generate a token stream to send to the parser.

Contructing an LL(1) Parser

Parsing a program is a rather complex task. The path the parser takes through the program is based on the language specification. There are also various solutions for implementation, including constructing a parse tree of all tokens in a program, such that a program can be expressed by an in order traversal of the Abstract Syntax Tree (AST). Basically, in my implementation I used an LL(1) parser, which is a top-down parser that uses a left-to-right scan, with a leftmost derivation. This parser also uses a one token look ahead. The lexical analyzer and the parser together comprise the semantic analysis component of the compiler.

Symbol Table

Constructing a symbol table is useful in scoping and determining a variable’s type. I found the implementation to be rather straightforward. A symbol table can be implemented using a stack of scopes, each of which is a table of the symbols that are currently in scope. So, the symbol table must be integrated into the parser, in order for symbols to be added to the table. This allows for type checking to be implemented in the parsing phase of the program.

Program Runtime

There were certain runtime components that needed to be implemented, in order for the program to have access to certain language level functions. These functions were simply added to the symbol table on a global scope, so that the resulting program would always have these runtime functions in scope.

Code Generation

As previously mentioned, this compiler project did not feature any machine independent optimizations. Because of this, the resulting code, although correct, was not extremely efficient. For example, a 3 variable addition:

x = y + z

could be broken down into:

load r1, y
add r1, r1, z
store x, r1

A Closer Look at RSA Encryption

2 Million Years

That’s how long it would take to break 2048 bit RSA encryption (using a standard sequential algorithm).

What is significant about RSA encryption in my mind is the number of different pieces that have to come together for this particular cryptosystem to be successful.

Prime Numbers

Probably the most crucial of these pieces is the fact that prime factorization of large numbers is currently computationally infeasible. For example, in order to find the prime factors of a given number n, you must first test all numbers up to the square root of n to see if they are in the set of all factors of n. Which has a computational complexity of n/2. This doesn’t sound too bad, until you have to check the primality of each of those factors. In order to check the primality, you must test n for divisibility by every number from 1 to n-1, in the worst case. There is some space and time trade off here, but the space that would be needed in order to store all of the computed prime numbers is also computationally infeasible, since the distribution of prime numbers is infinite.

Euler’s Totient Function

This is where Euler’s totient function comes in. The totient function, or phi(n), gives a value that represents the number of integers less than n (greater than 1), that are co-prime to n, or, numbers that are not necessarily prime, but do not divide evenly into n. Another key part of Euler’s Totient Function is that it can be deduced that the value of phi(n), when n is prime, is equal to n-1.

Extended Euclid GCD

Euclid’s algorithm can be used to compute the Greatest Common Divisor of a number. One advantage of the extended algorithm, which we will utilize here, is that it computes the modular multiplicative inverse at no extra cost.

Choosing a Public Key

We must choose a modulus n that is composed of wo large primes p and q, and compute the value of phi(n).

given n=pq, phi(n) = phi(p*q)

The Totient function can be distributed using the distributive property of multiplication, so that:

phi(p*q) = phi(p) * phi(q)

Since we know the Totient function, or phi(n), of a prime number n is equal to n-1, computing a private key is relatively easy. Since p and q are both primer, this gives us for phi(p*q):

phi(p) * phi(q) => (p-1) * (q-1)

We will now choose an integer e, that is co-prime, or relatively prime, to phi(n). This can be done easily with the Extended Euclid GCD algorithm. The integer e will be used as the public key in our encryption algorithm.

Computing a Private Key

Our private key, which will be used for decryption, is expressed fairly simply, as the modular multiplicative inverse of the public key e, mod phi(n). The modular multiplicative inverse is computed at no extra cost by the Extended Euclid GCD algorithm.

Diffie-Hellmann Key Exchange

Diffie and Hellmann also developed a method of Public Key cryptography with their public key exchange system. Basically, if two parties agree on a shared public key, and they each have their own personal private key values, they can each compute the message, or a shared value, very easily by adding their private key to the shared combination of the public key and the other party’s private key. An attacker can have the public key and the combintation of private keys and the public key, which has been shared publicly, and cannot compute the original message text in a reasonable amount of time without knowing the private key. This works for two reasons. Both parties know their own private keys, which are kept secret. Since the keys are computed using a one way function, this is difficult for an attacker to computer, due to the complexity of factoring large numbers.


A message, represented in binary, is raised to the value of a public key, e:

c = m ^ e (mod n)

This yeilds a cipher text.


In order to decrypt a cipher text, that cipher is raised to value of the private key, d:

m = c ^ d (mod n)

Which gives us the original message m.

View on code Github.