CS 442: Principles of Programming Languages

Gregor Richards

Estimated study time: 11 hr 58 min

Table of contents

Module 1: Languages

Introduction

Welcome to CS442! This course is titled “Principles of Programming Languages”, and rightly so. This course will discuss the principles underlying the design and implementation of programming languages.

First, let’s try to eliminate some misunderstandings by explaining what this course is not:

This course is not “a programming language every week”. Although we will be looking at several programming languages you’re probably not familiar with, our focus is on depth. We will examine programming languages as artifacts themselves.
This course is not a compilers course. That’s CS444. Although you will be implementing languages, the implementations won’t be good; the goal is understanding, not efficiency.
This course is not a history course, although we will look a bit at the history of programming languages for context. We aim to examine timeless concepts.

So, if this course isn’t any of those things, what is it? We’ll be looking at two major aspects of programming languages: How they are formally defined, in a framework allowing for mathematical rigor and proofs, and the scope of language paradigms that exist.

Formally, a programming language is modeled as a calculus. If your first language is English, you’ve probably never encountered the concept of a calculus, but in fact, the system we describe just as “calculus” is “the calculus of differentials and integrals”, or “infinitesimal calculus”. It came to be known simply as “calculus” because the Fundamental Theory of Calculus unified two calculi which were previously independent. A calculus is simply a mathematical language; a language for describing mathematical ideas. Set theory has a calculus, the calculus of sets; linear algebra has a calculus, the calculus of matrices. Indeed, even arithmetic is a calculus: it could be described simply as the calculus of numbers, but is more properly called, well, arithmetic. We never typically describe these things as calculi, because we don’t usually think about the calculi themselves. They’re used as tools.

We will bridge the gap between these mathematical calculi and programming languages by building a particular, simple programming language, called the $\lambda$-calculus (Lambda calculus), both as a mathematical calculus and as a (rather impractical) programming language. As a calculus, we will describe it with mathematical rigor, allowing us to prove some properties of $\lambda$-calculus expressions. As a language, we will implement it in software, such that $\lambda$-calculus expressions are also $\lambda$ programs. In both, we will extend it to understand how language concepts affect language behavior both formally and practically.

As well as $\lambda$-calculus and its derivatives, we will be looking at real programming paradigms. A programming paradigm is simply a way of thinking about software, usually exemplified by a fundamental way that data and code are stored and interacted with. You’re probably familiar with functional programming languages, such as Racket, and object-oriented languages, such as Java and (arguably) C++. In this course, we will be looking at a few more programming paradigms. For each paradigm, we will be looking at an exemplar programming language—i.e., a programming language which exemplifies the paradigm without trying to be “multi-paradigm” and thus muddying the water—and we will look at formal calculi which model the behavior of such languages. The paradigms and exemplars we will be examining are:

Functional and Haskell
Logic and Prolog
Imperative and Pascal
Object oriented and Smalltalk
Concurrent and Erlang
Systems and C

History of Programming Languages

The history of programming languages predates the history of computers, but has followed closely with the power and capability of computers since their inception. The earliest experience of real programming was manually entering CPU-specific code (machine code) in binary. Ultimately, this happens even to this day: all CPUs run their own machine code. Advancements in programming languages are possible because programming allows abstraction: writing a program in machine code is excessively annoying, but once the first assembler is written in machine code, the programmer is free to program in assembly instead of raw machine code. Once the first language compiler targeting assembly is written in assembly, the programmer is free to program in its language. Each of these steps allows a higher-level language, albeit not without consequences in terms of raw performance and predictability.

One of the first widely-used “high-level” languages—here, “high-level” really just means higher-level than assembly code—was Fortran, originally developed in the 1950s by IBM. The original version of Fortran predated a feature we consider so fundamental to programming languages since that we rarely even feel the need to name it: structured programming.

The thing that is “structured” about structured programming is control flow. Now-familiar concepts such as if blocks and procedures were yet to be invented. Before structured programming, the program was one giant list of numbered instructions, and the programmer could choose to conditionally jump to a different instruction. If they wanted to jump back again (and thus form a conditional block, like an if statement), they would have to do so manually! And this is setting aside the fact that Fortran programs of the era weren’t entered on a keyboard, but painstakingly encoded and punched into hundreds of cards.

Structured programming was probably first implemented in Algol, a language designed in the late 1950s and into the 1960s. Although Algol itself wasn’t particularly popular, almost all modern procedural languages (C, Java, JavaScript, Python, etc.) can trace at least some history to Algol’s design. Importantly, Algol was designed. Fortran was changed and enhanced as needs arose, with no particular design motivation other than to serve its purpose. Algol, as well as contemporaries such as LISP and COBOL, were, at least at their inception, carefully designed to encourage a particular style of programming. This era was the beginning of programming languages having different paradigms.

Aside: In fact, however, some aspects of programming paradigms predate programming languages. In the 1930’s, the Church-Turing thesis unified two different models of computation: Church’s and Turing’s. Programming languages which followed Church’s philosophy of computing would later be known as functional languages (i.e., languages in the functional paradigm), while languages following Turing’s philosophy are imperative. The Church-Turing thesis itself proves that these paradigms are equivalently powerful, so the families of languages that they spawned differ not in what they can do, but in how one expresses what they do. We’ll see a bit more about the Church-Turing thesis in Module 2.

Fortran, Algol, COBOL, and many of their successors are imperative programming languages: their fundamental model of computing is a list of instructions which is run in the order that it appears, with any instruction able to change the state of the computer in a way that affects how the following instructions operate. Lisp, also developed in the late 1950’s, followed a different basic design: the basic unit of computation was the function, and data was usually encapsulated so that the same sort of state change, while possible, was not central. It was possibly the first functional programming language. In this context, of course, “functional” doesn’t mean “working”, it means that functions are first-class, in that they are values in the language. Lisp also pioneered the concept of homoiconicity: code and data having the same form. Lisp’s central datatype is the list, and Lisp functions are represented as nested lists. As such, Lisp code can manipulate Lisp code. This tradition continued in languages such as Scheme (and Racket), where the ability of code to manipulate code has led to extremely powerful macro systems.

Imperative and functional languages continued to develop greater sophistication while not changing the fundamental paradigm until object-oriented programming was invented in the 1980’s, and concurrent languages appeared around the same time.

Programming Paradigms

In introducing the history of programming languages, we’ve discussed imperative and functional programming. These are two of the programming language paradigms we will investigate in this course.

A programming paradigm is a mode of thought, and as such, it’s impossible to formally define. The edges of a programming paradigm are often unclear, and so it’s impossible to answer questions such as “is this object-oriented programming?” Nonetheless, as programming paradigms developed, programming languages developed to support those paradigms, and understanding programming paradigms is crucial to having a broad understanding of programming languages. Teaching the breadth of programming language paradigms is one of the fundamental goals of this course.

Because it’s impossible to define precisely what is or is not in a programming paradigm, we will use exemplars: languages which exemplify the concepts of a particular programming paradigm and, ideally, little else. As such, exemplar languages are chosen not necessarily because they’re especially practical programming languages; languages which are more flexible are often more versatile. Rather, they’re chosen because to understand an exemplar programming language is to understand the mode of thought behind the paradigm it exemplifies.

Aside: An exemplar is different from an example in that while an example of property X has property X, an exemplar of property X is a template or model for property X. For instance, although C++ is an object-oriented programming language, its history and its goals make it difficult to separate out the concepts of object-oriented programming from systems, imperative, procedural, structural programming, etc. Thus, C++ is an example of object-oriented programming, but not an exemplar.

It should come as no surprise that there is no perfect list of all the programming language paradigms that exist. The paradigms selected for this course are those of enduring, practical importance.

The languages you’re asked to program in, OCaml and Smalltalk, are an example and an exemplar of functional programming and object-oriented programming, respectively. In general, the assignments will be to implement a simple interpreter for an example of some paradigm of programming language, and you will be required to use the less similar of these two languages. For instance, you will be asked to implement a functional language in Smalltalk and an object-oriented language in OCaml. The reason for this isn’t just cruelty: it is often easier to implement an interpreter for a programming language in a similar programming language because you can use the host language’s features in place of the guest language’s features, but taking this easy route does not help you to actually understand the paradigm. By using the “opposite” language, you will gain a more complete understanding of the paradigm. Note that OCaml has object-oriented features (that’s what the “O” stands for), but we won’t be using them in this course.

OCaml in CS442

OCaml is available on linux.student.cs.uwaterloo.ca as ocaml. In this course, you may use the OCaml standard library, and the Base, Core_Kernel and Core libraries if you would like, but you may not use any others unless you are specifically instructed to do so in an assignment.

OCaml has many useful and fast references online. For the purposes of this course, you should read the Guided Tour from Real World OCaml. The rest of the book is also a useful reference, of course, but you’re recommended to use it as a reference rather than to read it through, simply because you shouldn’t need it.

Real World OCaml recommends using the library Base. You can set up OCaml with Base on your own system following the instructions in its installation chapter, or on linux.student.cs.uwaterloo.ca by doing the following:

Set up opam with opam init and follow its directions (the defaults should work)
Include opam in your environment with eval $(opam env)
Install Base and related libraries and tools with opam install core base utop
Add the following to a file .ocamlinit in your home directory, which may be a new file:

#use "topfind";;
#thread;;
#require "base";;

Restrictions

In this course, you may not define classes in OCaml. You may use classes provided by the standard library or allowed libraries, but may not define your own. OCaml is an object-oriented dialect of Caml, which is in turn a dialect of ML, but it is for those ML roots that OCaml was chosen, not for its object orientation. OCaml was chosen over languages without such additions, such as Standard ML, simply because it is more well maintained. Note that records are not classes, and are perfectly fine to use.

Aside: Many object oriented languages don’t distinguish records—simple data containers—from classes—object-oriented types—because in these languages, a class is strictly more powerful than a record type. For instance, in C++, although C’s record syntax, struct, is still supported, structs are just classes that are public by default. However, the concepts evolved quite separately, and indeed, “purely” object-oriented languages such as Smalltalk don’t even have record types!

Real World OCaml recommends the use of the Dune build system, and you’re recommended to use it for your own convenience and testing, but we will not be using Dune to build your code. This is because we will be building your code against our own test suites, and we do not wish to require you to learn how to use Dune to build libraries. As a consequence, you must name your files as we specify, and must name your functions as we specify. You may of course have additional functions, variables, etc., beyond what we demand, as helpers, but you may not have additional files for your solution to any assignment question, since we won’t be using Dune, so wouldn’t know what additional files to compile. You can and should have additional files for testing, but they cannot be a requirement of your code; i.e., they cannot be part of your actual code’s functionality, only its testing.

Testing

To test in the simplest way, just build tests into your normal .ml file. However, make sure you remove them or comment them out before submitting!

For better testing, you should write a separate module, i.e., a separate file. It’s quite easy to use code from another module. For instance, if you are tasked with writing a1q1.ml, and a function named pushNum, then if you create a separate file named, e.g., testa1q1.ml, you can call a1q1.ml’s pushNum function as A1q1.pushNum. Just make sure you build your test code together with a1q1.ml. This is how our own tests will be built, as normal OCaml; you will not normally be expected to write a parser in this course.

Configuring Dune to build a plethora of different tests all against the same main code can be a bit complicated. If you’d rather use make, or just compile manually, here is the incantation you need for the above example:

ocamlfind ocamlc -package core -linkpkg -thread -o testa1q1 a1q1.ml testa1q1.ml

You can generalize this to anything else simply by replacing the exact .ml files. This produces an executable named testa1q1.

In the above example we use the Core package, and for most anything in this course, that should be sufficient, as it links in Base, Stdio, and most other standard libraries. If you want to be more particular, you can use a comma-separated list of packages:

ocamlfind ocamlc -package base,stdio -linkpkg -o testa1q1 a1q1.ml testa1q1.ml

Note that in the first example we used -thread because Core depends on it, but neither Base nor Stdio do, so it’s fine to leave it out in this case. ocamlc will warn you if you needed -thread but excluded it, and it’s always harmless to include it. We won’t look at using threads in OCaml until the end of the course.

OCaml has C- or C++-like compilation and linking, which is why you need to be specific both about what you’re compiling against (with -package) and what you’re linking against (with -linkpkg).

Smalltalk

Smalltalk is an object-oriented programming language. Indeed, Smalltalk is so object oriented that it’s barely a procedural imperative language: Smalltalk’s syntax doesn’t even have conditionals or loops! How is that possible? Well, you don’t need loops in the syntax of your language if block (lambda) objects have a whileTrue method which runs the block repeatedly, and you don’t need conditionals if your boolean objects have an ifTrue method that evaluates its argument only if the boolean was true. If you’re familiar with object-oriented languages that have a less pure approach to object-oriented programming (that is, nearly all of them), then Smalltalk’s style will come as a bit of a shock. But, it doesn’t take much getting used to, and before you’ve grown accustomed to it, you can simply mentally rewrite the conditions and loops you’re comfortable with into Smalltalk’s style.

Smalltalk’s oddness doesn’t stop there, however. Smalltalk is a class-based object-oriented language, as are C++, Java, Python, etc. You might then assume that a reasonable place to start with such a language is “what is the syntax to define a class?” Except… there isn’t one. The Smalltalk standard does not include syntax for classes. Smalltalk was, at its inception, not just a programming language, but an operating system and development environment, and the way that you create a class is through the graphical workspace. You then add methods to the class through user-interface interactions, and then finally, the body of a method has a defined syntax. In order to reduce the semantic leap from languages that you may be more familiar with, in this course, we’ll be using an unusual variant of Smalltalk that behaves more like a conventional programming language, GNU Smalltalk. GNU Smalltalk accepts Smalltalk files and has an interactive REPL, like other programming languages. If you’re interested in Smalltalk, you’ll probably want to explore some other Smalltalk environments, such as Pharo and Squeak, but bear in mind that only the syntax for the bodies of methods will look familiar.

This module assumes that you’re familiar with some modern object-oriented languages such as C++ and perhaps Java, and that you’re familiar with programming in general, and focuses on what will be unfamiliar in Smalltalk with that background.

My First Smalltalk Program

GNU Smalltalk is available on linux.student.cs.uwaterloo.ca. Just run the command gst:

$ gst
GNU Smalltalk ready
st>

You may also want to install GNU Smalltalk on your own system. In most systems with package managers, it’s in a package named gnu-smalltalk. Or, you can get the source at https://www.gnu.org/software/smalltalk/ and build it yourself. You should get the latest “alpha” version, which at the time of writing this is 3.2.91.

From here, we can write everyone’s favorite first program:

st> 'Hello, world!' displayNl
Hello, world!
'Hello, world!'
st>

That… kind of worked? It actually worked fine: the first line is the behavior of the displayNl method itself, and the second line is the return from the displayNl method. displayNl returns the thing that it displayed.

Let’s take a moment to examine how this worked, though, because it probably looks unfamiliar. Our command, 'Hello, world!' displayNl, has two components: the string and the message. The string should look familiar enough, but it’s worth noting that in Smalltalk, strings can only be delimited by single quotes ('), not double quotes ("). In addition, strings in Smalltalk are objects of the class String. In fact, everything in Smalltalk is an object!

Yes, even the class String itself is an object, of the class Class! And, of course, the class Class is itself an object, of, you guessed it, the class Class.

The second part is the message. This command sends the displayNl message to a string object. When an object receives a message (in this case, displayNl), it looks up a corresponding method in its class. String objects are of the class String, so String’s displayNl method is invoked (sometimes written String>>displayNl to clarify which class’s method is meant), and the result of the expression ('Hello, world!' displayNl) is the return value of the method. This whole process should feel familiar if you’ve used any object-oriented language, so rather than describing the whole process of sending a message and responding by invoking a method every time, we will simply describe this process as “calling the displayNl method”.

Smalltalk is quite purely object oriented: there are no procedures that aren’t methods, and there is no control flow but calling methods and returning. So, if methods are all that there is, then the way to display a string is to call the display method on the string—i.e., ask the string to display itself. If C++ supported this style of displaying strings, it might look something like this:

("Hello, world!").displayNl();

The parentheses and dots and other syntactic clutter are there because C++ has so many other ways of expressing yourself. When all you can do is call methods, you don’t need a dot or parentheses to say that you’re calling a method. Of course you’re calling a method, that’s all you can do. So, all that extra syntax falls away, and the simple way to call a method is just to name it: 'Hello, world!' displayNl.

Since single quotes delimit strings, if you want to put a single quote in a string, you need to escape it (specify that this single quote is not the end of the string). To do so, simply use two single quotes instead of one:

st> 'This isn''t SUCH a bad programming language!' displayNl.
This isn't SUCH a bad programming language!
'This isn''t SUCH a bad programming language!'
st>

Note that the string it actually printed had only one single quote. The second line is showing what value was returned by displayNl, and to show that a string was returned, it surrounds it in quotes, and escapes the quote again.

You can use Ctrl+D or the extremely memorable command ObjectMemory quit to leave the GNU Smalltalk interactive shell. We’ll mostly be using Smalltalk files, since Smalltalk was never really meant to be used in a read-eval-print-loop (REPL) like this.

Aside: The Nl in displayNl means “newline”. If you didn’t want to print with a newline at the end, just use display!

Let’s do Math

Create a file named math.st (or whatever you’d like) in your favorite text editor. Let’s write a program to do some basic arithmetic:

30 + 12 displayNl

And run it:

$ gst math.st
12

That result was probably not what you expected. The reason is that everything in Smalltalk is an object, so all behaviors are method calls, even +. It should be clearer if we add some parentheses to show the precedence:

30 + (12 displayNl)

A zero-argument method call like displayNl binds more tightly (has higher precedence) than a binary method call like +, so it came first. Let’s rewrite this in C++-like method-call syntax to show every step:

(30).operator+((12).displayNl());

The displayNl method was applied on the 12, and then its result was passed to the + method on 30. On the interactive shell, we would have also displayed the result of the whole computation, which is 42, but now that we’re writing actual Smalltalk files, the only output that’s displayed is what you explicitly ask for. We can fix this with parentheses (which, contrary to many other languages, are never method calls). We can also clarify our code with a comment, which in Smalltalk is surrounded by double quotes:

" Display the sum of 30 and 12, rather
  than adding the display of 12 to 30 "
(30 + 12) displayNl

Output: 42

OK! By putting (30 + 12) in parentheses, we’re forcing it to be evaluated first. Then, the displayNl method is called on the result, 42. Now, let’s do some very slightly more sophisticated math:

(30 + 6 * 2) displayNl

Output: 72

Once again, a surprising result. If we apply the usual rules of math, then * has higher precedence than +, so we would perform 6 * 2, then 30 + 12. But, this isn’t (just) math, and + and * aren’t just operators, they’re methods. Smalltalk doesn’t actually know the rules of mathematical precedence, it just knows how to call methods on objects. In this code, it just knows to call the + method because you asked it to, and then call the * method because you asked it to. In fact, Smalltalk doesn’t even know what the mathematical operators are; if you want to use @ as an operator for your own class, you just need to name a method in that class @.

Since it knows nothing of the order of operations, Smalltalk treats all binary operators with equal precedence, going left to right. We can now fix our simple math with even more parentheses:

" Multiply first "
(30 + (6 * 2)) displayNl

Output: 42

Since displayNl returns the object being displayed, we can actually observe this left-to-right behavior quite nicely, by printing the result of every intermediate calculation:

(30 displayNl + (6 displayNl * 2 displayNl) displayNl) displayNl

Output:

Exercise 1. Work through the above Smalltalk program and why it outputs exactly what it outputs.

We can perform multiple statements by separating them with a dot (.). In essence, you can think of the dot like a semicolon in C and languages that follow its style:

(30 + (6 * 2)) displayNl.
(10 - 15) displayNl.
(10 * 10) displayNl.

Output:

42
-5
100

It is not necessary to end the last statement in a list of statements with a dot, but it is common to do so.

Booleans, Conditions, and Loops

It should come as no surprise that Smalltalk has two boolean values: true and false. By this point, it should hopefully also not come as a surprise that true and false are objects of the class Boolean, and they have some useful methods.

In most modern programming languages, using a boolean to conditionalize execution looks something like this:

if (condition) {
    statements;
}

In Smalltalk, there are no conditional statements like this in the syntax. Instead, just like everything else, we do things conditionally by calling a method. In this case, the ifTrue: method on Booleans:

true ifTrue: ['true is true!' displayNl]

Output: true is true!

This example has introduced two new elements of Smalltalk syntax. First, the simple one. Note the : on ifTrue:. Methods named with a colon like this expect an argument. Of course, in this case, the interesting part is what that argument is.

Now, the more interesting new element. The expression ['true is true!' displayNl] defines a block. A block is sort of like an anonymous function, in that it is a packaging of code which can be evaluated later, but this analogy is imperfect, so we’ll simply use the standard Smalltalk term, “block”. We pass that block as an argument to the ifTrue: method. Like an anonymous function, ifTrue: can then choose to call it, or not to call it.

You may think that the implementation of ifTrue: has to “cheat” and do a conventional if statement to work. In fact, its implementation is far more object-oriented than that. The object true is actually of the class True, which is a subclass of the class Boolean. Some methods are implemented on Boolean, and those can be used on true: like in other class-based languages, subclasses inherit the behavior of their superclasses. But, the method ifTrue: is implemented on True, and all it does is evaluate the block passed as its argument. It doesn’t need to check if the boolean is true. It knows it’s true because this is a method on the True class, and only true is of the class True. Conversely, the method ifTrue: on the False class does nothing.

A block boxes up code, and it only runs the code if someone asks for it. It gives us a way of making it optional to execute the code at all.

Since both True and False support the ifTrue: method, but ifTrue: only evaluates its argument on True, this allows us to conditionalize the execution of a block, with nothing but objects and methods!

Let’s put this together with our math from before, adding the operators for comparing numbers:

" Mathematical sanity checks "
40 < 42 ifTrue: [
    '40 is less than 42' displayNl].
40 < 24 ifTrue: [
    '40 is less than 24???' displayNl].
10 + 10 > 19 ifTrue: [
    '10 + 10 is greater than 19' displayNl].
50 >= 50 ifTrue: [
    '50 is greater than or equal to 50' displayNl].
10 + 10 = 20 ifTrue: [
    '10 + 10 is 20' displayNl].

Output:

40 is less than 42
10 + 10 is greater than 19
50 is greater than or equal to 50
10 + 10 is 20

Remember when using these operators that they simply evaluate left-to-right. There is still no operator precedence:

10 + 10 > 10 + 9 ifTrue: [
    'This looks like fine math to me' displayNl].

This produces an error:

Object: true error: did not understand #+
MessageNotUnderstood(Exception)>>signal (ExcHandling.st:254)
True(Object)>>doesNotUnderstand: #+ (SysExcept.st:1448)
UndefinedObject>>executeStatements (example.st:1)

These error messages can be a bit dense, but only the first line really matters: true doesn’t know how to +, and because Smalltalk evaluates left-to-right, we tried to perform + 9 on the true that was a result of 10 + 10 > 10. As usual, we can fix this with more parentheses:

10 + 10 > (10 + 9) ifTrue: [
    '20 is greater than 19' displayNl].

Output: 20 is greater than 19

You can build a lot out of ifTrue:, but it doesn’t have an “else” branch. Booleans also have two other methods, ifFalse: and ifTrue:ifFalse:, which cover other cases:

100 > 10 ifFalse: [
    '100 isn''t greater than 10???' displayNl].

" There is no universally-accepted style
  for how to indent ifTrue:ifFalse:.
  Just be consistent in your own code.
"
10 >= 10
    ifTrue: [
        '10 is greater than or equal to 10' displayNl
    ] ifFalse: [
        '10 isn''t greater than or equal to 10???' displayNl
    ].

Output: 10 is greater than or equal to 10

These hopefully behave as you would expect them to, but note how the call to ifTrue:ifFalse: looks. First is the operator which generates the true on which we’ll be calling the method. Then comes ifTrue:, then comes the true argument value (a block), then comes ifFalse:, and then comes the false argument value (another block). This style of method call syntax is Smalltalk’s second greatest difference from more popular languages. Because parameter names and argument values can be intermixed in this way, it’s very important to make sure you put a dot at the end of your statements. If you don’t, the error messages can be extremely confusing.

Generally speaking, Smalltalk programmers try to name their methods such that, with the actual argument values in place, it reads a bit like an English sentence. For instance, on arrays (which we will look at in more detail later), the method to put a value into the array at a given location is at:put:. This would be written my-array at: the-position put: my-value. This reads as “in my array at the position put my value”.

Now we’ve done conditions, but not loops. The simplest way to loop in Smalltalk is over a numerical range:

1 to: 10 do: [:x|
    x displayNl
].

Output:

Note our new addition to blocks. Like methods, blocks can take arguments. Block arguments are labeled with : at the beginning of the block, and the list of block arguments is ended with |.

Of course, this can be mixed with math:

1 to: 2*2 do: [:x|
    x displayNl
]

Output:

What if you need to do a more sophisticated loop? Like any other programming language, it’s possible to loop while some condition is true (or while some condition is false), but to learn how to do that, first we’ll have to introduce variables, so our condition can actually change.

Variables and Assignment

To declare variables, surround them in pipes (|). To assign to them, use :=.

| x y |
1 to: 4 do: [:z|
    x := z * 2.
    y := x * 2.
    y displayNl.
].

Output:

Of course, this isn’t an especially interesting use of variables. Let’s use variables and math to make a more interesting loop:

| x |
x := 1.
[ x < 64 ] whileTrue: [
    x displayNl.
    x := x * 2.
].

Output:

One major difference from our conditionals (ifTrue:ifFalse:) is worth noting: our condition itself is in a block! Why did we need it in a block here, and why didn’t we need it in a block before? The answer is simple, but not obvious: anything that needs to be run repeatedly needs to be in a block. In order to check whether x < 64 multiple times, we need to put it in a block, so that block can be made to run repeatedly, re-checking whether x < 64 every time around the loop. Of course, the block given as an argument to whileTrue: is also evaluated multiple times! There’s also a whileFalse:, which behaves as you’d expect.

Exercise 2. Using the loops we’ve seen here, write a program to output the Fibonacci sequence.

Containers

Smalltalk has various standard containers, including lists, arrays, strings (which are arrays of characters), and dictionaries.

Lists

Although various Smalltalk classes are implemented as lists, the most frequently useful list type is OrderedCollection. Until now, we’ve only created objects implicitly—numbers are objects, blocks are objects—but never explicitly built an object from a class. As with everything, the way to build an object from a class in Smalltalk is to call a method on the class; usually, the new method. Thus, we create an OrderedCollection with OrderedCollection new:

| lst |
lst := OrderedCollection new.
lst displayNl.

Output: OrderedCollection ()

The display format for an OrderedCollection is simply OrderedCollection followed by its elements in parentheses. Add elements to an OrderedCollection with add::

| lst |
lst := OrderedCollection new.
lst add: 42.
lst displayNl.

Output: OrderedCollection (42 )

You can also use addFirst: to add to the beginning of the list instead of the end. If you have another collection (whether an OrderedCollection or not), you can add all of its elements with addAll:. You can concatenate two lists into a new list with the , operator:

| lsta lstb lstc |
lsta := OrderedCollection new.
lsta add: 1.
lstb := OrderedCollection new.
lstb add: 3.
lstc := lsta , lstb.
lstc add: 9.
lsta displayNl.
lstb displayNl.
lstc displayNl.

Output:

OrderedCollection (1 )
OrderedCollection (3 )
OrderedCollection (1 3 9 )

Alternatively, you can use with: (or with:with:, etc.) while building the OrderedCollection to create a list and add elements in one step. Loop over the list with do::

| lst |
lst := OrderedCollection
    with: 42
    with: 12.
lst do: [:v|
    (v < 20) displayNl.
].

Output:

false
true

You can access members of the list directly with at:, and get the number of elements in the list with size:

| lst i |
lst := OrderedCollection with: 100 with: 9.
i := 1.
[ i <= lst size ] whileTrue: [
    (lst at: i) displayNl.
    i := i + 1.
].
lst at: i.

Output:

100
9
Object: OrderedCollection new: 16 "<0x...>" error: Invalid index 3: index out of range
...

Note that lists—and all other collections in Smalltalk—are 1-indexed, not 0-indexed, so the first element is at: 1. In this example, our error was using at: i after the loop, since at that point, i > lst size. As before, the first line tells us what we need to know: “Invalid index 3: index out of range”.

You can also change an element in an OrderedCollection with at:put::

| lst i |
lst := OrderedCollection with: 0 with: 0.
i := 1.
[ i <= lst size ] whileTrue: [
    lst at: i put: i*2.
    i := i + 1.
].
lst displayNl.

Output: OrderedCollection (2 4 )

Finally, you can remove the first or last element of a list with removeFirst or removeLast respectively. These methods return the removed element, so are useful for using the list as a queue or stack.

Lists are lists, so indexing with at: is slow, but adding elements with add: is fast.

Arrays

Arrays are of a fixed size, and their size must be declared when they are created. As such, rather than a simple new method, Arrays have new:, which takes the size as an argument:

| arr i |
arr := Array new: 10.
arr at: 1 put: 10.
arr displayNl.
i := 1.
[ i <= arr size ] whileTrue: [
    arr at: i put: i-1.
    i := i + 1.
].
arr displayNl.

Output:

(10 nil nil nil nil nil nil nil nil nil )
(0 1 2 3 4 5 6 7 8 9 )

Note that before we’ve put a value into a slot of the array, the value is nil. nil is similar to null, nullptr, or NULL in other programming languages. But, nil is a fully-fledged object with methods and a class. You can check if a value is nil with a normal comparison, e.g. x = nil, but it is more common to use the isNil method, e.g. x isNil, which returns true only for nil.

Arrays can also be created simply with array literals, which are written with braces, with the elements separated by dots:

| arr |
arr := {2. 4. 6. 8. 10}.
arr do: [:v|
    (v / 2) displayNl.
].

Output:

Because their size is fixed, Arrays don’t support add: or its variants. However, every other OrderedCollection method shown in the previous section behaves the same as in OrderedCollections, including concatenation. Arrays are arrays, so indexing and mutating with at: is fast, but adding elements is so slow that the language has simply prevented it!

It is common to convert back and forth between OrderedCollections and Arrays to benefit from each kind’s advantages when needed. All collections have an asOrderedCollection method to convert to an OrderedCollection, and an asArray method to convert to an array:

| x |
x := {10. 10. 30} asOrderedCollection.
x add: 40.
x := x asArray.
x at: 2 put: 20.
x displayNl.

Output: (10 20 30 40 )

Aside: Array is actually a subclass of OrderedCollection which replaces its internal data structure with an array instead of a list. OrderedCollection is the generic superclass of all ordered collections, but is also a fully implemented list class itself. This is why OrderedCollection has a strange, generic-sounding name, while Array has a very precise, descriptive name.

Strings and Characters

We’ve already seen strings: they’re delimited by single quotation marks ('). In fact, strings are just a special kind of read-only Array, an Array of Characters. Anything you can do with an Array, you can do with a String, except that you cannot change it:

| s |
s := 'Hello, world!'.
s do: [:c|
    c displayNl.
].

Output:

H
e
l
l
o
,

w
o
r
l
d
!

We can convert a String to an Array and vice-versa, usually in order to make it modifiable. To specify an element of the Character type, prefix the character with a dollar sign ($):

| s |
s := 'Hello, world!' asArray.
s displayNl.
s at: 13 put: $?.
s := s asString.
s displayNl.

Output:

($H $e $l $l $o $,  $w $o $r $l $d $! )
Hello, world?

Any character after a $ will be taken as a character literal, so, for instance, you can get a literal newline with $ followed by a newline character. Using $ like this is usually considered to be in poor form, however, so the Character class also has a constructor that directly creates a newline character, lf:

| c |
c := Character lf.
'There will be two blank lines here:' displayNl.
c displayNl. " One blank line from c, the other from displayNl "
'^^^' displayNl.

Output:

There will be two blank lines here:

^^^

Dictionaries

Smalltalk also supports key-value maps, called Dictionarys. Any value can be used as a key, and a single Dictionary can contain keys of multiple types, though most dictionaries in practice only use one.

Create a Dictionary with new, and add key-value associations with at:put:, like other collections:

| dict |
dict := Dictionary new.
dict at: 'Hello' put: 'world'.
(dict at: 'Hello') displayNl.

Output: world

In addition, you can check whether a key is present with includesKey:, and remove a key-value pair with removeKey::

| dict |
dict := Dictionary new.
dict at: 'Hello' put: 'world'.
[ dict includesKey: 'Hello' ] whileTrue: [
    (dict at: 'Hello') displayNl.
    dict removeKey: 'Hello'.
].

Output: world

You can loop over Dictionarys with do:, like other collections, but the block will only receive the values, not the keys. To loop over both keys and values, use keysAndValuesDo:, which takes a two-argument block:

| dict |
dict := Dictionary new.
dict at: 'Hello' put: 'world'.
dict at: 42 put: 12.
dict keysAndValuesDo: [:key :value|
    'At key ' display.
    key display.
    ' the dictionary has the value ' display.
    value displayNl.
].

Output:

At key Hello the dictionary has the value world
At key 42 the dictionary has the value 12

Note that in two-argument blocks, the space between an argument name and the next : is mandatory.

Creating Your Own Classes

Until this point, our focus has been on learning the syntax of Smalltalk and how to use its built-in types. Now, let’s make our own type. We will make a “reverse Polish notation” calculator.

If you’re not familiar with reverse Polish notation (RPN) calculators, they are a form of calculator which represents its input as a list of commands which operates on a stack. A number is a command to push that number to the stack, and an operator is a command to pop two values from the stack, perform the relevant operation, and then push the result onto the stack. For instance, the mathematical expression $1 + 2 \times 3$ is written in RPN as 1 2 3 * +, and $(1 + 2) \times 3$ is written as 1 2 + 3 *.

Naturally, our RPN calculator will need a stack. It will need a method to push a number onto the stack, and methods for each of the supported operators.

The first question is, how do we create a new class? Naturally, we call a method on an existing class! Namely, the subclass: method. As there’s nothing in particular our RPN calculator should be a subclass of, we’ll just make it a subclass of Object, the base class:

Object subclass: #RPNCalculator.

We can verify that our new class does in fact exist:

Object subclass: #RPNCalculator.
RPNCalculator new displayNl.

Output: a RPNCalculator

"a RPNCalculator" is simply the output of Object>>display, the general-purpose display method that works—albeit not usefully—on all objects. The odd # thing is called a symbol, and is essentially just how you represent a string that’s intended to be used as a name in Smalltalk. Other than this one use and in error messages, you’re unlikely to ever need symbols.

It’s a bit pointless to create a subclass with absolutely nothing in it, of course. You can go one-by-one through the desired methods of the class and add them all, but GNU Smalltalk offers a shorthand for defining a method in a somewhat more familiar style:

Object subclass: RPNCalculator [
    | stack |

    push: aNumber [
        stack add: aNumber.
        ^ aNumber
    ]
].

This program creates a class named RPNCalculator as a subclass of Object (and note that in this shorthand, no # is needed), makes instances of the RPNCalculator class have the field stack (actually called an instance variable in Smalltalk), and adds a method push: which adds its argument to the stack (assuming that the stack is simply an OrderedCollection). This snippet also shows how to return a value from a method: with a carat (^).

We haven’t made our constructor yet, but to do so, we’re going to have to understand one crucial detail about Smalltalk: Smalltalk instance variables (fields) are always private. There is no syntax to access an instance variable of another object. But, the constructor is a method of the class, not a method of objects of the class (think of OrderedCollection new), and so it can’t access the instance variables even of the object it’s creating. Because of this, it’s common for constructors to be in two parts: the class has the constructor, and instances have an initializer. The initializer can access its own instance variables, so the constructor calls the initializer. We’ll demonstrate this by adding a constructor for RPNCalculator:

Object subclass: RPNCalculator [
    | stack |

    RPNCalculator class >> new [
        | r |
        r := super new.
        r init.
        ^ r
    ]

    " Stack implemented as a list "
    init [
        stack := OrderedCollection new.
    ]

    push: aNumber [
        stack add: aNumber.
        ^ aNumber
    ]
].

There are a few things to note about this example:

On the new method declaration: new is not a method on instances of RPNCalculator, but on the RPNCalculator class itself. The way we specify the RPNCalculator class is RPNCalculator class. The way we specify that we’re implementing a method on that class rather than the class we’re currently defining is >>. Thus, we define the method with RPNCalculator class >> new.
Because we’re overriding new, which is how you create a new object, we’ve actually lost the ability to create the object. But, we can call the method we overrode, Object class >> new, with super, as super new.
The new method cannot access the stack variable of the object it just created, so instead it calls init.
The init method can access stack, so it creates the needed OrderedCollection.
The push: method can now safely assume that stack is an OrderedCollection.

Now, we’d like to be able to do math. Let’s start with addition:

Object subclass: RPNCalculator [
    (...)

    add [
        | x y r |
        y := stack removeLast.
        x := stack removeLast.
        r := x+y.
        stack add: r.
        ^ r
    ]
].

This method pops two elements from the stack with removeLast, then adds them, then pushes the result onto the stack, and returns the result. We can—and will—write methods for each of the usual operators, but they’ll all be doing the same thing: pop two values, do a calculation with them, then push the result. Surely there’s some way we can write this so that we don’t write the same code over and over again? Of course there is: we need to use blocks! Blocks allow us to box up a calculation, like an anonymous function, and boxing up the actual calculation so we can separate it from the stack behavior is exactly what we want to do. We can write a generic “binary operator” method that takes a block as its argument, and expects the block to do the correct calculation. As yet, although we’ve passed blocks to other methods, we haven’t actually evaluated blocks ourselves, so we’ll have to learn how to do that, too.

At this point, you should be able to guess how you evaluate a block. You call a method on it, of course! Specifically, the value method if it’s a zero-argument block, the value: method if it’s a one-argument block, and the value:value: method if it’s a two-argument block. Our block is supposed to perform a binary operator, so it’ll be a two-argument block:

Object subclass: RPNCalculator [
    (...)

    binary: operatorBlock [
        | x y r |
        y := stack removeLast.
        x := stack removeLast.
        r := operatorBlock value: x value: y.
        stack add: r.
        ^ r
    ]

    add [
        ^ self binary: [:x :y| x + y]
    ]
].

Our new implementation of binary: is almost exactly like our old implementation of add, but it takes an operatorBlock argument, and then calls that block with value:value:. Our add method is now simplified to one line: call the binary: method on self (the Smalltalk equivalent of “this”), with a block as an argument which simply adds together its two arguments. The result of self binary: is returned from add.

Now, let’s add the remaining methods for other mathematical operations, and finish our RPN calculator:

Object subclass: RPNCalculator [
    | stack |

    RPNCalculator class >> new [
        | r |
        r := super new.
        r init.
        ^ r
    ]

    " Stack implemented as a list "
    init [
        stack := OrderedCollection new.
    ]

    push: aNumber [
        stack add: aNumber.
        ^ aNumber
    ]

    binary: operatorBlock [
        | x y r |
        y := stack removeLast.
        x := stack removeLast.
        r := operatorBlock value: x value: y.
        stack add: r.
        ^ r
    ]

    add [
        ^ self binary: [:x :y| x + y]
    ]

    sub [
        ^ self binary: [:x :y| x - y]
    ]

    mul [
        ^ self binary: [:x :y| x * y]
    ]

    div [
        ^ self binary: [:x :y| x / y]
    ]
].

Aside: It is generally considered good form in Smalltalk to have many short methods, rather than few long methods. That is, break methods down into their minimal useful components.

Now we have a few options for using and testing RPNCalculator. Assuming we’ve stored this in a file named rpncalc.st, we can use it on the GNU Smalltalk REPL by running gst with the file as an argument:

$ gst rpncalc.st
GNU Smalltalk ready
st> | x |
st> x := RPNCalculator new.
a RPNCalculator
st> x push: 1.
1
st> x push: 2.
2
st> x push: 3.
3
st> x mul.
6
st> x add.
7
st>

As you may guess from that command line, you can also write tests in a separate .st file, and run both with gst rpncalc.st tests.st. Finally, you can write more sophisticated tests using the built in testing framework SUnit.

Subclasses

Our RPNCalculator is mostly ambivalent to the kind of number you pass in. GNU Smalltalk actually has several ways of storing numbers: integers, precise fractions, and floating-point numbers. Unfortunately, the different kinds of numbers don’t always behave well when you mix and match, so if we want our calculator to be robust, we should make sure that incoming numbers are of a predictable type. We can do this by making subclasses of RPNCalculator which override push:, but convert to the desired type:

RPNCalculator subclass: FractionalRPNCalculator [
    push: aNumber [
        ^ super push: (aNumber asFraction)
    ]
].

RPNCalculator subclass: FloatRPNCalculator [
    push: aNumber [
        ^ super push: (aNumber asFloat)
    ]
].

Unsurprisingly, asFraction converts a number to a precise fraction, and asFloat converts a number to a float.

We can use one of these subclasses by calling FractionalRPNCalculator new or FloatRPNCalculator new. Note that we didn’t need to re-implement new for each; they inherited their new from their superclasses. And, we didn’t need to implement binary:, add, sub, mul, or div, for the same reason. If we needed to specialize one or all of these for these types, we could have overridden them, but all of these types do math the same, so we didn’t need to.

Note that the version of GNU Smalltalk installed on the student Linux environment has an unfortunate bug in its code to display floating point numbers, and because of this, printing floating point numbers will often fail. This is easily worked around by loading in the float printing code from a later version of GNU Smalltalk. This code is available on the course web site’s assignments tab, named floatfix.st. Load it before your own code, such as gst floatfix.st rpncalc.st.

Returning and Blocks

Earlier we said that blocks are similar but not identical to anonymous functions, and the major difference is how they interact with returning. When you return (with ^) inside of a block, the surrounding method returns, not just the block! For instance, consider the output of this code:

Object subclass: Funny [
    t: block [
        'I''m about to run the block!' displayNl.
        block value ifTrue: [
            ^true
        ].
        'The block''s value was false!' displayNl.
        ^false
    ]
]

| x |
x := Funny new.
x t: [true].
x t: [false].

Output:

I'm about to run the block!
I'm about to run the block!
The block's value was false!

This is particularly useful for making chains of comparisons without getting too deeply nested, or for “switch”-like patterns:

runCommand: cmd [
    (cmd = 'push') ifTrue: [
        self push.
        ^self
    ].

    (cmd = 'pop') ifTrue: [
        self pop.
        ^self
    ].

    (cmd = 'rotate') ifTrue: [
        self rotate.
        ^self
    ].
]

Display

The display and displayNl methods use the displayString method to convert a value to a string. So if we want our RPNCalculator to display its stack when displayNl is called on it, we could extend it with a displayString method:

displayString [
    | r |
    r := 'RPNCalculator ('.
    stack do: [:x|
        r := r , (x displayString) , ' '.
    ].
    ^r , ')'
]

Note that this only affects display and displayNl, which are not how values are displayed in the interactive REPL. You can override printString for that.

Debugging

By default, GNU Smalltalk gives stack traces, but nothing else for debugging. However, you can get a debugger (the MiniDebugger) by loading it into your GNU Smalltalk environment along with your own code, and then you will automatically enter the debugger when there is a problem:

$ gst /usr/share/gnu-smalltalk/examples/MiniDebugger.st floatfix.st rpncalc.st
Loading package DebugTools
GNU Smalltalk ready
st> | x |
st> x := RPNCalculator new.
a RPNCalculator
st> x push: 1.
1
st> x push: 0.
0
st> x div.
'1 error: The program attempted to divide a number by zero'
ZeroDivide(Exception)>>signal (ExcHandling.st:254)
SmallInt(Number)>>zeroDivide (SysExcept.st:1426)
SmallInt>>/ (SmallInt.st:277)
optimized [] in RPNCalculator>>div (tmp.st:42)
RPNCalculator>>binary: (tmp.st:24)
RPNCalculator>>div (tmp.st:42)
UndefinedObject>>executeStatements (a String:1)
^self activateHandler: (onDoBlock isNil and: [ self isResumable ])
(debug)

The (debug) prompt acts similarly to gdb: you can go up and down the callgraph, step through code, etc. Use help for a list of commands.

Rather than print, MiniDebugger has i (for “inspect”), which can inspect objects in great detail.

If you want to step through code easily, you’ll have to cause it to break, so that it enters the debugger. This is most easily done by adding the statement self halt to your problematic code.

GNU Smalltalk in this Course

This course will ask you to use Smalltalk to implement interpreters for (very simple) programming languages. We will not ask you to parse code. Abstract syntax trees will be built manually as Smalltalk objects, and examples given. In general, you should be writing the solution into one .st file, and tests in a separate file loaded afterward, and our input will also work in this way. For example, a simple test case for rpncalc.st might look like this:

| x |
x := RPNCalculator new.
x push: 10.
x push: 5.
x div = 2 ifFalse: [
    'Division fails!' displayNl.
].

And, if that’s in the file test.st, would be run as gst floatfix.st rpncalc.st test.st.

You may use only the libraries built into GNU Smalltalk, floatfix.st, and your own code.

More Resources

The content described in this document should be sufficient for everything needed in this course. If you feel that more functionality would help you, you can look at the GNU Smalltalk Library Reference and User’s Guide. If you’re interested in Smalltalk more broadly, you will probably want to look into a more conventional Smalltalk system such as Pharo or Squeak.

Fin

In the next module, we will begin formalizing programming languages, by introducing the $\lambda$-calculus.

Module 2: Untyped λ-Calculus

“Programming languages should be designed not by piling feature on top of feature, but by removing the weaknesses and restrictions that make additional features appear necessary.” — The Revised⁵ Report on the Algorithmic Language Scheme

When studying programming languages, the most important item a programming language theorist will be working with is an underlying model. Such models are usually defined in mathematical logic for a few reasons:

A mathematical logic provides a succinct and precise representation of the core mechanics, hence we do not need to worry about particular differences and similarities while reasoning about a class of programming languages.
Using a mathematical logic gives us a framework to prove certain properties about a programming language.

A good model should be as simple as possible, yet powerful enough that we can use it to model a large class of programming languages. For functional languages, such a model does exist; it is known as the λ-calculus (Lambda calculus).

Aside: This is why many functional languages incorporate a λ in their logo.

Our goal in this chapter is to define the λ-calculus and demonstrate its utility in expressing different entities you should already be very familiar with while working with programming languages, e.g. booleans, lists, and natural numbers. λ-calculus itself is actually quite simple, and it uses the power of abstraction to represent all these features. We will look at the semantics of λ-calculus informally in this module. The following module will revisit concepts in this chapter and introduce formal semantics. The module after that will discuss adding types to λ-calculus.

What we will show is that even though λ-calculus has a paucity of concepts, it can nonetheless express all interesting computations. This fact gives programming language designers a baseline understanding for when features of their language are computationally powerful.

Section 1: Definitions

The syntax of the λ-calculus is as follows, presented in Backus Normal Form (BNF):

⟨Expr⟩ ::= ⟨Var⟩ | ⟨Abs⟩ | ⟨App⟩ | (⟨Expr⟩)
⟨Var⟩  ::= a | b | c | ...
⟨Abs⟩  ::= λ ⟨Var⟩ . ⟨Expr⟩
⟨App⟩  ::= ⟨Expr⟩ ⟨Expr⟩

The four rules define the four elements of the syntax of λ-calculus: expressions, variables, abstractions, and application.

An expression — more precisely, a λ-expression or λ-term — is either a variable, an abstraction, an application, or an expression surrounded by parentheses.
A variable is generally a single letter, although we might occasionally use longer identifier names for clarity.
An abstraction is indicated by the leading character λ (Greek lower case letter lambda), and has two parts: the variable and an expression (the body), separated by a dot (.).
An application is simply a concatenation of two expressions. The first is called the rator and the second is called the rand (short for “operator” and “operand”).

To bridge these concepts with terms you may be more familiar with, “abstractions” are essentially functions, and “applications” are essentially function calls. But, don’t take this equivalence too far: the behavior of abstractions and applications may not match your expectations if you assume they behave exactly as in a programming language you’re familiar with.

Since applications are simply expressions concatenated together, we need precedence and associativity rules to understand how to read them. λ-terms are parsed as follows if without parentheses to indicate precedence:

Abstractions extend as far to the right as possible. For example, $ \lambda x.\, xy $ is parsed as $ \lambda x.\, (xy) $ and not as $ (\lambda x.\, x)y $.
Applications are left-to-right associative. For example, $ abc $ is $ (ab)c $, and not $ a(bc) $.

Example 1. Here are a few more examples to illustrate the precedence of λ-expressions:

λ-term	Equivalent λ-term, with least parentheses necessary
$ \lambda x.\, ((xy)\lambda z.\, z) $	$ \lambda x.\, xy\lambda z.\, z $
$ ((\lambda x.\, x)y) $	$ (\lambda x.\, x)y $
$ ((xw)(zy)) $	$ xw(zy) $
$ ((xy)\lambda x.\, z) $	$ xy\lambda x.\, z $

Exercise 1. Verify that the meaning of the expression will change when parentheses are removed for the terms in the right column.

We’ve now described the syntax of the λ-calculus, but syntax alone doesn’t tell us an expression’s meaning. We will now discuss the meaning of λ-expressions; more specifically, how to “compute” in the λ-calculus. Intuitively, we can see that a λ-expression consists of functions and calls, but more precisely:

An abstraction $ \lambda x.\, E $ denotes the function that takes an argument $ x $ and returns the expression $ E $.
An application $ MN $ denotes the function $ M $ applied to the argument $ N $.

Note that all abstractions have exactly one argument. We’ll see soon that this does not limit the expressibility of the λ-calculus.

The only “type” in the λ-calculus is a function. So, all expressions are understood to be functions, and thus expressions like $ xy $ are always legal; in this case, the expression denotes the application of the function $ x $ to the argument $ y $.

Aside: Note that we generally don’t give the functions defined in lambda calculus a name. That’s why many languages use the term “lambda” or “lambda functions” to refer to anonymous functions.

For the following sections, we will be talking about the operational semantics of λ-calculus in an informal way. The formal introduction of operational semantics will be seen in the next module.

Section 2: Free and Bound Variables

First we will start by discussing the simplest entity: variables. To be specific, we shall determine where variables obtain their meaning, and whether two occurrences of the same name refer to the same variable.

Consider the identity function: the simplest function, that just returns its only parameter. In the λ-calculus, it is denoted as $ \lambda x.\, x $. The $ x $ inside the body of the abstraction must refer to the same $ x $ in the variable position (the argument) of the abstraction. In formal terms, the latter is a binding occurrence of the former, and $ x $ is a bound variable. An occurrence of a variable that is not involved in a binding occurrence is called free.

Here are a few more examples to help build your intuition:

Example 2. In the λ-expression $ \lambda x.\, x(\lambda z.\, x)y $, both occurrences of $ x $ are bound to the abstraction having variable $ x $, and $ y $ is free.

Example 3. In the λ-expression $ (\lambda x.\, x)(\lambda z.\, x)y $, the first occurrence of $ x $ is bound to the abstraction $ (\lambda x.\, x) $. The second $ x $ and $ y $ are free variables.

Example 4. In the λ-expression $ abc $, all variables are free.

Informally, we can see the set of bound variables for an expression $ E $ contains all variables which appear inside abstractions that define them. This informal definition is fine for our understanding, but we will also define this property formally, as a formal definition can be used as a basis for proofs. Formal definitions tend to leverage the structural and recursive nature of the syntax; specifically, such a definition will structurally and recursively define the property on every kind of expression.

Definition 1. (Bound Variables) The set of bound variables of an expression $ E $, denoted $ BV[E] $, is defined as follows:

\[ \begin{aligned} BV[x] &= \emptyset \\ BV[\lambda x.\, L] &= BV[L] \cup \{x\} \\ BV[MN] &= BV[M] \cup BV[N] \end{aligned} \]

Variable $ x $ is bound in expression $ E $ if $ x \in BV[E] $.

Definition 2. (Free Variables) The set of free variables of an expression $ E $, denoted $ FV[E] $, is defined as follows:

\[ \begin{aligned} FV[x] &= \{x\} \\ FV[\lambda x.\, L] &= FV[L] \setminus \{x\} \\ FV[MN] &= FV[M] \cup FV[N] \end{aligned} \]

Variable $ x $ is free in expression $ E $ if $ x \in FV[E] $. An expression $ E $ is closed if $ FV[E] = \emptyset $. A closed expression is called a combinator.

Note that it is possible for a variable to be both free and bound; however, each occurrence of a variable in an expression is either free or bound, but not both. This is why our definition of “closed” depends on $ FV $ instead of $ BV $.

Example 5. In the expression $ x\lambda x.\, x $, $ x $ is both free and bound. The first occurrence of $ x $ is free, and the second one is bound.

Exercise 2. Provide a modified definition of bound variables, so one can track which expression a bound variable is bound to. Note: a variable can be bound to multiple expressions!

Bound variables get their meaning from the binding occurrences; on the other hand, free variables do not have a meaning within an expression. For free variables to be meaningful, we would have to rely on an external definition, and of course, if we included that external definition as part of the expression, then the variable would now be bound. Thus, an expression being a combinator means that the computation can proceed without any additional information.

Example 7. For analogy, consider the following C function:

int f(int x) {
    return x + y;
}

In the function f, x is bound to the only parameter and y is free. While f is valid in C, the computation can only proceed if y is defined externally.

Section 3: Substitution and Reduction

Computation in the λ-calculus is based on the notion of reduction. An expression is reduced until no further reductions are possible, or some other stopping condition is reached. The primary reduction mechanism in the λ-calculus is known as β-reduction (beta reduction).

Definition 3. (β-redex) An expression of the form $ (\lambda x.\, M)N $ is known as a (β-)redex. (Redex is short for “reducible expression”; the plural is redices.)

Consider the redex $ (\lambda x.\, M)N $. Following our usual analogy, $ \lambda x.\, M $ is a function with parameter $ x $ and argument $ N $. The expectation is that this evaluates to $ M $ with every free occurrence of $ x $ substituted for $ N $.

We use the notation $ M[N/x] $ to mean the substitution of $ N $ for all free occurrences of $ x $ in $ M $.

Definition 4. (β-reduction) Let $ M $ and $ N $ be λ-expressions, $ x $ a variable. The relation $ \to_\beta $ (β-reduction) is defined by the rule:

\[ (\lambda x.\, M)N \to_\beta M[N/x] \]

Further, if $ C[(\lambda x.\, M)N] $ denotes an expression $ C $ in which $ (\lambda x.\, M)N $ appears as a subterm, then:

\[ C[(\lambda x.\, M)N] \to_\beta C[M[N/x]] \]

A few things worth noting:

The notation $ C[M] $ refers to a specific occurrence of subterm $ M $ in $ C $, not to all occurrences. Thus if $ E \to_\beta E' $, then $ C[E] \to_\beta C[E'] $ means the reduction of a single occurrence.
This definition does not specify which redex to take for reduction; any valid redex inside an expression can be chosen.

We also introduce the following notations:

$ \to^n_\beta $ denotes exactly $ n $ steps of β-reduction.
$ \to^*_\beta $ denotes 0 or more steps of β-reduction.
$ \to^+_\beta $ denotes 1 or more steps of β-reduction.
$ \leftarrow_\beta $ denotes β-expansion: $ A \leftarrow_\beta B $ iff $ B \to_\beta A $.
$ =_\beta $ denotes β-equivalence: $ A =_\beta B $ iff $ A $ can be converted to $ B $ by some (possibly empty) sequence of applications of $ \to_\beta $ and $ \leftarrow_\beta $.

Now we need to take a step back and look at substitution. The substitution process requires a separate rule for each form of expression:

If $ E $ is a variable and the variable is $ x $, replace the variable.
If $ E $ is a variable and the variable is not $ x $, $ E $ does not change.
If $ E $ is an application, perform substitution on the rator and rand.
If $ E $ is an abstraction and the variable for it is $ x $, $ E $ does not change since occurrences of $ x $ must not be free.
If $ E $ is an abstraction and the variable for it is not $ x $, perform the substitution on the body.

Definition 5. (Substitution, provisional) Let $ E $ and $ T $ be λ-expressions and $ x $ be a variable. Denote $ E[T/x] $ the substitution of $ T $ for $ x $ in $ E $:

Here are a few examples of β-reduction and substitution in action:

Example 8.

\[ (\lambda x.\, x)a \to_\beta x[a/x] = a \]

Example 9.

\[ \begin{aligned} (\lambda x.\, \lambda y.\, x)ab &\to_\beta (\lambda y.\, x)[a/x]b \\ &= (\lambda y.\, x[a/x])b \\ &= (\lambda y.\, a)b \\ &\to_\beta a[b/y] = a \end{aligned} \]

Example 10.

\[ \begin{aligned} (\lambda x.\, \lambda y.\, y)ab &\to_\beta (\lambda y.\, y)[a/x]b \\ &= (\lambda y.\, y[a/x])b \\ &= (\lambda y.\, y)b \\ &\to_\beta y[b/y] = b \end{aligned} \]

The last two examples illustrate that abstractions like $ \lambda x.\, \lambda y.\, x $ can behave like multi-argument functions (returning the first argument), while $ \lambda x.\, \lambda y.\, y $ returns the second. This style of reduction-by-substitution allows us to build multi-parameter functions as a special case of single-parameter functions.

Aside: You might have heard the term currying, named after Haskell Curry, which is the process that converts a function taking multiple arguments into nested one-parameter functions which return functions accepting the remaining arguments. In λ-calculus, this is the most natural style of passing multiple arguments.

Also earlier, we said that the identity function is $ (\lambda x.\, x) $, and now it is $ (\lambda y.\, y) $. Similar to functions, if we replace the identifier in the variable part of an abstraction and their bounded occurrences with another identifier, the behavior of the abstraction should be the same. We express this with a principle known as α-conversion (alpha conversion):

Definition 6. (α-conversion) Let $ E $ be a λ-expression. We define the relation $ =_\alpha $ (α-equivalence) by the rule:

\[ \lambda x.\, E =_\alpha \lambda y.\, E[y/x] \]

given that $ y \notin FV[E] $. If $ C[M] $ is an expression $ C $ containing $ M $ as a subterm, and $ M =_\alpha N $, then $ C[M] =_\alpha C[N] $. Further, $ =_\alpha $ is an equivalence relation. Finally, α-conversion is the replacement of a term with an α-equivalent term.

Applying α-conversions to an expression should not change its behavior. However, consider what happens with the provisional definition of substitution when we apply $ (\lambda x.\, \lambda y.\, x) $ to $ y $ and $ z $:

\[ \begin{aligned} (\lambda x.\, \lambda y.\, x)yz &\to_\beta (\lambda y.\, x)[y/x]z \\ &= (\lambda y.\, x[y/x])z \\ &= (\lambda y.\, y)z \\ &\to_\beta y[z/y] = z \end{aligned} \]

The expectation is that $ \lambda x.\, \lambda y.\, x $ returns the first argument, yet we obtained $ z $ (the second). What happened is that the free variable $ y $ was captured by the inner $ \lambda y $ abstraction during substitution. We call this behavior dynamic binding. We want static binding instead: binding occurrences should never change throughout the computation. The fix requires a corrected definition of substitution:

Definition 7. (Substitution, corrected) Let $ E $ and $ T $ be λ-expressions and $ x $ be a variable. Denote $ E[T/x] $ the substitution of $ T $ for $ x $ in $ E $:

\[ \begin{aligned} x[T/x] &= T \\ y[T/x] &= y \quad (\text{if } y \neq x) \\ (MN)[T/x] &= M[T/x]\,N[T/x] \\ (\lambda x.\, M)[T/x] &= \lambda x.\, M \\ (\lambda y.\, M)[T/x] &= \lambda y.\, M[T/x] \quad (\text{if } y \neq x,\, y \notin FV[T]) \\ (\lambda y.\, M)[T/x] &= \lambda z.\, M[z/y][T/x] \quad (\text{if } y \neq x,\, y \in FV[T];\, z \text{ is a "new" variable}) \end{aligned} \]

This definition is mostly the same as the previous one, except for the last case: instead of letting the abstraction capture the variable, we rename the variable in the abstraction and its bound occurrences beforehand to some name that was never used before.

Example 11. Recomputing $ (\lambda x.\, \lambda y.\, x)yz $ with the corrected substitution:

\[ \begin{aligned} (\lambda x.\, \lambda y.\, x)yz &\to_\beta (\lambda y.\, x)[y/x]z \\ &= (\lambda a.\, x[a/y][y/x])z \\ &= (\lambda a.\, x[y/x])z \\ &= (\lambda a.\, y)z \\ &\to_\beta y[z/a] = y \end{aligned} \]

Section 4: Reduction and Normal Forms

Computation in the λ-calculus is a series of reductions. Let’s start with an obvious option: reduce everything. If the expression contains a β-redex, β-reduce it, and repeat the process until no β-redex is found.

Definition 8. (β-normal form) A λ-expression with no β-redex is in β-normal form (β-NF).

However, this process may not terminate. It is possible for reduction of a β-redex to produce a β-redex ad infinitum:

Example 12.

\[ (\lambda x.\, xx)(\lambda x.\, xx) \to_\beta xx[(\lambda x.\, xx)/x] = (\lambda x.\, xx)(\lambda x.\, xx) \]

This expression has no β-normal form since reduction of the only β-redex yields the original expression.

For that reason, it is worthwhile to consider other kinds of reduction rules, or reduce to other normal forms. Recall that in Racket, (map (lambda (x) (f x)) lst) can always be simplified to (map f lst). The intuition here is: if two functions accept the same set of values as argument and produce the same value when supplied the same argument, then these two functions are equal. Put in λ-calculus: $ (\lambda x.\, fx)y = fy $, so $ \lambda x.\, fx = f $.

Aside: This intuition is called function extensionality.

We formalize this in λ-calculus with another kind of reduction, called η-reduction (eta reduction):

Definition 9. (η-reduction) η-reduction is denoted by the following rule:

\[ \lambda x.\, Mx \to_\eta M \quad (\text{if } x \notin FV[M]) \]

If $ C[M] $ denotes an expression in which $ M $ occurs as a subterm, and $ M \to_\eta M' $ then $ C[M] \to_\eta C[M'] $. Analogously to β-expansion and β-conversion, we can define an η-expansion relation $ \leftarrow_\eta $ and η-conversion relation $ =_\eta $. A reduction in which both β and η reductions may occur is called βη-reduction.

Our goal is to reach some stopping condition defined syntactically. These conditions are called normal forms.

Exercise 3. Formally define η-normal form and βη-normal form.

Among the various alternative normal forms, the most important for our purpose is weak normal form (WNF):

Definition 10. (Weak Normal Form) An expression $ E $ is in weak normal form (WNF) if every β-redex in $ E $ lies within the body of some abstraction.

The intuition behind WNF is that, in real programming languages, computation does not occur inside a function until it is called. Hence, we do not consider redices that occur inside an abstraction as candidates for reduction until the abstraction itself has been supplied with an argument.

Example 13. The term $ \lambda x.\, (\lambda y.\, y)(\lambda z.\, (\lambda w.\, w)z) $ is in WNF but not in β-normal form. The term contains two β-redices: $ (\lambda y.\, y)(\lambda z.\, (\lambda w.\, w)z) $ and $ (\lambda w.\, w)z $, but both lie within the $ \lambda x $ abstraction (the outer one) or the $ \lambda z $ abstraction (the inner one).

Every term in β-normal form is also in WNF, as WNF is a strictly weaker criterion. Whenever we speak of “normal form” without qualification, we are referring to β-normal form.

Section 5: Order of Evaluation

5.1 The Church-Rosser Theorem

So far, we have not specified which redex to choose at any given time. For example, in the expression $ (\lambda y.\, y)((\lambda x.\, x)b) $, we have two choices:

\[ \begin{aligned} (\lambda y.\, y)((\lambda x.\, x)b) &\to_\beta (\lambda y.\, y)(x[b/x]) = (\lambda y.\, y)b \to_\beta b \\ (\lambda y.\, y)((\lambda x.\, x)b) &\to_\beta y[(\lambda x.\, x)b/y] = (\lambda x.\, x)b \to_\beta b \end{aligned} \]

Both paths lead to the same result. Is this true for all λ-expressions which have a normal form? Luckily, yes:

Theorem 1. (Church-Rosser, 1936) For λ-expressions $ E_1 $, $ E_2 $, and $ E_3 $, if $ E_1 \to^*_\beta E_2 $ and $ E_1 \to^*_\beta E_3 $, then there exists an expression $ E_4 $ such that $ E_2 \to^*_\beta E_4 $ and $ E_3 \to^*_\beta E_4 $ (up to α-equivalence).

Aside: A formal proof of the Church-Rosser Theorem is available in a separate document. You are not required to be familiar with it for this course.

The Church-Rosser Theorem guarantees that when faced with a choice of redex, it is always possible to arrive at the same final expression regardless of your choice. This idea implies that there is a unique normal form for any expression. Note that the theorem does not guarantee the existence of a β-normal form; it merely states that if the reduction terminates, it will reach a unique normal form.

Corollary 1. A λ-expression can reduce to at most one β-normal form (up to α-equivalence).

Proof. Let $ E_1 $ be an expression that reduces to normal forms $ E_2 $ and $ E_3 $. By the Church-Rosser Theorem, there is an expression $ E_4 $ such that $ E_2 \to^*_\beta E_4 $ and $ E_3 \to^*_\beta E_4 $. However, $ E_2 $ and $ E_3 $ are both normal forms, hence irreducible. Therefore, the only possible reduction to $ E_4 $ from $ E_2 $ and $ E_3 $ must take zero steps, so $ E_2 =_\alpha E_3 =_\alpha E_4 $. $ \square $

Although the Church-Rosser Theorem guarantees a unique β-normal form, a λ-expression may have several different instances of other kinds of normal forms.

Exercise 4. Prove or disprove the following statement: For λ-expressions $ E_1, E_2, E_3 $, if $ E_1 \to_\beta E_2 $ and $ E_1 \to_\beta E_3 $, then there exists an expression $ E_4 $ such that $ E_2 \to_\beta E_4 $ and $ E_3 \to_\beta E_4 $ (up to α-equivalence). Note the absence of asterisk superscripts.

5.2 Reduction Strategies

Although the Church-Rosser Theorem guarantees that no matter how we choose redices, we can never reach an expression from which we can’t reach the unique β-normal form (given that one exists), most real-world programming languages have much more clearly defined policies regarding order of evaluation. We will examine two such reduction strategies.

Applicative Order Reduction (AOR): We always choose the leftmost, innermost redex at each step. A redex is innermost if it contains no other redices.

Example 14. Reduction of $ (\lambda x.\, fx)((\lambda y.\, gy)z) $ under AOR:

\[ \begin{aligned} (\lambda x.\, fx)((\lambda y.\, gy)z) &\to_\beta (\lambda x.\, fx)((gy)[z/y]) \\ &= (\lambda x.\, fx)(gz) \\ &\to_\beta (fx)[gz/x] \\ &= f(gz) \end{aligned} \]

In this example we can see why AOR is similar to the programming languages you have seen: the argument to an abstraction is reduced to normal form before it is substituted. For this reason, AOR is sometimes dubbed call-by-value, and demonstrates a semantic property called eager evaluation.

Example 15. Failure of AOR to reach normal form:

\[ (\lambda x.\, y)((\lambda x.\, xx)(\lambda x.\, xx)) \to_\beta (\lambda x.\, y)((xx)[(\lambda x.\, xx)/x]) = (\lambda x.\, y)((\lambda x.\, xx)(\lambda x.\, xx)) \]

This expression does have a normal form ($ y $, yet we cannot reach it using AOR. The strategy loops forever reducing the argument.

Normal Order Reduction (NOR): We always choose the leftmost, outermost redex at each step. A redex is outermost if it is not contained in any other redex.

Example 16. Reduction of $ (\lambda x.\, fx)((\lambda y.\, gy)z) $ under NOR:

\[ \begin{aligned} (\lambda x.\, fx)((\lambda y.\, gy)z) &\to_\beta (fx)[(\lambda y.\, gy)z/x] \\ &= f((\lambda y.\, gy)z) \\ &\to_\beta f((gy)[z/y]) \\ &= f(gz) \end{aligned} \]

One of the properties of NOR is that arguments to a function are not evaluated until they are needed — the formal parameter is replaced verbatim with the argument without first simplifying it. NOR is sometimes called call-by-name, and demonstrates lazy evaluation.

Example 17. Under NOR, $ (\lambda x.\, y)((\lambda x.\, xx)(\lambda x.\, xx)) $ reduces correctly:

\[ (\lambda x.\, y)((\lambda x.\, xx)(\lambda x.\, xx)) \to_\beta y[(\lambda x.\, xx)(\lambda x.\, xx)/x] = y \]

While NOR reduces this expression immediately to $ y $, AOR immediately gets caught in an infinite reduction of the argument.

Theorem 2. (Standardization, 1958) If an expression has a normal form, then Normal Order Reduction is guaranteed to reach it.

Aside: In a purely functional setting, the Church-Rosser Theorem guarantees that the reduction will lead to the unique normal form. However, in any programming language with state (such as mutable variables, object fields, or I/O), a difference in the order of evaluation could cause completely different behaviors. A notorious example in C and C++ is the order of evaluation for arguments of functions and operators — the C++ standard does not specify this order, so the choice is entirely up to the particular compiler.

Section 6: Programming With λ-Calculus

So far, we have just introduced λ-calculus as a model of computation. Alonzo Church intended to use λ-calculus to provide a foundation for all of mathematics, but it was shown to be inconsistent for this purpose by Kleene and Rosser in 1935. Nonetheless, Church and Turing proved that their models of computation — the λ-calculus and the Turing Machine — are equivalent in terms of expressive power.

To prove that λ-calculus is useful, we need to show that it is expressive enough to represent the kinds of computation we might want in real languages. In this section, we are going to discuss how to imitate real-world programming using λ-calculus.

We use the double square brackets $ \llbracket \cdot \rrbracket $ to denote the λ-calculus representation of some entity. For example:

Example 18. The shorthand for “the λ-calculus representation for the identity function is $ \lambda x.\, x $” is:

\[ \llbracket \mathit{id} \rrbracket \;:=\ \lambda x.\, x \]

Note that this shorthand notation is not part of the language of λ-calculus; an expression only becomes a λ-expression after we expand all shorthand notations into their corresponding λ-expressions.

6.1 Booleans and Conditionals

Before introducing primitives for booleans, it helps to picture how to represent conditional expressions. The simplest conditional expression looks like:

\[ \llbracket \text{if } B \text{ then } T \text{ else } F \rrbracket \]

where $ B $ is the boolean value, $ T $ and $ F $ are the expressions to take if $ B $ is true or false, respectively. We let boolean values be selectors which, given two values as arguments, produce one of them:

\[ \llbracket \mathit{if} \rrbracket := (\lambda b.\, \lambda t.\, \lambda f.\, btf) \]

With arguments applied: $ \llbracket \text{if } B \text{ then } T \text{ else } F \rrbracket := (\lambda b.\, \lambda t.\, \lambda f.\, btf)\, \llbracket B \rrbracket\, \llbracket T \rrbracket\, \llbracket F \rrbracket $

Then our boolean values are functions that, given two values, produce the first or the second:

\[ \llbracket \mathit{true} \rrbracket := \lambda x.\, \lambda y.\, x \]\[ \llbracket \mathit{false} \rrbracket := \lambda x.\, \lambda y.\, y \]

We can simplify the conditional by β-reducing the application of $ \llbracket \mathit{if} \rrbracket $ three times, obtaining:

\[ \llbracket \text{if } B \text{ then } T \text{ else } F \rrbracket := \llbracket B \rrbracket\, \llbracket T \rrbracket\, \llbracket F \rrbracket \]

Verification that it works:

\[ \begin{aligned} \llbracket \text{if true then } T \text{ else } F \rrbracket &= \llbracket \mathit{true} \rrbracket\, \llbracket T \rrbracket\, \llbracket F \rrbracket = (\lambda x.\, \lambda y.\, x)\, \llbracket T \rrbracket\, \llbracket F \rrbracket \to^2_\beta \llbracket T \rrbracket \\ \llbracket \text{if false then } T \text{ else } F \rrbracket &= \llbracket \mathit{false} \rrbracket\, \llbracket T \rrbracket\, \llbracket F \rrbracket = (\lambda x.\, \lambda y.\, y)\, \llbracket T \rrbracket\, \llbracket F \rrbracket \to^2_\beta \llbracket F \rrbracket \end{aligned} \]

Using the construction of boolean values, we can build boolean operators as follows:

\[ \begin{aligned} \llbracket \mathit{and} \rrbracket &= \lambda p.\, \lambda q.\, pq(\lambda x.\, \lambda y.\, y) \\ \llbracket \mathit{or} \rrbracket &= \lambda p.\, \lambda q.\, p(\lambda x.\, \lambda y.\, x)q \\ \llbracket \mathit{not} \rrbracket &= \lambda p.\, p(\lambda x.\, \lambda y.\, y)(\lambda x.\, \lambda y.\, x) \end{aligned} \]

Example 19. Reductions of $ \llbracket \mathit{not\;true} \rrbracket $ and $ \llbracket \mathit{not\;false} \rrbracket $:

\[ \begin{aligned} \llbracket \mathit{not\;true} \rrbracket &= \llbracket \mathit{not} \rrbracket\, \llbracket \mathit{true} \rrbracket = (\lambda b.\, b\, \llbracket \mathit{false} \rrbracket\, \llbracket \mathit{true} \rrbracket)\, \llbracket \mathit{true} \rrbracket \\ &\to_\beta \llbracket \mathit{true} \rrbracket\, \llbracket \mathit{false} \rrbracket\, \llbracket \mathit{true} \rrbracket = (\lambda x.\, \lambda y.\, x)(\lambda x.\, \lambda z.\, z)\, \llbracket \mathit{true} \rrbracket \\ &\to^2_\beta (\lambda x.\, \lambda z.\, z) = \llbracket \mathit{false} \rrbracket \\[6pt] \llbracket \mathit{not\;false} \rrbracket &= \llbracket \mathit{not} \rrbracket\, \llbracket \mathit{false} \rrbracket \to_\beta \llbracket \mathit{false} \rrbracket\, \llbracket \mathit{false} \rrbracket\, \llbracket \mathit{true} \rrbracket \\ &= (\lambda x.\, \lambda y.\, y)(\lambda x.\, \lambda y.\, y)\, \llbracket \mathit{true} \rrbracket \to_\beta (\lambda y.\, y)\, \llbracket \mathit{true} \rrbracket \to_\beta \llbracket \mathit{true} \rrbracket \end{aligned} \]

Exercise 5. Show that $ \llbracket \mathit{and} \rrbracket $ and $ \llbracket \mathit{or} \rrbracket $ work as expected.

Aside: The fact that booleans are represented by their behavior should look familiar if you’ve read through the Smalltalk module. In a way, Smalltalk’s booleans are a lot like λ-calculus booleans!

6.2 Pairs and Lists

Programs generally require storage facilities in order to compute their results. The simplest expandable storage facility is the list; we begin with the fundamental data structure, the pair, which is just a combination of two data values. We will refer to the two elements of a pair as its head and tail, respectively.

The intuition is that a pair is a function that stores the head and tail, and accepts a selector as a parameter that produces one of them. Selectors that produce the first or second of two arguments sound familiar — that’s exactly what $ \llbracket \mathit{true} \rrbracket $ and $ \llbracket \mathit{false} \rrbracket $ do!

\[ \llbracket \langle h, t \rangle \rrbracket := \lambda s.\, sht \]

The functions head and tail that extract the value from a list pass $ \llbracket \mathit{true} \rrbracket $ or $ \llbracket \mathit{false} \rrbracket $ into the list:

\[ \llbracket \mathit{head} \rrbracket := \lambda l.\, l\, \llbracket \mathit{true} \rrbracket = \lambda l.\, l(\lambda x.\, \lambda y.\, x) \]\[ \llbracket \mathit{tail} \rrbracket := \lambda l.\, l\, \llbracket \mathit{false} \rrbracket = \lambda l.\, l(\lambda x.\, \lambda y.\, y) \]

We implement a cons function (from constructor) that takes two arguments and returns the pair:

\[ \llbracket \mathit{cons} \rrbracket := \lambda h.\, \lambda t.\, \lambda s.\, sht \]

We can verify correctness: $ \llbracket \mathit{head\;(cons\;A\;B)} \rrbracket = \llbracket A \rrbracket $ and $ \llbracket \mathit{tail\;(cons\;A\;B)} \rrbracket = \llbracket B \rrbracket $.

For the empty list, we first consider: how can we tell that a list is not empty? The selector approach means that, for a non-empty list (a pair), passing a selector that produces $ \llbracket \mathit{false} \rrbracket $ no matter what would always return $ \llbracket \mathit{false} \rrbracket $:

\[ \llbracket \mathit{null?} \rrbracket := \lambda l.\, l(\lambda h.\, \lambda t.\, \llbracket \mathit{false} \rrbracket) \]

The empty list should make $ \llbracket \mathit{null?} \rrbracket $ return $ \llbracket \mathit{true} \rrbracket $, so the empty list is something that ignores its selector and produces true:

\[ \llbracket \mathit{nil} \rrbracket := \lambda s.\, \llbracket \mathit{true} \rrbracket \]

Verification: $ \llbracket \mathit{null?\;nil} \rrbracket = \llbracket \mathit{true} \rrbracket $ and $ \llbracket \mathit{null?\;(cons\;A\;B)} \rrbracket = \llbracket \mathit{false} \rrbracket $.

Example 20. Construction of a list (list a b c):

\[ \begin{aligned} \llbracket \mathit{(list\;a\;b\;c)} \rrbracket &= \llbracket \mathit{(cons\;a\;(cons\;b\;(cons\;c\;nil)))} \rrbracket \\ &\to^6_\beta \lambda s.\, sa(\lambda s.\, sb(\lambda s.\, sc\, \llbracket \mathit{nil} \rrbracket)) \end{aligned} \]

Exercise 6. Define the function $ \llbracket \mathit{second} \rrbracket $, which gets the second element from the list.

6.3 Numbers

After introducing lists, it’s easy to represent numbers: just represent them using lists! An empty list would be 0, a list of one element would be 1, etc.

Exercise 7. Define the following entities using this idiom: $ \llbracket 0 \rrbracket, \llbracket 1 \rrbracket, \llbracket \mathit{pred} \rrbracket, \llbracket \mathit{succ} \rrbracket, \llbracket \mathit{isZero?} \rrbracket $. Verify that your solution works by showing that $ \llbracket \mathit{pred\;(succ\;n)} \rrbracket = \llbracket n \rrbracket $ (if $ n \neq 0 $.

However, there is a cleverer solution introduced by Alonzo Church called Church numerals. In Church’s representation, $ \llbracket n \rrbracket $ is defined as a function that takes a function $ f $ and a value $ x $, and produces the result of $ f $ applied $ n $ times to $ x $:

\[ \begin{aligned} \llbracket 0 \rrbracket &= \lambda f.\, \lambda x.\, x \\ \llbracket 1 \rrbracket &= \lambda f.\, \lambda x.\, fx \\ \llbracket 2 \rrbracket &= \lambda f.\, \lambda x.\, f(fx) \\ \llbracket 3 \rrbracket &= \lambda f.\, \lambda x.\, f(f(fx)) \end{aligned} \]

Addition. Given two Church numerals $ m $ and $ n $, we want to apply $ f $ a total of $ m + n $ times to $ x $. We can do this by applying $ f $ $ n $ times to $ x $, then applying $ f $ $ m $ times to the result:

\[ \llbracket + \rrbracket := \lambda m.\, \lambda n.\, \lambda f.\, \lambda x.\, mf(nfx) \]

A special case is the successor function:

\[ \llbracket \mathit{succ}\;n \rrbracket := \llbracket n + 1 \rrbracket = \lambda n.\, \lambda f.\, \lambda x.\, nf(fx) \]

Subtraction. Subtraction is harder: there is no way to “undo” a function application. An alternative arises from the observation that succ is easy. We can use a pair $ \langle a, b \rangle $ to track a number $ a $ and its predecessor: for each pair $ \langle a, b \rangle $, we create a new pair $ \langle a+1, a \rangle $. Starting from $ \langle 0, 0 \rangle $ and applying this procedure $ n $ times yields $ \langle n, n-1 \rangle $. Taking the tail gives the predecessor.

\[ \llbracket \mathit{pred} \rrbracket := \lambda n.\, \llbracket \mathit{tail} \rrbracket\!\left(n\left(\lambda p.\, \llbracket \mathit{cons} \rrbracket\,(\llbracket \mathit{succ} \rrbracket\,(\llbracket \mathit{head} \rrbracket\, p))\,(\llbracket \mathit{head} \rrbracket\, p)\right)(\llbracket \mathit{cons} \rrbracket\, \llbracket 0 \rrbracket\, \llbracket 0 \rrbracket)\right) \]

Exercise 8. Verify that the definition of $ \llbracket \mathit{pred} \rrbracket $ is correct by working through an example of your own.

Subtraction is defined as: given $ m $ and $ n $, apply $ \llbracket \mathit{pred} \rrbracket $ $ n $ times starting from $ m $:

\[ \llbracket - \rrbracket = \lambda m.\, \lambda n.\, n\, \llbracket \mathit{pred} \rrbracket\, m \]

Multiplication. Multiplication of $ m $ and $ n $ is the $ n $-fold repetition of $ f $, $ m $ times:

\[ \llbracket * \rrbracket = \lambda m.\, \lambda n.\, \lambda f.\, m(nf) \]

Exponentiation. $ m^n $ is the $ n $-fold repetition of $ m $ itself:

\[ \llbracket \hat{\phantom{x}} \rrbracket = \lambda m.\, \lambda n.\, nm \]

6.4 Recursion

Suppose one wants to find the length of a list in λ-calculus:

\[ \llbracket \mathit{len} \rrbracket := \lambda l.\, (\llbracket \mathit{null?} \rrbracket\, l)\, \llbracket 0 \rrbracket\, (\llbracket \mathit{succ} \rrbracket\, (\llbracket \mathit{len} \rrbracket\, (\llbracket \mathit{tail} \rrbracket\, l))) \]

This looks great, except $ \llbracket \mathit{len} \rrbracket $ is defined in terms of itself. There is no way to replace $ \llbracket \mathit{len} \rrbracket $ since no functions in λ-calculus have names. We can “factor out” the $ \llbracket \mathit{len} \rrbracket $ on the right hand side by β-expansion, obtaining:

\[ \llbracket \mathit{len} \rrbracket = (\lambda f.\, \lambda l.\, (\llbracket \mathit{null?} \rrbracket\, l)\, \llbracket 0 \rrbracket\, (\llbracket \mathit{succ} \rrbracket\, (f\,(\llbracket \mathit{tail} \rrbracket\, l))))\, \llbracket \mathit{len} \rrbracket \]

Let $ F = \lambda f.\, \lambda l.\, (\llbracket \mathit{null?} \rrbracket\, l)\, \llbracket 0 \rrbracket\, (\llbracket \mathit{succ} \rrbracket\, (f\,(\llbracket \mathit{tail} \rrbracket\, l))) $. Our equation becomes $ \llbracket \mathit{len} \rrbracket = F\, \llbracket \mathit{len} \rrbracket $. We need to find a fixed point of $ F $.

In mathematics, a fixed point of a function is a value $ x $ such that $ f(x) = x $. Consider the following expression:

\[ X = (\lambda x.\, f(xx))(\lambda x.\, f(xx)) \to_\beta f((\lambda x.\, f(xx))(\lambda x.\, f(xx))) = fX \]

Since $ X = fX $, $ X $ is a fixed point of $ f $, whatever $ f $ is. We can construct a general fixed-point combinator by modifying $ X $ to accept an argument for specifying $ f $. What we get is the Y combinator (Curry’s Paradoxical Combinator):

\[ Y := \lambda f.\, (\lambda x.\, f(xx))(\lambda x.\, f(xx)) \]

The critical property of the Y combinator is that for any $ g $:

\[ Yg =_\beta g(Yg) \]

Definition 11. (Fixed-point combinator) A fixed-point combinator is any combinator $ C $ such that:

\[ Cf =_\beta f(Cf) \]

for every $ f $.

Solving our defining equation for $ \llbracket \mathit{len} \rrbracket $ is equivalent to finding a fixed point for $ F $. That fixed point is $ YF $:

\[ \llbracket \mathit{len} \rrbracket := YF = Y(\lambda f.\, \lambda l.\, (\llbracket \mathit{null?} \rrbracket\, l)\, \llbracket 0 \rrbracket\, (\llbracket \mathit{succ} \rrbracket\, (f\,(\llbracket \mathit{tail} \rrbracket\, l)))) \]

An important observation: should we use AOR or NOR as our reduction strategy with the Y combinator? The AOR redex $ (\lambda x.\, f(xx))(\lambda x.\, f(xx)) $ does not have a normal form; AOR always chooses the leftmost innermost redex and would fall into an infinite reduction. So we must use NOR here.

Additionally, when we want a reduction strategy similar to AOR (which more closely resembles real-world programming languages), we need several modifications:

Applicative Order Evaluation (AOE): Choose the redex that is leftmost, innermost, and not inside the body of an abstraction. This can still reach WNF, since WNF does not require reduction of redices inside abstraction bodies.
The $ Y' $ combinator (for AOE): Even with AOE, the recursive part of $ Y $ leads to infinite reduction. We wrap the repetition in an abstraction (η-expansion) to prevent AOE from selecting it:

\[ Y' = \lambda f.\, (\lambda x.\, f(\lambda y.\, xxy))(\lambda x.\, f(\lambda y.\, xxy)) \]

Short-circuit conditionals: The plain $ \llbracket \mathit{if} \rrbracket $ does not perform short-circuit evaluation. When one branch has an infinitely reducing expression, we must wrap both branches in abstractions:

\[ \llbracket \text{if } B \text{ then } T \text{ else } F \rrbracket := \llbracket B \rrbracket\, (\lambda x.\, \llbracket T \rrbracket)\, (\lambda x.\, \llbracket F \rrbracket)\, x \]

With these three modifications, the reduction works under AOE.

Section 7: deBruijn Notation

As our search for a proper definition of substitution has taught us, the names of variables can often get in the way of the true meaning of an expression. In the expression $ \lambda y.\, a(\lambda y.\, b(\lambda y.\, cy)y)y $, each $ y $ has a different meaning. We present an alternative solution presented by deBruijn in 1972.

In deBruijn notation, variables are replaced with integers. The integers indicate the number of function bodies that must be escaped to locate the binding for the variable. For example, $ \lambda x.\, \lambda y.\, x $ in deBruijn notation is $ \lambda.\, \lambda.\, 2 $.

Note that a single binding can be represented by several different integers, depending on how many further function bodies the reference is embedded in.

Example 21. The expression $ \lambda x.\, (\lambda y.\, x)x $ has the deBruijn equivalent $ \lambda.\, ((\lambda.\, 2)1) $, in which $ x $ is replaced by both $ 2 $ and $ 1 $.

Free variables can be left unconverted. Importantly, in deBruijn notation, all α-equivalent expressions have exactly one representation.

The definition of β-reduction for deBruijn notation is:

\[ (\lambda.\, N)M \to_\beta N[M/1] \]

The substitution definition for deBruijn notation is:

\[ n[N/m] = \begin{cases} n & \text{if } n < m \\ n - 1 & \text{if } n > m \\ \mathit{rename}_{n,1}(N) & \text{if } n = m \end{cases} \]\[ (M_1 M_2)[N/m] = M_1[N/m]\,M_2[N/m] \]\[ (\lambda.\, M)[N/m] = \lambda.\, M[N/m+1] \]

where the renaming function is defined as:

\[ \mathit{rename}_{m,i}(j) = \begin{cases} j & \text{if } j < i \\ j + m - 1 & \text{if } j > i \end{cases} \]\[ \mathit{rename}_{m,i}(N_1 N_2) = \mathit{rename}_{m,i}(N_1)\, \mathit{rename}_{m,i}(N_2) \]\[ \mathit{rename}_{m,i}(\lambda.\, N) = \lambda.\, \mathit{rename}_{m,i+1}(N) \]

The deBruijn notation is useful for avoiding the cost of α-conversion. Substitution and β-reduction in deBruijn notation are essentially operations on integers, which is often faster than operations on names. Furthermore, deBruijn notation exposes the concept of a “stack frame”: the integers can be regarded as indices to a stack of values mapping such indices to the expression those bound variables refer to after a reduction.

However, deBruijn notation is considerably harder to read and understand, and often a program evaluating λ-calculus in deBruijn notation needs to provide facilities to convert one notation to another. Such conversions can be costly, offsetting the performance benefit.

Section 8: Implementation Strategies

Many programming languages are based on λ-calculus to some degree. Generally speaking, a λ-calculus expression itself can be stored as an abstract syntax tree (AST), in which each node is labeled with its type (variable, abstraction, or application). For instance, the λ-calculus expression $ \lambda b.\, \lambda t.\, \lambda f.\, btf $ (i.e., $ \llbracket \mathit{if} \rrbracket $ could be represented as a tree:

abstraction (b)
└── abstraction (t)
    └── abstraction (f)
        └── application
            ├── application
            │   ├── variable (b)
            │   └── variable (t)
            └── variable (f)

To evaluate it, you walk the tree, following the evaluation strategy of your choice, and replacing applications with their substitutions. You need to carry a list of bindings so you know what to substitute. This can get complicated, since a subtree may reuse a variable name, and you need to be able to generate a fresh name to avoid conflicts.

With deBruijn notation, you instead carry a stack of bindings, and a variable is simply an index into that stack. The equivalent deBruijn tree for $ \llbracket \mathit{if} \rrbracket $ uses indices:

abstraction
└── abstraction
    └── abstraction
        └── application
            ├── application
            │   ├── variable (3)
            │   └── variable (2)
            └── variable (1)

This makes evaluation simpler, but requires an extra conversion step, and makes debugging more difficult. An advantage of deBruijn notation is that you don’t need a final α-equivalence check on the output.

Module 3: Formal Semantics

One of the major endeavours of modern programming languages research is in formalizing our understanding of how language constructs behave on their own and in interaction with each other. We are interested in formalizing the meanings of the various elements of the programming language — and ultimately the language itself. This discipline is called formal semantics. In studying formal semantics, our goal is to formulate a model capable of precisely describing the behavior of every program in a given language. Such a model provides us tools to prove program correctness, program termination, or other critical properties. Furthermore, we can use such a model to prove certain properties of the language itself, to show equivalence of programs in different programming languages. The knowledge gained could even help build compilers and interpreters to produce more efficient implementations.

In Module 2, we said that a mathematical model for the programming language itself would provide a succinct and precise representation of the core mechanics. However, we had not yet given a definition of the behavior of the program in mathematical logic. For example, we described AOE strategy in plain English as “always choosing the leftmost, innermost redex that is not in an abstraction”. A prosaic definition like this will usually not suffice. We would like to have a small set of (usually) syntax-directed rules that describe the elements of a language’s syntax in a formal, mathematical setting.

A semantic model usually comes with a set of observables, which describes the valid outputs of the model. Such outputs could be the value produced by following a number of rules, the set of all types of the language, or simply whether a program returns an error or not.

In this course’s formal semantics, we are primarily going to study operational semantics — the semantics for specifying how a program executes and how to extract a result from it. More specifically, we are concerned with small-step operational semantics, which builds an imaginary “machine” and succinctly describes how this machine might take individual steps, rather than describing the entire computation in one step.

Section 1: Semantics and Category Theory

We will be describing the reduction steps in our programming languages by formally describing an arrow ($ \to $ operator, which maps a program state to the “next” program state. This is described within the context of category theory, in which our language is a category, and $ \to $ is a morphism over that category.

In CS courses, you have undoubtedly seen sets and set theory. Group theory extends set theory by describing groups, which are sets corresponding to certain axioms, and generalizes the language of functions between and within groups. Category theory abstracts beyond this by describing categories which may not obviously be describable as groups or sets; in particular, one can describe entire mathematical calculi as categories. For instance, one can describe set theory itself as a category, with expressions in set theory as the objects described, and the functions being equivalences (or reductions, or expansions, etc) between them. For example, the resolution of the expression $ \{1, 2\} \cup \{3, 4\} $ to $ \{1, 2, 3, 4\} $ is a morphism. We call the functions between objects morphisms.

Where category theory becomes particularly relevant is its abstraction over itself. In the language of category theory, we can describe an entire category as an object within the category of categories, and describe a morphism mapping that category to another category. Mappings between categories like this are called functors.

Aside: We introduced categories as “not obviously describable as groups or sets”. In fact, since category theory describes how categories can be mapped to other categories, and category theory is itself a category, it is perfectly possible to map any category to some kind of set. Hence “not obviously”, rather than “not”.

We describe our own languages in terms of a morphism, which maps program states to program states. Morphisms are usually shown as arrows, often with text to specify exactly which morphism is being described — in fact, we’ve already seen a few morphisms, such as $ \to_\beta $, but didn’t call them such at the time.

When describing morphisms, we are free to use other categories to do so. For instance, we could say that $ (x + y \to z) $ if $ (x + y = z) $, describing our language in terms of the language of arithmetic. It’s important to be clear what language is being described and what language is being used. In fact, a compiler is an example of a language being described in terms of functors — and proving compiler correctness involves proving the functor correct.

Aside: Programming language semantics are not an ad hoc invention; they are described in the language of categories.

Section 2: Semantics and Reality

But is there anything to guarantee that the semantics we formally model are the same as the semantics implemented in real programming language implementations? The short answer is “usually not”.

There are systems that make formal semantics executable, but the resulting interpreters are usually unusably slow. The purpose of these systems is to have a ground truth for writing test cases. There are also aspects of real implementations which are usually intentionally ignored in formal semantics. For instance, we won’t discuss what happens when the program state is too large to hold in memory, or garbage collection (even though it’s crucial to a correct implementation of many systems).

In reality, it’s impossible to prove, in the mathematical sense, anything about how a program will behave on a real system. Since we’re proving things about abstract calculi rather than real implementations, we can actually prove things with all the rigor of mathematics. It is the job of the designer of a formal semantics to argue that the semantics correctly reflects the design of the language.

Section 3: Review: Post System

The Post system, named after Emil Post, is an example of a deductive formal system, which can be used to reason about programming languages. There are three components to a Post system: a set of signs (which forms the alphabet of the system), a set of variables, and a set of productions. A term is a string of signs and variables, and a production is an expression of the form:

\[ \dfrac{t_1 \quad t_2 \quad \cdots \quad t_n}{t} \]

where $ t, t_1, \ldots, t_n $ ($ n \geq 0 $ are all terms. The $ t_i $ are called the premises of the production, and $ t $ is the conclusion. A production is read as “if the premises are true, then the conclusion holds”. A production without premises is permitted and is called an axiom.

Post systems are used to prove conclusions, where a proof is constructed from proofs of its premises:

An instance of an axiom is a proof of its conclusion.
If $ P_1, P_2, \ldots, P_n $ are proofs of $ t_1, t_2, \ldots, t_n $ respectively, and the production holds, then we have a proof of $ t $.

Thus, given a final conclusion, a proof of that conclusion can be formed by proving its premises, until no unproven premises remain. The result of such a proof is an upside-down tree with the root (final conclusion) at the bottom and the leaves (axioms) at the top.

Example 1. As an example of a Post system, we can encode the logical operations ‘and’ ($ \land $ and ‘or’ ($ \lor $ using the following three rules:

\[ \dfrac{A}{A \lor B} \qquad \dfrac{B}{A \lor B} \qquad \dfrac{A \quad B}{A \land B} \]

Using this small system, it is possible to show that the proof of $ (A \lor B) \land (A \lor C) $ follows from a proof of $ A $ alone:

\[ \dfrac{\dfrac{A}{A \lor B} \quad \dfrac{A}{A \lor C}}{(A \lor B) \land (A \lor C)} \]

Post systems are used extensively for describing formal semantics. You will see that formal semantics of programming languages, including type systems, are often described in Post systems.

Section 4: Operational Semantics for (Vanilla) λ-Calculus

We have already discussed the semantics of λ-terms in Module 2, when we discussed free and bound variables, substitution, α-conversion, and β-reduction. β-reduction seems to be a suitable candidate for operational semantics, for it specifies a procedure for carrying out computation.

Let’s rewrite β-reduction as a formal set of rules. First, all expressions that have β-redex at the outermost level can be directly reduced, with no premises:

\[ \dfrac{}{(\lambda x.\, M)N \to_\beta M[N/x]} \]

For reduction happening inside abstractions:

\[ \dfrac{M \to_\beta P}{\lambda x.\, M \to_\beta \lambda x.\, P} \]

In order to show that $ \lambda x.\, M \to_\beta \lambda x.\, P $, we must either provide a proof of $ M \to_\beta P $, or there must exist a rule stating $ M \to_\beta P $ is an axiom.

For applications, we can reduce either in the rator or in the rand — hence we need two separate rules:

\[ \dfrac{M \to_\beta P}{MN \to_\beta PN} \qquad \dfrac{N \to_\beta P}{MN \to_\beta MP} \]

We have just described how computation proceeds in λ-calculus. However, because a λ-calculus expression may match more than one of these conditions, our description is non-deterministic — we haven’t described a particular way of computing, but all valid ways of computing.

Section 5: Defining Evaluation Order

As we mentioned earlier, evaluation order is very important, since many programming languages will not execute code non-deterministically. In this section, we discuss the operational semantics of λ-calculus under Normal Order Reduction (NOR) and Applicative Order Evaluation (AOE).

Definition 1. (Small-Step Operational Semantics of the Untyped λ-Calculus, NOR) Let the metavariable $ M $ range over λ-expressions. Then a semantics of λ-terms in NOR is given by the following rules:

\[ \dfrac{}{(\lambda x.\, M_1)M_2 \to M_1[M_2/x]} \]\[ \dfrac{M \to M'}{\lambda x.\, M \to \lambda x.\, M'} \]\[ \dfrac{M_1 \to M_1' \quad \forall x.\, \forall M_3.\, M_1 \neq \lambda x.\, M_3}{M_1 M_2 \to M_1' M_2} \]\[ \dfrac{\forall x.\, \forall M_3.\, M_1 \neq \lambda x.\, M_3 \quad \forall M_1'.\, M_1 \not\to M_1'}{M_1 M_2 \to M_1 M_2'} \]

The first and second rules are the same as in β-reduction. Similar to non-deterministic β-reduction, we can reduce an expression that is either a redex, or an abstraction which contains a redex. To enforce NOR, we add additional restrictions to the third and fourth rules:

The third rule includes $ \forall M_1'.\, M_1 \not\to M_1' $ in the fourth rule — if $ M_1 $ can be reduced further, we should reduce $ M_1 $ first.
Both rules include $ \forall x.\, \forall M_3.\, M_1 \neq \lambda x.\, M_3 $ — the third and fourth rules do not apply if the first rule would have applied (i.e., if $ M_1 $ is an abstraction).

Definition 2. (Small-Step Operational Semantics of the Untyped λ-Calculus, AOE) Let the metavariable $ M $ range over λ-expressions. Then a semantics of λ-terms in AOE is given by the following rules:

\[ \dfrac{\forall M_2'.\, M_2 \not\to M_2'}{(\lambda x.\, M_1)M_2 \to M_1[M_2/x]} \]\[ \dfrac{M_1 \to M_1'}{M_1 M_2 \to M_1' M_2} \]\[ \dfrac{M_2 \to M_2' \quad \forall M_1'.\, M_1 \not\to M_1'}{M_1 M_2 \to M_1 M_2'} \]

For AOE, the first rule has the added condition that the rand (the argument) can be applied only if it can’t be reduced further. The abstraction rule is removed, since we cannot reduce within an abstraction — similarly, in most programming languages, you cannot evaluate inside a function you haven’t yet called.

Section 6: Terminal Values

In the previous section, we used the language of predicate logic (specifically, for-all) to conditionalize productions. While this is mathematically valid, it complicates the description of the language and makes it more difficult to prove that a particular production is the right one to use. Generally speaking, the rules and conditions become much clearer if we can instead syntactically define what expressions are terminal, or final — not capable of being reduced further.

There are a few choices for possible sets of terminal values in λ-calculus: β-normal form, weak normal form, or head normal form. If we use anything other than β-normal form, we lose the guarantee given by the Church-Rosser Theorem. Even with β-normal form as the set of terminal values, we still need to answer some important questions. For example:

What is the semantics of $ (\lambda x.\, xx)(\lambda x.\, xx) $? The only response we can give is “no semantics”, since it does not have a normal form.
What about $ (\lambda x.\, \lambda y.\, y)((\lambda x.\, xx)(\lambda x.\, xx)) $? It has a terminal value, but not all possible legal derivations will lead to it. Should this expression be given a final value of $ \lambda y.\, y $ since there is a possible reduction to it, or should we say there is no meaning?

The answer depends on what one needs to achieve by designing the semantics.

For now, we focus on the steps themselves rather than the possible terminal values, and just let our final values be “the set of values our operational semantics would produce”. We will discuss terminal values in greater detail in the next module.

Section 7: Showcase: A Simple Proof

In this section, we will show that $ (\lambda x.\, \lambda y.\, y)((\lambda x.\, xx)(\lambda x.\, xx))x $ indeed terminates and evaluates to $ x $ under NOR.

The formal way to specify that the former expression terminates and evaluates to the latter:

\[ (\lambda x.\, \lambda y.\, y)((\lambda x.\, xx)(\lambda x.\, xx))x \to (\lambda y.\, y)x \to x \]

For larger examples, it is useful to formally define the $ \twoheadrightarrow $ (i.e., $ \to^* $ operator so we can show every single step at once:

Definition 3. (Sequencing) Let the metavariable $ M $ range over λ-expressions and $ \to $ be the operator of “one step” in any small-step operational semantics. Then $ \to^* $ is defined as follows:

\[ \dfrac{}{M \to^* M} \]\[ \dfrac{M_1 \to^* M_2 \quad M_2 \to M_3}{M_1 \to^* M_3} \]

Aside: $ \to^* $ is the reflexive and transitive closure of $ \to $.

To keep the proof text short, let $ A = (\lambda x.\, \lambda y.\, y) $ and $ B = (\lambda x.\, xx)(\lambda x.\, xx) $. Now:

\[ \dfrac{\dfrac{}{ABx \to^* ABx} \quad \dfrac{}{ABx \to (\lambda y.\, y)x}}{ABx \to^* (\lambda y.\, y)x} \]\[ \dfrac{ABx \to^* (\lambda y.\, y)x \quad (\lambda y.\, y)x \to x}{(\lambda x.\, \lambda y.\, y)((\lambda x.\, xx)(\lambda x.\, xx))x \to^* x} \]

Aside: There are also many other kinds of semantics. Big-step operational semantics describes the terminal values every expression will evaluate to directly, rather than as the closure of smaller steps. For example, the big-step operational semantics for λ-calculus under AOE:
\[ \dfrac{}{V \Downarrow V} \qquad \dfrac{M[V_1/x] \Downarrow V_2}{(\lambda x.\, M)V_1 \Downarrow V_2} \]\[ \dfrac{M_1 \Downarrow V_1 \quad V_1 M_2 \Downarrow V_2}{M_1 M_2 \Downarrow V_2} \qquad \dfrac{M_1 \Downarrow V_1 \quad V_2 V_1 \Downarrow V_3}{V_2 M_1 \Downarrow V_3} \]
where $ V $ is the metavariable over values.
Denotational semantics are used to show the correspondence from language constructs to familiar mathematical objects. Our definition of functional language constructs in terms of λ-expressions is an example of a denotational semantics, where the map between expressions and observables is generally denoted by double square brackets. A denotational semantics is formally a functor, while an operational semantics is a morphism over the language described.

Section 8: Adding Primitives

Around the end of Module 2, we discussed λ-calculus implementations of commonly seen data types. While those discussions showcase the power of λ-calculus in representing computation, the implementations are not particularly practical. In practice, we tend to model those data types as primitives — intrinsic (built-in) values of our language. In this section, we will describe the semantic rules required if we were to add those built-in entities, since most of the semantics we see in future modules will have such intrinsics.

8.1 Booleans and Conditionals

We introduce syntactic elements in BNF. We use ... to denote the part of the definitions of expressions that was defined in Module 2:

⟨Boolexp⟩ ::= true | false
             | not ⟨Expr⟩
             | and ⟨Expr⟩ ⟨Expr⟩
             | or ⟨Expr⟩ ⟨Expr⟩
⟨Expr⟩    ::= ...
             | ⟨Boolexp⟩
             | if ⟨Expr⟩ then ⟨Expr⟩ else ⟨Expr⟩

These syntactic elements are very similar to the ones from Module 2. However, they are now actually part of the syntax — there is no λ-calculus representation for them. Programs in the λ-calculus with boolean primitives are simply λ-calculus expressions with additional syntax for boolean expressions, like so:

\[ \lambda x.\, \lambda y.\, \text{if } x \text{ then } y \text{ else } \mathit{false} \]

Let the metavariables $ B $ and $ E $ range over all boolean expressions and all λ-expressions, respectively. We will start with not:

\[ \dfrac{}{ \text{not true} \to \text{false}} \qquad \dfrac{}{ \text{not false} \to \text{true}} \qquad \dfrac{B \to B'}{ \text{not } B \to \text{not } B'} \]

For and and or, we want the computation of the first parameter to happen first, and we would like short-circuiting behavior:

\[ \dfrac{}{\text{and false } B \to \text{false}} \qquad \dfrac{}{\text{and true true} \to \text{true}} \qquad \dfrac{}{\text{and true false} \to \text{false}} \]\[ \dfrac{B \to B'}{\text{and } B\, B_1 \to \text{and } B'\, B_1} \qquad \dfrac{B_1 \to B_1'}{\text{and true } B_1 \to \text{and true } B_1'} \]

The first rule describes the short-circuiting behavior: as long as the first argument evaluates to false, the whole and evaluates to false, regardless of the second argument. The last two rules describe “the first argument must be fully evaluated before the second one”.

Exercise 1. Write the semantic rules for or.

Now we add the rules for if statements:

\[ \dfrac{}{\text{if true then } E_1 \text{ else } E_2 \to E_1} \qquad \dfrac{}{\text{if false then } E_1 \text{ else } E_2 \to E_2} \]\[ \dfrac{B \to B'}{\text{if } B \text{ then } E_1 \text{ else } E_2 \to \text{if } B' \text{ then } E_1 \text{ else } E_2} \]

8.2 Numbers

We restrict our definition to natural numbers. We assume we can represent numbers of an infinite range (no overflow). The syntax in BNF:

⟨Num⟩       ::= 0 | 1 | ...
               | ⟨NumBinOps⟩ ⟨Num⟩ ⟨Num⟩
⟨NumBinOps⟩ ::= + | - | * | /
⟨Expr⟩      ::= ...
               | ⟨Num⟩

Let $ M, N $ range over numeric expressions, and $ a, b $ range over natural numbers. Starting from addition:

\[ \dfrac{a + b = c}{(+ \; a \; b) \to c} \qquad \dfrac{M \to M'}{(+ \; M \; N) \to (+ \; M' \; N)} \qquad \dfrac{M \to M'}{(+ \; a \; M) \to (+ \; a \; M')} \]

This set of rules forces the first argument (left-hand side) to be evaluated before the second argument. Note that we are describing our language in terms of the language of arithmetic, with the predicate $ a + b = c $.

For subtraction:

\[ \dfrac{a - b = c \quad c \in \mathbb{N}}{(- \; a \; b) \to c} \qquad \dfrac{M \to M'}{(- \; M \; N) \to (- \; M' \; N)} \qquad \dfrac{M \to M'}{(- \; a \; M) \to (- \; a \; M')} \]

The semantics for subtraction requires that $ a - b $ is a natural number. With this, there is no rule to match expressions like $ (- \; 2 \; 3) $, so such expressions cannot be reduced — we describe this as getting stuck. Another way of handling this is to allow such subtraction, but define the result as something arbitrary such as 0:

\[ \dfrac{a - b = c \quad c \notin \mathbb{N}}{(- \; a \; b) \to 0} \]

Although this definition is unintuitive, it is not incorrect: we are defining our language’s subtraction, and if it doesn’t match perfectly with the subtraction of arithmetic, that is part of the definition.

Exercise 2. Write the semantic rules for $ * $ and $ / $ (use integer division; think about how to handle zero division).

Exercise 3. Propose changes to the syntax rules and add new semantic rules for pred and succ, which are unary functions for getting the predecessor and successor of a number. Note: pred 0 = 0.

8.3 Lists

We use the representation familiar from Racket: a list containing 1, 2, 3 is:

\[ \text{(cons 1 (cons 2 (cons 3 empty)))} = [1, 2, 3] \]

We use the shorthand $ L_1 + L_2 $ for the operator that appends $ L_1 $ to the start of $ L_2 $. For example: $ [1] + [2] = [1, 2] $. We also assume that $ + $ works for empty, the empty list.

The syntactic elements of lists:

⟨ListExpr⟩ ::= empty | (cons ⟨Expr⟩ ⟨ListExpr⟩)
              | [⟨Expr⟩ ⟨ListRest⟩]
⟨ListRest⟩ ::= ε | , ⟨Expr⟩ ⟨ListRest⟩
⟨Expr⟩     ::= ...
              | ⟨ListExpr⟩
              | first ⟨ListExpr⟩ | rest ⟨ListExpr⟩

Let the metavariables $ L, E $ range over list expressions and λ-expressions respectively. The semantic rules are:

\[ \dfrac{L_2 = [E] + L_1 \quad \forall E_1.\, E \not\to E_1}{(\text{cons}\; E\; L_1) \to L_2} \]\[ \dfrac{L_1 = [E] + L_2}{(\text{first}\; L_1) \to E} \qquad \dfrac{L_1 = [E] + L_2}{(\text{rest}\; L_1) \to L_2} \]

Note that the premise $ L_1 = [E] + L_2 $ implies that $ L_1 $ is not empty.

Reduction inside the built-in functions:

\[ \dfrac{E_1 \to E_1'}{(\text{cons}\; E_1\; E_2) \to (\text{cons}\; E_1'\; E_2)} \qquad \dfrac{\forall E_3.\, E_1 \not\to E_3 \quad E_2 \to E_2'}{(\text{cons}\; E_1\; E_2) \to (\text{cons}\; E_1\; E_2')} \]\[ \dfrac{E_1 \to E_1'}{(\text{first}\; E_1) \to (\text{first}\; E_1')} \qquad \dfrac{E_1 \to E_1'}{(\text{rest}\; E_1) \to (\text{rest}\; E_1')} \]

Note that our definition of lists has been slightly less formal than our previous definitions, as we relied on an informally described mathematical language of lists for our predicates. It is not uncommon for formal semantics to have some quasi-formal “holes” like this, though obviously it is preferable to define everything as precisely as possible.

8.4 Sets

A set is a mathematical collection of distinct objects. In real programming languages, it is usually implemented by a hash-map. However, when formulating a semantics for sets, we usually do not need to worry about their actual implementation; we can just treat it as a mathematical object.

The syntax of sets:

⟨SetExpr⟩  ::= empty | {⟨Expr⟩ ⟨SetRest⟩}
              | insert ⟨Expr⟩ ⟨SetExpr⟩
              | remove ⟨Expr⟩ ⟨SetExpr⟩
⟨SetRest⟩  ::= ε | , ⟨Expr⟩ ⟨SetRest⟩
⟨Expr⟩     ::= ...
              | ⟨SetExpr⟩
⟨Boolexp⟩  ::= ...
              | contains? ⟨Expr⟩ ⟨SetExpr⟩

Let metavariables $ S, E $ range over set expressions and all λ-expressions respectively:

\[ \dfrac{\forall E_1.\, E \not\to E_1}{(\text{insert}\; E\; \text{empty}) \to \{E\}} \qquad \dfrac{\forall E_1.\, E \not\to E_1 \quad S' = S \cup \{E\}}{(\text{insert}\; E\; S) \to S'} \]\[ \dfrac{\forall E_1.\, E \not\to E_1}{(\text{remove}\; E\; \text{empty}) \to \text{empty}} \qquad \dfrac{\forall E_1.\, E \not\to E_1 \quad S' = S \setminus \{E\}}{(\text{remove}\; E\; S) \to S'} \]\[ \dfrac{\forall E_1.\, E \not\to E_1 \quad E \in S}{(\text{contains?}\; E\; S) \to \text{true}} \qquad \dfrac{\forall E_1.\, E \not\to E_1 \quad E \notin S}{(\text{contains?}\; E\; S) \to \text{false}} \]

Note that again, we have described our own sets in terms of the language of set theory.

Exercise 4. Write the semantic rules for set where at least one argument is not fully reduced.

Section 9: Fin

In the next module, we will introduce types, which allow us to prove certain properties of languages — including that the semantics do not “get stuck” — by categorizing the kinds of values that may undergo certain operations.

Module 4: Types

Getting Stuck

In Module 2, we studied the λ-calculus as a model for programming. We showed how many of the elements of “real-world” programming languages can be built from λ-expressions. But, because these “real-world” constructs were all modelled as λ-expressions, we can combine them in any way we like without regard for “what makes sense”. When we added semantics, in Module 3, we also added direct (“primitive”) semantic implementations of particular patterns, such as booleans, rather than expressing them directly as λ-expressions. That change introduces a similar problem: what happens when you try to use a primitive in a way that makes no sense? For instance, what happens if we try to add two lists as if they were numbers?

Well, let’s work out an example that tries to do that, using the semantics of AOE, plus numbers and lists, from Module 3, and the expression (+ (cons 1 (cons 2 empty)) (cons 3 (cons 4 empty))):

The only semantic rule for + that matches is to reduce the first argument. After several steps, it reduces to [1, 2], so our expression is now (+ [1, 2] (cons 3 (cons 4 empty))).
The rule for + to reduce the first argument no longer matches, because the first argument is not reducible. But, the rule to reduce the second argument doesn’t match, because the first argument is not a number. No rules match, so we can reduce no further.

What happens next? The answer is that nothing happens next! There’s no semantic rule that matches the current state of our program, so we can’t take another step. This isn’t a problem with our semantics; after all, it makes no sense to add lists in this way. But, because of this, applying $\to^*$ to our code reaches a rather unsatisfying result.

We call this phenomenon “getting stuck”, but to understand what it means to get stuck, first we have to have a goal. Applying $\to$ repeatedly will reach some conclusion for any program that terminates, so what was different about the conclusion for the above program than a “good” one?

In real software, we usually run software for its side effects: it interacts with the user, or transforms some files, etc. In formal semantics, there is nothing external to the program, and all of the steps are syntactic transformations of the program. So, the question is, what is the syntax for a “complete” program?

Of course, it depends on the language. So, we rely on a topic we set aside in the last module: terminal values. We will consider a program to have run to completion if it reaches a terminal value, usually just called a “value”. Values are elements of the syntax that are complete and meaningful on their own, with no further reduction. For instance, in most languages with numbers, a literal number is a value. In the λ-calculus with AOE, an abstraction is a value, as without an application, there’s nothing else to do.

This definition is somewhat circular: How do we know when a program has run to completion? It reaches a terminal value. What’s a terminal value? A program fragment that can’t run any further. It’s up to the definer of the semantics to choose values that are both correct for their semantics and intuitively correct.

“Terminal value” is sometimes abbreviated to “term”. However, “term” is a very overloaded, well, term, so we will usually call them “values”, “terminals”, or “terminal values”.

Aside: The word “term” is actually related to “terminal” or “terminus”. It is the fact that a term has a single, conclusive meaning that makes it “term”! Yes, even in regular English, the term “term” comes from having a complete meaning on its own.

Common Terminal Values

We already saw that in the λ-calculus with AOE, an abstraction is a terminal value. In fact, in most—but certainly not all—programming language semantics with functions, a function is a value, since there is nothing further to do with a function that has not been called. An implication of this fact is that there is usually an infinite number of possible values, since there is at least an infinite number of possible functions; it is thus important that we describe our values syntactically, as context-free grammars already give us a way of expressing such infinite sets.

Most other values should be fairly obvious. In Module 3, we created extended versions of the λ-calculus with booleans, numbers, lists, and sets. In each case, we introduced syntax for those kinds of values, as well as syntax to do things with them. For instance, with booleans, we introduced true and false, but also if expressions. The new values in that extended λ-calculus are true and false; if expressions are non-terminal.

Aside: Non-terminal, I’ve heard that term before! Indeed, a non-terminal in a grammar is exactly the same: non-terminals can be reduced, terminals cannot.

In our extended λ-calculus with numbers, numbers are values. With lists, lists are values. With sets, sets are values. Hopefully, your intuition about what is a completed calculation should align with the definition of a terminal value.

Introduction to Types

Now that we’ve established that our semantics can get stuck, and defined what it means to get stuck, and that reaching a terminal value is not getting stuck, what’s the solution? Types.

At a basic level, a type is simply a collection of semantic objects from which the value of a variable or expression must be taken. For example, if a variable x is given type int, then there is a collection $t_{\text{int}}$ (in this case, perhaps, the set of integers) that contains all possible values x can assume. Our problem was that a particular value could not behave in the way that we tried to make it behave, so if we could categorize values into the way they behave, we could have prevented ourselves from getting stuck.

More broadly, one defines types based on semantic meaning. For instance, numbers and lists are different types, because you can do different things with them. You can’t add two lists, and so we can now give meaning to the error we had above: you tried to use addition on a type to which that doesn’t apply. We know, by definition, that cons always produces a list, and that + needs an argument of the number type, so we can reject the program without having to run it until it gets stuck.

Categorizing values in this way can be useful even if we’re not facing getting stuck. For example, the expression $\lambda x.\ \lambda y.\ x$ might represent the value true when viewed as boolean data, but it might represent the function that takes two arguments and returns the first when viewed as a function. Both of these views of $\lambda x.\ \lambda y.\ x$ are valid, but presumably, a programmer only intended one of them in any context. The set of booleans—in the λ-calculus not extended with primitive booleans—is much smaller than the set of functions, but it is also a subset of the set of functions. If the programmer had written down that they intended it to be a boolean, then that documents how they intend it be used, even though in the λ-calculus, functions are largely interchangeable.

The first use of types occurred around the year 1900, when mathematicians were formulating theories of the foundations of mathematics. Types were used to augment formal systems to avoid falling into variants of Russell’s paradox. Type theory remained a highly specialized and rather obscure field until the 1970’s when its connection with programming languages became clear. Throughout its history, type theory has remained closely related to logic and deduction systems; the connection between type theory and logic is known as the Curry-Howard Isomorphism, and its nature should become clear in the discussion that follows, but we will first focus on the practical use of type systems in programming languages.

Static vs. Dynamic Typing

In a statically-typed programming language, types can be reasoned from the code itself, without needing to execute it. In many statically typed languages, the types are written in the code, though this is not technically a requirement. Statically typed languages include OCaml, C and C++, and Java.

For instance, if I write the program below in OCaml and attempt to compile it, the compiler will refuse with a type error. Reasoning only from the code, without executing the program—indeed, as OCaml is a compiled language, I could not possibly execute the program, since the compiler never produced an executable—the compiler could tell us that our types were violated.

let x = [1;2] + [3;4]

File "ex.ml", line 1, characters 8-13:
Error: This expression has type 'a list but an expression was expected of type int

As we set out previously, the goal of types in general is to prevent programs that might get stuck. The goal of static typing is to catch this kind of error before the program actually runs. In a statically typed language, every code fragment’s types are checked to be consistent—by that language’s definition of “consistent”—before execution. This also means that, in most cases, code will be checked for correctness even if it’s never used. For instance, if the code above were in if false, it would still be rejected, even though the code is unreachable. In fact, static types cannot refuse only programs that would actually get stuck at run-time. Rice’s theorem guarantees this.

In a dynamically-typed programming language, types are reasoned about only during execution (or not at all). Type errors arise only at run-time, and only careful code authorship can guarantee an absence of type errors. Dynamically typed languages include Smalltalk, JavaScript, Python, and nearly all languages that fall under the umbrella term “scripting language”.

For instance, if I write the program below in GNU Smalltalk and run it, the program will begin execution, but immediately crash with a type error:

{1. 2} + {3. 4}

Object: Array new: 2 error: did not understand #+
MessageNotUnderstood(Exception)>>signal (ExcHandling.st:254)
Array(Object)>>doesNotUnderstand: #+ (SysExcept.st:1448)
UndefinedObject>>executeStatements (ex.st:1)

Because types are only checked during execution, if an expression or statement in the language would definitely cause a type error, but it is never actually reached, then the program may run flawlessly. Furthermore, as variables can generally store values of any type, the same code may or may not evoke type errors depending on what values are passed to it, and this can make it exceedingly difficult to discover the original source of an error.

Consider the following Smalltalk code, which has a definite type error but only in a block that’s never reached, and has a conditional type error in a block that is reached. The error is only produced on the very last invocation of the run:right: method.

Object subclass: Example [
  run: x right: y [
    true ifTrue: [ ^x+y ]
         ifFalse: [ ^{1. 2} + {3. 4} ].
  ]
]

Example new run: 1 right: 2.
Example new run: 1.5 right: (3/4).
Example new run: {49. 50} right: {51. 52}.

In semantics, a language is statically typed if types can be determined without using $\to^*$—that is, without “running” a program. We will see in the section on the Simply-Typed λ-Calculus how this determination is formally described.

Statically typed languages usually offer guarantees, while dynamically typed languages offer flexibility. However, the distinction between them is only when types are reasoned about; not all statically typed languages guarantee that type errors will not occur, and not all dynamically typed languages allow type errors. For that distinction, we need strong typing.

Strong vs. Weak Typing

A strongly typed language is a language in which type errors are guaranteed not to happen at run-time. In terms of semantics, a language is strongly typed if we can reject all programs which would get stuck without having to actually take the steps and get stuck. A weakly typed language allows type errors at run-time, or semantically, may get stuck.

As it turns out, this definition is, at best, wishy-washy. Generally speaking, OCaml is considered to be a strongly typed language. After all, there is no way to compile a program that attempts to add two lists with +, so that kind of type error simply cannot arise at run-time. However, OCaml will let you compile a match expression that misses cases, and that can be argued to be a type error that occurs at run-time, so perhaps it would be better to say that the language “OCaml programs which compile without warnings” is strongly typed? Alas, not even that is true: the type of numbers which are valid divisors in division is “all numbers except zero”, but OCaml doesn’t have such a type, and doesn’t force you to check before dividing.

This problem descends quickly into philosophy. Is division by zero a type error, or some other kind of error? That just depends on whether you choose to accept “non-zero number” as a type, and almost no programming languages do. This makes the definition of strong typing circular: type errors are guaranteed not to happen, and type errors are those errors that we can prevent from happening.

Similarly, Java is usually considered strongly typed, but it allows any reference type to have the value null, even though trying to use null as that reference type will usually raise a type error.

Conversely, C and C++ are usually considered to be weakly typed, but they do prevent many type errors.

In spite of being quite loosely defined, strong and weak typing are nonetheless useful categories; after all, catching 99% of a certain category of error is better than catching 0%. OCaml and Java are generally considered strongly typed. Not all statically typed languages are strongly typed: C and C++ allow you to cast pointers in any way you please, and more importantly, to attempt to use them after they’ve been deleted or before they’ve been allocated, so they are certainly weakly typed. Most dynamically typed languages are considered to be weakly typed, including all of the dynamically typed languages listed above.

Pathological Cases Collapse Everything (Or: This is All Meaningless)

Is assembly language statically typed or dynamically typed? We don’t ascribe a type to every register, but in fact, we do write types… implicitly, in the operation. In MIPS assembly, for example, using mult is an indication that our two registers both contain 32-bit signed integers, and using multu is an indication that they contain 32-bit unsigned integers. Using lw or sw is an indication that a register contains a pointer. This author likes to use the term “operational types” to distinguish this kind of language where types are on the operations instead of on the values, since he finds the static-dynamic categorization poorly applicable.

Is assembly language strongly typed or weakly typed? If I try to load a word from address 0, then that will crash. Except, that’s not a property of assembly language, it’s a behavior of the memory management unit (MMU); and, if you’re using a standard such as POSIX, there may be a well-defined set of steps to take (raising a segmentation fault), from which your program can recover and continue to run. So, if we formally model assembly language and POSIX, then our semantics wouldn’t even get stuck here. And, on a sufficiently limited system without an MMU, assembly code cannot crash at all, which would seem to trivially classify it as strongly typed. You can’t get type errors at run-time if you can’t get any errors at run-time.

Are shell scripts statically typed or dynamically typed? This may seem like an absurd question, but in fact, variables in the shell can only contain values of one data type: strings. So, it’s easy to reason about types in a shell script statically: $\forall x.\ x \in \text{strings}$. More broadly, it’s hard to classify “singly-typed” programming languages as statically or dynamically typed, because it’s meaningless to reason about such a paucity of types. But, this pathology can easily be extended to describe all dynamically typed languages as statically typed as well: all values are of the type “all values”.

Are shell scripts strongly typed or weakly typed? There is no such thing as a type error in the shell, since everything is a string. A command may not be found, but there’s a well-defined behavior for what to do when a command isn’t found; nothing “gets stuck”, and a shell script will always run to completion, even if every step along the way fails catastrophically.

The issue we’re circling around is that getting stuck isn’t actually meaningful. More precisely, getting stuck is a property of our formal semantics, and not a property of the language that our semantics models. A programming language cannot simply blow up the Universe, so it has to do something in all of the circumstances that we would define as “getting stuck” in formal semantics. Perhaps it throws an exception (like Java with null), perhaps it raises a signal (like assembly and C on POSIX-compliant systems), or perhaps it simply sets a flag (like a command not found in shell), but it doesn’t “get stuck”.

This is a mismatch that arises naturally from attempting to mathematically model a real system, and is not avoidable or resolvable. It is simply important to sometimes come up for air, so to speak, and determine whether you’ve defined your semantics in a way that getting stuck models something you really care about and want to avoid, or it’s simply a mathematical anomaly.

Aside: You will sometimes see languages divided into “compiled” vs. “interpreted” languages, and often those terms will be associated with static or dynamic types. However, compilation or interpretation is not a property of the language at all, but its implementation. C is usually classified as a compiled language, but C interpreters, such as PicoC, exist as well. Smalltalk is usually classified as an interpreted language, but most Smalltalk “interpreters” are actually Just-in-Time compilers, which are, well, compilers. I would recommend avoiding the terms “compiled language” and “interpreted language” entirely, since they conflate languages and implementations.

We will set aside this discussion by focusing on defined semantics, rather than languages per se. We can argue over whether a particular implementation of assembly with POSIX is strongly typed, but a particular set of formal semantics either can get stuck or it cannot. In that context, in the rest of this module, we will only discuss statically typed, strongly typed languages.

The Simply-Typed λ-Calculus

To begin our study of type systems, we need a typed language. We create one by augmenting the λ-calculus with a simple sublanguage of types. We call the resulting language the simply-typed λ-calculus.

The simply-typed λ-calculus has the following syntax:

⟨Expr⟩     ::= ⟨Var⟩ | ⟨Abs⟩ | ⟨App⟩ | (⟨Expr⟩)
⟨Var⟩      ::= a | b | c | ...
⟨Abs⟩      ::= λ ⟨Var⟩ : ⟨Type⟩ . ⟨Expr⟩
⟨App⟩      ::= ⟨Expr⟩ ⟨Expr⟩
⟨Type⟩     ::= ⟨PrimType⟩ | ⟨Type⟩ → ⟨Type⟩
⟨PrimType⟩ ::= t1 | t2 | ...

The core syntax of λ-calculus is largely unchanged. The only difference is that we now associate a type with the variable of each abstraction. The intuition is that abstractions will only accept arguments whose type matches the type of the variable—i.e., the formal parameter.

In addition to the core calculus, we now have a small language of types. Types in this language come in two forms: primitive and constructed. A primitive type is a type that is “built into” the language, and not built from any other types. Examples in real programming languages usually include types such as int, float, and char. When we introduce primitive values to our semantics, we will also include their types among the primitive types, but the simply-typed λ-calculus has no primitive values. So, we describe primitive types abstractly, using lowercase t, possibly subscripted. For instance, the identity function over the type $t_1$ is $\lambda x : t_1.\ x$.

A constructed type is a type that is built from other types. A constructed type also includes a type constructor, that indicates how it is built from its constituent type(s). For the typed λ-calculus, there is only one kind of constructed type: the function type, whose type constructor is an arrow ($\to$. We will consider other kinds of constructed types later in this module, and in later modules. When we need to describe an arbitrary type, without knowing exactly which, we will use $\tau$ (tau), possibly subscripted. So, given types $\tau_1$ and $\tau_2$, the constructed type $\tau_1 \to \tau_2$ represents the type of functions with parameters of type $\tau_1$ and results of type $\tau_2$.

Although there are no actual primitive values in the simply-typed λ-calculus, it is still valid to define an abstraction as having a primitive type for its formal parameter. This abstraction will not be callable, because no value will ever be able to inhabit this variable, but abstractions are values, so it can still be used as a piece of data, even if it can’t really be used as a function. This is important, because abstract types such as $\tau_1 \to \tau_2$ are not part of our language of types itself, so every type must be fully resolved to some construction of primitive types, and a value of the type $t_1 \to t_2$ is an abstraction with a formal parameter type of $t_1$. Of course, this means that types are essentially a toy in the simply-typed λ-calculus, but this language will act as a baseline to build more interesting languages, with primitives. Thus, our previous example of $\lambda x : t_1.\ x$ is by definition the identity function over $t_1$, but since no value can occupy $t_1$, it’s not a usable function.

The core operations of substitution, α-conversion, β-reduction, and η-reduction from the untyped λ-calculus carry over to the simply-typed λ-calculus unchanged. Instead, we will be defining a new property: we are interested in determining whether an expression is well-typed, and if so, what its type is.

As we are discovering types to avoid the problem of getting stuck, what it means for an expression to be well-typed is that when evaluated (reduced), it will not get stuck. What it means for an expression to have a given type is that, when evaluated, if the evaluation reaches a terminal value, that value will be of the given type. That is, we can predict the value of the result of an expression prior to executing it. Note that it is possible for an expression to be well-typed but not reach a terminal value, by reducing infinitely. Generally, we do not attempt to guarantee termination with types.

To determine whether an expression is well-typed, we will define a set of type rules, formulated as a Post system, and use these rules to derive types syntactically. These rules are independent of the rules to reduce the expression; the goal is that we can determine an expression’s type without reducing it.

To begin, we introduce the notion of a type environment, which is usually denoted as $\Gamma$ (uppercase gamma), also called a set of type assumptions. Type environments are used to supply types for free variables, which rely on external context for their semantics. Specifically, for instance, a type environment for the body of an abstraction will associate the variable of the abstraction with the type it’s been ascribed. A type environment is essentially a lookup table that, given a variable name, returns its type. We can model a type environment as a list of $\langle\text{name}, \text{type}\rangle$ pairs, or as a name → type map. As a list, it is written as an “ordered set”, because the order of two elements relative to each other usually doesn’t matter, but if they have the same name, then the first is preferred; this is because names can be reused, and we need to make sure to bind a variable to its innermost definition.

Our goal is to use a type environment and an expression to make a type judgment. A type judgment will take the form $\Gamma \vdash E : \tau$. This judgment denotes the statement, “Under the type assumptions $\Gamma$, expression $E$ has type $\tau$”. The symbol $\vdash$ is called a turnstile, and is used in logic to indicate that the right-hand side can be derived from the left-hand side.

The following is an example of a type judgment, in a language with at least the primitive type int:

\[ \{\langle x, \text{int}\rangle,\ \langle y, \text{int} \to \text{int}\rangle\} \vdash y\ x : \text{int} \]

This judgment means that applying the function y to the argument x (denoted by y x) under the given type environment (where x is of type int and y is of type int → int) will yield an expression of type int.

Now, let’s define the set of Post rules that we will use to derive type judgments in the simply-typed λ-calculus. The first rule we consider determines the type of a single variable. Since a lone variable is necessarily free, its type information can only come from the type environment $\Gamma$. Thus, we find the type of a lone variable by looking it up in the environment. We’ll start with an obvious definition, but we will have to slightly correct this definition later:

T_VariablePrelim

\[ \dfrac{\langle x, \tau \rangle \in \Gamma}{\Gamma \vdash x : \tau} \]

Informally, this rule says “if the variable x is associated with the type $\tau$ in $\Gamma$, then the expression x has the type $\tau$ in the type environment $\Gamma$”.

Next, we consider the type rule for abstractions. Since an abstraction denotes a function, it must have some type $\tau_1 \to \tau_2$, where $\tau_1$ is the type of the parameter and $\tau_2$ is the type of the result. To determine the type of the result, however, is to perform a type judgment on the body of the abstraction; just like many of our semantic rules required that a subexpression take a step, our type judgment may depend on a type judgment for a subexpression. The rule is as follows:

T_Abstraction

\[ \dfrac{\{\langle x, \tau_1 \rangle\} + \Gamma \vdash E : \tau_2}{\Gamma \vdash (\lambda x : \tau_1.\ E) : \tau_1 \to \tau_2} \]

Informally, this rule says “if expression $E$ has type $\tau_2$ in the type environment formed by extending $\Gamma$ with the pair $\langle x, \tau_1\rangle$, then the expression $\lambda x : \tau_1.\ E$ has the type $\tau_1 \to \tau_2$ in the type environment $\Gamma$”. That is, the type of a function is the constructed ($\to$ type in which the parameter type is the explicitly defined parameter type, and the result type is the result of a type judgment of the body, with the variable name of the argument having the parameter type. Note that we use $+$ with ordered sets as an ordered union, with the left-hand side coming first.

Now, let’s reexamine our variable rule. Consider the expression $\lambda x : t_1.\ \lambda x : t_2.\ x$. Assuming we start with an empty type environment, the inner x will be judged in the type environment $\{\langle x, t_2\rangle, \langle x, t_1\rangle\}$. The premise of T_VariablePrelim is $\langle x, \tau\rangle \in \Gamma$. In this example, there are two values of $\tau$ for which this is true: $t_1$ and $t_2$. Thus, both $\Gamma \vdash x : t_1$ and $\Gamma \vdash x : t_2$ would be true. In some contexts it can be correct for an expression to be judged to have multiple types, but in most, this is a mistake. Here, it’s certainly a mistake, since the inner x is bound only to the declaration of x with type $t_2$, and not to the declaration of x with type $t_1$. We only wanted the first instance of x in our environment. We will write $\Gamma(x) = \tau$ as a shorthand for “the first entry of x in $\Gamma$ is paired with $\tau$”. Now, we can rewrite our variable rule with this restriction:

T_Variable

\[ \dfrac{\Gamma(x) = \tau}{\Gamma \vdash x : \tau} \]

And, we can now discard the T_VariablePrelim rule in preference of T_Variable.

Exercise 1. Write the formal rules for $\Gamma(x) = \tau$.

Finally, we consider the rule for applications. A function should only accept arguments of the same type as its formal parameter. In other words, if we have a function with formal parameter x of type $\tau_1$ and result type $\tau_2$, and we want to apply this function to an argument y, then y should be required to have type $\tau_1$ as well. The type of the application itself should then be the type of the result of the function, $\tau_2$. Note that this means our type judgment is acting both as a way of discovering the type of the whole expression and as a way of restricting the type of certain subexpressions; it is in this way that we prevent programs with incorrect types from compiling, by refusing to give them types. The application rule is formalized as follows:

T_Application

\[ \dfrac{\Gamma \vdash E_1 : \tau_1 \to \tau_2 \qquad \Gamma \vdash E_2 : \tau_1}{\Gamma \vdash E_1\ E_2 : \tau_2} \]

In this case, the rule is read, “If $E_1$ has type $\tau_1 \to \tau_2$ in the type environment $\Gamma$, and $E_2$ has type $\tau_1$ in the type environment $\Gamma$, then the expression $E_1\ E_2$ has type $\tau_2$ in the type environment $\Gamma$”.

We now have a system for deriving the type of an expression in the simply-typed λ-calculus. Expressions for which no type judgment can be derived from these Post rules are not well-typed—that is, they possess type errors—and are not semantically valid expressions in the simply-typed λ-calculus.

Although we won’t prove it here, it is also important to note that our type judgment is decidable. By Rice’s theorem, we cannot decide whether an expression will get stuck, because it could take an unlimited number of steps to get there; but, we can decide whether an expression is well-typed. So, we would like to guarantee that no well-typed program gets stuck.

Type Safety

A programming language is said to be type safe (or type sound) if a well-typed program can’t “go wrong” due to types. Specifically, this means that the semantics won’t get stuck for any well-typed program, and the type predicted by the type judgment reflects the actual type of the value produced. These ideas are formalized as progress and preservation. The precise statement of progress and preservation depends on the semantics under scrutiny, so we will describe progress and preservation for the simply-typed λ-calculus.

Theorem 1 (Progress). Let $E$ be a closed, well-typed expression in the simply-typed λ-calculus. That is, for some $\tau$, we have $\{\} \vdash E : \tau$. Then, either $E$ is a value, or there is an expression $E'$ such that $E \to_\beta E'$.

Theorem 2 (Preservation). Let $\Gamma$ be a type environment and $P$ and $Q$ be λ-expressions such that $P \to^*_{\beta\eta} Q$ (that is, $P$ reduces to $Q$ by a sequence of zero or more β-reductions and η-reductions). If $\Gamma \vdash P : \tau$ for some type environment $\Gamma$ and type $\tau$, then $\Gamma \vdash Q : \tau$.

Progress guarantees that well-typed terms do not get stuck: if $E$ is well-typed, then $E$ cannot be, for example, (+ (cons 1 empty) (cons 2 empty)), for which there is no reduction rule. Progress for the simply-typed λ-calculus is trivial, since no closed λ-expressions get stuck without our primitive additions, but we will prove it in a moment; it is the addition of these primitives that makes progress interesting. Our original goal was to be able to know that a program will not get stuck, and progress alone isn’t quite sufficient for this. While progress guarantees that any $E$ that is well-typed is not stuck, it does not guarantee that the $E'$ it reduces to is well-typed, and thus does not guarantee that $E'$ is not stuck.

Preservation, also known as the Subject-Reduction Theorem, guarantees that our predicted type is correct and consistent; that is, if we predicted that an expression has type $\tau$, and we take a step of evaluation, then it still has type $\tau$.

With preservation and progress together, we can accomplish our original goal, to guarantee that a program will not get stuck. Progress says that a well-typed $E$ is not stuck, and reduces to $E'$. Preservation says that $E'$ is also well-typed. Progress says that $E'$ is, therefore, also reducible, etc. Read together, this means that if we can make a type judgment for an expression, then that expression reduces to a value of that type, and does not get stuck; that is type safety. Although it’s not obvious, we’ve actually excluded the possibility that it reduces infinitely, but this is a terrible sacrifice! We will discuss that problem in the next section.

Proof of Progress

Let $E$ be a closed λ-expression of type $\tau$. The proof is by induction on the length of the type derivation for $E$. Since $E$ is closed, $E$ cannot be a variable. If $E$ is an abstraction, then $E$ is a value, and we are done. Thus, the only interesting case is when $E$ is an application. Let $E = E_1\ E_2$. By the T_Application rule, there is a type $\tau_1$ such that $\{\} \vdash E_1 : \tau_1 \to \tau$, and $\{\} \vdash E_2 : \tau_1$. By induction, either $E_1$ is a value, or $E_1$ is reducible, and similarly for $E_2$. If $E_1$ is reducible, we have $E_1 \to_\beta E_1'$, and so $E = E_1\ E_2 \to_\beta E_1'\ E_2$, and thus $E$ is reducible. If $E_2$ is reducible, we have $E_2 \to_\beta E_2'$, and so $E = E_1\ E_2 \to_\beta E_1\ E_2'$, and thus $E$ is reducible. Otherwise, both $E_1$ and $E_2$ are values, and thus abstractions. Thus, $E_1$ is an abstraction, and we have previously established that it is well-typed. By the T_Abstraction rule, there is some $E_3$ such that $E_1 = \lambda x : \tau_1.\ E_3$. By the unconditional application rule of β-reduction, $E = (\lambda x : \tau_1.\ E_3)E_2 \to_\beta E_3[E_2/x]$, and thus $E$ is reducible. Progress now follows by induction.

Proof of Preservation

We prove the result in the case of a single reduction step $P \to_{\beta\eta} Q$. The stated result then follows by iteration. We prove the result by induction on the structure of $P$. Note first that $P$ cannot be a variable, for then $P$ would not be reducible. There are thus five cases to consider, which arise from the productions of β- and η-reduction:

\[ \dfrac{M \to_\beta P}{M\ N \to_\beta P\ N} \]

There exist some $P_1, P_2, P_1'$ such that $P = P_1\ P_2$, $P_1 \to_{\beta\eta} P_1'$, $Q = P_1'\ P_2$. Then by T_Application there is a type $\tau_1$ such that $\Gamma \vdash P_1 : \tau_1 \to \tau$ and $\Gamma \vdash P_2 : \tau_1$. By induction, since $P_1 \to_{\beta\eta} P_1'$, we have $\Gamma \vdash P_1' : \tau_1 \to \tau$. Thus, by T_Application, $\Gamma \vdash P_1'\ P_2 : \tau$, i.e., $\Gamma \vdash Q : \tau$.

\[ \dfrac{N \to_\beta P}{M\ N \to_\beta M\ P} \]

There exist some $P_1, P_2, P_2'$ such that $P = P_1\ P_2$, $P_2 \to_{\beta\eta} P_2'$, $Q = P_1\ P_2'$. Similar to case 1, but by induction, since $P_2 \to_{\beta\eta} P_2'$, we have $\Gamma \vdash P_2' : \tau_1$. Thus, by T_Application, $\Gamma \vdash P_1\ P_2' : \tau$, i.e., $\Gamma \vdash Q : \tau$.

\[ \dfrac{M \to_\beta P}{\lambda x.\ M \to_\beta \lambda x.\ P} \]

There exist some $x, \tau_1, E, E'$ such that $P = \lambda x : \tau_1.\ E$, $E \to_{\beta\eta} E'$, $Q = \lambda x : \tau_1.\ E'$. Then there is some type $\tau_2$ such that $\tau = \tau_1 \to \tau_2$. By T_Abstraction, we have $\langle x, \tau_1\rangle + \Gamma \vdash E : \tau_2$. By induction, since $E \to_{\beta\eta} E'$, we have $\langle x, \tau_1\rangle + \Gamma \vdash E' : \tau_2$. Then, by T_Abstraction, we obtain $\Gamma \vdash (\lambda x : \tau_1.\ E') : \tau_1 \to \tau_2$, i.e., $\Gamma \vdash Q : \tau$.

Case 4. $\lambda x.\ M\ x \to_\eta M$ (if $x \notin FV[M]$

There exist some $x, \tau_1, E$ such that $P = \lambda x : \tau_1.\ E\ x$, $Q = E$, $x \notin FV[E]$. Then there is some type $\tau_2$ such that $\tau = \tau_1 \to \tau_2$. By T_Abstraction, we have $\langle x, \tau_1\rangle + \Gamma \vdash (E\ x) : \tau_2$. Since x has type $\tau_1$ and $E\ x$ has type $\tau_2$, by T_Application, it must be true that $\langle x, \tau_1\rangle + \Gamma \vdash E : \tau_1 \to \tau_2$. Finally, since $x \notin FV[E]$ (i.e., x does not occur free in $E$, the type of $E$ is not dependent on the type of x, so removing $\langle x, \tau_1\rangle$ from the type environment cannot affect our type judgment of $E$. Thus, we obtain $\Gamma \vdash E : \tau_1 \to \tau_2$, i.e., $\Gamma \vdash Q : \tau$.

Case 5. $(\lambda x.\ M)N \to_\beta M[N/x]$

There exist some $x, \tau_1, M, N$ such that $P = (\lambda x : \tau_1.\ M)N$, $Q = M[N/x]$. Then, by T_Application, $\Gamma \vdash N : \tau_1$, and further, by T_Abstraction, $\langle x, \tau_1\rangle + \Gamma \vdash M : \tau$. We prove $\Gamma \vdash Q : \tau$ by induction on the structure of $M$. A λ-expression can be a variable, an abstraction, or an application, so there are three cases:

(a) $M$ is a variable. If $M = x$, then by T_Variable, $\langle x, \tau_1\rangle + \Gamma \vdash x : \tau_1$. As $M$ was previously shown to be of type $\tau$ in this environment, $\tau = \tau_1$. Then, since $Q = M[N/x] = N$ and $\tau_1 = \tau$, $\Gamma \vdash N : \tau_1$ is equivalent to $\Gamma \vdash Q : \tau$. If $M = z \neq x$, then $\langle x, \tau_1\rangle + \Gamma \vdash M : \tau$ is equivalent to $\Gamma \vdash z : \tau$, since the type of z is not dependent upon the type of x. Then, since $z[N/x] = z$, we have $\Gamma \vdash M[N/x] : \tau$, i.e., $\Gamma \vdash Q : \tau$.

(b) $M$ is an abstraction. Then there exist some $y, \tau_2, E$ such that $M = \lambda y : \tau_2.\ E$. By performing an α-conversion, we can arrange that $y \neq x$ and $y \notin FV[N]$. From $\langle x, \tau_1\rangle + \Gamma \vdash M : \tau$ and $M = \lambda y : \tau_2.\ E$, we see that there exists some $\tau_3$ such that $\tau = \tau_2 \to \tau_3$, and we have $\langle y, \tau_2\rangle + \langle x, \tau_1\rangle + \Gamma \vdash E : \tau_3$. Since $y \notin FV[N]$, we can augment the type judgment for $N$ and obtain $\langle y, \tau_2\rangle + \Gamma \vdash N : \tau_1$. We can now apply the induction hypothesis and conclude that $\langle y, \tau_2\rangle + \Gamma \vdash E[N/x] : \tau_3$. By T_Abstraction, $\Gamma \vdash (\lambda y : \tau_2.\ E[N/x]) : \tau_2 \to \tau_3$. But since $y \neq x$, $(\lambda y : \tau_2.\ E[N/x]) = M[N/x]$. Thus, $\Gamma \vdash (M[N/x]) : \tau_2 \to \tau_3$, and since $\tau = \tau_2 \to \tau_3$, we obtain $\Gamma \vdash Q : \tau$.

(c) $M$ is an application. Then there exist some $E_1, E_2$ such that $M = E_1\ E_2$. Then, by T_Application, there exists a type $\tau_2$ such that $\langle x, \tau_1\rangle + \Gamma \vdash E_1 : \tau_2 \to \tau$ and $\langle x, \tau_1\rangle + \Gamma \vdash E_2 : \tau_2$. Then, by induction, we have $\Gamma \vdash (E_1[N/x]) : \tau_2 \to \tau$ and $\Gamma \vdash (E_2[N/x]) : \tau_2$. Now, by T_Application, $\Gamma \vdash ((E_1[N/x])(E_2[N/x])) : \tau$, i.e., $\Gamma \vdash ((E_1\ E_2)[N/x]) : \tau$. But this is $\Gamma \vdash (M[N/x]) : \tau$, or simply, $\Gamma \vdash Q : \tau$.

Thus, $\Gamma \vdash Q : \tau$. Preservation now follows by induction.

The Strong Normalization Theorem

The untyped λ-calculus was useful because it is simultaneously very simple and very powerful, able to represent any computation in spite of only having lambdas. We observed that some programs in untyped λ-calculus “got stuck”, so we aimed to restrict “acceptable” λ-expressions to those that don’t, and defined types and a type judgment to do so. However, the safety that our type system provides us comes at a severe cost in expressive power, as the following theorem, known as the Strong Normalization Theorem, shows:

Theorem 3 (Strong Normalization). Given any type environment $\Gamma$, the set of well-typed terms in the simply-typed λ-calculus is strongly normalizing, i.e., given a well-typed expression $E$, every sequence of reductions starting from $E$ has a finite number of steps.

In other words, it is impossible for well-typed terms in the simply-typed λ-calculus to reduce infinitely. If we cannot construct infinitely reducing expressions, then we have lost computational power. We can no longer simulate an arbitrary Turing machine in the λ-calculus; the typed λ-calculus is not Turing-complete!

We won’t prove the Strong Normalization Theorem here, as the proof is quite intricate, but we can intuit about it by considering what type we would give to the simplest infinitely-reducing expression, $(\lambda x.\ x\ x)(\lambda x.\ x\ x)$. In fact, we need only to consider the problem of assigning a type to x in $\lambda x.\ x\ x$. In the expression $x\ x$, the x in the rator position is being treated as a function, so it must have a type of the form $\tau_1 \to \tau_2$, for some types $\tau_1$ and $\tau_2$. Then, for the expression to be well-typed, the argument to which x is applied must have type $\tau_1$. But, the argument is x itself, and x has type $\tau_1 \to \tau_2$! We cannot express any type $\tau_1$ in the simply-typed λ-calculus such that $\tau_1 \to \tau_2 = \tau_1$. Thus, the type derivation fails. We conclude that self-application cannot occur in the simply-typed λ-calculus. Other expressions with infinite reduction sequences fail to type-check in similar ways.

Because of the Strong Normalization Theorem, languages that are based on the simply-typed λ-calculus (statically typed functional languages), if they are to be Turing-complete, must include built-in facilities for constructing recursive definitions without violating the type system; directly implementing a recursion combinator (in the absence of additional primitives, at least) in these languages would be impossible. However, remember that we only needed the odd Y-combinator because a function could not refer to itself; we had to “solve for” the recursive call and use the Y-combinator to reach a fixed point. In λ-calculus, the only name binding is the formal parameter to λ functions. The standard workaround for the Strong Normalization Theorem is to allow a special kind of name binding by which a function may refer to itself, but where the binding is not itself the parameter of a λ function, as that would be the problematic combinator. For instance, OCaml’s let rec construction creates a recursive (self-referential) binding. Resolving types with let rec is complicated, because an inner expression can depend on the fully-resolved outer type; we will discuss how it’s done in the next module. The simpler technique is to require functions to explicitly declare both their parameter and result types.

Polymorphism

Let’s consider Church numerals in the simply-typed λ-calculus. Recall that a Church numeral is a two-argument lambda function (that is, a nested abstraction) in which the first argument is a function and the second argument is the value to apply that function to $n$ times, where $n$ is the value of the Church numeral. What type can be ascribed to the Church numeral, and to its arguments?

Well, the f argument needs to be of a function type, and because it can be called on x or on the result of f x, it needs to be of some type $\tau_1 \to \tau_1$, and x needs to be of type $\tau_1$. We can give this the concrete type $t_1$, making the Church numeral for two, for example, $\lambda f : t_1 \to t_1.\ \lambda x : t_1.\ f(f\ x)$, with type $(t_1 \to t_1) \to t_1 \to t_1$.

Now, what types can we give to, for instance, $\ulcorner{*}\urcorner$? Recall that $\ulcorner{*}\urcorner = \lambda m.\ \lambda n.\ \lambda f.\ \lambda x.\ m(nf)x$. Let’s particularly focus on $nf$. The type of f has to be $t_1 \to t_1$, because that’s how we’ve just defined Church numerals. That’s fine, since that’s the type that n expects, so the expression $nf$ is of type $t_1 \to t_1$. m is also a Church numeral, so $nf$ is typed correctly, and so is $m(nf)x$. $\ulcorner{*}\urcorner$ is typable.

But now, let’s consider $\ulcorner{\hat{\phantom{x}}}\urcorner$. Recall that $\ulcorner{\hat{\phantom{x}}}\urcorner = \lambda m.\ \lambda n.\ \lambda f.\ \lambda x.\ nmfx$. Let’s focus on $nm$. n is a Church numeral, so the argument type it expects is $t_1 \to t_1$. But, m is also a Church numeral, so it is of type $(t_1 \to t_1) \to t_1 \to t_1$. These types don’t match, so $\ulcorner{\hat{\phantom{x}}}\urcorner$ is untypable!

The problem occurred when we moved from abstract types (some $\tau_1$ to concrete types (specifically $t_1$. Everything would have type checked fine if we could have said that a Church numeral is ambivalent to exactly what types you pass in, so long as f is a function with domain and range equal to the type of x. Let’s work out Church numerals with abstract types.

Now we say that f is of type $\tau_1 \to \tau_1$, and x is of type $\tau_1$. What this means is that x can be of any type at all, so long as it is the same type that is both the parameter and result type of f. $\ulcorner{*}\urcorner$ works exactly as it did before, substituting $t_1$ for $\tau_1$. But now, let’s type $\ulcorner{\hat{\phantom{x}}}\urcorner$.

The type of a Church numeral is $(\tau_1 \to \tau_1) \to \tau_1 \to \tau_1$. But note that $\tau_1$ need only be consistent within any Church numeral; different Church numerals in the same program can have different concrete versions of $\tau_1$. So, for clarity, while examining $\ulcorner{\hat{\phantom{x}}}\urcorner$, we will say that m is of type $(\tau_2 \to \tau_2) \to \tau_2 \to \tau_2$, and n is of type $(\tau_3 \to \tau_3) \to \tau_3 \to \tau_3$.

Let’s start by examining the expression $nm$. n expects an argument of type $\tau_3 \to \tau_3$, and m is of type $(\tau_2 \to \tau_2) \to \tau_2 \to \tau_2$. These are both abstract types, so can we somehow relate $\tau_3$ to $\tau_2$ to make this possible? Yes! $\tau_3 = \tau_2 \to \tau_2$. Taking this assumption, the result of $nm$ is of type $\tau_3 \to \tau_3$, or equivalently, $(\tau_2 \to \tau_2) \to \tau_2 \to \tau_2$. $nm$ expects an argument of type $\tau_3 = \tau_2 \to \tau_2$, and f is of type $\tau_1 \to \tau_1$. Can we somehow relate $\tau_3$ to $\tau_1$ to make this possible? Again, yes. $\tau_3 = \tau_1 \to \tau_1$, and thus, $\tau_1 = \tau_2$. Now, $nmf$ is of type $\tau_3$, or equivalently, $\tau_1 \to \tau_1$. x is of type $\tau_1$, so everything is typable.

We see that even for Church numerals to work correctly, we need abstract types to be a part of our language, and a part of our type judgment. To type expressions, we need to be able to relate abstract types to each other. Congratulations, we’ve just invented polymorphism.

The abstraction of code to work over numerous types of data is known as polymorphism, and code that works on data of several types is called polymorphic. By contrast, code that is monomorphic is not “abstracted over types” and can only work on data of a single type.

We didn’t need to discuss polymorphism in the context of the untyped λ-calculus, because the problem only arose when we tried to check types. It is not that dynamically typed code is all polymorphic; rather, dynamically typed code is neither polymorphic nor monomorphic, as those are terms from the domain of typing. The kind of code reuse afforded by dynamic typing is qualitatively different from polymorphism. Dynamic type systems, like that of Smalltalk, allow all functions to accept arguments of all types, but may fail when the values cannot actually behave as the functions expect them to. A polymorphic function may accept arguments of many different types, but it need not accept all types. Polymorphism does not preclude static type-checking; rather, it generalizes static type checking. Statically typed, polymorphic code can work on a variety of data types, but still offers the guarantee of type safety at run-time; dynamically typed code can produce run-time type errors.

Appropriate to its name, polymorphism comes in several forms. In Cardelli and Wegner’s paper “On understanding types, data abstraction, and polymorphism”, they categorized polymorphism in a hierarchy. There are two major varieties of polymorphism: ad hoc polymorphism, and universal polymorphism.

Ad hoc polymorphism is based on constructing multiple implementations of the entity being coded, one for each specific type that it can be used with. As a result, ad hoc polymorphism only permits code to run on a finite and bounded number of data types. By contrast, universal polymorphism, considered by many to be the only true form of polymorphism, is based on constructing a single implementation that is generalized over types in some way. Universal polymorphism allows code to work on an unbounded number of types, possibly including types which may not even have existed when the polymorphic code was written.

Each of these types of polymorphism has two subvariants. Ad hoc polymorphism is further divided into overloading and coercion. Overloading occurs when the same name is used for different entities (usually functions) inside the same scope. For example, an addition function might be required to work on both integers and real numbers. References to overloaded names are disambiguated at compile-time, via a process known as overload resolution. To determine the exact implementation of an overloaded function to which a given reference is bound, the compiler generally examines the number and type of the function’s arguments (as in C++), and, optionally, the return type required by the function’s context (as in Ada). Overloading in C++ and Java are forms of overloading polymorphism.

Coercion occurs when values of one type are converted to another type, so that an expression makes sense in context. Coercion can be either explicit or implicit. Explicit coercion, otherwise known as casting, occurs when a programmer forces the type of an expression to change, either through a conversion function or through a built-in casting operator. Explicit coercion is not really a form of polymorphism, since the conversion function or casting operator is equivalent to calling a function with the appropriate argument and return type. Implicit coercion occurs when the compiler changes the type of an expression without any action on the part of the programmer. For example, compilers often coerce integers into floating-point or real numbers so that, for example, an addition of the form 3.5 + 4, which attempts to add an integer and a real number, will satisfy the type system. In general, strongly-typed languages limit implicit coercion to a few limited, safe cases, or exclude it entirely.

Universal polymorphism can be divided into two kinds as well: parametric polymorphism and inclusion polymorphism. Generally considered to be the most powerful and useful form of polymorphism, parametric polymorphism refers to the ability to build abstractions over all types, by constructing objects (generally, functions) that are parameterized (often implicitly) by types. We stumbled upon parametric polymorphism while giving types to Church numerals: $\tau_1$ is, in essence, a type parameter to a Church numeral, and resolving the relationships between abstract types was how we found the argument for that parameter. A parametrically polymorphic static type system can guarantee that, no matter what types are supplied as type parameters, the result will be type-safe. OCaml has parametric polymorphism, and you’ve probably encountered type errors specifying types such as 'a -> 'a. Those 'as are OCaml’s $\tau$s! Parametric polymorphism is closely related to the concept of generic programming, which appears in many languages, including Ada, Modula-3 and EL1, and in a more limited form, in Java. We will discuss parametric polymorphism further in the next module.

Inclusion polymorphism, also called subtyping, is based on the arrangement of types into a lattice, known as a subtype hierarchy. Type $\tau_1$ is a subtype of type $\tau_2$ if values of type $\tau_1$ are valid in every context in which values of type $\tau_2$ can occur. Thus, any function that works on values of type $\tau_2$ can also work on values of type $\tau_1$. Inclusion polymorphism is universal, rather than ad hoc, because new subtypes of $\tau_2$ can be created, without code expecting $\tau_2$ being affected. Inclusion polymorphism forms the basis for the kind of polymorphism observed in object-oriented programming. For instance, subclassing in C++ and Java are forms of inclusion polymorphism. We will discuss inclusion polymorphism further as part of our discussion on object orientation, in Module 8.

Types in Practice

A type judgment is a mathematical formulation of a type checker, which is implemented in a real programming language compiler or interpreter. Generally speaking, type checkers simply do a depth-first search over the code, carrying a stack-like type environment as they go, and determine types from the inside out. Most type judgments work in that way: if the subexpressions have some types, then the whole expression has some type. Parametric polymorphism complicates type checking, and we will investigate the algorithm for type checking in a parametrically polymorphic language in the next module.

More importantly, types are often manifest in how the code is compiled. For instance, on many systems, a C int and a C double must be stored in different banks of registers to use them, and probably have different sizes as well. In a garbage collected language, you must communicate to the garbage collector where reference-typed values are stored. And, at the most basic level, if C code accesses a certain field of a struct, or code in an object-oriented language calls a particular method of an object, then the compiler must know where and how that is stored. This aspect of types is examined further in CS444, and we will mostly not address it in this course, but we will discuss the peculiarities of type implementations in different paradigms.

Semantics Redux

In the previous module, we introduced semantics using universal quantifiers to guarantee ordering, which was a bit unsatisfactory, since it makes proofs more difficult. With types, we can now have a more concrete sense of terminal values, so let’s redefine our semantics in terms of terminal values. We will denote terminal values syntactically:

⟨Term⟩ ::= ⟨Abs⟩

Of course, we will add to this definition later. Note that this definition is correct for AOE; NOR’s definition would be much more complicated.

We may now rewrite AOE in terms of terminal values:

Definition 1. (Small-Step Operational Semantics of the Simply-Typed λ-Calculus, AOE)

Let the metavariable $M$ range over λ-expressions, and $V$ range over terminal values.

\[ \dfrac{M_1 \to M_1'}{M_1\ M_2 \to M_1'\ M_2} \]\[ (\lambda x : \tau.\ M)V \to M[V/x] \]\[ \dfrac{M \to M'}{V\ M \to V\ M'} \]

Note that we can avoid all universal quantifiers simply by restricting subexpressions which need to have been fully evaluated to $V$. Since ReduceRight only applies if the rator has been fully evaluated, the rator will always be fully evaluated before the rand. In turn, since abstractions are values and Application requires the rand to be a value, both ReduceLeft and ReduceRight will occur before Application.

Exercise 2. Prove that these semantics are deterministic. That is, for all λ-expressions $E_1$ and $E_2$ such that $E_1 \to E_2$, and for all λ-expressions $E_3$, show that either $E_1 \not\to E_3$ or $E_2 = E_3$ up to α-renaming. The $\to$ here is AOE’s $\to$ above, not $\to_\beta$ (which is non-deterministic).

Adding More Types

In the remainder of this module, we will extend the simply-typed λ-calculus with some commonly-seen primitives, to give you an idea of the kinds of type rules you are likely to see in other languages. First, we need to contend with the fact that the simply-typed λ-calculus isn’t even Turing-complete, by making recursion possible again.

Let Bindings

We will add two new constructs to our language: the let binding and the let rec binding. A let binding allows us to define a variable without an actual application. A let rec binding allows us to define a variable which is usable within its own definition, thus allowing us to write a recursive function. To simplify typing, our let rec bindings will only allow us to bind abstractions, and will require explicitly specifying the return type of the abstraction.

We extend the syntax of the simply-typed λ-calculus as follows:

⟨Expr⟩ ::= ...
         | let ⟨Var⟩ = ⟨Abs⟩ in ⟨Expr⟩
         | let rec : ⟨Type⟩ ⟨Var⟩ = ⟨Abs⟩ in ⟨Expr⟩

While λ-calculus has always had variables, variables were resolved by substitution. This is not how most real programming languages work, and will not work for recursive functions. Instead, we need to store variables somewhere. Because of this, the way that we do reductions must change. Previously, our “step” morphism related a λ-expression to a λ-expression: a reduction could always be performed on an expression with no further context. We must add some context, in the form of the variables currently defined. We will do this by instead relating pairs, in which a pair is a store and a program.

Our store will be a partial map, denoted $\sigma$ (sigma). $\sigma(x)$ is the value of x in the map, and $\sigma[x \mapsto v]$ denotes a new map in which x maps to v, and all other values map as they did in $\sigma$. empty denotes the empty map. The store will store only our let bindings, not other variables, as other variables will continue to operate by substitution.

Our first step to describe a step in the whole program, therefore, is to describe how it relates to stepping with a store:

\[ \dfrac{\langle \text{empty}, E\rangle \to \langle \sigma, E'\rangle}{E \to E'} \]

That is, we can take a step in $E$ if we can take a step in $E$ with an empty store. We will fill the store in subexpressions of $E$.

The semantic rules of AOE can be used verbatim, adjusting for our new syntax:

\[ \langle \sigma, (\lambda x : \tau.\ M)V\rangle \to \langle \sigma, M[V/x]\rangle \]\[ \dfrac{\langle \sigma, M_1\rangle \to \langle \sigma, M_1'\rangle}{\langle \sigma, M_1\ M_2\rangle \to \langle \sigma, M_1'\ M_2\rangle} \]\[ \dfrac{\langle \sigma, M\rangle \to \langle \sigma, M'\rangle}{\langle \sigma, V\ M\rangle \to \langle \sigma, V\ M'\rangle} \]

Now, let’s define semantics for let.

\[ \dfrac{\sigma' = \sigma[x \mapsto V[z/x]] \quad z \text{ is a fresh variable} \quad \langle \sigma', M\rangle \to \langle \sigma', M'\rangle}{\langle \sigma, \text{let}\ x = V\ \text{in}\ M\rangle \to \langle \sigma, \text{let}\ x = V\ \text{in}\ M'\rangle} \]\[ \dfrac{\sigma[x] = V}{\langle \sigma, x\rangle \to \langle \sigma, V\rangle} \]\[ \langle \sigma, \text{let}\ x = V_1\ \text{in}\ V_2\rangle \to \langle \sigma, V_2[V_1/x]\rangle \]

The most important rule here is LetBody, which defines how a variable enters $\sigma$, as $\sigma'$. If a let binding maps x to some value $V$, then a mapping from x to $V$ is added to the store of variables during the reduction of $M$. In the version of $V$ in $\sigma'$, x is replaced with a fresh variable z, because x is only bound to $V$ in $M$, not in $V$ itself. let rec is a recursive binding, and so it will bind x in $V$. $M$ is evaluated with $\sigma'$ as its store, so it can find x.

The Variable rule describes using variables from the store. Note that there is no Variable rule in normal AOE, because variables are only resolved via substitution; with a store, we need an explicit rule to replace a variable with its value. The two do not conflict in this case because of how AOE evaluates. The Variable rule will only be reached if it is not inside an abstraction, so there cannot be another same-named variable to contend with.

A let is not itself a value, so we also have a LetResolution rule to allow a fully-resolved let binding to become a value. Note that because of how variables work in the λ-calculus, the final step to resolve a let binding to a value uses substitution; otherwise, any uses of the defined variable in the body would become free. This is just an unusual result of combining substitution with let binding, and not a universal feature of languages, and we’ll have to use a different approach for let rec.

Now, let’s look at the semantics for let rec, which will be mostly similar to let.

\[ \dfrac{\sigma' = \sigma[x \mapsto V] \quad \langle \sigma', M\rangle \to \langle \sigma', M'\rangle}{\langle \sigma, \text{let rec} : \tau\ x = V\ \text{in}\ M\rangle \to \langle \sigma, \text{let rec} : \tau\ x = V\ \text{in}\ M'\rangle} \]\[ \dfrac{x \neq y \quad x \in FV[M]}{\langle \sigma, \text{let rec} : \tau_1\ x = V\ \text{in}\ \lambda y : \tau_2.\ M\rangle \to \langle \sigma, \lambda y : \tau_2.\ \text{let rec} : \tau_1\ x = V\ \text{in}\ M\rangle} \]\[ \dfrac{x \notin FV[V_2]}{\langle \sigma, \text{let rec} : \tau\ x = V_1\ \text{in}\ V_2\rangle \to \langle \sigma, V_2\rangle} \]

The LetRecBody rule is identical to LetBody, except that the version of $V$ which is added to $\sigma$ may still refer to x. This is how recursion is achieved. In a future reduction of the let binding, the x in the body of $V$ can expand to $V$ again.

There is no Variable rule specific to let rec, and indeed, there couldn’t be, since the let syntax is not part of the rule for variables anyway.

The LetRecResolve rule is similar to LetResolution, but rather than using substitution to deal with any lingering xs in $V_2$, they are simply disallowed. We cannot simply remove the let rec if the recursive definition is still used.

What we can do in that case is use LetRecInvert, which applies in all the cases that LetRecResolve does not. In order to make the abstraction that this let rec resolves to be usable for further applications, while still allowing the recursive function to be used, we can put the let rec inside the abstraction. This sort of inversion is a common way to resolve semantic corners like this one.

Now that we’ve created syntax and semantic rules for let bindings, it’s time to add type judgments. We can use all previous type judgments as they are; we only need to add new judgments for our new syntax.

\[ \dfrac{\Gamma \vdash V : \tau_1 \qquad \{\langle x, \tau_1\rangle\} + \Gamma \vdash E : \tau_2}{\Gamma \vdash \text{let}\ x = V\ \text{in}\ E : \tau_2} \]\[ \dfrac{\Gamma_0 = \{\langle x, \tau_1\rangle\} + \Gamma \qquad \Gamma_0 \vdash V : \tau_1 \qquad \Gamma_0 \vdash E : \tau_2}{\Gamma \vdash (\text{let rec} : \tau_1\ x = V\ \text{in}\ E) : \tau_2} \]

The T_Let rule is similar to T_Abstraction, but instead of the type being written, it is judged from the value bound to x.

The T_LetRec rule takes that one step further, by using the environment in which x has already been defined to judge the type of $V$ itself. Thus, $V$ can refer to x and still type-check. To do this without an explicit specification of $\tau_1$ is complex, so for this simple calculus, we simply demanded that $\tau_1$ be written, and that it match the actual type of $V$.

We will not prove progress and preservation for our new let-rec-calculus, but the proof follows a similar line of reasoning as for the simply-typed λ-calculus. More importantly, the let-rec-calculus is Turing-complete again, albeit far more cumbersome to use even than the untyped λ-calculus.

Booleans and Conditionals

Recall that our boolean and conditional extension in Module 3 added syntax for true and false, not, and, or, and if conditionals. The semantic rules can be used verbatim; we only need to add type and value syntax, and type rules.

First, we extend our definition of terminal values and our type language to add a new boolean type:

⟨Term⟩     ::= ... | true | false
⟨PrimType⟩ ::= ... | boolean

To add a specific type for true and false would require inclusion polymorphism, so we will stick to a single boolean type.

Now, we add type rules. We will start with the booleans themselves, not, and, and or.

\[ \Gamma \vdash \text{true} : \text{boolean} \]\[ \Gamma \vdash \text{false} : \text{boolean} \]\[ \dfrac{\Gamma \vdash E : \text{boolean}}{\Gamma \vdash \text{not}\ E : \text{boolean}} \]\[ \dfrac{\Gamma \vdash E_1 : \text{boolean} \qquad \Gamma \vdash E_2 : \text{boolean}}{\Gamma \vdash \text{and}\ E_1\ E_2 : \text{boolean}} \]\[ \dfrac{\Gamma \vdash E_1 : \text{boolean} \qquad \Gamma \vdash E_2 : \text{boolean}}{\Gamma \vdash \text{or}\ E_1\ E_2 : \text{boolean}} \]

The true and false literals are, of course, booleans. As well, not, and, and or are always of boolean type, but their operands must also be of boolean types for type judgment to succeed.

Now, let’s examine if.

\[ \dfrac{\Gamma \vdash E_1 : \text{boolean} \qquad \Gamma \vdash E_2 : \tau \qquad \Gamma \vdash E_3 : \tau}{\Gamma \vdash \text{if}\ E_1\ \text{then}\ E_2\ \text{else}\ E_3 : \tau} \]

For if to work at all, the condition must be boolean. More importantly, the “then” and “else” branches must have the same type ($\tau$. Note that we don’t care what the type of $\tau$ is: we can simply judge the type of $E_2$ and $E_3$, and then make sure they’re the same.

Numbers

We saw previously that it’s hard to use Church numerals in the simply-typed λ-calculus without polymorphism. The better solution, of course, is to simply add numbers. Again, the semantics from Module 3 are sufficient, so we only need to add numbers to the terminals and type language, and add type judgments for the new operators.

First, we extend our terminals and type language to add a new natural number type:

⟨Term⟩     ::= ... | 0 | 1 | ...
⟨PrimType⟩ ::= ... | nat

And now, we only need two new type judgments:

\[ \Gamma \vdash a : \text{nat} \]

(Recalling that $a$ is a metavariable over natural numbers, T_Num quite simply says that all natural numbers are nats; this is an axiom.)

\[ \dfrac{O \in \langle\text{NumBinOps}\rangle \qquad \Gamma \vdash E_1 : \text{nat} \qquad \Gamma \vdash E_2 : \text{nat}}{\Gamma \vdash (O\ E_1\ E_2) : \text{nat}} \]

T_NumOp specifies that the result of a numeric binary operation is always a nat (as we have no other kind of number), and its operands must also be nats.

If we added a type for integers, for example, we would need rules for when we switch from natural numbers to integers. For instance, the sum of two nats is a nat, but the difference between two nats is an int.

Exercise 3. Write the syntax, semantics, and type rules for numeric comparison operations, such as <, <=, >, etc. Then, simple equality of = or ==. What if you want to be able to compare abstractions? What if you want to be able to compare anything to anything?

Lists

The most critical change we will need for lists is a constructed list type: we must be able to distinguish a list of $t_1$s from a list of $t_2$s. In many languages, there would also be a simple “list” type, where all lists are of that type, but this would require subtyping, a form of polymorphism, so we’ll exclude it. However, this presents a problem: what is the type of empty? Our solution will be a bit dubious, and we’ll explain why after presenting the types.

First, our extended type language:

⟨Type⟩ ::= ... | list ⟨Type⟩

We may now declare something as, for instance, a list t1.

Now, let’s extend the terminals so that fully-resolved lists are terminal:

⟨Term⟩         ::= ... | empty | [⟨Term⟩ ⟨TermListRest⟩]
⟨TermListRest⟩ ::= ε | , ⟨Term⟩ ⟨TermListRest⟩

Now, the new type judgments. Let’s start with cons and empty.

\[ \dfrac{\Gamma \vdash E_1 : \tau \qquad \Gamma \vdash E_2 : \text{list}\ \tau}{\Gamma \vdash (\text{cons}\ E_1\ E_2) : \text{list}\ \tau} \]\[ \Gamma \vdash \text{empty} : \text{list}\ \tau \]

The T_Cons rule should be fairly clear: you can cons an element of type $\tau$ to a list of type $\text{list}\ \tau$. But, consider the T_Empty rule carefully: $\tau$ is unrestricted! This means that for any type $\tau$, it is true that $\Gamma \vdash \text{empty} : \text{list}\ \tau$! This sort of non-deterministic type judgment is a particularly ad hoc form of polymorphism, and is usually not considered acceptable, since type judgments being deterministic makes them far easier to implement in a real language. We’ll have to wait for better polymorphism to have a better option, though.

Exercise 4. Correct the semantics for lists from Module 3 to use terminal values instead of universal quantifiers to enforce ordering.

Sets

The rules for sets are extremely similar to the rules for lists, so we will simply present them:

⟨Term⟩        ::= ... | empty | {⟨Term⟩ ⟨TermSetRest⟩}
⟨TermSetRest⟩ ::= ε | , ⟨Term⟩ ⟨TermSetRest⟩
⟨Type⟩        ::= ... | set ⟨Type⟩

\[ \dfrac{\Gamma \vdash E_1 : \tau \qquad \Gamma \vdash E_2 : \text{set}\ \tau}{\Gamma \vdash (\text{insert}\ E_1\ E_2) : \text{set}\ \tau} \]\[ \dfrac{\Gamma \vdash E_1 : \tau \qquad \Gamma \vdash E_2 : \text{set}\ \tau}{\Gamma \vdash (\text{remove}\ E_1\ E_2) : \text{set}\ \tau} \]\[ \Gamma \vdash \text{empty} : \text{set}\ \tau \]

In the next module, we will begin our examination of programming language paradigms with functional programming.

Module 5: Functional Programming

“Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.” — Philip Greenspun’s tenth rule

“Greenspun’s tenth rule is just an arrogant way to say ‘programs sometimes use abstraction’.” — Gregor Richards

1. Introduction to Functional Programming

You may have learned Racket in your first year (if you are an undergraduate student at Waterloo and followed a normal curriculum), and you have certainly used OCaml for this course. Congratulations, you’ve used a functional programming language! In fact, OCaml’s second antecedent, ML, was considered as an exemplar of functional programming, but Haskell will be used instead, for reasons which will become clear soon.

First, let’s clear one thing up: in common parlance, “functional” is a synonym for “in working order”. Thus, there are many obvious jokes around the premise “if it’s not functional, then it must be dysfunctional!” Functional in this context, of course, means “of or relating to functions”.

Functional languages are based on the construction and evaluation of expressions, and the mathematical equivalence between parameterized expressions and functions. Computations are encapsulated as functions, and functions are combined with each other to express complex computational tasks. Ordering is implicit, and computations are described in terms of the mathematical formula required to achieve their result, rather than an explicit set of steps.

Functional languages are distinguished by the absence (or reduced importance) of commands, particularly assignment statements. Through the elimination of commands, which alter program state, functions in the language behave more like functions in the mathematical sense: the output of any function is completely determined by its input, and not on the sequence of previous calls to the function, input/output, nor on any external state. In a purely functional language, execution of a function produces no side-effects — the entire state of the machine, including the values bound to all variable names, remain unchanged after invocation of the function. Languages with this property are said to exhibit referential transparency:

Definition 1. (Referential Transparency) A language is referentially transparent if every function has the property that, whenever it is given the same input, it always produces the same output.

More generally, we say that a language is referentially transparent if the meaning (i.e., the value) of an entity (variable, expression, etc.) is defined independently of its external context. However, for our purposes, the above definition will suffice.

An important consequence of referential transparency is that if two entities are declared to be equal, such as a variable bound to a value, then either side of the equals sign should be equivalent in all contexts. For example, if the variable z is set to the value f(3), then throughout their collective lifetime, the expressions z and f(3) can be used interchangeably.¹ Without referential transparency, we lose this property.

Because of practicality considerations, few languages completely prohibit the use of side-effects. Those that do are called “pure functional” languages, and those that don’t are called “impure”. An obvious example of a pure functional language is the $ \lambda $-calculus itself. Languages like Scheme/Racket, its cousins Lisp and Common Lisp, ML, and its descendent OCaml, are impure, because they permit side-effects through assignment statements and I/O, but have a pure subset and generally discourage the use of impure features. Other languages, such as Haskell, which will serve as the exemplar of functional programming, prohibit assignment and I/O entirely, instead relegating all change in state to a form of explicit specification, called monads. Such languages are generally considered pure, and this design decision enforces programming in a functional style much more stringently than impure functional languages do, which is why Haskell is being used as an exemplar.

In impure functional languages, z and f(3) from above will usually be equivalent, but might not be. For example, consider this f and z written in OCaml:

open Stdio
let x = ref 0
let f q =
  x := !x + 1;
  !x + q
let z = f 3
let () =
  printf "%d\n" z;
  printf "%d\n" (f 3);
  printf "%d\n" z;
  printf "%d\n" (f 3)

Output:

Both printed values for z are 4, but f 3 (i.e., f(3)) takes the value 4 when it’s bound to z, 5 when it’s printed on the second line, and 6 when it’s printed on the fourth line. OCaml’s references break referential transparency.

By definition, in all functional languages, functions are first-class values. That is, variables can hold functions, and functions can be passed as values. This is not an uncommon characteristic of other programming languages, but it is certainly a mandatory characteristic of functional languages. Moreover, in functional languages, the difference between a function and a value is often just the presence or absence of parameters. For instance, in OCaml, let f = 42 is a declaration of a variable, and let f x = x + 42 is a declaration of a function; and of course, the declaration of a function is really just a declaration of a variable which happens to be bound to a function, which is a value.

Typically, variables in functional languages aren’t, uh, variable. Actually, they’re variable in the mathematical sense, but not the intuitive sense: what “variable” means mathematically is that in different resolutions of an expression or function, a variable may take different values. Once bound, a variable’s value cannot change; the way that it varies is that it could be bound differently in a different context. This property is immutability. Impure functional languages are immutable by default, and usually require some extra construct — such as OCaml’s refs and records — to gain mutability. Pure functional languages have no way to express mutable variables per se.

2. Exemplar: Haskell

“A monad is a monoid in the category of endofunctors, what’s the problem?” — James Iry, parodying Philip Wadler, A Brief, Incomplete, and Mostly Wrong History of Programming Languages

Our exemplar language for functional programming is Haskell. Haskell is a lazy, pure functional language, and is vastly more popular than any other language in that category.

Remember, our goal is not to learn exemplars to the point that we can use them for effective programming, merely to use them to demonstrate some concepts of each paradigm. In this module, we will describe semantics and syntax as they relate to Haskell, so that we have something solid to relate them to; when Haskell is an outlier, we will also informally describe how it’s done differently in other functional languages.

In the 1980’s, a committee was formed to design a state-of-the-art pure functional language. The result was Haskell, named for the logician Haskell Curry. The first report on the Haskell language was published in 1990, and the current Haskell standard is Haskell 2010.

Haskell was influenced by ML, and is thus a cousin of OCaml. Some of its syntax will look familiar to what you’ve seen in OCaml. However, Haskell differs from ML in several important ways.

We will be introducing Haskell syntax as we introduce the relevant concepts. However, most of Haskell will be left unexplored, and this module is not intended to serve as a guide for learning Haskell.

3. Expressions

All behavior in functional programs is described by expressions. Expressions can be made up of numbers, lists, functions, or other data. We combine expressions together to form more complex expressions.

Expressions typed into a Haskell REPL, such as ghci, are evaluated immediately. If we type 20 into ghci, it will respond with 20.

Expressions can be combined using built-in functions, such as + or *, to form a compound expression that represents the application of that function. Note that + and * are functions, not operators, but they can be used in a way that resembles operators, by placing the function between its operands. The syntax is similar to OCaml and largely derived from normal mathematical notation. For instance:

> 100 + 20
120
> 100 * 20
2000
> 100 + 30 * 5
250
> 5 + 10 + 15 + 20
50

Functions can also be called in a way more similar to the $ \lambda $-calculus, with the function followed by its arguments:

> max 5 10
10
> min (-12) 600
-12
> abs (-42)
42

Again, operators are actually functions, so they may be called in the $ \lambda $-calculus style as well, merely requiring a bit of syntactic shuffling:

> (+) 100 20
120
> (*) 100 20
2000
> (+) 100 ((*) 30 5)
250

Just like the $ \lambda $-calculus, all functions actually take one argument, and multi-argument functions are just repeated application:

> add10 = (+) 10
> add10 20
30
> add10 (-10)
0

We’ve also shown a variable binding here; variable bindings in Haskell are less akin to let bindings in OCaml, which evaluate their expression ahead of time, and more a simple assertion of equivalence. That is, we’ve declared that add10 is a shorthand for writing (+) 10.

Aside: The fact that every function is actually single-argument is why both $ \lambda $-calculus and Haskell eschew parentheses for arguments to functions, in spite of the fact that mathematical notation has them. There’s no need for that extra syntax when the number of arguments is unambiguous, and it would be a lot of clutter to add it!

All behavior is encapsulated into functions in this way, so all compound expressions either are function calls or can be reinterpreted as function calls (there is some syntactic sugar for common data types such as lists).

Since operators are just functions, if you’re feeling especially cruel, you can redefine them:

> x + y = x - y
> 2 + 2
0

Luckily, these redefinitions are contextual. You cannot convince all of Haskell that addition and subtraction are the same thing; you are only allowed to shoot yourself in the foot.

4. Functions

In functional languages, functions are first class. Specifically, an entity is first-class if it can be:

returned from functions,
passed to functions as arguments, and
stored as data.

This term is not precise, and different definitions will have different requirements. For instance, it is common to additionally require function expressiveness — that functions may be expressed in any context — which would exclude C’s functions from the definition. We won’t split hairs about what it means to be “first class” in this course.

The formal underpinning of a first-class function is a $ \lambda $-calculus abstraction. The $ \lambda $-calculus’s abstractions can be expressed in Haskell using a relatively similar syntax, but replacing the hard-to-type $ \lambda $ character with a backslash (\) and the dot with an ASCII arrow (->). For instance, we can represent Church numerals and their mathematical operators:

> churchTwo   = \f -> \x -> f (f x)
> churchThree = \f -> \x -> f (f (f x))
> churchMul   = \m -> \n -> \f -> m (n f)
> churchExp   = \m -> \n -> n m

Using these functions is a bit unsatisfying, only because the result is also a function, and Haskell doesn’t have a built-in way to print a $ \lambda $ function:

> churchMul churchTwo churchThree
<interactive>:1:1: error:
    * No instance for (Show ((t0 -> t0) -> t0 -> t0))
        arising from a use of 'print'
      (maybe you haven't applied a function to enough arguments?)
    * In a stmt of an interactive GHCi command: print it

But, remember how Church numerals work: they are functions which take an f and an x, and apply f repeatedly to x n times, where n is the number that the Church numeral represents. So, we can “translate” the resulting Church numeral into numbers that Haskell can print by making f the function to add 1, and x = 0:

> churchMul churchTwo churchThree ((+) 1) 0
6
> churchExp churchTwo churchThree ((+) 1) 0
8
> churchMul (churchMul churchTwo churchThree) (churchExp churchTwo churchThree) ((+) 1) 0
48

Or, we can recover something that looks a bit more like the $ \lambda $-expression by building it as a string, with the ++ operator, which concatenates strings in Haskell:

> churchMul churchTwo churchThree (\x -> "(f " ++ x ++ ")") "x"
"(f (f (f (f (f (f x))))))"

Indeed, Haskell is based directly on the $ \lambda $-calculus, with several of the additional constructs that we added in Modules 3 and 4. Haskell is typed, and as we saw in Module 4, typed languages cannot usually directly represent the Y combinator. Haskell is no exception:

> y = \f -> (\x -> f (x x)) (\x -> f (x x))
<interactive>:1:23: error:
    * Occurs check: cannot construct the infinite type: t0 ~ t0 -> t
        Expected type: t0 -> t
          Actual type: (t0 -> t) -> t
    * In the first argument of 'x', namely 'x'
      In the first argument of 'f', namely '(x x)'
      In the expression: f (x x)
    * Relevant bindings include
        x :: (t0 -> t) -> t (bound at <interactive>:1:13)
        f :: t -> t (bound at <interactive>:1:6)
        y :: (t -> t) -> t (bound at <interactive>:1:1)

Haskell messages are often arcane, but the critical part here is that it can’t find a way for the type t0 to be equivalent to the type t0 -> t. Recursion is possible because bindings work similarly to the let and let rec bindings we introduced in Module 4.²

Haskell also has a shorthand for multi-argument functions. For instance, \x y -> x + y is a function to add its two arguments ($ \eta $-reducible to the (+) function). This form is always equivalent to nested functions, such as \x -> \y -> x + y, so serves the purpose only of abbreviation.

5. Conditionals and Predicates

Suppose we want to construct a function to compute the absolute value of a number x. The absolute value $ |x| $ of x is defined as:

\[ |x| = \begin{cases} x & \text{if } x \geq 0 \\ -x & \text{if } x < 0 \end{cases} \]

There is, of course, a built-in absolute value function, abs, which was demonstrated above; our implementation is unlikely to be as good as the built-in one, but will demonstrate conditions.

We can evaluate this using if conditionals. In languages in the C family, if is a statement, and the condition affects which statements will be executed; in functional languages, if is an expression, and evaluates to either its then or else branch. Because it’s an expression, and because Haskell is typed, we can’t just leave out the else branch: recall how in our if syntax in Module 4, we demanded that each branch have the same type. Well, the else branch can’t have the same type as the then branch if there is no else branch, so the else branch is mandatory. In languages in the C family, the ?: operator behaves the same, and requires an “else” branch for the same reason. Knowing this, let’s write our absolute value function:

> absButTerrible = \x -> if x >= 0 then x else -x
> absButTerrible 42
42
> absButTerrible (-42)
42

An if expression is written (if p then t else e). p is called the predicate, or simply condition, and determines which of t and e will be evaluated. t and e are called the then expression and else expression, respectively.

Aside: You may be wondering why we’ve always put negative numbers inside of parentheses. One unfortunate consequence of Haskell’s and OCaml’s parenthesis-light function call syntax is that the numeric - for negation is almost always ambiguous. absButTerrible -42 may look right to our eyes, but it’s actually trying to subtract 42 from the value absButTerrible. You can’t subtract a number from a function, so this doesn’t work.

Haskell — and most typed functional languages — is picky about the type of the predicate. In C, for instance, we’re allowed to use any number or pointer as the predicate, and it’s considered to be true if it’s not 0 or NULL. In Haskell, the predicate must be boolean:

> if 42 then 42 else -42
<interactive>:1:4: error:
    * Could not deduce (Num Bool) arising from the literal '42'
    [...]

What this error is saying is that it couldn’t find a way to interpret a Num (number) as a Bool (boolean).

Of course, since Haskell is based on the $ \lambda $-calculus, we could also build our own booleans and conditions, in exactly the style that we built $ \lceil \text{true} \rceil $ and $ \lceil \text{false} \rceil $ in the $ \lambda $-calculus:

> true     = \x -> \y -> x
> false    = \x -> \y -> y
> lambdaIf = \b -> \t -> \f -> b t f
> lambdaIf true  "true is true"  "true is false"
"true is true"
> lambdaIf false "false is true" "false is false"
"false is false"

This is quite pointless, since our $ \lambda $-flavored true and false don’t behave like Haskell’s own True and False:

> if true then "true is true" else "true is false"
<interactive>:1:4: error:
    * Couldn't match expected type 'Bool'
      with actual type 'p10 -> p20 -> p10'

Indeed, precisely what we’re doing is passing the $ \lceil \text{true} \rceil $ we defined in Module 2 to the if we defined in Module 3, which expects its own primitive types.

6. Guards

There is an alternative to if for conditionalizing execution: guards. Guards are little more than an abbreviation for nested if statements, but are common in functional languages, because nested ifs can become difficult to understand. First, let’s start by declaring our functions in a different way:

> add10 x = x + 10

Recall that previously we said this binding was actually asserting equivalence, rather than binding a variable. This style of binding should make that very clear: we are asserting that writing the expression add10 x is equivalent to writing the expression x + 10. As a practical matter, this declares a variable named add10 bound to a function, but mathematically, the declaration is a statement of equivalence.

Recall now our mathematical description of absolute value. The form to the right of = is a description of cases, and Haskell, with a slightly different syntax, allows the same. In a Haskell file, this can be written like so:

absButTerrible x | x >= 0 = x
                 | x <  0 = -x

This can be read as “absButTerrible x is defined conditionally like so: if x >= 0, then absButTerrible x is x; if x < 0, then absButTerrible x is -x”.

These guards can be written as nested ifs. That brings us to the concept of syntactic sugar.

7. Syntactic Sugar

Until this point, we haven’t needed to introduce any new semantics, because everything has followed either the semantics of the $ \lambda $-calculus, or one of the additions we proposed in Module 3. But syntactically, the guards we’ve seen don’t really fit either.

Formal semantics are almost always defined for a reduced core language. Quite often, a formal semantics is trying to demonstrate only one language feature, in which case this core language may be sufficient. But, when a formal semantics is trying to accurately model a full language, rather than just a particular feature, it is associated with a translation — usually informal — from the full language into the core represented by the formal model. In this case, we can describe it using the double square brackets introduced in Module 2. We can thus describe patterns of the above sort as syntactic sugar like so:

Let the metavariables $ D $ range over valid function signatures (i.e., a name followed by formal arguments), $ C $ range over case lists, and $ E $ range over expressions. Then:

\[ \begin{aligned} \lbrack\!\lbrack D \; C \rbrack\!\rbrack &= D = \lbrack\!\lbrack C \rbrack\!\rbrack \\ \lbrack\!\lbrack | E_1 = E_2 \; C \rbrack\!\rbrack &= \text{if } E_1 \text{ then } E_2 \text{ else } \lbrack\!\lbrack C \rbrack\!\rbrack \\ \lbrack\!\lbrack \rbrack\!\rbrack &= \text{error "Non-exhaustive pattern"} \end{aligned} \]

We haven’t yet mentioned error, but it does what you probably suspect it does: raises an error if it’s evaluated. In this case, it’s there because case lists might not cover every possible case, and if they don’t, an error is raised at run-time.

The first line in this relation states that a function declaration formed from case statements can be written as a standard function declaration formed with =, by rewriting the case statements. The second line says that a given case $ E_1 = E_2 $ can be rewritten as an if-then-else, where the else expression is the remaining cases. The last line says that the empty case list (i.e., the end of the case list) is an error if it’s reached.

Using this definition, we would rewrite absButTerrible as follows:

absButTerrible x = if x >= 0 then x else if x < 0 then -x else error "Non-exhaustive pattern"

We’ll use this sort of semi-formal rewriting of complex patterns into simple ones frequently to make our formal semantics more compact; in this case, thanks to this rewrite, we’re yet to encounter anything that needs any extension to our formal semantics at all.

Exercise 1. Note that we’ve rewritten the case patterns in terms of the abbreviated syntax described above (e.g. absButTerrible x = ... rather than absButTerrible = \x -> ...). Using a similar style, describe the translation of this syntax. If you’re already familiar with Haskell and know what’s coming in Section 12, then just define the simple case described here, not full pattern matching.

8. Name Binding and Environments

We’ve cavalierly used bindings above, without being very clear what they mean. Most languages have a global scope, in which all declarations are equal. In Scheme, for example, a variable defined with define is usable anywhere in the program, even in code written before the define itself, so long as the define is evaluated before the code using the definition is evaluated. However, there is also a way to bind names contextually.

An environment is essentially a lookup table, mapping identifier names to values. An environment also associates names with types, in which context it’s called a type environment, which we’ve already seen. The environment forms our abstraction of the computer’s memory, so that rather than dealing with pointers and addresses, we have names.

Environments can also nest, and nested environments don’t need to have the same definition of any given name. This nesting causes environments to form a tree, with the global scope’s environment — if any — as the root. The tree is a run-time phenomenon, as multiple calls to the same function create different environments, but its structure is described by the structure of the code: code in one environment can see variables declared in that environment, and in environments that surround it syntactically. We describe a variable in a given environment as scoped to that environment. This is called lexical scoping, and is how all modern languages, functional or otherwise, deal with name binding.

In Haskell, an environment is declared by a let...in expression, which declares a new binding, but only within the context of the expression after in; the new environment is a child of the surrounding environment. For instance, the following two functions both use a variable named x, but x is bound to different values in each context:

fun1 n =
  let x = "A" in n ++ "x is " ++ x

fun2 n =
  let x = "B" in n ++ "x is " ++ x

*Example> fun1 "In fun1, "
"In fun1, x is A"
*Example> fun2 "In fun2, "
"In fun2, x is B"

Aside: Some people like to snidely deride certain modern languages, in particular JavaScript, as not having lexical scoping, but this is a misunderstanding of lexical scoping. Lexical scoping means that the environment in which variables are scoped can be discovered from only the syntax of the code, not that the environment in which a variable is scoped starts exactly from the text declaring the variable. In the case of JavaScript, variables declared with var are scoped to the environment of their surrounding function. Languages with non-lexical scoping include Common Lisp and the Unix shell.

With the exceptions of the let bindings above, all of the declarations we’ve made so far in this module have been in the global scope, which is a single, shared environment for all “global” variables. Because this environment is global and shared, recursion and even mutual recursion are easy to express:

badlyApproachZero x | x > 0 = badlyApproachZero (-x + 1)
                    | x < 0 = badlyApproachZeroPrime x
                    | x == 0 = x

badlyApproachZeroPrime x | x > 0 = badlyApproachZero x
                         | x < 0 = badlyApproachZeroPrime (-x - 1)
                         | x == 0 = x

In this example, badlyApproachZero and badlyApproachZeroPrime are mutually recursive, but can easily refer to each other since they reside in the same environment, the global scope.

We already looked at formally expressing the semantics of environments, with let bindings in Module 4. But, let’s recall the semantic and type rules for one part of let bindings, to discuss them in the context of environments:

\[ \textbf{LetBody} \quad \dfrac{\sigma' = \sigma[x \mapsto V[z/x]] \quad z \text{ is a fresh variable} \quad \langle \sigma', M \rangle \to \langle \sigma', M' \rangle}{\langle \sigma, \text{let } x = V \text{ in } M \rangle \to \langle \sigma, \text{let } x = V \text{ in } M' \rangle} \]\[ \textbf{T_Let} \quad \dfrac{\Gamma \vdash V : \tau_1 \qquad \{\langle x, \tau_1 \rangle\} + \Gamma \vdash E : \tau_2}{\Gamma \vdash \text{let } x = V \text{ in } E : \tau_2} \]

Our store here, $ \sigma $, is part of the formal semantics encoding of our environment. The other part is $ \Gamma $, our type environment. $ \Gamma $ only exists in the type judgment, and $ \sigma $ only exists in the semantics, but they are each other’s dual. Note how the environments are used: every expression is paired with its environment, and evaluated or judged in that environment. Subexpressions in which new variables have been defined are evaluated or judged in a new context, $ \sigma' $ or $ \{\langle x, \tau_1 \rangle\} + \Gamma $, which only exists in that subexpression and its descendents. It is this association between the syntactic nesting of expressions and the nesting of environments that defines lexical scoping.

Aside: Not all stores form environments, or at least not all stores form lexically-scoped environments. “Store” is just a general term for a variable-value map. Don’t be surprised to see things called “store” that don’t behave exactly like the store we’re describing here. We’ll see a very different store in Module 7.

Now, let’s consider the case of the global scope. We cannot rewrite our badlyApproachZero example in terms of lets, or even in terms of let recs,³ because let rec as we described it only let us make one recursive function, not a pair of mutually recursive functions. So, how do we express a global scope in this way?

To be honest, the usual answer is “we don’t”, or we only do so informally. Formal semantics describe the reduction of an expression, and it’s taken as rote that we have some way of pre-populating $ \sigma $ with all of the variables defined globally, rather than the empty $ \sigma $ we started with while defining the let rec semantics in Module 4. But, let’s do it formally anyway.

In most statically-typed languages, even functional languages, there is actually a different syntax for the global scope than there is for expressions. In Haskell, you cannot simply write an expression in a file and expect it to work. In fact, in the global scope, all you can write is declarations. In a pure functional language, those declarations do not, in and of themselves, do anything, except for defining the global environment. You then need a starting expression to evaluate, which becomes the left part of the $ \to $ in the first step of evaluation. In Haskell, in an unusual nod to C, the starting place is main. For a Haskell file to be usable as a Haskell program, it must define a function main, and to run the Haskell program is to evaluate main.

So now, we have two steps: convert our top-level syntax into $ \sigma $, and then evaluate main in $ \sigma $. This conversion is the step that is rarely described formally, so there is no “usual” name for it; we will give it the simple name resolve. Thus, the starting point for a program $ P $ is the pair $ \sigma $ and $ E $ such that $ \sigma = \text{resolve}(P) $, $ E = \sigma(\text{main}) \; z $, and $ z $ is the argument to the main function (such as C’s argv and argc).

The syntax for programs is simply a list of declarations. Haskell is indentation-sensitive, which cannot be expressed in BNF because it does not form a context-free language. In order to make the language of our semantics clearer, and sidestep this problem, we will separate declarations by semicolons, and make a few similar syntactic concessions later.

⟨Program⟩  ::= ⟨DeclList⟩
⟨DeclList⟩ ::= ⟨Decl⟩ ; ⟨DeclList⟩
             |
⟨Decl⟩     ::= ⟨Var⟩ = ⟨Term⟩

Now, let’s define resolve in terms of a program $ P $, where $ x $, $ V $, and $ L $ are metavariables ranging over variables, terminal values, and declaration lists, respectively:

\[ \textbf{EmptyProgram} \quad \text{resolve}() = \text{empty} \]\[ \textbf{Declaration} \quad \dfrac{\sigma = \text{resolve}(L) \qquad \sigma' = \sigma[x \mapsto V]}{\text{resolve}(x = V \; ; \; L) = \sigma'} \]

An empty program of course defines no variables, so $ \text{resolve}() = \text{empty} $. A program declaring multiple variables can be separated into a single declaration ($ x = V $ and “the rest” ($ L $, and resolve of that program simply extends the resolution of “the rest” and adds $ x $. Technically, this means that our definition of resolve goes right-to-left. It is typically the job of type judgment to assure that a name is not multiply defined, rather than formal semantics, and if a name is guaranteed not to be multiply defined, then a $ \sigma $ built left-to-right is indistinguishable from a $ \sigma $ built right-to-left.

A final wrinkle is that in most languages, these declarations may associate names with expressions, rather than just terminal values, and those expressions need to be evaluated. In Haskell, names are bound to expressions, but as we will see in Section 14, those expressions do not need to be evaluated. For the time being, we will assume they are just associated with values.

9. Algebraic Data Types

We imagined booleans as being built-in, primitive types in Haskell. This is true, but only because the built-in if-then-else syntax can only work if it knows what booleans are. Without that, we could write our own Haskell booleans like so:

data Bool = True | False

Remember that a type is just a set of values belonging to that type. Previously, we assumed that the actual definition of types and values were external to the language: we defined a language with numbers, and then said that “number” was a type; we defined a language with abstractions, and then said that “abstraction” was a type. Haskell’s approach to both types and values is different.

This declaration introduces a new type to the language, and introduces two new values to the language. True and False needn’t be keywords, built into the language; they are defined by this type declaration. This kind of type declaration is called algebraic data types, and is common in typed functional languages. This particular declaration enters two new names into our environment, True and False, but what is the value associated with them?

In fact, it’s not important to have an actual, concrete value. The values associated with True and False are called symbols, which are values invented solely to be unique: what is important is that we can, at run-time, know when we are referring to True, and know when we are referring to False, so all we need is some distinguishable value. You can imagine this value as being a number, if you’d like, with every value declared in this way having a unique number, but that would be a detail of the implementation, not the concept. As far as a user can tell, True is True and False is False, and that’s all that matters. In our formal semantics, we will represent these symbol values as one-tuples in which the only value is the name itself, in braces; that is, the value of True is $ \{True\} $. In an actual implementation, one could simply generate numbers to represent every symbol, or allocate a small space and use the pointer as the symbol value; in either case, that number or pointer is never actually exposed to the user.

Aside: Actually, there are many different ways of expressing such values in formal semantics as there are formal semantics for languages with algebraic data types, so don’t necessarily expect to see symbols of this form everywhere.

In Haskell, True and False are called data constructors. The reason for this will become clearer momentarily, when we parameterize them.

Let’s look at how we might introduce symbols to resolve:

⟨Decl⟩       ::= ··· | data ⟨Var⟩ = ⟨ValueList⟩
⟨ValueList⟩  ::= ⟨DataConstr⟩ ⟨ValueListRest⟩
⟨ValueListRest⟩ ::= "|" ⟨DataConstr⟩ ⟨ValueListRest⟩
                  |
⟨DataConstr⟩ ::= ⟨Var⟩
⟨Term⟩       ::= ··· | {⟨Var⟩}

\[ \textbf{EmptyDataDecl} \quad \text{resolve}(\text{data } n = \varepsilon \; ; \; L) = \text{resolve}(L) \]\[ \textbf{DataDecl} \quad \dfrac{\sigma = \text{resolve}(\text{data } n = M \; ; \; L) \qquad \sigma' = \sigma[N \mapsto \{N\}]}{\text{resolve}(\text{data } n = N \; M \; ; \; L) = \sigma'} \]

Note that in DataDecl, $ n $ names the type we are declaring, while $ N $ names the value we are declaring. Since the symbol value for $ N $ is $ \{N\} $, $ N $ maps to $ \{N\} $. Of course, it is also necessary to remember the name $ n $ during type checking, but we’re only looking at semantics here. Also, as it happens, Haskell requires that type and data constructor names are capitalized, and variable names are not; this is helpful for typing Haskell, but unnecessary for our purposes, so we will assume that all of these can have names of the same form.

The type declaration we just saw declares a finite type. The Bool type has exactly two values: True and False. Finite types aren’t especially useful, so let’s imagine we want a type for lists of booleans. Of course, there are infinitely many possible lists of booleans, so there must be an infinite number of values in this type. How can we express that with algebraic data types?

The answer is that data constructors can have parameters, and those parameters are other types. So, leaning on our classic definition of a list — a list is either empty, or a pair of a value and a list — we can build an algebraic data type which represents lists of booleans:

data BoolList = Empty | Pair Bool BoolList

The Empty constructor behaves exactly as True or False did: it is simply a symbol value, carrying no further information than the fact that Empty is Empty. On the other hand, Pair is not a BoolList in and of itself; instead, it is a two-argument function, where the first argument must be a boolean and the second argument must be another BoolList. The result of this function is a Pair value, associated with the two argument values. That is, Pair is a $ \text{Bool} \to \text{BoolList} \to \text{BoolList} $. All values of algebraic data types are effectively $ n $-tuples, where $ n - 1 $ is the number of arguments to the data constructor, and the first element in the tuple is a symbol representing the constructor itself. Therefore, our declaration above gives us a way of making a pair of a boolean and boolean list, and “tagging” that pair so that we can tell later that it was specifically one of our Pairs, and not some other pair.

With parameterized data constructors, we gain the ability to describe new, infinite data types. There are an infinite number of boolean lists.

Now, let’s look at how we might introduce parameterized data constructors to resolve:

⟨DataConstr⟩ ::= ··· | ⟨Var⟩ ⟨Var⟩⁺
⟨Term⟩       ::= ··· | {⟨Var⟩ (, ⟨Term⟩)⁺}

\[ \textbf{ParamDataDecl} \quad \dfrac{\sigma = \text{resolve}(\text{data } n = M \; ; \; L) \qquad \sigma' = \sigma[N \mapsto \lambda x_1 \to \lambda x_2 \to \cdots \to \lambda x_m \to \{N, x_1, x_2, \ldots, x_m\}]}{\text{resolve}(\text{data } n = N \; T_1 \; T_2 \cdots T_m \; M \; ; \; L) = \sigma'} \]

Here, ParamDataDecl shows that an $ m $-argument data constructor is stored in $ \sigma' $ as an $ m $-argument $ \lambda $ function, curried of course, which resolves to the tuple described above. Note that this definition allows any value to occupy each of the slots of our algebraic data type; it is the role of type checking to assure that the values are actually of the types declared by $ T_1 $ through $ T_m $.

The boolean list containing True then False with this declaration is written Pair True (Pair False Empty). Since data constructors are functions, and functions are curried, we can extract partial applications:

PrependTrue  = Pair True
TrueFalseList = PrependTrue (Pair False Empty)

Finally, as Haskell’s pièce de résistance, algebraic type declarations themselves may be parameterized by type, allowing for this general definition for lists of any type:

data List a = Empty | Pair a (List a)

This only affects types, and not semantics, so we won’t show our resolve extended to make it work: Pair is the same two-argument function.

Aside: Haskell, of course, natively supports lists, so there’s no need to explicitly declare a list type like we just have. However, the native list syntax is nothing more than syntactic sugar for a sequence of pairs like this.

10. Parametric Polymorphism and System F

As yet, we’ve skirted around the issue of polymorphism, but remember that in Module 4, we declared polymorphism as necessary to express even Church numerals. Virtually all typed functional languages, certainly including Haskell, use parametric polymorphism. What makes parametric polymorphism so parametric? The fact that types may be parameterized, like functions.

This is formalized in a calculus called System F, also known as the second-order $ \lambda $-calculus, which was invented independently by Girard and Reynolds. System F looks much like the simply-typed $ \lambda $-calculus, but includes syntax for abstractions and applications over types:

⟨Expr⟩     ::= ⟨Var⟩ | ⟨Abs⟩ | ⟨App⟩ | ⟨t-Abs⟩ | ⟨t-App⟩
⟨Var⟩      ::= a | b | c | ···
⟨Abs⟩      ::= λ ⟨Var⟩ : ⟨Type⟩ . ⟨Expr⟩
⟨App⟩      ::= ⟨Expr⟩ ⟨Expr⟩
⟨t-Abs⟩    ::= Λ ⟨TyVar⟩ . ⟨Expr⟩
⟨t-App⟩    ::= ⟨Expr⟩ {⟨Type⟩}
⟨Type⟩     ::= ⟨PrimType⟩ | ⟨TyVar⟩ | ⟨Type⟩ → ⟨Type⟩ | ∀ ⟨TyVar⟩ . ⟨Type⟩
⟨PrimType⟩ ::= t₁ | t₂ | t₃ | ···
⟨TyVar⟩    ::= α | β | γ | ···

$ \Lambda $ is a Greek capital lambda. In System F, $ \Lambda $ behaves like a second level of $ \lambda $, over types, hence the name second-order $ \lambda $-calculus. To distinguish applications in the first language from applications in the second, the second-order language uses braces ($ \{\} $ for application, but furthermore, only allows applications of explicitly written types. As a consequence, the second-order language is not Turing-complete, since it is impossible to represent recursion. This is actually a good thing, as it means that the halting problem does not haunt our types, and we know that we can always evaluate the second-order component of System F. Otherwise, the second-order language is just $ \lambda $-calculus, and uses the same reduction rules.

In order to have a language to express the types described by $ \Lambda $, System F additionally expands $ \langle \text{Type} \rangle $ to include the universal quantifier. For instance, it is now possible for an expression to have the type $ \forall \alpha . \alpha \to \alpha $, which is a function which accepts any type, and returns a value of the same type. Note that $ \alpha $ is not a type itself, but a type variable: any particular resolution of this function will have a concrete type, but it may be a different concrete type in different contexts.

In the previous module, we explored the difficulty in expressing Church numerals in the simply-typed $ \lambda $-calculus. We attempted to give Church numerals the type $ (t_1 \to t_1) \to t_1 \to t_1 $, but found that our definition of $ \hat{\cdot} $ didn’t work, because it tried to make $ t_1 \to t_1 $ equal to $ (t_1 \to t_1) \to t_1 \to t_1 $. We informally fixed this by making the type $ (\tau_1 \to \tau_1) \to \tau_1 \to \tau_1 $, and allowing different Church numerals to have different $ \tau $s. Now that we can explicitly make type parameters, rather than informally talking about $ \tau $s, how would we describe a Church numeral?

A Church numeral needed only one core type, $ t_1 $, so that is the type we must parameterize. We will simply wrap the definition of a Church numeral, in this case two, in a second-order $ \Lambda $ which makes that a parameter:

\[ \lceil 2 \rceil = \Lambda\alpha.\, \lambda f : \alpha \to \alpha.\, \lambda x : \alpha.\, f\,(f\,x) \]

We can recover our original Church numerals over $ t_1 $ by applying $ t_1 $ as our type parameter:

\[ (\Lambda\alpha.\, \lambda f : \alpha \to \alpha.\, \lambda x : \alpha.\, f\,(f\,x))\{t_1\} \;\to\; \lambda f : t_1 \to t_1.\, \lambda x : t_1.\, f\,(f\,x) \]

And now, we can make a version of $ \hat{\cdot} $ that is itself parameterized by type:

\[ \lceil \hat{\cdot} \rceil = \Lambda\alpha.\, \lambda m : (\alpha \to \alpha) \to \alpha \to \alpha.\, \lambda n : ((\alpha \to \alpha) \to \alpha \to \alpha) \to (\alpha \to \alpha) \to \alpha \to \alpha.\, \lambda f : \alpha \to \alpha.\, \lambda x : \alpha.\, n\,m\,f\,x \]

The types in this expression are quite intimidating, so let’s break them down. A Church numeral is a function which takes two arguments. The first is a function over the second, which will be applied as many times as the value of the Church numeral. A two-argument function is a function returning a function, so it’s some $ \tau_1 \to \tau_2 \to \tau_3 $. In this case, $ \tau_1 $ has to be a function over $ \tau_2 $, so the Church numeral is $ (\tau_2 \to \tau_2) \to \tau_2 \to \tau_3 $. And finally, since the Church numeral returns the result of the first argument, $ \tau_3 $ must equal $ \tau_2 $, so the Church numeral is $ (\tau_2 \to \tau_2) \to \tau_2 \to \tau_2 $. We thus need only one type argument, $ \tau_2 $, which we will name $ \alpha $. That explains the type of $ m $.

How about the odd type of $ n $? In $ \lceil \hat{\cdot} \rceil $, $ n $ takes $ m $ as an argument, but $ n $ must also be a Church numeral. So, $ n $ must be some $ (\tau_4 \to \tau_4) \to \tau_4 \to \tau_4 $, but it also must accept an argument of $ m $’s type, which is $ (\alpha \to \alpha) \to \alpha \to \alpha $. That is, $ \tau_4 \to \tau_4 = (\alpha \to \alpha) \to \alpha \to \alpha $. We can add some parentheses to the type of $ m $ to get $ (\alpha \to \alpha) \to (\alpha \to \alpha) $, at which point the relationship between $ \alpha $ and $ \tau_4 $ is clear: $ \tau_4 = \alpha \to \alpha $. So, the type of $ n $ is the rather long-winded $ ((\alpha \to \alpha) \to \alpha \to \alpha) \to (\alpha \to \alpha) \to \alpha \to \alpha $.

To actually use the System F variety of Church numerals, we need to fill in all the types in a compatible way, e.g.:

\[ \lceil \hat{\cdot} \rceil \{t_1\}\,(\lceil 2 \rceil \{t_1\})\,(\lceil 2 \rceil \{t_1 \to t_1\}) \]

The power of System F’s second-order language is that we can instead make the above expression itself polymorphic, by replacing $ t_1 $ with some parameterized type:

\[ \Lambda\gamma.\, \lceil \hat{\cdot} \rceil \{\gamma\}\,(\lceil 2 \rceil \{\gamma\})\,(\lceil 2 \rceil \{\gamma \to \gamma\}) \]

A second, less obvious power of System F is that the second-order language of types is erasable. Because type-level substitutions only happen in the type language, and normal $ \lambda $ substitutions only happen in the $ \lambda $-calculus portion, the actual behavior of any System F expression is dictated by only its $ \lambda $-calculus component. You can remove all $ \Lambda $-abstractions and $ \Lambda $-applications without affecting the meaning of any program. Thus, the only purpose of the type language is for type checking; the concerns of parametric polymorphic types and actual behavior are separated. More precisely:

Definition 2. (Erasure) Given a term $ E $ in System F, we define the erasure of $ E $, denoted $ \text{erase}(E) $, as follows:

\[ \begin{aligned} \text{erase}(x) &= x \\ \text{erase}(\lambda x : \tau . E) &= \lambda x.\, \text{erase}(E) \\ \text{erase}(M\,N) &= \text{erase}(M)\,\text{erase}(N) \\ \text{erase}(\Lambda\alpha . E) &= \text{erase}(E) \\ \text{erase}(E\{\tau\}) &= \text{erase}(E) \end{aligned} \]

Theorem 1. (Erasability) If $ E \to E' $, then either $ \text{erase}(E) = \text{erase}(E') $, or $ \text{erase}(E) \to_\beta \text{erase}(E') $, up to $ \alpha $-renaming.

The type rules for System F consist of the rules for the simply-typed $ \lambda $-calculus, augmented with two new rules to treat type abstractions and type applications. The rule for type abstractions is as follows:

\[ \textbf{T_TypeAbs} \quad \dfrac{\Gamma \vdash E : \tau \qquad \forall x.\, \langle x, \alpha \rangle \notin \Gamma}{\Gamma \vdash \Lambda\alpha . E : \forall\alpha.\tau} \]

There are several things to note about this rule:

All $ \Lambda $-abstractions are of universal types.
$ \tau $ may contain $ \alpha $, and indeed, the type of $ \Lambda\alpha.E $ could be as simple as $ \forall\alpha.\alpha $. In this example, $ E $ is directly of the parameterized type.
$ \alpha $ doesn’t have a type, it is a type, so nothing is added to $ \Gamma $. $ \Gamma $ maps variables to types; it does not map type variables to anything.
The second premise demands that nothing in the environment has type $ \alpha $ already. In particular, this requirement prevents us from constructing ill-formed terms like $ \Lambda\alpha.\lambda x : \alpha.\Lambda\alpha.x $, in which the inner “$ \Lambda\alpha $” attempts to capture the type of $ x $ (i.e., the outer $ \alpha $ and thereby divorce $ x $’s type from that of $ \alpha $’s binding occurrence.

For type language applications, the rule is as follows:

\[ \textbf{T_TypeApp} \quad \dfrac{\Gamma \vdash E : \forall\alpha.\tau_1}{\Gamma \vdash E\{\tau_2\} : \tau_1[\tau_2/\alpha]} \]

The expression $ \tau_1[\tau_2/\alpha] $ is a type substitution, and behaves precisely as substitution does over the $ \lambda $-calculus. In this case, it denotes the replacement of all (free) occurrences of $ \alpha $ in $ \tau_1 $ with $ \tau_2 $. Its formal definition is analogous to that of substitution for $ \lambda $-terms. In particular, note that type substitutions are capture-avoiding replacements, in which $ \alpha $-conversions of type variables are performed as needed to prevent capture.

In fact, the type rule for type application is most similar to the semantic rule for $ \lambda $-applications. But, because recursion is syntactically disallowed in the type language of System F, type judgment is guaranteed to terminate.

Exercise 2. Using the type rules for System F, derive the type for $ \lceil \hat{\cdot} \rceil $.

In spite of the fact that System F’s type language is erasable, it nonetheless guarantees type soundness, i.e. progress and preservation, which we won’t prove here. It is also more powerful than the simply-typed $ \lambda $-calculus, as demonstrated by the fact that it can successfully represent $ \lceil \hat{\cdot} \rceil $. However, System F still cannot represent the Y combinator, is still strongly normalizing, and is still not Turing-complete.

Typed functional languages, including Haskell, base their type language on System F, and this is how parameterized types, like List a we defined above, are possible. While Pair is a data constructor, and may be used only in the first-level language, List is a type constructor — i.e., a $ \Lambda $-abstraction — and may be used only in the type language.

Having to explicitly write type constructors, type applications, and types more generally, is extremely tedious, so most typed functional languages, including Haskell, use some form of type inference.

11. Type Inference

Given that the type language of System F is erasable, one might reasonably ask if this erasure can be undone. That is, is it possible to write untyped $ \lambda $-calculus, infer what its types should be, and thus gain type judgment without having to write types? The answer is yes! … sort of. Type inference is the reverse of type erasure:

Definition 3. (Polymorphic Type Inference for System F) For a term $ E $ of the untyped $ \lambda $-calculus, the polymorphic type inference problem (in the context of System F) is to find a well-typed term $ E' $ in System F, such that $ \text{erase}(E') = E $.

As it turns out, in System F, type inference is undecidable. But, we can infer types if we restrict ourselves to inferring a slight restriction of all of the types that System F can represent. Specifically, we will restrict ourselves to inferring type expressions in which the universal quantifier may only be applied to the entire type expression, and not to some subexpression. That is, $ \forall\alpha.\alpha \to \alpha $ is allowed, but $ t_1 \to \forall\alpha.\alpha $ is not, and neither is $ (\forall\alpha.\alpha) \to t_1 $. The consequence of this restriction is that we may make functions themselves polymorphic, but a function cannot take an unresolved polymorphic function as its argument, nor can it return an unresolved polymorphic function. This restricts the set of expressions for which we can infer types, but, as it happens, still allows virtually all useful functions.

Let’s consider what it means to infer types. We are operating over the untyped $ \lambda $-calculus, which of course has no type judgment. But, we would like to find a well-typed expression in System F. So, let’s start by applying erasure to all subexpressions of our typing rules, and see where that leads us. We’ll start with T_TypeAbs and T_TypeApp:

\[ \text{erase}(\textbf{T_TypeAbs}) \;\to\; \textbf{T_Gen} \quad \dfrac{\Gamma \vdash E : \tau \qquad \forall x.\, \langle x, \alpha \rangle \notin \Gamma}{\Gamma \vdash E : \forall\alpha.\tau} \]\[ \text{erase}(\textbf{T_TypeApp}) \;\to\; \textbf{T_Spec} \quad \dfrac{\Gamma \vdash E : \forall\alpha.\tau_1}{\Gamma \vdash E : \tau_1[\tau_2/\alpha]} \]

Note that once we erase the type language portions of the expressions, the expressions in the premises and conclusions of these two rules become the same! The T_Spec rule, called specialization, says that if $ E $ is of a polymorphic type $ \forall\alpha.\tau_1 $, then we can substitute $ \alpha $ for some $ \tau_2 $. With the goal of undoing erasure, this is certainly true: we could rewrite $ E $ as $ E\{\tau_2\} $, which erases to $ E $. Note that T_Spec doesn’t actually specify $ \tau_2 $ — it doesn’t appear in the premises — and so this judgment is true for any $ \tau_2 $. Of course, $ E $ would never be of a polymorphic type, except that we’ve also created T_Gen.

The T_Gen rule, called generalization, says that if $ E $ is of type $ \tau $, then we can instead view it as a polymorphic type $ \forall\alpha.\tau $. Again, with the goal of undoing erasure, this is certainly true: we could rewrite $ E $ as $ \Lambda\alpha.E $, which erases to $ E $.

These rules allow us to generalize or specialize types, but we still need a way of getting types in the first place. If we apply erasure to the T_Abstraction rule from Module 4, the mystery is solved:

\[ \text{erase}(\textbf{T_Abstraction}) \;\to\; \textbf{T_Ascr} \quad \dfrac{\{\langle x, \tau_1 \rangle\} + \Gamma \vdash E : \tau_2}{\Gamma \vdash (\lambda x.\, E) : \tau_1 \to \tau_2} \]

The T_Ascr rule, type ascription, allows us to “invent” a type for the variable of an abstraction; note how $ \tau_1 $ is not actually restricted, except insofar as $ E $’s types work with $ x $ assigned the type $ \tau_1 $. Again, we could rewrite $ \lambda x.\, E $ as $ \lambda x : \tau_1.\, E $, which erases to $ \lambda x.\, E $.

Erased versions of the rules for T_Variable and T_Application are uninteresting, as they had no explicit types to erase anyway. We will use them verbatim.

What we have just derived are the type rules for the polymorphic $ \lambda $-calculus. Syntactically, the polymorphic $ \lambda $-calculus is the untyped $ \lambda $-calculus. It is distinguished only in having types, even if they’re not written. Thus, the polymorphic $ \lambda $-calculus is an example of a statically-typed language in which types are not written.

Our new set of type rules is odd. Our first observation is that the type rules for the simply-typed $ \lambda $-calculus possess a property that our new set of type rules lacks. This property is known as syntax-directedness. A formal semantic system is syntax-directed if it has the following two properties:

There is no more than one rule in the system for each expression in the language.
The semantics of an expression is defined completely in terms of the semantics of its immediate subexpressions.

Because the type rules for the simply-typed $ \lambda $-calculus are syntax-directed, they possess a useful second property: they’re deterministic. That is, for a given expression, the type rules can only be applied in one way. Based on the syntactic structure of an expression $ E $, we can determine the unique rule that matches $ E $’s structure. We apply this rule and then recurse on the subexpressions. Thus, type derivation in the simply-typed $ \lambda $-calculus is a completely mechanical process.

The type rules for the polymorphic $ \lambda $-calculus, on the other hand, are not syntax-directed. For a given expression $ E $, there can be as many as three rules that we can apply: the rule based on the syntactic structure of $ E $, as well as the rules for specialization and generalization. Worse yet, both of those rules, as well as the rule for type ascription, can be applied in infinitely many different ways, for every possible type. Because these type rules are not syntax-directed, using them to derive types is complicated. But, because the types have not been written into the underlying syntax, to derive types with these rules is to infer types. However, there is a well-known algorithm, described and proved by Milner and Damas, that can perform type inference in the polymorphic $ \lambda $-calculus, known as Hindley-Milner type inference.⁴ We now focus our attention on their algorithm.

11.1 Unification

Consider how we informally resolved the types for Church numerals, and more specifically, for $ \lceil \hat{\cdot} \rceil $. We derived that $ m $ and $ n $ must be of some types $ (\tau_2 \to \tau_2) \to \tau_2 \to \tau_2 $ and $ (\tau_3 \to \tau_3) \to \tau_3 \to \tau_3 $, respectively. And, since $ \lceil \hat{\cdot} \rceil $ includes the expression $ nm $, $ m $’s type must be $ n $’s argument type; i.e., $ \tau_3 \to \tau_3 = (\tau_2 \to \tau_2) \to \tau_2 \to \tau_2 $. This relationship is true if $ \tau_3 = \tau_2 \to \tau_2 $. This process of discovering the relationship between abstract types is called unification, and is the heart of type inference.

Because the type rules for the polymorphic $ \lambda $-calculus are not syntax-directed, rather than build the type of an expression directly from the types of its immediate subexpressions, Milner’s algorithm uses type variables to accumulate incomplete type information about the immediate subexpressions. The algorithm treats the incomplete type information gathered from the subexpressions as a set of equations to be solved. It then solves the equations, thus obtaining a type for the original expression.

The first step to understanding Milner’s algorithm is to consider the problem of solving type equations. To solve type equations is to relate types, as we did with Church numerals above, and the algorithm to do so is known as Robinson’s Unification Algorithm.

Our adaptation of Robinson’s unification algorithm works on type expressions with unquantified type variables. Solutions of type equations are expressed in terms of substitutions in type expressions. We first mentioned type substitutions in our description of System F, in the type rule for specialization. Because we consider only unquantified type expressions here, the expression $ \tau_1[\tau_2/\alpha] $ simply denotes the type $ \tau_1 $ with all instances of $ \alpha $ replaced by $ \tau_2 $; the problem of capturing in substitution that we had in Module 2 does not occur. A substitution may have any number of steps $ [\tau_n/\alpha_m] $, including zero, so we represent an empty substitution as $ [] $.

The solution of the type equation $ \tau_1 = \tau_2 $ is called a unifier:

Definition 4. (Unifier) Given types $ \tau_1 $ and $ \tau_2 $, a unifier of $ \tau_1 $ and $ \tau_2 $ is a substitution $ S $ such that $ \tau_1 S = \tau_2 S $.

Consider, for example, the types $ \tau_1 = \alpha \to (\beta \to \alpha) $ and $ \tau_2 = \alpha \to (\text{nat} \to \alpha) $, assuming we have the “nat” primitive type. By setting both $ \alpha $ and $ \beta $ to “nat”, we specialize both $ \tau_1 $ and $ \tau_2 $ to $ \text{nat} \to (\text{nat} \to \text{nat}) $. Thus, the substitution $ [\text{nat}/\alpha][\text{nat}/\beta] $ is a unifier of $ \tau_1 $ and $ \tau_2 $. On the other hand, if we only set $ \beta $ to nat, then $ \tau_1 $ and $ \tau_2 $ both specialize to $ \alpha \to (\text{nat} \to \alpha) $. Thus, $ [\text{nat}/\beta] $ is also a unifier of $ \tau_1 $ and $ \tau_2 $.

It is likely that the latter unifier is more useful, as it leaves the type more open to further unification. In general, we prefer unifiers that constrain the specialization of type variables as little as possible, and in particular, we seek a Most General Unifier:

Definition 5. (Most General Unifier) Let $ \tau_1 $ and $ \tau_2 $ be type expressions. A substitution $ S $ is a Most General Unifier (MGU) of $ \tau_1 $ and $ \tau_2 $ if $ S $ is a unifier of $ \tau_1 $ and $ \tau_2 $, and for every unifier $ S' $ of $ \tau_1 $ and $ \tau_2 $, there exists a substitution $ S'' $ such that $ S' = S'' \circ S $.

It is a fact that if $ \tau_1 $ and $ \tau_2 $ can be unified (that is, if they possess a unifier), then they possess an MGU. The MGUs of a given pair of expressions are equivalent to one another, up to renaming of variables. Robinson’s unification algorithm takes two type expressions as parameters, and returns an MGU for them, if one exists.

Before we look at the algorithm, let’s take a moment to recall the space we’re operating in. We have several kinds of types: primitive types, arrow types, universal types, polymorphic types, and abstract types.

Primitive types are types such as $ t_1 $ and “nat”. Unification cannot change primitive types, because the code will not function with a different type.
Arrow types are constructed types with a $ \to $, i.e., function types. Unification cannot substitute an arrow type for a non-arrow type — functions will still be functions! — but may be able to substitute one or both sides of the arrow.
Universal types are constructed types with a $ \forall $, i.e., universally quantified types. If an expression has type $ \forall\alpha.\alpha \to \alpha $, that means that the expression is a function which takes a value of any type and returns a value of the same type. Unification won’t actually interact with universal types; this is the restriction we made at the beginning of this section. Instead, after unification, we will generalize using T_Gen, which produces universal types.
Polymorphic types — i.e., type variables — are the parameter of type abstractions, and the operand of universal types, such as $ \alpha $. All polymorphic types are defined by some surrounding universal quantifier such as $ \forall\alpha.\alpha \to \alpha $. Our goal in unification is to discover specializations necessary to unify polymorphic types with any of the above types.
Abstract types are types such as $ \tau_1 $, and they do not actually exist in the language. Any given abstract type is some real type in the unification algorithm, and we write the unification algorithm with an abstract type to indicate that a step works (or fails) with any type.

Definition 6. (Unification) The unification algorithm, $ U $, takes as input type expressions $ \tau_1 $ and $ \tau_2 $, and returns a most general unifier for them, if one exists, as follows:

\[ \begin{aligned} &\textbf{U_PrimSelf:} &U(t, t) &= [] \\[4pt] &\textbf{U_PrimErr:} &U(t_1, t_2) &= \text{error}(t_1 \neq t_2) \\[4pt] &\textbf{U_PrimArrow:} &U(t, \tau_1 \to \tau_2) &= \text{error} \\[4pt] &\textbf{U_ArrowArrow:} &U(\tau_1 \to \tau_2,\, \tau_3 \to \tau_4) &= S_1 S_2 \quad \text{where } S_1 = U(\tau_1, \tau_3),\; S_2 = U(\tau_2 S_1, \tau_4 S_1) \\[4pt] &\textbf{U_Spec:} &U(\alpha, \tau) &= \begin{cases} \text{error} & \text{if } \tau \text{ is a constructed type and } \alpha \text{ appears in } \tau \\ [\tau/\alpha] & \text{otherwise} \end{cases} \\[4pt] &\textbf{U_Comm:} &U(\tau_1, \tau_2) &= U(\tau_2, \tau_1) \end{aligned} \]

U_PrimSelf: Unifying a primitive type with itself produces a trivial substitution. Naturally, as $ t = t $ in the first place, no substitution is needed to relate the two.

U_PrimErr: A primitive type cannot be unified with a different primitive type (for example, integers and strings cannot be unified).

U_PrimArrow: A primitive type cannot be unified with a function type.

U_ArrowArrow: To unify two function types, we first unify their parameter types, obtaining a substitution $ S_1 $. We then apply $ S_1 $ to the result types of the two expressions, and then unify the result types, obtaining a substitution $ S_2 $. The result is the composition of $ S_1 $ and $ S_2 $. Note that the application of $ S_1 $ to $ \tau_2 $ and $ \tau_4 $ is necessary to ensure that any type information obtained during unification of $ \tau_1 $ and $ \tau_3 $ is enforced during unification of $ \tau_2 $ and $ \tau_4 $. This process could equivalently be done in the opposite order; it is unnecessary to unify the parameter types before the return types, only to assure that the substitution from one is enforced in the other.

U_Spec: To unify a type variable $ \alpha $ with any type expression $ \tau $, we simply return the substitution that replaces $ \alpha $ with $ \tau $, with one exception: if $ \tau $ contains $ \alpha $, but is not equal to $ \alpha $, then the unification is circular and must fail (for example, the type equation $ \alpha = \alpha \to \alpha $ is circular and has no solution). This safeguard against circularity is known as the occurs-check. Note that U_Spec also implies that $ U(\alpha, \alpha) $, i.e., the unification of a type variable with itself, is $ [\alpha/\alpha] $. This substitution is of course pointless, so practical implementations will instead produce $ [] $.

U_Comm: $ U $ is commutative, so, for instance, $ U(\tau_1 \to \tau_2, t) $ can be reversed to fit U_PrimArrow. As an algorithm, it’s important to check each case before swapping the arguments, to avoid an infinite recursion.

11.2 Algorithm W

With unification, we may finally describe the polymorphic type inference algorithm of Milner and Damas itself: Algorithm W.

Algorithm W, or simply W, forms the basis for the type system of Haskell and many other typed functional languages, including OCaml. W takes as input a type environment and an expression, and returns a substitution and the type of the expression in the context of that substitution. Note that this substitution applies to the type environment; substitution over a type environment is simply substitution over each of the types in the type environment.

Definition 7. (Algorithm W) Algorithm W takes as input a type environment $ \Gamma $ and an expression $ E $, and returns a pair of a substitution and a type expression.

\[ \textbf{W_Var} \quad W(\Gamma,\, x) = \langle [],\, \tau \rangle \quad \text{where } \tau = \Gamma(x) \]\[ \textbf{W_Abs} \quad W(\Gamma,\, (\lambda x.\, E)) = \langle S,\, (\alpha S) \to \tau \rangle \quad \text{where } \alpha \text{ is a new type variable},\; \langle S, \tau \rangle = W(\{\langle x, \alpha \rangle\} + \Gamma,\, E) \]\[ \textbf{W_App} \quad W(\Gamma,\, (E_1\, E_2)) = \langle S_1 S_2 S_3,\, \alpha S_3 \rangle \quad \text{where } \langle S_1, \tau_1 \rangle = W(\Gamma,\, E_1),\; \langle S_2, \tau_2 \rangle = W(\Gamma S_1,\, E_2),\; \alpha \text{ is a new type variable},\; S_3 = U(\tau_1 S_2,\, \tau_2 \to \alpha) \]

Once W has terminated, any type variables introduced by W that remain in the type returned by the algorithm are assumed to be universally quantified by T_Gen.

We see that W works by decomposing the given expression according to its structure, typing the constituent parts, and then using unification and composition to construct a type for the original expression. W is syntax-directed: there is exactly one case for each production of abstract syntax in the polymorphic $ \lambda $-calculus, and the type of each expression is built directly from the types of its immediate subexpressions. We can therefore examine the operation of W case-by-case.

W_Var: Finding the type of a variable is trivial; we simply look it up in the environment $ \Gamma $. As we did not specialize any type variables, we return the empty substitution.

W_Abs: To find the type of an abstraction, we invent a new type variable $ \alpha $ through T_Ascr, and bind the parameter variable $ x $ to $ \alpha $ in $ \Gamma $ while typing the body $ E $ of the abstraction. During the typing of $ E $, the variable $ \alpha $ may be specialized to a more specific type. If it is, the specialization will form part of the substitution $ S $ that is returned, replacing the invented $ \alpha $. Thus, we complete the typing of $ \lambda x.\, E $ by applying $ S $ to $ \alpha $ and using the resulting type as the parameter component of the type we return. The result component of the type is $ \tau $, the type returned by the typing of $ E $. We also pass up the substitution $ S $ generated by the recursive invocation.

W_App: To find the type of an application, we first type the rator $ E_1 $, obtaining a type $ \tau_1 $ and a substitution $ S_1 $. We then type the rand $ E_2 $ using the updated environment $ \Gamma S_1 $ (to ensure that any type information gathered in the typing of $ E_1 $ is enforced during typing of $ E_2 $, obtaining the type $ \tau_2 $ and the substitution $ S_2 $. Because we are typing an application, $ E_1 $ must be a function type, the parameter component of which must match the type of $ E_2 $. Thus, we now unify $ \tau_1 S_2 $ with $ \tau_2 \to \alpha $, where $ \alpha $ is a new type variable (we apply $ S_2 $ to $ \tau_1 $ to ensure that both expressions being unified reflect the information obtained during both recursive applications of W). The type we return is $ \alpha $, updated by application of $ S_3 $ to reflect the type constraints enforced by the unification. The substitution we return is the composition of $ S_1 $, $ S_2 $, and $ S_3 $. Note that application is the only step which requires the unification algorithm, $ U $.

11.3 Type Inference in Practice

Real languages which use type inference, such as Haskell and OCaml, always allow types to be specified explicitly as well. This is both for documentation, and because polymorphic types sometimes do not express the user’s actual intent; they can easily be too polymorphic.

Type inference allows programs to be fully statically typed, even if few or no types are written.

Type Inference (continued)

Type inference is distinct from dynamic typing, because types are reasoned about from the code alone, and distinct from gradual and optional typing (if you happen to be familiar with these concepts), because every expression is given a precise type. Type inference does not weaken type checking, but allows sophisticated types to be used without having to write them, or often, even work them out.

The downside of type inference is that error messages can be incomprehensible, even to those familiar with type inference. However, knowing the concepts of type inference makes reading such type errors easier. With a knowledge of type inference, you are recommended to go back to Section 4 and read the error message that Haskell produced when we attempted to type the Y combinator.

Exercise 3. Extend W for let bindings as defined in Module 4.

Exercise 4. Extend W to support optional explicit type specification, i.e., simply-typed $ \lambda $-calculus abstractions in the polymorphic $ \lambda $-calculus.

Pattern Matching

Now that we have both algebraic data types and parametric polymorphism, we can discuss one of the fundamental features of typed functional languages: pattern matching.

Consider our List type from Section 9:

data List a = Empty | Pair a (List a)

The types are erasable, so setting aside types, this is a definition of two data constructors, Empty and Pair, of which the latter takes two arguments. But data constructors in our semantics generate symbols, and symbols aren’t actually part of the written language, just part of our semantics. Indeed, the tuples that we generate in our semantics don’t even have any operations to unpack them; they only exist as values. So, how would we actually use a List in practice?

The answer is that there is a construct similar to the guards we saw in Section 6, but much more powerful: pattern matching. All typed languages with ADTs support pattern matching, to avoid exposing symbols as a construct in the language. Untyped functional languages such as Scheme/Racket usually expose symbols.

In Haskell, we can match based on a list like so:

sumList l =
  case l of
    Empty      -> 0
    Pair x y   -> x + sumList y

There is a lot to unpack in the case construction. Each line associates a data constructor with an expression, but it does even more: the data constructor can have unbound variables, and implicit let bindings are built to bind them. So, rewritten as a nested if statement, this would look something like

sumList = \l ->
  if (l[0] = #Empty) then 0
  else if (l[0] = #Pair) then
    let x = l[1] in
    let y = l[2] in
    x + sumList y

but this syntax is purely fictional. Both l[n], to extract a value from the tuple we are using in our semantics to represent ADTs, and #, to name symbols, do not exist in the actual language. That being said, this is how the above pattern would actually be implemented.

Patterns are actually far more powerful than this; in general, any expression composed of data constructors, variables, and values is allowed. For instance, we can write a function that sums the first two elements of a list by nesting Pair:

sumFirstTwo l =
  case l of
    Pair x (Pair y _) -> x + y

This example also demonstrates _, the “don’t care” pattern, which behaves like a variable (in that it matches anything), but does not actually bind a variable. Like with guards, the else branch of any pattern matching is a call to error.

Values in patterns are an unusual case, but also give us an easy way to write base cases for recursive functions over non-recursive data types:

sumNums x =
  case x of
    0 -> 0
    _ -> x + sumNums (x-1)

Patterns are strictly ordered, so a more general pattern (such as _) must come after a more specific pattern (such as 0).

Implementing patterns involves reducing the left-hand side of -> using a reduction rule specific to patterns. It is reduced to a value which may have “holes”, labeled by variable names. The value produced by that reduction is compared to the value passed to case, with a comparator that supports nested tuples, and matches a hole to any value. The right-hand side of -> is then evaluated in an environment that associates the variables defined by the holes with the values defined by the case value.

Haskell also supports a shorthand for pattern matching, in a syntax that resembles declaring a function multiple times:

sumNums 0 = 0
sumNums x = x + sumNums (x-1)

Like with case expressions, the order is critical, as this is just syntactic sugar for a case expression.

Semantics of Patterns

Since patterns are strictly ordered, the first step to defining their formal semantics is to rewrite them in a way that is more similar to the familiar if-then-else expression. Haskell — and most languages with pattern matching — do not have such a “match/else” expression, so we will invent one. In the language of our semantics, and not the source language, we will suppose that the expression $ \text{match } E_1 : P \text{ then } E_2 \text{ else } E_3 $ evaluates $ E_2 $ if the expression $ E_1 $ evaluates to a value which matches the pattern $ P $, and $ E_3 $ otherwise. $ E_2 $ will be evaluated in a context in which the variables in $ P $ have been bound.

We will also simplify the process of matching by asserting that patterns are only one level deep; that is, we may match E1 : Pair a b, but we may not match E1 : Pair a (Pair b). The latter may be rewritten in terms of the former, albeit with a somewhat more complicated structure to the resulting match-then-else expressions.

Now, let’s write the rules for match. Remember that if the expression evaluates to an ADT of any kind, the result will be some tuple $ \{Sym, \ldots\} $, where $ Sym $ is the symbol for the data constructor, and the rest is the arguments. To match, we must evaluate both sides. Since we will produce let-bindings, we will evaluate with a store $ \sigma $.

First, the usual rules to evaluate both sides. $ P $ here is a metavariable over patterns, which are syntactically just expressions; the difference comes in type judgment, in that they may have unbound variables.

\[ \textbf{MatchLeft} \quad \dfrac{ \langle \sigma,\, E_1 \rangle \to \langle \sigma,\, E_1' \rangle }{ \langle \sigma,\, \text{match } E_1 : P \text{ then } E_2 \text{ else } E_3 \rangle \to \langle \sigma,\, \text{match } E_1' : P \text{ then } E_2 \text{ else } E_3 \rangle } \]\[ \textbf{MatchRight} \quad \dfrac{ \langle \{\},\, P \rangle \to \langle \{\},\, P' \rangle }{ \langle \sigma,\, \text{match } E_1 : P \text{ then } E_2 \text{ else } E_3 \rangle \to \langle \sigma,\, \text{match } E_1 : P' \text{ then } E_2 \text{ else } E_3 \rangle } \]

Note that $ P $ is reduced without anything in the store. This is because all variables in patterns are supposed to be unbound; they are bound by the matching.

Now, once both sides have been reduced, we can perform our match. First, we need to convert the unbound variables in $ P $ into let-bindings:

\[ \textbf{MatchBind} \quad \dfrac{ \begin{array}{l} P = \{y,\, _1,\, _2,\, \cdots\, _n,\, x,\, z_1,\, z_2,\, \cdots,\, z_m\} \\ P' = \{y,\, _1,\, _2,\, \cdots,\, _n,\, _,\, z_1,\, z_2,\, \cdots,\, z_m\} \\ V = \{y,\, M_1,\, M_2,\, \cdots,\, M_n,\, N,\, M_1',\, M_2',\, \cdots,\, M_m'\} \end{array} }{ \langle \sigma,\, \text{match } V : P \text{ then } E_1 \text{ else } E_2 \rangle \to \langle \sigma,\, \text{match } V : P' \text{ then } \mathbf{let}\; x = N\; \mathbf{in}\; E_1 \text{ else } E_2 \rangle } \]

This is a bit wordy, so let’s go part-by-part:

The $ P = $ premise demands that the pattern have some number $ n $ (which may be 0) of throw-away _ matches, then a variable name $ x $, then $ m $ more variables.
The $ P' = $ line specifies that after this step of matching, our pattern will have one more _, replacing $ x $.
The $ V = $ line associates the appropriate element — the one after $ n $ other elements — of the value we are matching against with the name $ N $.
The conclusion expands $ E_1 $ into $ \mathbf{let}\; x = N\; \mathbf{in}\; E_1 $, so that $ E_1 $ will be evaluated in a context with $ x $ defined as matched in $ V $.

When all the binding is done, we need to actually evaluate the expression:

\[ \textbf{MatchThen} \quad \dfrac{ \begin{array}{l} P = \{y,\, _1,\, _2,\, \ldots,\, _n\} \\ V = \{y,\, M_1,\, M_2,\, \ldots,\, M_n\} \end{array} }{ \langle \sigma,\, \text{match } V : P \text{ then } E_1 \text{ else } E_2 \rangle \to \langle \sigma,\, E_1 \rangle } \]

This simply demands that $ P $ and $ V $ be of the same form; in particular, their symbol $ y $ must be the same.

Finally, if the symbol is wrong, the match fails:

\[ \textbf{MatchElse} \quad \dfrac{ \begin{array}{l} P = \{y,\, \cdots\} \\ V = \{z,\, \cdots\} \\ y \neq z \end{array} }{ \langle \sigma,\, \text{match } V : P \text{ then } E_1 \text{ else } E_2 \rangle \to \langle \sigma,\, E_2 \rangle } \]

Note that nothing in our rules checked for the case that the pattern and value matched have different sizes (e.g., that $ P $ is an $ n $-tuple and $ V $ is an $ m $-tuple), so the semantics would get stuck if this occurred. Our data constructors only ever produce tuples of a fixed length, so it is the job of type judgment, and the rest of our semantic rules, to assure that the semantics do not get stuck in this way.

Errors and Exceptions

Haskell uses a special error type to indicate erroneous situations. Errors carry with them an error message, as a string. In most other languages, including OCaml, exceptions behave in a similar way to errors. For example, calling sumFirstTwo with a list of length less than two will raise an error.

Semantically, since errors are, well, errors, virtually all formal semantics simply ignore them. But we don’t shy away from formally defining things in this course, so let’s add errors to the $ \lambda $-calculus:

⟨Term⟩ ::= … | error
⟨Expr⟩ ::= … | error

\[ \textbf{ErrorLeft} \quad \dfrac{}{\text{error}\; E \to \text{error}} \qquad\qquad \textbf{ErrorRight} \quad \dfrac{}{V\; \text{error} \to \text{error}} \]

In short, if an error occurs in any subexpression during application, then the expression is itself an error. This propagates upwards through all applications, so that if a program encounters an error anywhere, the entire program will be in error. With the $ \lambda $-calculus, we only needed to explicitly show this propagation with two rules, but adding all the primitives from Module 3 would involve a dozen new, and extremely boring, rules.

More interesting than errors are exceptions, but Haskell has no exceptions per se. (Exceptions in Haskell are a library feature relating to monads, not a language feature.)

Lazy Evaluation

Recall that when we discussed the $ \lambda $-calculus, we considered three reduction strategies: Applicative Order Reduction (AOR), Applicative Order Evaluation (AOE), and Normal Order Reduction (NOR). AOR proceeds by always reducing the leftmost, innermost redex (in the case of AOE, ignoring redices inside of abstractions), while NOR always reduces the leftmost, outermost redex. As a result, under AOR and AOE, arguments to a function are always reduced completely before they are passed to the function. Conversely, under NOR, arguments are passed to the function first, and then evaluated as needed. This “as needed” form of evaluation is called lazy evaluation, and with lazy evaluation, unused arguments (or even variable bindings!) will never be evaluated at all. AOR and AOE are forms of eager evaluation, which are similar to how almost all programming languages evaluate.

As we learned from the Standardization Theorem, a reduction under NOR will always reach a normal form, if one exists. On the other hand, AOR and AOE can get caught in infinite reduction sequences, even if a normal form exists. From this point of view, it would make sense to base the reduction strategies of functional programming languages on NOR rather than AOR. However, most languages are based on eager evaluation. Why should this be? The answer is that in general, eager strategies tend to be more efficient than lazy strategies. Consider the following expression:

\[ (\lambda x.\; x\,x\,x\,x\,x)((\lambda y.\; y)\,z) \]

This expression reduces to the normal form $ z\,z\,z\,z\,z $. However, it takes six steps to reach the normal form under NOR, but only two to reach it under AOR. This phenomenon manifests itself any time a function makes use of its argument more than once. If the argument is not reduced before being passed to the function, then it is replicated by the function and must then be reduced multiple times.

There are, however, two situations in which NOR can reach a normal form faster than AOR. First, as we have seen, AOR can fall into avoidable infinite reductions, whereas NOR is guaranteed to reach a normal form if one exists. Any finite amount of time is less than infinite time. The second case is when NOR can avoid evaluation steps entirely. Consider the following expression:

\[ (\lambda x.\; z)((\lambda a.\; \lambda b.\; \lambda c.\; \lambda d.\; \lambda e.\; e)\,y\,y\,y\,y\,y) \]

This expression reduces to the normal form $ z $. However, AOR requires six steps of reduction to reach it, while NOR requires only one. NOR does not bother to perform reductions whose results will just be thrown away, while AOR does.

Haskell is a lazily evaluated language, which means it mostly follows NOR. Actually, as Haskell has no interest in normal forms per se, but is a language with actual primitive values, it follows NOE: Normal Order Evaluation. NOE is to NOR as AOE is to AOR: it follows the same order, but never reduces inside an abstraction. NOE will also never reduce a rand directly, since it must be passed into an abstraction to be used.

All of the primitive semantic rules we defined in Module 3 work for any of these evaluation orders, but our informal definition of pattern matching does not quite. For pattern matching to work lazily, the left-hand side is evaluated fully, and the comparator evaluates fully only the parts of the case expression necessary to make the comparison.

In addition, lazy evaluation allows us to change our definitions of let bindings and global declarations: both can allow expressions to be bound, rather than values, because those expressions will only be evaluated when needed anyway.

This lazy evaluation strategy makes certain patterns natural in Haskell, but nonsense in other languages. For instance, consider this function:

allTheNumbersFrom = \x -> Pair x (allTheNumbersFrom (x+1))

This function generates a list of every integer greater than or equal to $ x $. That list is, of course, infinitely long. And yet, this function works fine: we can call it and get a list. If we get the first element of that list, it is $ x $. The second element is $ x+1 $. Of course, if we try to go over the entire list, we will never get to the end, but that does not stop Haskell from generating it.

Let’s see why this works by getting the third element of one of these infinite lists. We start with these additional definitions of third, second, and first:

third  = \x -> case x of Pair _ y -> second y
second = \x -> case x of Pair _ y -> first y
first  = \x -> case x of Pair y _ -> y

Now, let’s do our reduction, using NOE (also simplifying let expressions to use substitution):

third (allTheNumbersFrom 1)
→ (\x -> case x of Pair _ y -> second y) (allTheNumbersFrom 1)
→ case (allTheNumbersFrom 1) of Pair _ y -> second y
→ case ((\x -> Pair x (allTheNumbersFrom (x+1))) 1) of Pair _ y -> second y
→ case (Pair 1 (allTheNumbersFrom (1+1))) of Pair _ y -> second y
→ let y = (allTheNumbersFrom (1+1)) in second y
→ second (allTheNumbersFrom (1+1))
→ (\x -> case x of Pair _ y -> first y) (allTheNumbersFrom (1+1))
→ case (allTheNumbersFrom (1+1)) of Pair _ y -> first y
→ case ((\x -> Pair x (allTheNumbersFrom (x+1))) (1+1)) of Pair _ y -> first y
→ case (Pair (1+1) (allTheNumbersFrom (1+1+1))) of Pair _ y -> first y
→ let y = (allTheNumbersFrom (1+1+1)) in first y
→ first (allTheNumbersFrom (1+1+1))
→ (\x -> case x of Pair y _ -> y) (allTheNumbersFrom (1+1+1))
→ case (allTheNumbersFrom (1+1+1)) of Pair y _ -> y
→ case ((\x -> Pair x (allTheNumbersFrom (x+1))) (1+1+1)) of Pair y _ -> y
→ case (Pair (1+1+1) (allTheNumbersFrom (1+1+1+1))) of Pair y _ -> y
→ let y = (1+1+1) in y
→ 1+1+1
→ 2+1
→ 3

Consider in particular the last few lines: when we evaluated first, its argument still had an infinite recursion, but since we did not use that part of the argument, the fact that it was infinite is irrelevant. And, in the very end, since we had not yet needed to evaluate our addition — remember, + is a function too — all of the addition happens at the end.

This style of evaluation is wildly different from that used by other programming languages. In terms of the meaning of any expression, it is strictly better, since we can define infinite data structures without filling all of memory. It fits nicely with another feature of Haskell: Haskell is pure. In a pure functional language, it does not actually matter if a particular subexpression is evaluated, if it does not contribute to the final value; whether it was evaluated or not is unobservable, except in how long your program runs.

There is a more subtle benefit in how lazy evaluation handles recursion. Most functional programming languages need to handle one case of recursion specially: so-called tail recursion. Tail recursion is when the very last thing a function does is call itself, with a different argument. The obvious implementation of recursion would simply add a new stack frame, and then all stack frames return in sequence when the base case is reached. The problem with this is that for very large data structures, the stack is easily exhausted. Other functional languages handle this case by clearing the current stack frame before making the recursive call, so that the same stack space is used repeatedly. Haskell needs no special handling: it just returns the unevaluated expression.

The compiler magic required to make Haskell perform well in spite of lazy evaluation is nothing short of herculean. The simple fact is that as lazy as Haskell is, computers are neither lazy nor functional. It is the job of the Haskell compiler to determine which expressions are definitely evaluated, put them in an evaluation order, and make light functions of all the expressions which are maybe evaluated. Expressions which cannot be proved to be evaluated exactly once are memoized, which means that memory is allocated to store their value, and then that value is used when the expression is evaluated again, so that if a variable is used multiple times, it does not involve fully evaluating an expression every time. In spite of all this, the major implementation of Haskell, ghc, produces code with performance roughly on par with C compilers.

Laziness can be added ad-hoc to any programming language with first-class functions. For instance, an OCaml function which behaves similarly to the Haskell allTheNumbersFrom above can be constructed by wrapping the problematic infinite recursion in a function, and infinite lists in an algebraic data type:

type 'a lazyList =
  | Empty
  | Pair of 'a * (unit -> 'a lazyList)

let rec allTheNumbersFrom x =
  Pair (x, fun _ -> (allTheNumbersFrom (x+1)))

let first (Pair (x, _)) = x
let second (Pair (_, x)) = first (x())
let third (Pair (_, x)) = second (x())

This version is more difficult to use, since we must explicitly call the function to get more of the list, with x() (the application of x with unit). However, this pattern, in particular with infinite lists, is sufficiently useful that it has made its way into several languages, such as Python, in the form of generators.

Laziness requires pureness for the results to be unsurprising, but pureness comes at a great cost. In a language like OCaml, if you want to print something, you just call the printf function, which returns a unit. In Haskell, there are no functions that return unit, since that is wholly meaningless in a pure language: expressions have no behavior other than to evaluate a value, so an expression that evaluates to unit could not actually do anything. When an expression does something other than evaluate to a value, such as print, that other behavior is called a side effect. All practical programming languages, including Haskell, allow side effects, but laziness and purity significantly change how they can be expressed. Before we discuss the implications of that, let’s discuss how it is done in an impure language like OCaml.

Impure State and Assignment

Consider OCaml, in which a reference can be created with ref v, where v is its initial value. Because references change their value, the order of evaluation of any function that uses references is extremely important; OCaml follows AOE, as this is a fairly intuitive “left-to-right” order. But that is only half the story. How would we formally describe references and state?

Recall how we described a store, $ \sigma $. The store itself is not useful for storing references, because it is transient; taking a step does not actually change the store, we simply add to the store when we encounter let bindings in subexpressions. What is more important is that it gave us a new way to think about our reduction in general: rather than just reducing an expression, we paired that expression with a store. In order to represent state, we add one more element to that tuple: a heap.

A heap is a map, like the store, which associates references with their values. In formal semantics, heaps are often named $ \Sigma $ (capital sigma). We add $ \Sigma $ to the tuple we reduce over, so instead of just reducing over a pair of a store and an expression, we now reduce over a triple of a heap, store, and expression: $ \langle \Sigma, \sigma, E \rangle $. Unlike the store as we defined it, however, taking a reduction step can actually change the heap. That is, there are rules that look like this:

\[ \dfrac{\cdots}{\langle \Sigma,\, \sigma,\, E \rangle \to \langle \Sigma',\, \sigma,\, E' \rangle} \]

Thus, as reduction proceeds, later expressions are evaluated in a changed heap. We start off our reduction with $ \Sigma = \{\} $, and we cannot forget the state of the heap until the expression part of our triple has reduced to a value; i.e., reduction is complete.

Heaps are called such not by their analogy to the heap data structure, but by their analogy to the heap of a program, i.e., physical memory. As the program runs, the heap is changed, and so that change should be reflected in the reduction steps.

Our store mapped variable names to values, but the heap cannot be indexed so nicely, because references do not actually have names, and even if they did, multiple references may have the same name. Instead, like with symbols in abstract data types, we need another sort of value created just to be unique. These values are called labels. You can think of them as memory addresses, as they are the indices into our heap, but our formal heap is just a simple map with no particular structure, so it is better just to think of labels as arbitrary values. A reference will take the value of a label, and accessing references will involve using that label to index the heap. Labels are terminal values, insofar as they are not, in and of themselves, reducible.

Now, let’s extend the $ \lambda $-calculus with let bindings to include references using OCaml’s syntax, and labels:

⟨Term⟩ ::= … | ⟨Label⟩
⟨Expr⟩ ::= …
         | ref ⟨Expr⟩
         | ! ⟨Expr⟩
         | ⟨Expr⟩ := ⟨Expr⟩

Note that $ \langle Label \rangle $ does not form part of the actual user language — a user cannot write a label, they can only write a ref, which may become a label — so all that is important in the definition of $ \langle Label \rangle $ is that there are infinitely many unique labels. The typical metavariable for labels is $ \ell $.

Now, let’s define the semantics for each of our new expressions. Let $ E $ range over expressions, $ V $ range over terminal values, $ \ell $ range over labels, $ \Sigma $ range over heaps, $ \sigma $ range over stores.

\[ \textbf{RefEval} \quad \dfrac{ \langle \Sigma,\, \sigma,\, E \rangle \to \langle \Sigma',\, \sigma,\, E' \rangle }{ \langle \Sigma,\, \sigma,\, \text{ref}\; E \rangle \to \langle \Sigma',\, \sigma,\, \text{ref}\; E' \rangle } \]\[ \textbf{Ref} \quad \dfrac{ \ell \text{ is a fresh label in } \Sigma \quad \Sigma' = \Sigma[\ell \mapsto V] }{ \langle \Sigma,\, \sigma,\, \text{ref}\; V \rangle \to \langle \Sigma',\, \sigma,\, \ell \rangle } \]\[ \textbf{DerefEval} \quad \dfrac{ \langle \Sigma,\, \sigma,\, E \rangle \to \langle \Sigma',\, \sigma,\, E' \rangle }{ \langle \Sigma,\, \sigma,\, {!}E \rangle \to \langle \Sigma',\, \sigma,\, {!}E' \rangle } \qquad \textbf{Deref} \quad \dfrac{ \Sigma(\ell) = V }{ \langle \Sigma,\, \sigma,\, {!}\ell \rangle \to \langle \Sigma,\, \sigma,\, V \rangle } \]\[ \textbf{AssgLeft} \quad \dfrac{ \langle \Sigma,\, \sigma,\, E_1 \rangle \to \langle \Sigma',\, \sigma,\, E_1' \rangle }{ \langle \Sigma,\, \sigma,\, E_1 := E_2 \rangle \to \langle \Sigma',\, \sigma,\, E_1' := E_2 \rangle } \]\[ \textbf{AssgRight} \quad \dfrac{ \langle \Sigma,\, \sigma,\, E \rangle \to \langle \Sigma',\, \sigma,\, E' \rangle }{ \langle \Sigma,\, \sigma,\, V := E \rangle \to \langle \Sigma',\, \sigma,\, V := E' \rangle } \qquad \textbf{Assg} \quad \dfrac{ \Sigma' = \Sigma[\ell \mapsto V] }{ \langle \Sigma,\, \sigma,\, \ell := V \rangle \to \langle \Sigma',\, \sigma,\, V \rangle } \]

Exercise 5. Perform the reduction in AOE for this program: $ (\lambda x.\; \lfloor\text{false}\rfloor\; (x := 2)\; (!x))\; (\text{ref}\; 0) $

To those more familiar with C and C++, ref is essentially malloc, ! is *, and x := y is *x = y.

Because $ \Sigma $ is actually changed by ref and :=, the order of evaluation can completely change the meaning of a program, and expressions with results which are ultimately discarded — such as the first argument to $ \lfloor\text{false}\rfloor $ — can nonetheless contribute to the final value of the program. Although nothing makes state of this sort technically incompatible with lazy evaluation, the behavior of a program with state and lazy evaluation would be so difficult to predict, it is simply never done.

Aside: You may have noticed that virtually all of our formal semantics for added features have some rather boring steps saying “if you have an expression instead of a value as a subexpression, evaluate that first”. Some formal semantics attempt to abbreviate this, but that is often very unclear; it is still more common to simply include all of these dull ordering steps.

The same basic idea — adding an extra element to the tuple that represents our program state — can also be extended to any other kind of impure behavior, such as standard I/O (with a list of input characters and output characters) and even files (with a tree of files).

Note that you will see many different ways of expressing the operands to $ \to $ when state is included. It is not uncommon to write the triple as $ \Sigma;\, \sigma;\, E $ instead of $ \langle \Sigma, \sigma, E \rangle $. Some works prefer to group both state components, something like $ \langle \Sigma, \sigma \rangle;\, E $. These differences are purely superficial; the critical thing is that our reduction is now over not just the expression, but state.

Monads

At long last: monads!

When confronted with the question of how to deal with side effects, Haskell’s creators were quite aware of formal semantics, and how side effects are modeled in formal semantics, and they made a surprising observation: while an expression $ E $ in a language like OCaml might not be pure, $ \to $ is pure. The side-effects are not “side” effects if we describe the pair — such as the pair of heap and expression — instead of the expression alone. Everything that is a side-effect to $ E $ is represented directly in $ \Sigma' $, so the entire morphism has no side effects. With that observation, monads were born. (Actually, monads are a concept within category theory — they are just monoids over the category of endofunctors! — but what monads mean to us is a way to bring the pair of “side-effecting expression” and “target of side effects” together into our description of a function.)

Monads bundle the side-effecting behavior into a sort of black box, and allow us to link those black boxes into explicitly specified chains. Those chains form the specific ordering that makes stateful behavior predictable.

The concept of a monad comes from category theory. Monads were first applied to computer science by Eugenio Moggi, and their use was popularized by the Haskell community, in particular by Philip Wadler and Simon Peyton Jones. Monads are a rather abstract concept — this is why they are considered so difficult to understand — so we shall introduce them through the motivating example of input and output, and then generalize.

The IO Monad

The monad for performing I/O is called, appropriately, IO. IO is a polymorphic type, as an I/O function may return something as well as performing a side effect; for instance, getLine is a value of type IO String. The fact that getLine is not a function may seem strange, and is strange, but remember, functions are pure, so it could not have been a function. Inside of getLine’s value is some code that allows Haskell to get a character. But that code is not a Haskell function. Similarly, there is putStrLn, which is of type String -> IO (). Calling putStrLn does not write a string to standard output. It is a function so that it can make an I/O black box specific to the string you use as an argument; that I/O black box itself contains the code that allows Haskell to write the string out. That I/O is paired with (), a unit value containing no information.

Conceptually, we may think of the type IO a as follows:

IO a = World -> (a, World)

That is, given a distinguished type World that represents the state of the entire world, we see that an IO action may be thought of as a function that takes the entire world as input, and produces a value of type a and a new world as output. Indeed, that is exactly how our formal semantics modeled side effects above: $ \Sigma $ is the world — all the state outside of our code — and a is the type that $ E $ actually evaluates to. Since Haskell is a real programming language, the “entire world” includes the computer screen and keyboard. This is purely conceptual, since we cannot write functions over the entire world, so instead, an IO is just a black box.

Let’s imagine that we want to write a very simple program using getLine and putStrLn that simply reads in one line and writes it back out again. We want to compose getLine and putStrLn into a single program that performs both actions. The state of the world “flows” through getLine, into putStrLn, and out of our combined program.

The combinator used to combine getLine and putStrLn in this way is written >>=, and called “bind”. In the case of IO, >>= has the following type:

(>>=) :: IO a -> (a -> IO b) -> IO b

In this particular case, the type variable a will be substituted for String, so that IO String is the type of getLine, and the type variable b will be substituted for (), so that a -> IO b is the type of putStrLn. Our combined action, which we call echo, is then written as follows:

echo = getLine >>= putStrLn

echo is a non-function value, of type IO (). It specifies, but does not actually perform, the I/O. Thus, nothing we have done so far is impure. All we have done is describe the side-effecting behavior; we have not actually done anything with side effects.

In general, the notation f >>= g specifies (but does not perform!) the following sequence of actions:

the action f is performed first;
the result produced by f is passed as an argument to the function g, yielding another action;
the resulting action is performed, and its result returned as the result of evaluating the entire expression.

We have done all this work to produce a value that describes the side effects to be performed, but how do we actually make it happen? Recall that in Section 8, we said that in a language in which the global syntax only allows declarations, we need to know some starting point, and that in Haskell, that starting point is main. But main is not strictly a function; main is an IO monad! The behavior of the entire Haskell program is to run the I/O described by the value of the expression declared as main. So, we can make a Haskell program which performs echo like so:

main = echo

A Haskell program is ultimately a specification of the I/O behavior which must be performed throughout the program, and it is some entity outside the language which actually causes that I/O to occur. It is in this convoluted way that Haskell remains a pure functional language, but is capable of I/O. I/O can never be pure, but Haskell cannot perform I/O; it can only describe it.

Aside: There is unsafePerformIO, which converts an IO a into an a by performing the I/O whenever it is evaluated. unsafePerformIO is named so for a reason. The reason it is called “unsafe” rather than just “impure” is mainly laziness: in a lazy language, it is really hard to predict when — or if — the expression will be evaluated, so it is unsafe to assume that this unsafePerformIO will actually perform its I/O, and very unsafe to assume it will perform it at a given time.

Additionally, in the Haskell REPL ghci, if an expression given by the user evaluates to an IO, then the action specified by that IO is performed immediately:

> getLine >>= putStrLn
Input line
Input line

(The second line is the user input, and the third line is ghci’s output.)

Suppose now that we would like to combine our echo program with itself, to produce a program that requests a line, prints it, and then requests and prints a second line. We cannot simply write echo >>= echo, because echo is not a function, so its type does not match the type expected by the second parameter of >>=, and therefore the expression contains a type error.

On the other hand, the output of the first echo, namely (), conveys no information, and is not useful in further computation. So, we can solve our problem by chaining the first echo to a function that ignores its input and then performs the second echo. We can capture this behavior in a new combinator, >>, called “then”:

(>>) = \f -> \g -> f >>= (\_ -> g)

We can then implement our desired program as follows:

echoecho = echo >> echo

As another example of >>, the following is an IO action to read a line and print it twice:

echoTwice = getLine >>= (\l -> (putStrLn l >> putStrLn l))

Consider now the case where we would like to return a value to the user, rather than just print to the screen. Suppose we would like to read three lines from standard input and return the first and third, as a pair. Such a program must have type IO (String, String):

myIOAction = getLine >>= \l1 ->
             getLine >>
             getLine >>= \l3 ->
             return (l1, l3)

Here, after the third getLine, there are no I/O actions left to perform; however, we still need a way to return the pair (l1, l3) to the user. Using (l1, l3) here will cause a type error, because monadic binding (>>=) always produces some IO b. (l1, l3) has type (String, String), rather than the required IO (String, String). For this reason, we need a new primitive, return. The return primitive has no effect on World; its purpose is to take a value and wrap it inside the IO monad, so that it can be used in IO actions. It allows expressions in only the $ \lambda $-calculus side to be within our IO specification.

We now have the machinery necessary to consider a more complex monadic computation. The following action retrieves lines of text until an empty line is found, and returns the entire block (note that ++ is used to concatenate strings in Haskell):

getBlock = getLine >>= \l ->
           if l == "" then
             return ""
           else
             getBlock >>= \ls ->
             return (l ++ "\n" ++ ls)

Generalizing Monads

Now that we have some experience programming with monadic I/O, we consider in the abstract what it means to be a monad.

Definition 8. (Monad) A monad is a triple $ \langle M,\, {>>=_M},\, \text{return}_M \rangle $, where $ M $ is a type constructor and

$ \text{return}_M : \alpha \to M\{\alpha\} $
$ {>>=_M} : M\{\alpha\} \to (\alpha \to M\{\beta\}) \to M\{\beta\} $

such that the following monad laws are satisfied:

$ (\text{return}_M\; a) >>= k = k\; a $
$ m >>= (\text{return}_M) = m $
$ m >>= ((\lambda a.\; k\; a) >>= (\lambda b.\; h\; b)) = (m >>= (\lambda a.\; k\; a)) >>= (\lambda b.\; h\; b) $

By now, we have seen several examples of monadic programming, and it has become apparent that certain programming patterns emerge. In particular, the second argument of >>= is often an explicit $ \lambda $-abstraction. To hide some of the complexity and focus attention on the underlying structure of the computation, Haskell supports a syntactic sugar for these common monad patterns: do notation. We use do notation to remove explicit invocations of >>=. For example, our previous myIOAction can be rewritten using do notation as follows:

myIOAction =
  do l1 <- getLine
     getLine
     l3 <- getLine
     return (l1, l3)

When monads are generalized in this way, they become useful not only for I/O computations but also as a general programming tool. Many constructions we use in imperative programming may be modeled as monads.

Recall that in Section 13, we discussed how errors propagate in Haskell. That was a rather brutish way to handle errors, so let’s look instead to handling errors with monads.

We will use monads to handle errors in much the same way that we might use exceptions in another language. Rather than raise an exception in the pure functional setting, a computation that encounters an error will return a special token to indicate the error, rather than a normal value. We can model this “normal value or error token” behavior with a data declaration:

data Maybe a = Nothing | Just a

This Maybe type is functionally identical to OCaml’s option type. For example, a division function which wishes to carefully handle error cases would return Nothing if the divisor is 0 (to avoid a division by zero error), and Just (the quotient) if the divisor is non-zero. A typical caller would need to pattern match to actually use this division function, but we can use monadic composition to make it simpler.

We can make a monad of Maybe by defining the monadic >>= and return functions for Maybe’s patterns:

(>>=) Nothing  _ = Nothing
(>>=) (Just x) g = g x
return = Just

Remember that data constructors are still functions, so we can simply rebind return as Just.

The combinator >>=, which combines monadic computations, first examines the value of its first argument. If the first argument is Nothing, then there was a program error, and the Nothing is propagated without executing the second action. This is similar to how error is propagated in an application without concern for the rest of the expression. If the first argument is Just x, then there was no program error, and computation may proceed: the function g may be applied to the result of the previous computation. The combinator return, which just wraps its argument into the Maybe monad, is equivalent to Just.

We may now implement a safe integer division operation, using the Maybe monad:

safeDiv x y =
  if y == 0 then
    Nothing
  else
    Just (div x y)

Suppose now we wish to define a function f that takes inputs a, b, and c, and computes a/b + b/c. We may create a safe version of f using the Maybe monad as follows:

safef a b c =
  do
    r1 <- safeDiv a b
    r2 <- safeDiv b c
    return (r1 + r2)

Notice the imperative feel of safef, even though it is both functional and pure; monadic composition gives ordering to pure functions. Also notice that there is no explicit error-handling inside of safef; all of the mechanics of propagating Nothing have been confined to the definition of the Maybe monad itself. Nevertheless, if b is 0, so that safeDiv a b fails, the error (Nothing) will propagate through the remainder of safef, so that the remaining computations, safeDiv b c and return (r1 + r2), will be skipped.

The IO monad supports variables and state, but in fact, we do not need to involve the entire World just to manage state; we may also use monads to model mutable state, and assignment, in a purely functional setting. We consider a “state transformer” on a given state type s and result type a to be a function that takes an old state (of type s) as a parameter, and returns a result (of type a) and a new state (of type s) as results:

data State s a = ST (s -> (a, s))

Remember that monads box an action, not a value, and that action in this case is a transformation over the state, along with the function over a.

The datatype State is the type constructor for our state transformers, and ST is the data constructor for a given transformer. We observe that, for any state type s, the type constructor State s is a monad, given definitions of >>= and return:

(>>=) (ST f) g = ST (\s0 ->
                   let (a, s1) = f s0 in
                   let (ST h) = g a in
                   let (b, s2) = h s1 in
                   (b, s2))

return x = ST (\s -> (x, s))

Our return combinator maps a value to a function that returns that value, while leaving the state unchanged. Our >>= produces a function that threads the initial state s0 through the computations f and g. First f, a function on states, is applied to the initial state s0, producing a value a and a new state s1. The function g, of type a -> State s b, is then applied to a, producing a state computation h, which is then applied to the “current” state s1, to produce a new value b and a final state s2. Thus, the initial state s0 is mapped to s1 after f is applied to it, and then to s2 after g a is applied to it. The result is the pair (b, s2), consisting of the final value b and the final state s2.

The state monad moves the changes to $ \Sigma $ into the explicit domain of Haskell code: a function in the State monad operates over the space of $ E $, and returns a value describing the operation over the state of $ \Sigma $. If we make $ \Sigma $ a single variable, instead of a whole map, then we can get or put the value of that variable with pure state functions:

get   = \a -> \s -> (s, s)
put v = \a -> \s -> ((), v)

The get function ignores its input and sets the result and state to the same value. The put function puts a given value into the state.

Although IO’s behavior is essentially magic — you cannot “open” an IO value and retrieve the underlying code to perform I/O — the concepts of monadic composition apply exactly the same when the monad is decomposable and even purely functional, as in State.

Monads Elsewhere

Although monads are only strictly necessary in a pure functional language, their ability to explicitly specify behavior that is usually only implicitly specified — such as state change and I/O — makes them useful even in an impure setting. Although monads are not necessary in OCaml, they are now supported, and can be used to make I/O and stateful behavior part of the checked type of functions. Similarly, Scala, an object-oriented language with functional features that runs on the Java platform, supports monads, even though its surrounding environment is anything but pure.

However, any language which both allows impure behavior and allows monads must necessarily sacrifice some assurance: you can use monads to explicitly specify impure behavior, but you do not have to. Haskell is the largest-scale language to require monads for all impure behavior.

Implementation and Continuation-Passing Style

Implementing functional language is complicated by the fact that no matter how much functional programmers may want it, computers are not functional; they operate in an explicit sequence of steps. However, our formal semantics always described steps, through the $ \to $ morphism, so clearly it is not impossible to describe functional languages in this fashion. In this section, we discuss continuations, an idea borrowed from formal semantics, and Continuation Passing Style (CPS), an intermediate representation based on them.

Evaluation of expressions (and execution of programs) proceeds in several steps, each corresponding roughly to one step of $ \beta $-reduction in the $ \lambda $-calculus. With each step of the evaluation, we associate a continuation. A continuation is an entity (we may think of continuations as functions) that represents the remainder of the computation to be performed. If $ E \to E' $, then the continuation is a version of $ E $ in which the part reduced by $ E \to E' $ is factored out, and made an argument. A continuation takes as input the value currently being computed and applies the remainder of the computation to this value. The result is the final result computed by the entire program.

Example. Consider the expression $ (3 + 5) \times 4 $. The continuation of the subexpression $ 3 $ is $ \lambda x.\; (x + 5) \times 4 $. The continuation of the subexpression $ 3 + 5 $ is $ \lambda x.\; x \times 4 $. The continuation of the entire expression, by convention, is simply the identity function, $ \lambda x.\; x $.

Typically, during the compilation process, a compiler will transform source code into a convenient intermediate representation, which is typically based on a much simpler language, and upon which optimizing transformations are more easily performed. One of these intermediate representations, known as Continuation Passing Style (CPS), is particularly popular in compilers for functional languages.

CPS is based on making the notion of a continuation of an expression explicit. Each function is transformed so that it accepts an additional parameter, representing the function’s continuation. Then, rather than return a value, the function passes that value to its continuation. Because CPS is more easily described with AOE than NOR, we will discuss it in the context of OCaml.

Consider the following simple OCaml function:

let f x = 3

This function could be rewritten to accept continuations as follows:

let f x k = k 3

In this way, functions in CPS never return; they simply pass the values they compute to their continuation. When the computation is finished, the continuation is the identity function, and this continuation is passed the final value of the computation.

Consider now the following, more complex, example:

let rec filter l p =
  match l with
  | []      -> []
  | x :: xs ->
      if p x then
        x :: (filter xs p)
      else
        (filter xs p)

This function takes a list and a predicate, and filters out all values from the list which match the predicate. This function could be rewritten in CPS as follows:

let rec filter l p k1 =
  match l with
  | [] -> k1 []
  | x :: xs ->
      let k2 b = (
        if b then
          let k3 l2 = (
            let r = x :: l2
            in k1 r
          ) in filter xs p k3
        else
          let k4 l2 = (
            k1 l2
          ) in filter xs p k4
      ) in p x k2

From this example, we can observe the major characteristics of CPS. First, we note that CPS names all intermediate expressions (in this example, the only instance of this behavior is the binding of the name r to the expression x :: l2). In this way, CPS makes dataflow explicit. Second, CPS names all points of control flow: each recursive call to filter is made with a separate continuation, and these continuations (k3 and k4) are distinct from the continuation k2 passed to the predicate p, and the top-level continuation k1. In this way, CPS makes control flow explicit.

Tracing through the CPS code, we see that, upon invocation of filter, the first step of the computation is the invocation of p. The result (either true or false) is passed to the continuation k2, which tests the value computed by p and invokes filter recursively with either the continuation k3 or the continuation k4, both of which eventually pass control back to the caller via invocation of the continuation k1.

Explicit control flow and dataflow make optimizations based on control flow analysis and dataflow analysis particularly easy. Consider, for example, the second recursive call to filter in the CPS version:

let k4 l2 = (
  k1 l2
) in filter xs p k4

In general, CPS formulations of a tail call — a recursive call which is the last behavior in a function — will have this form. The definition of k4 is trivial — it is $ \eta $-reducible to k1 — and so we could easily replace the above code with simply the following:

filter xs p k1

That is, we simply reuse the top-level continuation k1. As simple as this transformation is, what we have actually done is optimize away tail recursion: the continuation requires no context from this function call, so the context from this function call can be discarded. Tail-call elimination is one of the fundamental optimizations of functional languages, and CPS makes it trivial.

In Haskell, this style is less needed, because using NOR, we evaluate the outermost expression, instead of the innermost. Tail calls naturally melt away, because what we return is not the result of a recursive call, but the expression of the recursive call. Thus, while there are areas of Haskell to which CPS is applicable, generally different techniques are used in lazy languages.

Appendix: Full Semantics

We have introduced many features of functional languages, each independently. This appendix serves as a unified reference. This is the mostly-complete syntax, semantics, and resolution for what we might call “untyped $ \lambda $-Haskell”, i.e., an untyped language with Haskell-like features (excluding some semantics that were left as exercises).

Syntax

⟨Program⟩    ::= ⟨DeclList⟩
⟨DeclList⟩   ::= ⟨Decl⟩ ; ⟨DeclList⟩
               |
⟨Decl⟩       ::= ⟨Var⟩ = ⟨Expr⟩  |  data ⟨Var⟩ = ⟨ValueList⟩
⟨ValueList⟩  ::= ⟨DataConstr⟩ ⟨ValueListRest⟩
⟨ValueListRest⟩ ::= "|" ⟨DataConstr⟩ ⟨ValueListRest⟩
               |
⟨DataConstr⟩ ::= ⟨Var⟩ ⟨Var⟩*

⟨Expr⟩ ::= ⟨Var⟩
         | ⟨Abs⟩
         | ⟨App⟩
         | true
         | false
         | if ⟨Expr⟩ then ⟨Expr⟩ else ⟨Expr⟩
         | match ⟨Expr⟩ : ⟨Expr⟩ then ⟨Expr⟩ else ⟨Expr⟩
         | ⟨Num⟩
         | ⟨NumExp⟩
         | let ⟨Var⟩ = ⟨Expr⟩ in ⟨Expr⟩
         | error
         | ( ⟨Expr⟩ )

⟨Term⟩    ::= ⟨Var⟩ | ⟨Abs⟩ | true | false | ⟨Num⟩ | error
⟨Var⟩     ::= (any valid ID)
⟨Abs⟩     ::= λ ⟨Var⟩ . ⟨Expr⟩
⟨App⟩     ::= ⟨Expr⟩ ⟨Expr⟩
⟨Num⟩     ::= 0 | 1 | …
⟨NumExp⟩  ::= ⟨NumBinOps⟩ ⟨Expr⟩ ⟨Expr⟩
⟨NumBinOps⟩ ::= + | − | ∗ | /

Semantics

Application

\[ \langle \sigma,\, (\lambda x.\; M_1)\, M_2 \rangle \to \langle \sigma,\, M_1[M_2/x] \rangle \]

ReduceLeft

\[ \dfrac{ \langle \sigma,\, M_1 \rangle \to \langle \sigma,\, M_1' \rangle }{ \langle \sigma,\, M_1\, M_2 \rangle \to \langle \sigma,\, M_1'\, M_2 \rangle } \]

Add / AddLeft / AddRight

\[ \dfrac{a + b = c}{\langle \sigma,\, (+\; a\; b) \rangle \to \langle \sigma,\, c \rangle} \qquad \dfrac{ \langle \sigma,\, M \rangle \to \langle \sigma,\, M' \rangle }{ \langle \sigma,\, (+\; M\; N) \rangle \to \langle \sigma,\, (+\; M'\; N) \rangle } \qquad \dfrac{ \langle \sigma,\, M \rangle \to \langle \sigma,\, M' \rangle }{ \langle \sigma,\, (+\; a\; M) \rangle \to \langle \sigma,\, (+\; a\; M') \rangle } \]

Sub / SubLeft / SubRight

\[ \dfrac{a - b = c \quad c \in \mathbb{N}}{\langle \sigma,\, (-\; a\; b) \rangle \to \langle \sigma,\, c \rangle} \qquad \dfrac{ \langle \sigma,\, M \rangle \to \langle \sigma,\, M' \rangle }{ \langle \sigma,\, (-\; M\; N) \rangle \to \langle \sigma,\, (-\; M'\; N) \rangle } \qquad \dfrac{ \langle \sigma,\, M \rangle \to \langle \sigma,\, M' \rangle }{ \langle \sigma,\, (-\; a\; M) \rangle \to \langle \sigma,\, (-\; a\; M') \rangle } \]

IfTrue / IfFalse / IfExpr

\[ \langle \sigma,\, \text{if true then } E_1 \text{ else } E_2 \rangle \to \langle \sigma,\, E_1 \rangle \qquad \langle \sigma,\, \text{if false then } E_1 \text{ else } E_2 \rangle \to \langle \sigma,\, E_2 \rangle \]\[ \dfrac{ \langle \sigma,\, E_1 \rangle \to \langle \sigma,\, E_1' \rangle }{ \langle \sigma,\, \text{if } E_1 \text{ then } E_2 \text{ else } E_3 \rangle \to \langle \sigma,\, \text{if } E_1' \text{ then } E_2 \text{ else } E_3 \rangle } \]

Variable

\[ \dfrac{ \sigma[x] = E }{ \langle \sigma,\, x \rangle \to \langle \sigma,\, E \rangle } \]

LetBody / LetResolution

\[ \dfrac{ \sigma' = \sigma[x \mapsto V[z/x]] \quad z \text{ fresh} \quad \langle \sigma',\, M \rangle \to \langle \sigma',\, M' \rangle }{ \langle \sigma,\, \text{let } x = E \text{ in } M \rangle \to \langle \sigma,\, \text{let } x = E \text{ in } M' \rangle } \]\[ \langle \sigma,\, \text{let } x = E \text{ in } V \rangle \to \langle \sigma,\, V[E/x] \rangle \]

MatchLeft / MatchRight / MatchBind / MatchThen / MatchElse

(As given in Section 12.1 above.)

ErrorLeft / ErrorRight

\[ \langle \sigma,\, \text{error}\; E \rangle \to \langle \sigma,\, \text{error} \rangle \qquad \langle \sigma,\, V\; \text{error} \rangle \to \langle \sigma,\, \text{error} \rangle \]

Global Name Resolution

\[ \textbf{EmptyProgram} \quad \text{resolve}(\,) = \text{empty} \]\[ \textbf{Declaration} \quad \dfrac{ \sigma = \text{resolve}(L) \quad \sigma' = \sigma[x \mapsto V] }{ \text{resolve}(x = V;\; L) = \sigma' } \]\[ \textbf{EmptyDataDecl} \quad \text{resolve}(\text{data}\; n = \varepsilon;\; L) = \text{resolve}(L) \]\[ \textbf{DataDecl} \quad \dfrac{ \sigma = \text{resolve}(\text{data}\; n = M;\; L) \quad \sigma' = \sigma[N \mapsto \{N\}] }{ \text{resolve}(\text{data}\; n = N\; M;\; L) = \sigma' } \]\[ \textbf{ParamDataDecl} \quad \dfrac{ \sigma = \text{resolve}(\text{data}\; n = M;\; L) \quad \sigma' = \sigma[N \mapsto \lambda x_1 \to \lambda x_2 \to \cdots \lambda x_m \to \{N,\, x_1,\, x_2,\, \cdots,\, x_m\}] }{ \text{resolve}(\text{data}\; n = N\; T_1\; T_2\; \cdots\; T_m\; M;\; L) = \sigma' } \]

Module 6: Logic Programming

“There are no witty quotes about logic programming.” — Gregor Richards

Programming languages are often classified as either declarative or imperative. These two rather broad terms refer to the degree to which the task of programming is abstracted from the details of the underlying machine, and are often associated with the terms high-level and low-level, respectively. A programming language is declarative if programs in the language describe the tasks to be performed without outlining the specific steps to be taken. Conversely, imperative languages prescribe more of the details of how computations are to be carried out.

Aside: Of course, technically, a declarative language is simply a language in which declarations are primary, and an imperative language is a language in which imperatives (commands) are primary, but this distinction in level of abstraction is a natural consequence of this structural difference.

In reality, the terms declarative and imperative are most appropriately treated as relative terms: one language is “more declarative” or “more imperative” than another. In this way, the notions of “declarative” and “imperative” suggest the possibility of constructing a “spectrum” of programming languages arranged according to the level of abstraction at which they operate. While an actual construction of such a spectrum is not really a well-defined task, we may certainly ask ourselves what languages would lie at its boundaries — what are the lowest- and highest-level programming languages?

At the lowest level, we can program a computer by specifying the sequence of electrical impulses that passes through the processor. Programming of this nature amounts essentially to specifying the sequence of 1’s and 0’s that make up an executable program.

At the highest level, we might imagine a language that allows us to simply describe the characteristics of the program we seek, and then allow the computer to perform the task of meeting the criteria we laid out. The language we use to describe our programs should ideally be completely devoid of any references to computers or implementation details; it should simply outline the constraints to be satisfied in order to solve the problem at hand.

An ideal candidate for such a language is predicate logic. By defining an appropriate problem domain and set of predicates, it is possible to use predicate logic to describe any computable problem. Moreover, predicate logic is a language of mathematics, and its existence predates computers. Indeed, a system in which we may program purely in predicate logic is an ideal to which many in the declarative programming community aspire. In this chapter, we discuss the logic programming paradigm, and its best-known representative language, Prolog.

Prolog is our exemplar, but it’s actually more than that for logic programming; virtually all languages in the logic programming paradigm are based on Prolog. In a very real sense, logic programming is Prolog.

Logic programming forms one branch of a family of query languages, and we will explore query programming more broadly at the end of this module.

1. Programming in Logic

We begin with a summary of the language of predicate logic. Strings in this language (called formulas) are composed of the following elements:

constants, denoted by $ a, b, c, \ldots $, possibly subscripted;
function letters, denoted by $ f, f_1, f_2, \ldots, g, g_1, g_2, \ldots $, etc.;
predicate symbols, denoted by (possibly subscripted) capital letters;
variables, denoted by $ x, x_1, x_2, \ldots, y, y_1, y_2, \ldots $, etc.;
grouping and delimiting symbols: $ (, ) $ and $ , $;
logical connectives: $ \neg, \to, \lor, \land $ (negation, implication, or, and and, respectively);
quantifiers: $ \forall $ and $ \exists $.

With each function symbol and each predicate symbol, we associate a number $ n \geq 0 $, known as its arity. The arity of a function or predicate denotes the number of arguments it will accept. The particular language under consideration depends on our choice of constants, function letters, and predicate symbols. For example, suppose we introduce the unary function symbol $ s $ and the binary predicate $ g $. Then the following is a formula in our predicate language:

\[ \forall x.\, g(s(x), x) \]

We might choose to interpret this formula as follows: we take as our domain of consideration the set of natural numbers, so that the variable $ x $ may range over the values $ 1, 2, \ldots $. We then associate the function $ s $ with the successor function $ \lambda x.\, x + 1 $, and the predicate $ g $ with the binary predicate $ \lambda(x,y).\, x > y $. Then the formula reads: “For every natural number $ x $, the successor of $ x $ is greater than $ x $,” which, in this domain of interpretation, happens to be true. We assume at this point that you have sufficient background in logic to distinguish between well-formed and mal-formed formulas in predicate languages.

The logical connectives are those used in logic in general:

$ \neg X $ is true if $ X $ is false (logical negation).
$ X \to Y $ means “$ X $ implies $ Y $”, and so is true if, in all cases that $ X $ is true, $ Y $ is also true. If $ X $ is always false, then the implication is said to be “vacuously true”.
$ X \lor Y $ is true if either $ X $ or $ Y $ is true (logical or).
$ X \land Y $ is true if both $ X $ and $ Y $ are true.

A sequence of clauses joined by $ \lor $ is called a disjunction, while a sequence of clauses joined by $ \land $ is called a conjunction.

Aside: The only way this author can ever remember the difference between $ \lor $ and $ \land $ is to observe that $ \land $ looks a bit like an ‘A’, as in “AND”. Perhaps the same mnemonic will help you.

The logic programming paradigm proceeds as follows: we first assert the truth of a set of formulas in our predicate language. Our set of assertions forms a database. We then either query the database about the truth of a particular formula, or we present a formula containing variables and ask the system to determine which instantiations of the variables satisfy the formula, i.e., make it evaluate to true.

This task, as stated above, is quite difficult, as predicate languages admit a wide variety of qualitatively different formulas. To make the task more feasible, we will show how formulas in predicate languages may be placed in a more uniform format known as clausal form. The exact constraints of clausal form are best described by describing the process of transforming any formula into clausal form.

We first note that we can get rid of the logical implication symbol, $ \to $, by replacing all occurrences of $ A \to B $ by $ \neg A \lor B $. For example:

\[ \forall x.\,(A(x) \to B(x, f(y))) \quad\text{becomes}\quad \forall x.\,(\neg A(x) \lor B(x, f(y))) \]

Next, we can move all quantifiers to the left. Doing so will produce a formula of the form $ \langle\textit{quantifiers}\rangle\; \langle\textit{body}\rangle $. To do this we observe the following two rules:

\[ \neg\forall x.\,A \quad\text{becomes}\quad \exists x.\,\neg A \]\[ \neg\exists x.\,A \quad\text{becomes}\quad \forall x.\,\neg A \]

Quantifiers may be pulled outside of conjunctions and disjunctions unchanged. Note, however, that if moving a quantifier to the left introduces a variable capture, then we must perform an α-conversion and rename the quantified variable to a fresh one. That is, to simplify an expression like this one:

\[ (\forall x.\,A(x)) \land (\forall x.\,B(x)) \]

we must first rename one or both of the $ x $’s:

\[ (\forall x.\,A(x)) \land (\forall x'.\,B(x')) \]

Then we may move the quantifiers out:

\[ \forall x.\,\forall x'.\,A(x) \land B(x') \]

Note also that we may not reorder quantifiers with respect to each other. That is, when moving quantifiers to the left, we must respect the left-to-right order in which they occurred in the original formula. For example, the formula $ \neg(\exists x.\,A(x) \land \forall x.\,B(x)) $ becomes $ \forall x.\,\neg(A(x) \land \forall x.\,B(x)) $, and then $ \forall x.\,\exists y.\,\neg(A(x) \land B(y)) $.

Next, we can eliminate existentially quantified variables via a process known as Skolemization: essentially, we replace them with freshly-invented constants, known as Skolem constants. For example, $ \exists x.\,\exists y.\,(A(x,y) \lor B(x)) $ becomes $ A(s_1, s_2) \lor B(s_1) $, where $ s_1 $ and $ s_2 $ are Skolem constants.

Aside: Actually, we’ve been Skolemizing throughout the course! Every time we’ve had a metavariable in the Post system, that’s actually a Skolemized existential quantifier.

Skolemization becomes more difficult in the presence of universal quantifiers. Consider the expression $ \exists y.\,\forall x.\,P(x,y) $. This expression asserts that there is a value of $ y $ such that $ P(x,y) $ holds of every $ x $. Hence we may replace $ y $ with a Skolem constant $ s_1 $ that represents the particular value of $ y $ for which the assertion holds. However, consider the expression $ \forall x.\,\exists y.\,P(x,y) $. In this case, the assertion does not guarantee that there is a single choice of $ y $ for which $ P(x,y) $ will hold for all $ x $; rather it asserts that a suitable value of $ y $ can be found for each value of $ x $. Thus, the value of $ y $ is dependent on the value of $ x $, and therefore it would not be appropriate to replace $ y $ with a Skolem constant. Rather, we should replace $ y $ with a newly-invented Skolem function parameterized by $ x $. Thus, we Skolemize $ \forall x.\,\exists y.\,P(x,y) $ to $ \forall x.\,P(x, s_1(x)) $, where $ s_1 $ is a Skolem function.

In general, an existentially quantified variable $ x $ is replaced with a Skolem function parameterized by all of the universally quantified variables whose quantifiers occur to the left of $ x $’s quantifier. If there are no universal quantifiers to the left of $ x $’s quantifier, then $ x $ may be replaced with a Skolem constant.

In this way, we may convert any formula in our predicate language to one without implications, in which the only quantifiers that appear are universal quantifiers, and in which all of the quantifiers occur leftmost in the formula.

Free variables in our predicate language are variables not associated with a quantifier; hence their meaning is assumed to be externally provided. Since we seek self-contained formulas defining the characteristics of the problems we wish to solve, we shall assume that there are no free variables. Hence, after the transformations we have performed so far, all of the variables in the formula are universally quantified. Thus, the presence of the quantifiers provides no additional information and for convenience we shall omit them. Thus, for example, we shall write the Skolemized formula $ \forall x.\,P(x, s_1(x)) $ as $ P(x, s_1(x)) $, and understand that the former is meant when we use the latter.

What remains is a quantifier-free expression involving primitive predicates and the logical connectives $ \neg $, $ \lor $, and $ \land $. Next, we will push the logical negations inward, so that negation may only be applied to lone predicates. To do this, we use the following transformation rules:

\[ \neg(A \land B) = \neg A \lor \neg B \]\[ \neg(A \lor B) = \neg A \land \neg B \]\[ \neg\neg A = A \]

Since we have pushed the negations all the way down to individual predicates, we may consider them to be part of the predicates and ignore them for the moment. We shall call an occurrence of a predicate negated if it has a logical negation operator applied to it.

We now have a formula consisting of (possibly negated) predicates that have been combined using $ \land $ and $ \lor $. We may distribute $ \lor $ over $ \land $ using the following rule:

\[ A \lor (B \land C) = (A \lor B) \land (A \lor C) \]

That rule is commutative, since $ \lor $ is commutative. Putting this all together, we can convert a formula to what’s called conjunctive normal form (CNF), in which the formula is written as a set of clauses, connected by $ \land $, and each clause is a set of (possibly negated) predicates, connected by $ \lor $. For example, the formula

\[ (P(x) \lor Q(x) \lor R(y,z)) \land P(y) \land (S(z) \lor Q(z)) \]

is in CNF.

Since conjunction is commutative and associative, we may dispense with the conjunctions and simply view a formula as a set of clauses; for example, we may view the example from the previous paragraph as the set:

\[ \{ P(x) \lor Q(x) \lor R(y,z),\; P(y),\; S(z) \lor Q(z) \} \]

Each clause is a disjunction of (possibly negated) predicates. Since disjunction is commutative and associative, we may reorder the predicates so that the unnegated predicates occur first and the negated predicates occur last. Thus, a clause has the form:

\[ P_1 \lor \cdots \lor P_j \lor \neg P_{j+1} \lor \cdots \lor \neg P_n \]

After applying de Morgan’s law ($ \neg(A \land B) = \neg A \lor \neg B $, we can collect the negations and convert the clause to the form:

\[ P_1 \lor \cdots \lor P_j \lor \neg(P_{j+1} \land \cdots \land P_n) \]

We can also reintroduce implication and convert the clause to the form:

\[ (P_{j+1} \land \cdots \land P_n) \to (P_1 \lor \cdots \lor P_j) \]

Finally, we reverse the formula and introduce the symbol :- to replace (and reverse) $ \to $, obtaining:

\[ (P_1 \lor \cdots \lor P_j) \mathbin{:-} (P_{j+1} \land \cdots \land P_n) \]

Thus, we can express any formula in our predicate language as a single implication relating a conjunction to a disjunction, which we call clausal form. The disjunction $ P_1 \lor \cdots \lor P_j $ is known as the head of the clause, and the conjunction $ P_{j+1} \land \cdots \land P_n $ is called the body of the clause. The meaning of the clause is that if the body of the clause is true, then the head of the clause is true.

The major advantage of clausal form is that it admits a particularly simple method of inference known as resolution, which we may express by the following inference rule:

\[ \dfrac{ (P_1 \lor \cdots \lor P_i \lor R) \mathbin{:-} (P_{i+1} \land \cdots \land P_m) \qquad (Q_1 \lor \cdots \lor Q_j) \mathbin{:-} (R \land Q_{j+1} \land \cdots \land Q_n) }{ (P_1 \lor \cdots \lor P_i \lor Q_1 \lor \cdots \lor Q_j) \mathbin{:-} (P_{i+1} \land \cdots \land P_m \land Q_{j+1} \land \cdots \land Q_n) } \]

That is, you can split an arbitrarily long clause in two around a new predicate symbol $ R $: if the first part of the body is true, then either the first part of the head is true, or something from the rest of the head is true, represented by $ R $. If the second part of the body is true or, by way of $ R $, the first part of the body was true but the first part of the head was not, then something from the second part of the head is true. We can repeat this process as often as we like until we’re left with single queries.

Resolution has the important property that if a set of clauses is logically inconsistent (i.e. cannot be satisfied), then resolution will be able to derive a contradiction, denoted by the empty clause, :-. Conversely, if resolution is able to derive :-, then the set of clauses is inconsistent. Thus, to determine whether the set of clauses $ \{C_1, \ldots, C_n\} $ implies clause $ D $, we need only show, by resolution, that the set of clauses $ \{C_1, \ldots, C_n, \neg D\} $ is inconsistent (i.e. that it can derive :-).

We shall concern ourselves primarily with a particular kind of clause, known as a Horn clause, defined as follows:

Definition 1. (Horn Clause) A Horn clause is a clause whose head contains at most one predicate. A headed Horn clause is a Horn clause whose head contains exactly one predicate. A headless Horn clause is a Horn clause whose head contains no predicates.

Since the head is a disjunction of results of an implication, and the body is a conjunction of premises of an implication, a headed Horn clause means “if all of these facts are true, then this one fact is true”. A headless Horn clause implies nothing, so it’s simply a query: it either is true or it is not. Horn clauses may also have no body, in which case they are simply a declaration of a fact. A body-less Horn clause is written without :-.

As it happens, Horn clauses are sufficient to express any computable function. Further, we may express any computable function as a set of Horn clauses with exactly one headless Horn clause. The headless Horn clause represents the “query”, and has the form $ {:}{\text{-}}\; P_1 \land \cdots \land P_n $. By using resolution, coupled with unification for matching, we may decide whether the query represented by the headless Horn clause may be satisfied, and if it can, which substitutions must be applied to the variables in the query in order to satisfy the clauses. This is the essence of logic programming.

2. Introduction to Prolog

Prolog is by far the most famous logic programming language. It was invented around 1970 at the University of Marseilles. The name ‘Prolog’ comes from Programmation en Logique.

Prolog programs have three major components: facts, rules, and queries. Facts and rules are entered into a database (in the form of a Prolog file) before computation. Queries are entered interactively, and are the means by which Prolog programs are “run”.

Example 1. The following is an example of a fact in Prolog:

cogito(descartes).

This fact establishes a predicate “cogito” of arity 1 (that is, the predicate “cogito” takes a single argument), usually denoted cogito/1. The token cogito is known as the functor. Notice that facts in Prolog end with a period (.). The token descartes represents a symbolic constant that satisfies the predicate cogito/1. That is, when the predicate cogito/1 is supplied with the argument descartes, the result is true.

Interpretation of the meaning of a predicate and its arguments is left to the programmer. In the example above, a reasonable interpretation might be “Descartes thinks” (Descartes cogitat).

We may use queries to retrieve from the database the information stored in facts. Consider the following queries:

?- cogito(descartes).
Yes
?- cogito(aristotle).
No

In the first query, we ask the database whether the assertion cogito(descartes) can be verified. Prolog answers with Yes, since the fact cogito(descartes) occurs in the database. When we query cogito(aristotle) (presumably asking whether Aristotle thinks), Prolog responds with No, as this assertion cannot be deduced from the database, which currently contains only the fact cogito(descartes).

A database consisting only of facts is generally of little interest, as the amount of information it can provide is limited by the number of facts present. A Prolog database will usually also contain a number of rules, by which new facts may be deduced from existing facts. The following is an example of a rule:

sum(X) :- cogito(X).

Reversing this to more conventional predicate logic, it can be read as:

\[ \forall X.\,(cogito(X) \to sum(X)) \]

This rule establishes a predicate sum/1. The token X is a variable, and is universally quantified. Variables in Prolog must begin with a capital letter. All other identifiers must begin with a lowercase letter. A rule is essentially a Horn clause: the rule sum(X) :- cogito(X) essentially states that for all X, sum(X) is satisfied whenever cogito(X) is satisfied. That is, cogito, ergo sum: I think, therefore I am.

We may now issue queries containing our new predicate, sum/1, to the database:

?- sum(descartes).
Yes
?- sum(aristotle).
No

As expected, the query sum(descartes) succeeds and the query sum(aristotle) fails. The former can be deduced from the fact cogito(descartes); the latter cannot be deduced.

Queries may also contain variables. If the query succeeds, then Prolog reports a satisfying assignment for each variable in the query. For example, we may query the predicate sum/1 as follows:

?- sum(X).
X = descartes

Variables are distinguished by the fact that they begin with a capital letter, while other atoms must begin with a lowercase letter. When a query contains variables, it is possible for more than one instantiation of the variables to satisfy the query. For example, if we add the fact cogito(aristotle) to the end of our database, then the query ?- sum(X). will have two solutions; namely, X = descartes and X = aristotle. Prolog returns the first solution it finds in the database (X = descartes) and then waits for the user to press a key. If we press ;, then Prolog will return the next solution it finds in the database, or No, if none remain. If we press Enter, Prolog abandons the computation and returns to the interactive prompt.

A Prolog query is essentially a headless Horn clause. Hence, just as a Horn clause may have a conjunction of several predicates in its body, so may a Prolog query. For example, suppose we add the following fact to the end of our database:

philosopher(descartes).

Suppose now that we issue the following query:

?- cogito(X), philosopher(X).

This query is satisfied by only those values of X satisfying both cogito(X) and philosopher(X). Thus, in this case, the only answer returned by Prolog is:

X = descartes

Of course, everyone who thinks is a philosopher in their own way, so we may add the following clause to our database:

philosopher(X) :- cogito(X).

And if we query philosopher(X) now, we get that both descartes and aristotle are philosophers. Interestingly, the fact philosopher(descartes) is established in two ways: he is a philosopher because he is a thinker, but we have also stated as a plain fact that he is a philosopher. These two ways of stating clauses do not conflict with each other, even if they imply the same result.

3. Prolog Data Structures

The Prolog programming language shares a feature common in untyped functional languages, in particular Lisp and Scheme, known as uniformity: Prolog code and Prolog data are identical in appearance. Data structures in Prolog are known simply as structures, and consist of a functor, followed by zero or more components. The following is an example of a Prolog structure:

book(greatExpectations, dickens)

The functor is book, and the components are greatExpectations and dickens, presumably representing the title and author. If we place the functor as a fact in and of itself, it is the same functor as we used before: this is actually a declaration that book(greatExpectations, dickens) is a fact. We will make structures distinct from facts by contextualizing them momentarily.

Prolog is an untyped language; hence there is no need to “declare” structures or to use them in a uniform way. In addition to the structure illustrated above, we may also use the following structures in the same program:

book(greatExpectations)
book(great,expectations,charles,dickens)
book(greatExpectations,author(dickens,charles))
book

Note that a nullary structure (e.g. book above) is just an atom, and, in fact, a symbol. Also notice from the third example above that structures may be nested.

We see, then, that structures have exactly the same general appearance as predicates. We distinguish one from the other by context. Structures may only appear inside predicates; predicates cannot usually appear inside other predicates. Consider the following Prolog fact:

owns(jean, book(greatExpectations, dickens)).

Here, owns/2 is not nested inside of anything else; hence it is a predicate. On the other hand, book/2 is nested inside of owns/2; hence it is a structure. jean, greatExpectations, dickens, and, from before, descartes and aristotle are also structures, namely nullary structures (atoms). Given this fact, we may now issue the following queries:

?- owns(jean,X).
X = book(greatExpectations, dickens)
?- owns(X, book(greatExpectations, dickens)).
X = jean
?- owns(jean, book(greatExpectations, X)).
X = dickens
?- owns(X, book(Y, dickens)).
X = jean, Y = greatExpectations

The following query, in which we ask for the name of the functor, is not valid:

?- owns(jean, X(greatExpectations, dickens)).

In addition to structures, Prolog supports lists. Lists in Prolog are enclosed in square brackets, and their elements separated by commas. For example,

[10, 20, 30]

represents the list consisting of 10, followed by 20, followed by 30. The empty list is written as []. Prolog’s “cons” operator is the vertical bar, |. The expression [X|Y] denotes the list whose elements consist of X, followed by the elements of Y (Y is assumed to be a list). For example, the list [10, 20, 30] may also be written [10 | [20, 30]]. We may also write any finite number of list elements before the |. For example, [10,20,30] can also be written as [10, 20 | [30]] and as [10, 20, 30|[]]. This flexible cons syntax allows us to specify queries such as [10, 20 | X], assigning variables to only part of a list.

4. Unification and Search

In general, logic programming is an exercise in theorem-proving. Given a set of clauses, we use resolution to prove them consistent or inconsistent, and use the example of consistency or proof of inconsistency to answer a query. Hence, logic programming language interpreters behave much like theorem-proving software.

In the abstract, given a set of clauses, we can use resolution to derive an inconsistency by combining clauses as we see fit. In other words, there is no fixed sequence of alternatives to try, and we are free to rely on our intuition in our search for a derivation of the empty clause.

However, in real programming tasks, we often prize predictable behavior, so that we may make assertions about the properties of programs. Hence logic programming languages take a deterministic approach to the resolution problem, and in the case of Prolog, the deterministic order is defined as part of the language.

Effective Prolog programming requires that we understand the mechanisms by which Prolog answers queries. Execution of Prolog programs is based on two principles: unification, and depth-first search. Here, unification refers to Robinson’s unification algorithm, which we saw in the previous module. Depth-first search refers to the strategy that Prolog employs when searching through the database. Essentially, Prolog attempts to satisfy a query by searching the database from top to bottom, attempting to unify each fact, and the head of each rule, with the query. If the query unifies with a fact, then Prolog returns the MGU, or simply “Yes” if the MGU is the empty substitution (i.e. if there are no variables in the query). If a query unifies with the head of a rule, then Prolog applies the MGU to the body of the rule and recursively attempts to satisfy the body of the rule. If the search terminates without finding a match to the query, Prolog returns “No”.

This style of execution is so far removed from the step-wise execution we’ve seen before (the $ \to $ morphism) that it’s rarely useful to describe it in that way. Instead, we simply describe the unification and search algorithms; the surprising result is that by implementing these algorithms, we have implemented a programming language, even though no component seems to behave like an interpreter or compiler.

In the context of Prolog, unification is somewhat more complex than what we presented in Module 5. We present the details here.

Definition 2. (Unification Algorithm for Prolog Predicates)

Let the name $ R $, possibly subscripted, range over relations (that is, predicates and structures). Let $ O $, possibly subscripted, range over objects (atoms). Let $ X $, possibly subscripted, range over variables. Let $ T $, possibly subscripted, range over terms (relations, objects, and variables). Let $ S $, possibly subscripted, range over substitutions.

Then the unification algorithm for Prolog predicates is as follows:

\[ U(O_1, O_2) = \begin{cases} [\,] & \text{if } O_1 = O_2 \\ \mathit{fail} & \text{otherwise} \end{cases} \]\[ U(O, R(T_1, \ldots, T_n)) = \mathit{fail}, \quad n \geq 1 \]\[ U(X, T) = [T/X] \]\[ U(R_1(T_{1,1}, \ldots, T_{1,n}),\; R_2(T_{2,1}, \ldots, T_{2,m})) = \begin{cases} \mathit{fail} & \text{if } R_1 \neq R_2 \\ \mathit{fail} & \text{if } m \neq n \\ S_1 S_2 \cdots S_n & \text{otherwise, where:} \\ \quad S_1 = U(T_{1,1},\, T_{2,1}) \\ \quad S_2 = U(T_{1,2} S_1,\, T_{2,2} S_1) \\ \quad \vdots \\ \quad S_n = U(T_{1,n}(S_1 S_2 \cdots S_{n-1}),\, T_{2,n}(S_1 S_2 \cdots S_{n-1})) \end{cases} \]

The astute reader will notice that this formulation of unification omits the occurs-check. Recall that the occurs-check would place a condition on the rule for $ U(X, T) $, which would require that $ X $ not occur in $ T $ if $ T $ is not equal to $ X $. Indeed, omitting the occurs-check carries the same potential for trouble here as it does in the theory of type assignment. Nevertheless, Prolog omits the occurs-check because it is expensive to perform. As a result, it is possible to enter facts and queries into Prolog that would result in circular unification and cause Prolog to return infinite answers. For example, suppose we place the following rule in the database:

equal(X, X).

Then suppose that we issue the following query:

?- equal(X, f(X)).
X = f(f(f(f(f(f(f(f(f(f(...))))))))))

The result is a nonsensical substitution in which X is assigned an expression of infinite length.

This is less of a problem for Prolog than it is for type-checking, because while it’s critical for type checking to terminate (i.e., the type judgment must be decidable), Prolog is a Turing-complete language, so there exist programs that won’t terminate anyway. Nonetheless, in some cases, we might like Prolog to perform the occurs-check despite its cost. In such cases, we may use the built-in predicate unify_with_occurs_check/2. Using this predicate, we would rewrite our equality predicate as follows:

equal(X,Y) :- unify_with_occurs_check(X,Y).

Unification is the mechanism by which Prolog matches queries against facts and rules in the database. To complete our understanding of Prolog’s execution model, we now discuss the strategy by which Prolog searches the database for matches.

When attempting to answer a query, Prolog always searches the database from the top down. Prolog attempts to satisfy the predicates in a query one at a time, starting from the left and working to the right. If a predicate in the query matches the head of a rule, then the body of the rule is added to the front of the list of predicates to be satisfied (remember that :- is implication in reverse!), and Prolog recursively tries to satisfy the new query. In this way, Prolog’s search strategy is based on depth-first search. When adding predicates in this way, because you may have other predicates with similar variable names already in the list of predicates to be resolved, it may be necessary to rename variables, similarly to while substituting in λ-applications, to avoid pairs of unrelated variables which have the same name being interpreted as the same variable.

Example 2. Consider the following database:

p(X) :- q(X), r(X).    % 1
q(X) :- s(X), t(X).    % 2
s(a).                   % 3
s(b).                   % 4
s(c).                   % 5
t(b).                   % 6
t(c).                   % 7
r(c).                   % 8

Suppose that we then enter the following query:

?- p(X).

To satisfy this query, Prolog searches the database for a fact, or the head of a rule, that unifies with p(X). The first (indeed, the only) match is the first rule, p(X) :- q(X),r(X).; the MGU is the empty substitution. Thus, Prolog replaces p(X) in the query with q(X), r(X) and attempts to answer the new query. Prolog searches the database for a match for q(X), and finds the rule q(X) :- s(X), t(X). Thus, it replaces q(X) with s(X), t(X) in the query; the query becomes s(X), t(X), r(X), and Prolog attempts to satisfy s(X). The first match is the rule s(a); the MGU is [a/X]. Prolog applies the MGU to the remainder of the query and attempts to satisfy the query t(a), r(a). As there is no match for t(a) in the database, the current answer fails, and Prolog searches for another answer to s(X). The second match is s(b) with MGU [b/X]. Prolog applies the MGU and then attempts to satisfy t(b), r(b). There is a match for t(b) in the database, but not for r(b), and so this answer fails as well. Thus, Prolog backtracks and tries the third and final answer for s(X), namely s(c), with MGU [c/X]. Prolog applies the MGU and tries to satisfy t(c), r(c). As both of these facts occur in the database, Prolog has found a successful match and returns the MGU, X = c.

Alternatively, this search with backtracking can be shown with indentation:

p(X)
  q(X), r(X)
    s(X), t(X), r(X)
      s(a), t(X), r(X)  [a/X]
        t(a), r(a)
          fail
      s(b), t(X), r(X)  [b/X]
        t(b), r(b)
          r(b)
            fail
      s(c), t(X), r(X)  [c/X]
        t(c), r(c)
          r(c)
            solution: X=c

We can best illustrate Prolog’s backtracking search strategy using a structure known as a tree of choices. The following three examples were originally presented by Cohen.

Example 3. Suppose we have the following simple database:

p :- q, r, s.   % 1
p :- t, u.      % 2
q :- u.         % 3
t.              % 4
u.              % 5
p :- u.         % 6

Given this database, each internal node in the search tree for the query ?- p, t. represents the current query being evaluated. Every internal node has exactly six children, one for each entry in the database. If an entry in the database matches the first predicate in the query, then the query is updated and evaluated in the subtree with the same number as the position of the matching entry in the database. If the query is exhausted, we have found a successful match, and output “Yes”. Thus, Prolog’s search strategy is essentially a depth-first traversal of this tree.

Prolog’s search is not guaranteed to terminate. Indeed, since Horn clauses are equivalent to computable functions, whether Prolog does or does not terminate is the halting problem.

Example 4. Consider the following database, which is identical to the previous database, except for a change in the last entry:

p :- q, r, s.   % 1
p :- t, u.      % 2
q :- u.         % 3
t.              % 4
u.              % 5
p :- u, p.      % 6

The addition of the predicate p to the end of the database introduces recursion into the search. Now, when Prolog matches the query against the rule p :- u, p, after elimination of u by rule 5, we obtain our original query again. Hence, Prolog recursively answers the query ?- p, t again; since a success had previously been found, the result is an infinite sequence of “Yes” answers.

The behaviour of the query ?- p, t is dependent upon the ordering of entries in the database. Suppose we reorder the database as follows:

p :- u, p.      % 1
p :- q, r, s.   % 2
p :- t, u.      % 3
q :- u.         % 4
t.              % 5
u.              % 6

Here, we have simply moved the final entry to the top of the database. In this tree, the infinite recursion appears first. Hence, when attempting to evaluate the query ?- p, t, Prolog attempts to evaluate ?- u, p, t and then ?- p, t again. Thus, Prolog immediately gets caught in an infinite evaluation sequence, before ever generating a successful match for the query. Thus, in this case, Prolog falls into an “infinite loop” and never reaches the match indicated at the bottom of the tree.

5. Values and Operators in Prolog

The structures and lists we’ve already discussed are values in Prolog, but Prolog also supports numbers. Prolog clauses may contain numbers, and operators over those numbers. Prolog’s arithmetic comparison operators are >, <, >=, =<, is, =:=, and =\=. The last two operators represent numeric equality and inequality, respectively. Note that less-than-or-equal-to is =<, not the more common <=.

For instance, the following clause makes greaterThanTen(X) a fact when X is greater than ten:

greaterThanTen(X) :- X >= 10.

However, Prolog cannot search for facts about numbers in the same way that it can for structures. For instance, the query ?- greaterThanTen(X). fails, because >= does not appear as a rule in the database to search; X must already have a fixed value for X >= 10 to be verifiable.

is and =:= are both numeric equality operators, differing only in how they treat integers as compared to floating-point numbers. 1 =:= 1.0 is true, but 1 is 1.0 is false. is is more common, simply because floating-point arithmetic is rare in Prolog in the first place. Prolog can unify using is, but only if one side has been fully resolved, and the other side is just a variable. For instance, X is Y + Z is unifiable only if Y and Z have already been resolved; if X and Y have already been resolved, it is of course mathematically trivial to discover the value of Z, but Prolog cannot unify in this way. That is, Prolog allows $ U(X, n) $, where $ n $ is an arithmetic expression with no variables after substitution.

In addition, Prolog has an = operator, which simply forces two values to unify. For instance, we could rewrite equal above to:

equal(X, Y) :- X = Y.

Note that this is unification, which is distinct from numeric equality. If one of X or Y is a number, then X = Y only if X is Y. But, if either is not a number, then X = Y is true if they’re structurally identical, and X is Y is never true.

Most versions of Prolog also support strings, but we will not discuss them here, as they’re no more powerful than lists of character codes, which we can already build with lists and numbers. Strings are, of course, a bit more practical than lists of character codes.

6. Programming in Prolog

In this section, we consider the task of writing real programs in Prolog. Whereas the languages we have considered so far have been based on writing and combining functions, the only entities we may define in Prolog are predicates, which are based on facts and rules.

To mimic the behavior of a function with a predicate, we create a predicate with arguments for each of the arguments of the function, as well as the result(s). We then use facts and rules to describe the characteristics of the output with respect to the input. For example, suppose we wish to mimic the behaviour of the function $ f $ with arguments $ a $ and $ b $ and result $ c $. Then we would write a predicate f(a, b, c) and proceed to write facts and rules relating c to a and b. So, rather than saying that $ f(a, b) = c $, which is meaningless since all predicates are boolean, we say that f relates a, b, and c.

We first consider the problem of appending two lists. To append two lists, we will invent a ternary predicate append/3, which we interpret as follows: append(X, Y, Z) means that Z is the result of appending Y to X. It remains to relate X, Y, and Z, as follows:

append([], Y, Y).
append([X|T], Y, [X|Z]) :- append(T, Y, Z).

From the first rule, we see that the result of appending any list after the empty list is the list itself. From the second rule, we see that, if Z is the result of appending Y to T, then [X|Z] is the result of appending Y to [X|T]. We may then invoke append as follows:

?- append([1,2,3], [4,5,6], R).
R = [1,2,3,4,5,6]

It’s worth taking a moment to consider the style of programming here. We have not stated an algorithm for appending lists per se; rather, we have stated two facts:

That the concatenation of an empty list and any list is the second list, and
that if Z is the concatenation of T and Y, then [X|Z] is the concatenation of [X|T] and Y.

By reversing that, we can discover an algorithm for concatenating lists, by prepending one element at a time, but that algorithm is just the implied combination of these two rules and Prolog’s search algorithm.

Next, we will write a Prolog predicate to reverse a list. Using append/3 above, we may implement a predicate reverse/2 as follows:

reverse([], []).
reverse([X|Y], R) :- reverse(Y, Z), append(Z, [X], R).

Here, the reversal of the empty list is the empty list, and the reversal of any other list is the reversal of the tail of the list, followed by the head of the list. We may also implement reverse using an accumulator, as follows:

reverse(X, Y) :- reverse2(X, [], Y).
reverse2([], Y, Y).
reverse2([X|T], Y, R) :- reverse2(T, [X|Y], R).

Here, we repeatedly push list elements from the first argument of reverse2 onto the second argument of reverse2. We continue until the first argument is empty, at which point the second argument contains the answer.

Again, what we’ve actually done is state some facts about lists, and it is Prolog’s search algorithm that actually reverses lists. In the case of reverse2, it’s unclear what fact reverse2 even corresponds to, since one would not usually describe reversing a list in terms of an accumulator; the hallmark of an effective logic programmer is the ability to discover or invent these clauses that cause computation to proceed, though they don’t correspond directly to any particular real fact.

7. Control Flow and the Cut

Consider now the problem of determining whether an object is a member of a list. To answer this question, we might implement a predicate member/2, as follows:

member(X, [X|_]).
member(X, [_|Y]) :- member(X,Y).

Note that the underscore, _, is a “wildcard” placeholder, indicating that the value occurring at the underscore is not being used.

A related problem is that of lookup in association lists. Under the assumption that an association list stores pairs of the form pair(X,Y), we might implement lookup/3 as follows:

lookup(X, [pair(X,R)|_], R).
lookup(X, [_|Y], R) :- lookup(X, Y, R).

Notice that, in the cases of both member and lookup, we do not need to explicitly handle the empty list. Since the empty list cannot match a pattern of the form [X|Y], invoking member or lookup with a second argument equal to the empty list will simply cause Prolog to fall through both clauses and return “No”.

While both member and lookup work as advertised, their behavior becomes somewhat dubious if the list passed as the second argument contains duplicate entries. If the element X occurs in the list Y multiple times, then the query ?- member(X,Y). will return “Yes” once for each time it occurs. Similarly, if the pair pair(X, _) occurs multiple times in the list, then lookup will return multiple results. Depending on its intended use, this behavior of lookup may be appropriate, but if our association list is meant to be a map, it is incorrect. Similarly, in the case of member, we are generally interested only in whether the element X occurs in the list Y; if it does, a single “Yes” will suffice.

In fact, we’ve seen this problem before: the list of pairs we’ve made here is exactly equivalent in structure to a type environment, $ \Gamma $, and if we were using Prolog to check types, then we would want lookup to have the same ordering constraint as we discussed for type environments. Ideally, lookup(Γ, Y, Z) should be exactly the same relation as $ \Gamma(Y) = Z $.

To produce the expected behavior, we need some facility by which we can affect Prolog’s search; that is, we need a control flow facility. However, given the tree search paradigm by which Prolog operates, conventional control flow constructs (loops, conditionals, exceptions, etc.) do not seem appropriate, or even natural.

Aside: It would also be possible to accomplish this with a “not equal” predicate, but as it turns out, the solution we’re implementing here is more powerful!

In Prolog, we alter control flow via a construct known as the cut, which is denoted by the predicate !. The effect of the cut is to commit Prolog to choices made so far in the search, thus pruning the search tree. Its behavior is best illustrated by example.

Example 5. Suppose we have the following database:

a :- b, !, c.   % 1
b :- d.         % 2
b.              % 3
d.              % 4
c :- e.         % 5
c.              % 6
e.              % 7
a :- d.         % 8
f :- a.         % 9
f.              % 10

Then consider the query ?- f.

When Prolog attempts to satisfy a rule containing the cut, it first attempts to satisfy all of the predicates occurring before the cut. If all the predicates before the cut are satisfied when it reaches the cut, Prolog commits to all choices made to satisfy the predicates to the left of the cut. Prolog also commits to the choice of rule made to satisfy the predicate on the left hand side of the rule. For example, when Prolog attempts to satisfy the rule a :- b, !, c, it first attempts to satisfy b. If it succeeds, then Prolog commits to the first answer for b. Further, Prolog does not consider any other rules that might produce a solution for a. Visually, Prolog prunes the search tree at each subtree in which the query to be satisfied contains a cut. Further, if the query to be satisfied in a subtree contains a cut, then Prolog prunes its parent tree as well. Portions of the search tree that are pruned by the cut are not explored. Notice that three potential “Yes” answers are eliminated by the cut. Also note that, since the rule a :- b, !, c contains a cut, both the subtrees for b, !, c and for a are pruned. Since d, !, c contains a cut, both the subtrees for d, !, c and for b, !, c (which was already pruned) are pruned.

In general, there are three main uses of the cut:

to confirm that the “correct” match has been found;
to indicate “exceptions to the rule”;
to prevent needless extra computation.

7.1 Use 1: Confirming the Correct Match

To illustrate the first use of the cut, let us consider the creation of a predicate sum_to/2, in which sum_to(N, R) binds R to the sum $ 1 + 2 + \cdots + N $. We might define this predicate as follows:

sum_to(1, 1).
sum_to(N, R) :- M is N - 1, sum_to(M, P), R is P + N.

This formulation of sum_to will compute the correct value of R given N, but if the user presses ; after Prolog returns the value of R, Prolog goes into an infinite loop, searching for other solutions. The reason for this behavior is that the constant 1 matches both the pattern 1 in the first rule and the pattern N in the second rule. Hence, when evaluating ?- sum_to(1, R), Prolog actually matches the query against both rules, even though only the first was intended. When Prolog matches ?- sum_to(1, R) to the second rule, the result is an infinite recursion. Further, since the recursion ultimately reduces every query of sum_to to a query of ?- sum_to(1, R), every query of sum_to produces an infinite loop after the first answer.

To remedy this problem, we may use the cut to tell Prolog that the first rule is the “correct” rule to use when the first argument is 1:

sum_to(1, 1) :- !.
sum_to(N, R) :- M is N - 1, sum_to(M, P), R is P + N.

In this case, when a query ?- sum_to(1, R) matches the first rule, the cut prevents it from also matching the second rule. In this way, we prevent infinite recursion, and sum_to only returns a single, correct answer.

Our motivating example of lookup/3 also fits this case, if we wish to look up only the first match:

lookup(X, [pair(X,Y)|_], Y) :- !.
lookup(X, [_|Z], R) :- lookup(X, Z, R).

If the first case matches, so a pair is found, then it will only return that value. The cut will prevent it from trying to discover further solutions.

7.2 Use 2: Exceptions to the Rule

To illustrate the second use of the cut (exceptions), suppose that we have the following database:

smallMammal(cat).
smallMammal(rabbit).
goodPet(X) :- smallMammal(X).

In this database, we assert that cats and rabbits are small mammals, and that small mammals make good pets. Suppose now that we add the following fact to the database:

smallMammal(porcupine).

Working under the assumption that porcupines do not make good pets, we now have an exception to the rule that small mammals make good pets. To indicate this exceptional condition, we make use of a special predicate, fail, which always fails. On its own, fail is of limited use, as no rule containing fail can ever be satisfied. However, when combined with the cut, fail may be used to cause Prolog to return “No” when exceptional cases arise:

smallMammal(cat).           % 1
smallMammal(rabbit).        % 2
smallMammal(porcupine).     % 3
goodPet(porcupine) :- !, fail.  % 4
goodPet(X) :- smallMammal(X).  % 5

If we now issue the query ?- goodPet(porcupine)., Prolog will match the query against the first rule for goodPet/1. Since it matches, Prolog then processes the cut, which commits it to choosing the first rule. Finally Prolog processes the predicate fail, which makes the rule fail. Here, Prolog is not permitted to backtrack and try the second rule for goodPet/1, since it has processed the cut. Hence, the query ?- goodPet(porcupine). results in an answer of “No”. Note that it’s critical that the cut come before the fail, because Prolog will not proceed to attempting to satisfy any other conditions once one has failed, so a cut (or anything else) after a fail will never be reached.

On the other hand, if we issue either the query ?- goodPet(cat). or ?- goodPet(rabbit)., then Prolog will not be able to match the query to the first rule for goodPet/1, because neither cat nor rabbit unifies with porcupine. Thus, Prolog does not process the cut; instead, it backtracks and tries the second rule, attempting to satisfy ?- smallMammal(cat). and ?- smallMammal(rabbit)., respectively. In both cases, Prolog succeeds and returns “Yes”.

7.3 Use 3: Preventing Needless Computation

The cut may be used to prevent needless computation. Our motivating example of member/2 demonstrates this use of the cut: we are interested only in whether an object occurs in a list at all, and don’t care if it appears multiple times.

Using the cut, we may reimplement member/2 as follows:

member(X, [X|_]) :- !.
member(X, [_|Y]) :- member(X, Y).

Under this reformulated definition, Prolog aborts the search after the first match it finds, so the needless extra computation required to find subsequent matches is eliminated.

8. Negation

Even though negation is not required for programming with Horn clauses to be Turing-complete, Prolog provides a negation operator as a matter of convenience. Prolog’s negation operator is the meta-predicate not/1. Given a predicate P, the goal not(P) attempts to satisfy P, and returns success if and only if the attempt to satisfy P failed. Any bindings of variables to values that occurred during the evaluation of P are erased by the application of not.

A simple example of the use of not is a rule stating that a mineral is a thing that is not animal or vegetable:

mineral(X) :- thing(X), not(animal(X)), not(vegetable(X)).

Here, the rule defining mineral/1 succeeds when thing/1 succeeds and both animal(X) and vegetable(X) fail.

Prolog programmers may make other meta-predicates by use of the call meta-predicate, which simply treats its argument as a goal and attempts to solve it. For instance, the id meta-predicate is:

id(X) :- call(X).

With this definition, if sum(descartes) is true, then id(sum(descartes)) is also true, as is id(id(sum(descartes))), etc.

call is rarely used, but with it, the cut, and fail, we can define not directly in Prolog as follows:

not(X) :- call(X), !, fail.
not(X).

not attempts to satisfy its argument. If it succeeds, it forces Prolog to return failure; otherwise, if the argument cannot be satisfied, then the first rule fails, and Prolog tries the second rule. As the second rule always succeeds, Prolog returns “Yes”.

Even though not tends to capture the behavior we generally associate with logical negation, it is not a true logical negation operator. Consider the following database:

animal(dog).
vegetable(carrot).
both(X) :- not(not(animal(X))), vegetable(X).

If not were a true logical negation, then we could equivalently define both/1 above as:

both(X) :- animal(X), vegetable(X).

and expect the query ?- both(X). to return “No”, as no symbol in the database is both an animal and a vegetable. However, when we use the original formulation of both/1, with the double negation, the query ?- both(X). returns the answer X = carrot. To see this, consider how Prolog attempts to evaluate the query. It first needs to satisfy the query ?- animal(X)., which it does by setting X equal to dog. Thus, the goal not(animal(X)) fails, since an answer was found. At this point, the variable X becomes uninstantiated. Since the goal not(animal(X)) fails, the goal not(not(animal(X))) succeeds, but X remains uninstantiated. Hence, Prolog now attempts to satisfy the query vegetable(X). Since X is uninstantiated, Prolog is free to set X equal to carrot, thus satisfying the query.

Because not does not have true logical negation semantics, some newer Prolog implementations prefer to use the name \+ (called “cannot be proven”) for this operator instead.

9. Example: Sorting a List

In this section we present more examples of Prolog programming through the extended example of sorting a list. We shall focus our efforts on a naïve, purely declarative notion of sorting:

sort(X, Y) :- permute(X, Y), sorted(Y), !.

That is, Y is a sorted reordering of X if Y is a reordering (or permutation) of X, and Y is sorted.

This implementation of sort/2 is based on the notion of generators and tests. The generator, permute/2, generates all permutations of the list Y (albeit, again, simply by stating the permutations as facts; we never write algorithms directly in Prolog). The test, sorted/1, determines whether a particular permutation is sorted. In this way, we reject all unsorted permutations of X. Because there may be several sorted permutations of X, and only one is relevant, we use a cut to terminate our search once we have found one sorted permutation of X.

The implementation of sorted/1 is quite easy, and is outlined below:

sorted([]).
sorted([_]).
sorted([X,Y|Z]) :- X =< Y, sorted([Y|Z]).

In words, empty and singleton lists are sorted; any other list is sorted if its first element is less than or equal to its second, and its tail is sorted.

We give two implementations of permute/2. In our first implementation, we create a generator to_X/2, where to_X(N, R) is taken to mean “R is an element of the set $ \{0, 1, \ldots, N-1\} $”. The implementation of to_X is outlined below:

to_X(X, _) :- X =< 0, !, fail.
to_X(_, 0).
to_X(X, R) :- X1 is X - 1, to_X(X1, M), R is M + 1.

Invocation of to_X yields the expected result:

?- to_X(2, R).
R = 0 ;
R = 1 ;
No

Mathematically, this is equivalent to to_X(X, R) :- X >= 1, R =< X - 1, but Prolog cannot discover resolutions for X and R with only these bounds.

Next we give a predicate for finding the length of a list:

length([], 0).
length([_|T], R) :- length(T, Y), R is Y + 1.

Our intent is to use to_X to choose an element to remove from the list we wish to permute. To accomplish this feat, we define a predicate select/4, where select(L, N, A, B) binds A to the $ N $-th element of L, and binds B to L with A removed:

select([H|T], 0, H, T) :- !.
select([H|T], N, A, [H|B]) :- N1 is N - 1, select(T, N1, A, B).

With select in place, we may now define permute as follows:

permute([], []) :- !.
permute(L, R) :-
    length(L, Len), to_X(Len, I), select(L, I, A, B),
    permute(B, BP), append([A], BP, R).

That is, to permute a list, we find its length, Len, and bind I to each integer in $ \{0, \ldots, Len-1\} $. For each I, we take the I-th element out of L and permute the remainder. Note how this style of stating facts has made the “for each” completely implicit: since it is a fact that to_X(Len, I) for each I < Len, Prolog can simply backtrack to another resolution of to_X as many times as it needs to, behaving like a loop with no explicit loops or even recursion specified. Finally, we put the I-th element at the front of the permuted remainder.

We can also write permute in a much shorter way:

permute([], []).
permute(L, [H|T]) :- append(U, [H|V], L), append(U, V, W), permute(W, T).

Here, we are using append as a generator. Rather than supply the arguments and retrieve a result, we supply the result, L, to append. append then generates all lists whose concatenation is equal to L. We then capture an element H from the middle of L, and permute the remainder, placing H at the front of the list.

With sorted/1 and permute/2 implemented, sort/2 now works, but as you can doubtless imagine, it is hilariously inefficient. This naïve implementation of sort/2 will find a sorted list by checking every possible permutation, so it takes $ O(n!) $ time where $ n $ is the length of the list. To sort things efficiently takes far more thought.

10. Example: Merge-Sorting a List

To make our sort more efficient, let’s implement the classic merge-sort algorithm. Merge sort is a recursive algorithm: to merge-sort a list, we split it in half, recursively merge-sort each half, and then merge the result. This requires three sub-parts: dividing a list in half, merging two sorted lists, and merge sort itself.

10.1 Dividing a List

When implementing merge-sort, one typically divides a list into the first $ n $ elements and the last $ m $ elements, where $ n + m $ is the length of the list. However, any fair division works, and a style of division that is much more easily expressed in Prolog is dividing a list into the elements with even and odd indices. We express this with two mutually-recursive predicates, divide/3 and divide2/3:

divide([], [], []).                                      % 1
divide([X|LI], [X|L1], L2) :- divide2(LI, L1, L2).     % 2

divide2([], [], []).                                     % 3
divide2([X|LI], L1, [X|L2]) :- divide(LI, L1, L2).     % 4

divide/3 declares that its first argument is divided into its second and third arguments if the first element of its first argument is the first element of its second argument (L1), and the remainder are divided by divide2/3. divide2/3 is similar, with the only distinction being that it requires the first element to be in the last list, L2. In either case, the division of an empty list is simply two empty lists.

Consider the division of a simple list [1,2,3]:

divide([1, 2, 3], A, B)
  divide2(LI, L1, L2)  [1/X] [[2,3]/LI] [[X|L1]/A] [L2/B]
    divide2([2,3], L1, L2)
      divide(LI_, L1_, L2_)  [2/X] [[3]/LI_] [L1_/L1] [[X|L2_]/L2]
        divide([3], L1_, L2_)
          divide2(LI_2, L1_2, L2_2)  [3/X] [[]/LI_2] [[X|L1_2]/L1_] [L2_2/L2_]
            divide2([], L1_2, L2_2)
              L1_2 = [], L2_2 = []

By following the substitutions backwards, we can get that A = [1,3] and B = [2].

10.2 Merging Sorted Lists

Next, we need to be able to merge sorted lists. First, let’s handle the base case: if we merge an empty list with any list L, the result is L:

merge([],L,L) :- !.
merge(L,[],L).

We used the cut simply to avoid processing both cases when merging two empty lists. Now, we simply need cases to merge two non-empty lists, by choosing the lesser element:

merge([X|T1],[Y|T2],[X|T]) :- X=<Y, merge(T1,[Y|T2],T).
merge([X|T1],[Y|T2],[Y|T]) :- X>Y, merge([X|T1],T2,T).

These rules say that the lists [X|T1] and [Y|T2] merge to [X|T] if X=<Y, and [Y|T] otherwise. T is the merger of the remainder of the lists.

10.3 Merge Sort

Finally, we use these two algorithms to implement merge sort. Merge sort’s base case is when the list is empty or of length one, and in either of these cases, the list is trivially sorted:

merge_sort([], []) :- !.
merge_sort([X], [X]) :- !.

In both cases, we use the cut to make sure that the recursive case cannot apply. The merge sort algorithm itself is the last case, and it is written as a division, two recursive merge sorts, and a merge:

merge_sort(List, Sorted) :-
    divide(List, L1, L2),
    merge_sort(L1, Sorted1), merge_sort(L2, Sorted2),
    merge(Sorted1, Sorted2, Sorted),
    !.

Once again, we use the cut to free Prolog of any need to search for alternative solutions; there is only one sorting of a list.

Note that merge_sort always generates the same result as sort, but when Prolog’s search algorithm is applied to it, it will usually find that result more quickly. This is why effective Prolog programming requires an understanding of the search algorithm.

11. Manipulating the Database

Prolog provides facilities by which we may programmatically alter the contents of the database. This is done primarily via the predicates asserta/1, assertz/1, and retract/1.

The predicates asserta and assertz add a clause to the database. Given a clause X, asserta(X) adds X to the beginning of the database, while assertz(X) adds X to the end of the database. Where a clause appears in the database is, of course, important because of the well-specified resolution order of Prolog.

The predicate retract removes a clause from the database. Given a clause X, retract(X) removes the first clause matching X from the database. If no clause matching X is found, then retract(X) fails.

Note that the effects of asserta, assertz, and retract are not undone by backtracking. Thus, if we wish to undo the effects of asserta or assertz, we must do so explicitly, via a call to retract, and vice versa. Undoing retract can be essentially impossible, since we can only add a clause to the beginning or the end of the database, not the middle.

Excessive use of asserta, assertz, and retract tends to lead to highly unreadable and difficult to understand programs, and is discouraged.

12. Difference Lists

In Section 6, we created a predicate to reverse a list, as follows:

reverse([], []).
reverse([X|Y], R) :- reverse(Y, Z), append(Z, [X], R).

We also introduced a version of reverse based on accumulators, as follows:

reverse(X, Y) :- reverse2(X, [], Y).
reverse2([], Y, Y).
reverse2([X|T], Y, R) :- reverse2(T, [X|Y], R).

Even though, as declarative programmers, we would prefer not to worry about efficiency, when we compare these two formulations of reverse/2, we cannot help but notice that the second formulation is more efficient. Assuming a linear implementation of append/3, we see that the second formulation of reverse/2 requires only a linear number of query evaluations, while the first formulation requires a quadratic number of query evaluations.

In general, the use of accumulators is closely related to the notion of tail recursion in functional programming, and predicates based on accumulators are often more efficient than their counterparts without accumulators. However, accumulators can pose problems of their own. Consider the following predicate, which is based on accumulators, and whose purpose is to increment each element of a list:

incall(X, Y) :- incall2(X, [], Y).
incall2([], Y, Y).
incall2([X|T], Y, R) :- Z is X + 1, incall2(T, [Z|Y], R).

When we invoke incall/2, the result is as follows:

incall([1,2,3], Y).
Y = [4,3,2]

We see that incall/2 does indeed increment all of the elements of the list; however, the list returned by incall/2 is the reverse of what we intended! It is reasonably clear why this happened: at each step of the execution, we simply removed an element from the head of the input, incremented it, and placed it at the head of the accumulator. Thus, elements occurring later in the input are pushed onto the accumulator later, and therefore occur earlier in the accumulator list than elements which appeared before them in the input. The result is a list reversal.

To reap the benefits of programming with accumulators without the inconvenience imposed by the list reversals, Prolog programmers use a technique known as difference lists.

We may think of a difference list as a list with an uninstantiated tail. For example, by using difference lists we would represent the list [1,2,3] as [1,2,3|X]. We may then use X as an argument to other computation. Effectively, what we obtain through X is a “pointer to the tail” of the list, often called “the hole”. By instantiating X, we may add elements to the tail of the list. Further, if we instantiate X to a difference list, the result is still a difference list. This fact suggests the following implementation of append:

appendDL(List1, Hole1, List2, Hole2, List3, Hole3) :-
    Hole1 = List2, List3 = List1, Hole3 = Hole2.

Here, each difference list is represented by two parameters, List and Hole, representing the list and the hole at its tail, respectively. The result is that the list represented by List3 and Hole3 is bound to the result of appending the second list to the first. We may implement appendDL more succinctly as follows:

appendDL(List1, List2, List2, Hole2, List1, Hole2).

Our immediate observation is striking: this implementation of append runs in constant time! Since a caller had to retain a reference, so to speak, to both the beginning and tail of the list, all Prolog needs to do is unify them, with no extra step of walking through the list. We can then finalize the list by unifying the final hole with the empty list, again in constant time. Appending linked lists is usually constant time in imperative languages, but this is the first constant time append operation that we have encountered in this course. This fact alone suggests that difference lists are indeed a valuable tool.

Consider now our motivating example of incrementing the elements of a list via accumulators. To keep our code clean, we will represent a difference list by a structure with functor dl and components for the list and its uninstantiated tail. Using difference lists, we implement incall as follows:

incall(X, Y) :- incall2(X, dl(Z, Z), dl(Y, [])).
incall2([], Y, Y).
incall2([X|T], dl(AL, [X1|AH]), R) :- X1 is X + 1, incall2(T, dl(AL, AH), R).

Here, we use a difference list as an accumulator and as a placeholder for the result. When we invoke incall, it calls incall2 and sets the accumulator equal to the empty difference list (denoted dl(Z, Z)). As incall2 traverses the list, it repeatedly replaces the hole in the accumulator with a singleton difference list, representing the incremented head of the current list. When the input list is exhausted, incall2 simply returns the accumulator. incall then fills the hole in the result with the empty list ([]), thus converting the result to a normal list.

Thus, when we invoke incall, as in the following:

?- incall([1,2,3], Y).

we get the desired result:

Y = [2,3,4]

Let’s go over that example in greater detail, by showing the steps of resolution:

incall([1, 2, 3], Y)
  incall2(X, dl(Z, Z), dl(Y, []))  [[1, 2, 3]/X]
    incall2([1, 2, 3], dl(Z, Z), dl(Y, []))
      X1 is X + 1, incall2(T, dl(AL, AH), R)
        [1/X] [[2, 3]/T] [Z/AL] [[X1|AH]/Z] [dl(Y, [])/R]
      X1 is 1 + 1, incall2([2, 3], dl([X1|AH], AH), dl(Y, []))
        incall2([2, 3], dl([X1|AH], AH), dl(Y, []))  [2/X1]
          incall2([2, 3], dl([2|AH], AH), dl(Y, []))
            X1 is X + 1, incall2(T, dl(AL, AH_), R)
              [2/X] [[3]/T] [[2|AH]/AL] [[X1|AH_]/AH] [dl(Y, [])/R]
            X1 is 2 + 1, incall2([3], dl([2,X1|AH_], AH_), dl(Y, []))
              incall2([3], dl([2,X1|AH_], AH_), dl(Y, []))  [3/X1]
                incall2([3], dl([2,3|AH_], AH_), dl(Y, []))
                  X1 is X + 1, incall2(T, dl(AL, AH), R)
                    [3/X] [[]/T] [[2,3|AH_]/AL] [[X1|AH]/AH_] [dl(Y, [])/R]
                  X1 is 3 + 1, incall2([], dl([2,3,X1|AH], AH), dl(Y, []))
                    incall2([], dl([2,3,X1|AH], AH), dl(Y, []))  [4/X1]
                      incall2([], dl([2,3,4|AH], AH), dl(Y, []))
                        (unify dl([2,3,4|AH], AH) and dl(Y, []))
                          Y = [2,3,4]

In this way, we see that we may use difference lists to write programs with accumulators, in which the accumulators grow at the tail rather than at the head. For this reason, the result is not, as before, the reverse of what we wanted.

13. Miscellany

To load a source file into Prolog, we use the predicate consult/1. A call to consult(X) loads the file X.pl into the database. A convenient shorthand for consult(X). is simply [X]..

To access Prolog’s help facility, we issue the query ?- help(T)., where T is our desired help topic. To search the help files, we issue the query ?- apropos(T), where T is our desired search term.

Comments in Prolog begin with % and extend to the end of the line. C-style comments, beginning with /* and ending with */, are also available.

To display data to the screen, we use Prolog’s display/1 predicate. For character data, we may also use put/1. To print a newline to the screen, we use nl/0. In this way, Prolog programs can be made to behave like more traditional programs.

14. Query Programming and Datalog

A few of the words we used in this module may be familiar to you if you’ve studied databases. In particular, we call the set of clauses a database, and call our predicates and structures relations. These terms are the same as are used in relational databases, and are, in fact, used in exactly the same way. While meaning only to discuss programming in predicate logic, we have accidentally stumbled into a larger paradigm: query programming.

Although logic programming and relational databases were developed independently, when their relationship was discovered, attempts were made to describe a subset of Prolog which fit the use case of databases. In particular, the goal was to find a large and useful subset of Prolog which is not Turing-complete, since it is desirable when simply asking questions about data to ensure that the questions will be answered. These subsets are broadly called Datalog. Datalog is syntactically a subset of Prolog, and semantically mostly a subset of Prolog, except that certain programs which must not halt in Prolog instead must halt in Datalog. All queries which halt in Prolog and are valid in Datalog yield the same set of results.

Exactly what restrictions Datalog has isn’t precisely defined — Datalog isn’t an independent programming language, but a family of non-Turing-complete subsets of Prolog — but at least the following restrictions are common:

The cut operator, fail, and call are removed.
Queries are always run to completion and give all results, rather than interactively requesting further results, and the results are not guaranteed to be in any order (although it’s not uncommon to explicitly sort them as a final step after querying).
Structures other than atoms are removed.
Fully recursive queries (e.g., a predicate A(X) reached while trying to satisfy the predicate A(X)) are rejected.
Every variable appearing in the head of a rule must appear in the argument of a non-negated predicate in the body of the rule.
Negation is stratified: If the rules for predicate x include a negation of predicate y, then there must be no path through the rules by which predicate x refers to predicate y.
Numeric operators are also stratified: If the rules for predicate x include a numeric operator over variable Z, then either the result must be guaranteed to be closer to zero, or its use in predicates must stratify as with negation.
Lists, if allowed at all, must be stratified like numbers, either by reducing towards the empty list or by stratified use of predicates.

In short, with the exception of rules 1 and 2, these rules guarantee that taking a step in resolving a query always yields a lesser query, in the sense that it has fewer steps until facts are reached. This renders the language non-Turing-complete, but still useful. All queries in Datalog terminate.

Rule 3 removes the ability to create recursive structures (by removing all structures), and thus removes that path to infinite recursion. Note that we can still express interesting structures, but in a different way. For instance, to express:

owns(jean, book(greatExpectations, dickens)).

we may instead give both people and books some kind of identifier, and then express the ownership relation over those identifiers, like so:

person(24601, jean).
book(9564, greatExpectations, dickens).
owns(24601, 9564).

Since the number of atoms in any given database is finite, rules 3 and 4 together guarantee that a query with only positive predicates over atoms (i.e., no numbers or lists) will eventually terminate. Rule 5 guarantees that no predicates are themselves infinite, since variables must be resolved. Rules 6, 7, and 8 allow us to add negation, numbers, and lists, respectively, by guaranteeing that each approach a base case.

The implication of rules 1 and 2 are more interesting: Without the cut, and without any particular guarantee on the order of results, the exact order in which we resolve a query is no longer constrained. In Prolog, it was necessary to follow the search algorithm precisely, because the result was only correct if we did. Thus, we couldn’t optimize a query; only one implementation of the algorithm is correct. Moreover, some queries and databases which in Prolog would expand infinitely are allowed and must terminate in Datalog, and the restrictions to Datalog’s structure guarantee that this is possible. In spite of being a simpler language, Datalog is much more complicated to implement, because we are allowed to optimize the query, and such optimizations are required to guarantee termination.

15. Relational Databases

Relational databases were created completely independently of Prolog, and have their own foundational logic, called the relational algebra. In addition to their own foundational logic, they have their own defining language, the Structured Query Language, or SQL, sometimes pronounced as a homophone with “sequel”. SQL is less a programming language than an entire family of related programming languages, as each relational database engine has its own customized version of SQL; the core, however, is standardized.

The data in relational algebra is sets of relations, which are precisely the kinds of relations we’ve defined, i.e., functions. In relational database terminology, relations which are only defined as facts are called tables, and relations which are defined (at least in part) as rules (Horn clauses with bodies) are called views. These are sometimes also called extensional database predicates and intensional database predicates, respectively. A single fact within a table is called a row, and may not contain variables, only atoms or primitive data.

Consider a database which keeps track of employees and their supervisory relationship. This database might have two tables: one which tracks employee information, such as the employee’s identification number and name, and a second which tracks who supervises whom. For instance, consider the following database:

employee(34, dantes).    % 1
employee(5, villefort).  % 2
employee(1, napoleon).   % 3

supervisor(1, 5).        % 4
supervisor(5, 34).       % 5

The employee/2 table states that the employee dantes has employee number 34, the employee villefort has employee number 5, and the employee napoleon has, of course, employee number 1. The supervisor/2 table states that napoleon supervises villefort (supervisor(1, 5)), and villefort supervises dantes (supervisor(5, 34)).

Aside: This database is not a useful summary of The Count of Monte Cristo.

A useful view may be the supervisory relationship by name, instead of by number. In relational algebra terms, this is the join of the two tables:

supervisor_by_name(X, Y) :- employee(N, X), employee(M, Y), supervisor(N, M).

In a relational database, each argument to a predicate is called a field of the relation, and has a name, and usually a restricted type. Joining two (or more) tables is creating a new relation which is defined when some fields are equal in the original relations. In this case, supervisor_by_name/2 joins twice: once with the first field of supervisor/2 and the first field of employee/2, and once with the second field of supervisor/2 and the first field of employee/2. The natural join of two tables is simply the join in which the fields which must match are those with the same names, which is not meaningful to us as Datalog fields do not have names.

In addition, relational databases usually have unique and key fields. For our purposes, there is no distinction between the two concepts; a key field is simply a unique field. This simply restricts how relations are added to the table, preventing two relations from having the same value for that field. For instance, in our above table, nothing prevents us from adding this relation:

employee(34, faria).

Doing so, however, would yield strange results, since two employees now have the same employee number. It was presumably our intention that no two employees have the same employee number — this is, of course, the point of using such numbers instead of employee names — so a relational database engine would prevent this row from being added to the database. We can instead add faria like so:

employee(27, faria).

and with that row, we can additionally add a supervisor relation:

supervisor(5, 27).

Queries with variables are called selections, because they select values for the variables which satisfy the query. For instance, we can find every employee who is supervised by employee 5 (villefort) like so:

?- supervisor_by_name(villefort, X).
X = dantes
X = faria

In the original theory of relational algebra, recursive views and queries were not allowed. In practice, however, there are as many extensions of relational algebra (and of SQL) as there are relational database systems, and most of them allow recursive views in some form. For instance, many would allow us to define a view for whether a named employee is higher in the supervisory hierarchy than another. We can split this into its two components: the relation of the name to the employee number, and the recursive relation of two employee numbers via supervisor:

higher(X, Y) :- employee(NX, X), employee(NY, Y), nhigher(NX, NY).
nhigher(M, N) :- nhigher(P, N), supervisor(M, P).
nhigher(M, N) :- supervisor(M, N).

We can now query all employees under napoleon at any level of the hierarchy:

?- higher(napoleon, X).
X = dantes
X = faria
X = villefort

Note that in Prolog, this query would not complete, because we’ve structured its recursion before its base case. Some relational database systems support Datalog directly, but more frequently, they support a modified version of SQL inspired by Datalog.

16. Query Optimization

Consider the query ?- nhigher(1, X). in Prolog, shown partially here:

1:  nhigher(1, X)
2:  nhigher(P, N), supervisor(M, P)  [1/M] [X/N]
3:    nhigher(P, X), supervisor(1, P)
4:    nhigher(P_, N), supervisor(M, P_)  [P/M] [X/N], supervisor(1, P)
5:      nhigher(P_, X), supervisor(P, P_), supervisor(1, P)
6:      nhigher(P_2, N), supervisor(M, P_2)  [P_/M] [X/N], supervisor(P, P_), supervisor(1, P)
7:        nhigher(P_2, X), supervisor(P_, P_2), supervisor(P, P_), supervisor(1, P)
8:        ...

This query will never resolve, because nhigher(P, N) is listed before supervisor(M, P) in a clause defining nhigher/2, so it will expand to an infinite series of supervisor/2 relations. If we had simply put supervisor(M, P) before nhigher(P, N), then Prolog would have resolved this easily. It is the role of a query optimizer to reorder predicates in Datalog, making such queries not just more optimal, but in cases like this one, possible.

Aside: You will see query optimization discussed in very different terms than this section uses, simply because it was originally developed for SQL and relational databases, not Datalog. The concept is the same, but the details vary greatly.

Query optimization is an enormous field, and we will only touch on the basic concepts here. Generally speaking, at any time, a query optimizer is presented with a list of predicates it must resolve, and must choose one to resolve first. The structure of Datalog guarantees that either ordering is valid (except for halting), but one may be faster (in this case, infinitely faster).

Given a list of predicates to be resolved, a query optimizer orders them based first on the structure of the predicates, and second on a cost model. The cost model is completely specific to the specific query optimizer and database engine, so we won’t discuss it here. Structurally, observe that nhigher/2 consists of a recursive join over supervisor/2: we join its first argument to its second. Join optimization consists of choosing joins which reduce the size of our query before joins which increase the size of our query. The “size” in this case is the number of unrestricted variables or, if equal, the number of predicates in the resulting query. Consider the partially-evaluated query nhigher(P, X), supervisor(1, P). Since supervisor/2 is a table (it has only facts), if we resolve it, we will not introduce any new predicates, and will resolve the value of P. On the other hand, if we resolve nhigher(P, X) first, its second rule will introduce one new variable, as well as a new predicate. Thus, we choose to resolve supervisor(1, P) before nhigher(P, X). When options are equal in these terms, it’s necessary to use a heuristic cost model.

A similar model can also be used to select from multiple resolutions of a predicate. For instance, we would choose to expand nhigher(1, X) to supervisor(M, N) [1/M] [X/N] before nhigher(P, N), supervisor(M, P) [1/M] [X/N], because the latter introduces more variables and predicates. This is less important, since ultimately, in Datalog, every resolution must be explored, but SQL allows, for instance, requesting just one matching result, and this optimization can be useful for doing so.

Now, let’s consider the same query as above, but with a reordering step to bring the best query by the above model to the front:

1:  nhigher(1, X)
2:  /* OPTIMIZATION: Choose supervisor(M, N) [1/M] [X/N] first */
3:  supervisor(M, N)  [1/M] [X/N]
4:    supervisor(1, X)
5:    supervisor(1, 5)
6:    solution 1: X=5
7:  nhigher(P, N), supervisor(M, P)  [1/M] [X/N]
8:    nhigher(P, X), supervisor(1, P)
9:    /* OPTIMIZATION: Choose supervisor(1, P) first */
10:   supervisor(1, P), nhigher(P, X)
11:   supervisor(1, 5), nhigher(5, X)
12:     nhigher(5, X)
13:     /* OPTIMIZATION: Choose supervisor(M, N) [5/M] [X/N] first */
14:     supervisor(M, N)  [5/M] [X/N]
15:       supervisor(5, X)
16:       supervisor(5, 34)
17:       solution 2: X=34
18:       supervisor(5, 27)
19:       solution 3: X=27
20:     nhigher(P, N), supervisor(M, P)  [5/M] [X/N]
21:       nhigher(P, X), supervisor(5, P)
22:       /* OPTIMIZATION: Choose supervisor(5, P) first */
23:       supervisor(5, P), nhigher(P, X)
24:       supervisor(5, 34), nhigher(34, X)
25:         nhigher(34, X)
26:         /* OPTIMIZATION: Choose supervisor(M, N) [34/M] [X/N] first */
27:         supervisor(M, N)  [34/M] [X/N]
28:           supervisor(34, X)
29:           fail
30:         nhigher(P, N), supervisor(M, P)  [34/M] [X/N]
31:           nhigher(P, X), supervisor(34, P)
32:           /* OPTIMIZATION: Choose supervisor(34, P) first */
33:           supervisor(34, P), nhigher(P, X)
34:           fail
35:       supervisor(5, 27), nhigher(34, X)
36:       /* Similar to the case of supervisor(5, 34), nhigher(34, X) */

This query is not just faster than the Prolog version, but infinitely faster than the Prolog version, in that it terminates, resolving to X=5, X=34, and X=27.

17. Everything In Between

Most implementations of relational databases actually implement extensions to SQL which render them Turing-complete, but typically not the Prolog cut. In some cases, this is simply by cheating and allowing a Turing-complete language to be embedded into SQL, but in other cases, it is by allowing less restrictive databases, for instance by removing the stratification requirements of numbers or lists, or allowing unbounded variables in the heads of Horn clauses.

The cut operator is rarely reintroduced, because the goal of such a system is usually a Turing-complete database in which query optimization is still allowed. Modern database systems have to contend with Turing-complete programs over which optimization can turn a non-halting computation into a halting computation!

Datalog has also found some use in machine learning, where certain ML structures, such as neural networks, fit within the constraints of Datalog.

18. Fin

In the next module, we will look at imperative programming languages, which account for most languages used in practice.

References

Roman Barták. On-line guide to Prolog programming, 1998. http://kti.mff.cuni.cz/~bartak/prolog/
Edgar F Codd. A relational model of data for large shared data banks. In Software pioneers, pages 263–294. Springer, 2002.
Jacques Cohen. Non-deterministic algorithms. ACM Computing Surveys (CSUR), 11(2):79–94, 1979.
Jacques Cohen. Describing Prolog by its interpretation and compilation. Communications of the ACM, 28(12):1311–1324, 1985.
Todd J Green, Shan Shan Huang, Boon Thau Loo, Wenchao Zhou, et al. Datalog and recursive query processing. Now Publishers, 2013.
Hongyuan Mei, Guanghui Qin, Minjie Xu, and Jason Eisner. Neural datalog through time: Informed temporal modeling via logical specification. In International Conference on Machine Learning, pages 6808–6819. PMLR, 2020.

Module 7: Imperative Programming

“Gotos aren’t damnable to begin with. If you aren’t smart enough to distinguish what’s bad about some gotos from all gotos, goto hell.” — Erik Naggum

1. Imperative Programming

You are almost certainly familiar with imperative programming languages. Most of the most popular programming languages in the world are imperative: C, C++, Java, JavaScript, Python, etc. Smalltalk, which you’ve been using in this course, is also imperative, although it’s an unusual example. This module is fairly brief — really, it is, it just has several long figures to make it seem longer than it is! — both because you’re expected to have familiarity with imperative programming languages and because imperative programming is fairly conceptually simple, particularly after logic programming.

In English, “imperative” is a synonym of “command” — it comes from the same root as “emperor” — and that’s the core of its meaning in programming languages as well. An imperative language is a language in which the fundamental behavior is described by imperatives, i.e., commands, in sequence. In a functional language, expressions are the basic unit of behavior, and the ordering of evaluation is largely implicit. In an imperative language, statements are the basic unit of behavior, and a program is built from lists of statements. One statement is always completed before the next statement is executed, so the ordering is explicit in the code: the order of execution is the order of statements. Formally, an imperative is a single “step”, which is usually a single statement, but some statements have nested statements within them, in which case one statement is formed from many imperatives. In practice, we will use the terms “statement”, “imperative”, and “command” interchangeably.

Aside: We have, of course, seen that the order of evaluation is perfectly well defined in functional languages. Furthermore, OCaml allows imperatives directly, with its ; operator, and Haskell’s do syntax allows imperative-like behavior as well. None of these categories are precise.

In order for commands to be able to communicate information from one to the next, all imperative languages have mutable variables. In functional languages, variables — let bindings and function arguments — are immutable, and mutability is encapsulated in the structure of a reference or monad. In an imperative language, statements may change variables, and the meaning of any statement can vary based on the values of the variables it uses at the time that the statement is executed.

Virtually all modern imperative languages have procedures. “Procedure” is, again, a normal English word, with the same meaning in programming languages as it has in English: a procedure is a list of steps, possibly parameterized. For instance, the procedure for walking a dog is parameterized by the dog, and the procedure for sorting an array is parameterized by the array. Pedantically, a procedure is distinct from a function because:

functions, as mathematical entities, are defined by expressions, not imperatives, and
functions, as mathematical entities, are referentially transparent.

In practice, only the most pure functional languages restrict functions to this definition, so we will use the terms “function” and “procedure” interchangeably. Using these terms interchangeably also annoys pedants, which is always a laudable goal. Procedural languages are imperative languages with procedures. This accounts for virtually all imperative languages, so this module will mostly talk about procedural and imperative languages interchangeably. Note that although we will use “function” and “procedure” interchangeably, it is never correct to use “functional language” and “procedural language” interchangeably, as the former is a sub-paradigm of declarative languages, and the latter is a sub-paradigm of imperative languages.

There is yet another set of terms, routine and subroutine, which are also synonyms of “procedure”, although some languages have more specific meanings for all of these terms within the context of the language. In English, “routine” is a synonym of “procedure” (albeit with different connotation), and “subroutine” is simply “routine” with the prefix “sub-” to imply the possibility of nesting. We will use the term “subroutine” when discussing Pascal, and define all of these terms more precisely for that language, but in all other contexts, we consider them all to be equivalent.

A language doesn’t need procedures to be useful or Turing-complete, although it is impossible for a language to be Turing-complete if it has no way of repeating, so a non-procedural imperative language must have loops. Modern imperative programming languages without functions of any sort are rare. Assembly language is imperative and doesn’t have functions on any real architectures, but all other examples are extremely special-purpose. Early imperative languages, in particular early versions of BASIC, lacked functions as well.

Another sub-paradigm within imperative programming is structured programming. Virtually all modern imperative languages are structured languages, so we will also mostly use these terms interchangeably. In a structured programming language, control flow — i.e., conditions and loops — are explicitly represented in the structure of the code. For instance, in our exemplar language, Pascal, to execute some statements a, b, and c only if a condition x is true, you write:

if x then
begin
  a;
  b;
  c
end

The structure of the condition — that a, b, and c are only to be executed if x is true — is reflected in the structure of the code: the a, b, and c statements are nested inside of the if statement. If you’re more familiar with C syntax, the equivalent is:

if (x) {
  a;
  b;
  c;
}

In an unstructured language, all control flow is done with jumps, such as goto in C, or jmp in many assembly languages. For instance, we could rewrite the above C example (quite badly) like so:

if (x) goto xTrue;
afterCondition:
[...]
xTrue:
  a;
  b;
  c;
  goto afterCondition;

In an unstructured language, the order of statements is less clear, since gotos can cause the order of execution to differ substantially from the order that statements appear. gotos are generally considered harmful [1], so we will mostly look exclusively at structured languages.

So, when we look at imperative languages, we’re really going to be looking at structured, procedural programming languages. It just so happens that that accounts for nearly all modern imperative languages.

2. Examples

It is more difficult to select a single exemplar for imperative languages than any other paradigm. Most programming languages are imperative, and most imperative programming languages before object-oriented programming became popular were uniformly imperative, with little pollution from other paradigms, so there are dozens of programming languages which are, or at least were, purely imperative. As such, we will briefly discuss some other potential exemplars, before describing the exemplar we will be using, Pascal.

The obvious exemplar for imperative programming is C, and for the most part, it’s purely imperative. Unlike many other examples — in fact, unlike even Pascal — C has remained purely imperative, and has no object-oriented or functional features to speak of. This is largely because C’s object-oriented descendants have become just that, descendants, rather than new versions of the language. If you want to use objects in C, you should use C++ or Objective-C. If you want to use objects in Pascal, you should use a recent implementation of Pascal. Perversely, even if you want to use objects in COBOL, you need only use a recent version of COBOL. So, if C is such a good exemplar, why wasn’t it chosen? Simple: it’s not. C’s pointers are pervasive and persistent, and require manual memory management as a constant fact of life. This is fine, but places it into an intersecting paradigm, systems programming languages, which we will discuss in Module 10, for which C is the exemplar. Further, C has some unusual syntactic anomalies — in particular, the C preprocessor — which make it difficult to relate to other languages.

Another possible exemplar was Forth, a stack-based language which is perhaps the most strictly imperative language in existence. Forth has no expressions, only commands. It accomplishes this by operating on a stack, precisely like the RPN calculator we developed in the Smalltalk component of Module 1. For instance, this Forth program computes $ 1 + 2 \times 3 $ and prints the result:

1 2 3 * + .

The behavior of a number command in Forth is to push that number to the stack. The behavior of a mathematical operator is to pop two numbers, perform the specified operator, and push the result to the stack. The behavior of . is to pop a number from the stack and print it. You may also define new commands in Forth from sequences of existing commands, forming procedures. Forth was not selected because its style is so foreign that it’s hard to connect lessons learned about Forth to other imperative languages. Also, no two people can agree on how to capitalize Forth (FORTH? Forth? forth?), and the compromise of small-caps is hard to type.

Another option was BASIC, the programming language that defined the 80’s. BASIC is a fairly straightforward, dynamically typed, imperative language. At this point, it would be reasonable to provide a short example of BASIC, but that’s the problem: BASIC was pre-loaded on many 80’s microcomputers, but each implementation was subtly incompatible, and the language evolved considerably over the course of the 80’s, 90’s, and early 2000’s. So, the simple reason not to choose BASIC as an exemplar is that there is no BASIC language; it’s more like a family of superficially related languages.

Arguably, Turing machines are imperative, and could serve as an exemplar as well. But, quite simply, very few languages derive their syntax or core structure from Turing machines. The Turing machine is a good theoretical basis, but not a good programming language.

3. Exemplar: Pascal

Pascal is an imperative, structured, procedural programming language originally designed by Niklaus Wirth around 1970 [2]. Pascal is statically typed but weakly typed. We won’t focus much attention on its types, and the formal semantics we define will not have types. Pascal is an excellent exemplar, in particular, of procedural programming, because a Pascal program is fundamentally a tree of procedures. Many imperative programming languages do not allow procedures to nest, in the same way that functions can nest in functional languages, but Pascal does; it allows procedures to be defined anywhere where variables may be defined, and scopes everything lexically.

There are three kinds of procedures in Pascal: programs, procedures, and functions. By the common definition of the term “procedure”, and the definition we’re using in this module, they are all procedures; Pascal simply has its own definitions for these terms. To avoid confusion, we will therefore use the term “subroutine” to refer to all of them, and the more specific terms to refer to the more specific structures in Pascal. When not discussing Pascal, we will continue to use the terms interchangeably. In Pascal, a program is the outer subroutine for, predictably, a program, and is technically just a “routine” since it’s not sub- to anything; a procedure is a subroutine which does not return anything, so is only used for its side effects; and a function is a subroutine which has a return value.

A Pascal subroutine consists of a subroutine header, which defines which of the three types of subroutines it is; a declaration list, which defines the environment for the subroutine; and a subroutine body. For instance, this is the classic Pascal “Hello, world!” program, with the addition of a variable to hold the string 'Hello, world!' simply to demonstrate variable declarations:

program Hello;
var greeting: string;
begin
  greeting := 'Hello, world!';
  writeln(greeting)
end.

The line program Hello; is the subroutine header. A program subroutine has no properties other than its name (no arguments), and this subroutine’s name is Hello. Naming a program in Pascal is mostly just for documentation, but becomes important with linking libraries; we will not be discussing libraries in this module, so for us, the name is just documentation.

The line var greeting: string; is a variable declaration, in this case declaring the variable greeting of the type string. There could be any number of variable declarations between program and begin, but variable declarations are only allowed there, not in the body. This separation of variable declarations from subroutine code is rare in modern imperative languages, but was the norm for much of imperative language history. C only allowed mixing of declarations and statements in the C99 standard in 1999 (although it was commonly allowed in compilers regardless)!

The begin and end lines define a block, which is a list of statements, and this block forms the subroutine body. They are akin to C’s { and }, which takes some getting used to, since they don’t visually match in the same way. Statements in Pascal are separated by ;. Note that they are separated by ;, not terminated by ;, so the final statement does not need a ;, which is also true of Smalltalk’s dot. Like Smalltalk, the final ; (dot in Smalltalk) is allowed, so we could have written writeln(greeting); if we wished. Technically, this is allowed because Pascal allows an empty statement, so if we end a statement list with ;, then we’re ending it with an empty statement. An empty statement does nothing. Like Smalltalk and OCaml, the assignment operator is :=.

Note that “block” is the common name for a statement list, but Smalltalk uses the term differently, to refer to (essentially) an anonymous function. Do not confuse the two! A block in this context is nothing more than a statement list. Blocks in Pascal (and most imperative languages) are not values, they’re just a feature of the syntax.

One kind of declaration is a subroutine declaration, so we can have subroutines as part of the environment of other subroutines. For instance, if we wanted to use a subroutine to generate the greeting, we could define it like so:

program Hello;
var greeting: string;

function genGreeting(): string;
begin
  genGreeting := 'Hello, world!'
end;

begin
  greeting := genGreeting();
  writeln(greeting)
end.

Line 5 is a function declaration: it consists of the keyword function, the name of the function (in this case, genGreeting), a list of arguments surrounded by parentheses (in this case, there are no arguments), a colon, the return type of the function, and then a semicolon. So, genGreeting is a zero-argument function which returns a string; a function is, of course, a kind of subroutine. To return a value in Pascal, you simply assign the return value to the name of the function, as in genGreeting := 'Hello, world!'. This doesn’t end the subroutine, just set its return value, so we could have added more statements after that line, and they would have run before genGreeting returned. This is unusual, but fits the goals of structured programming: one cannot simply jump out of a block (in this case a subroutine body) whenever one chooses; blocks run to completion. All declarations in a declaration list are terminated by semicolons, so there is a semicolon after the end on the last line of genGreeting to indicate that that is the end of the declaration of genGreeting. The program declaration is not part of a declaration list, but the entire program must end with a dot.

Declaration lists form environments with lexical scoping, so subroutines declared there can see and interact with variables defined there or in any enclosing scope. For instance, we could rewrite genGreeting as a procedure (subroutine without a return value) by making it directly modify greeting, like so:

program Hello;
var greeting: string;

procedure genGreeting();
begin
  greeting := 'Hello, world!'
end;

begin
  genGreeting();
  writeln(greeting)
end.

A procedure declaration is similar to a function declaration, but with the word procedure and no return type (since there is no return).

All subroutines have declaration lists, but a declaration list may be empty. We could add declarations between the header and the begin of our genGreeting example to create local variables of genGreeting, and even nested subroutines. Just like in functional languages, each call to a subroutine will have its own set of local variables, but if, for instance, two instances of genGreeting are called, that will not create two instances of greeting, because that’s a variable of the program, not the subroutine.

Although programs are subroutines, their name does not form part of their own scope, so a program cannot call itself, and so cannot be directly recursive. Other subroutines are defined in an environment visible within the subroutine (the surrounding scope), so can be recursive. For instance, here’s a program with a recursive implementation of a factorial function, in which the main subroutine simply outputs five factorial:

program FiveFac;

function fac(x: integer): integer;
begin
  if x = 1 then
    fac := 1
  else
    fac := fac(x-1) * x
end;

begin
  writeln(fac(5))
end.

if statements behave similarly to other imperative languages, and have the same form: if condition then statement else statement. The else and its statement are optional. If you wish to perform multiple steps in a condition, you can use a block:

program FiveFac;

function fac(x: integer): integer;
var y: integer;
begin
  if x = 1 then
    fac := 1
  else
  begin
    y := fac(x-1);
    fac := y * x
  end
end;

begin
  writeln(fac(5))
end.

Technically speaking, because multiple statements can be nested inside an if, if is a statement, but not an imperative; an imperative is a single step.

All imperative languages have loops, rather than just recursion, and indeed, some imperative languages do not support recursion. Most imperative languages have some form of compound data type — i.e., a way for a single variable to store or reference many values — either through arrays, records, or both. We will discuss both of these features later in this module.

Pascal also has many features we will not discuss at all in this module. In particular, call-by-reference, Object Pascal, and units (libraries) are simply not in scope.

4. The Simple Imperative Language

The $ \lambda $-calculus was a good starting point for functional languages, because it’s fundamentally built out of functions (abstractions). But, it doesn’t fit imperative languages well. Although our small-step semantics (the $ \to $ morphism) does describe its reduction in terms of steps, there is no explicit statement of an order; no imperatives. Instead, we’ll create a new fundamental language on which to build imperative concepts.

Like the $ \lambda $-calculus, we want a language that is reasonably simple, yet captures most of our ideas about imperative programming. The language commonly used for these purposes is known as the Simple Imperative Language. There are many slight variants of the Simple Imperative Language, and the variant we will use is presented below:

Definition 1. The Simple Imperative Language consists of the strings derivable from $ \langle \mathit{stmtlist} \rangle $ below:

⟨stmtlist⟩ ::= ⟨stmt⟩
             |  ⟨stmt⟩ ; ⟨stmtlist⟩

⟨stmt⟩ ::= skip
          | begin ⟨stmtlist⟩ end
          | while ⟨boolexp⟩ do ⟨stmt⟩
          | if ⟨boolexp⟩ then ⟨stmt⟩ else ⟨stmt⟩
          | ⟨var⟩ := ⟨intexp⟩

⟨boolexp⟩ ::= true
            | false
            | not ⟨boolexp⟩
            | ⟨boolexp⟩ and ⟨boolexp⟩
            | ⟨boolexp⟩ or ⟨boolexp⟩
            | ⟨intexp⟩ > ⟨intexp⟩
            | ⟨intexp⟩ < ⟨intexp⟩
            | ⟨intexp⟩ = ⟨intexp⟩

⟨intexp⟩ ::= 0 | 1 | ···
           | ⟨var⟩
           | ⟨intexp⟩ + ⟨intexp⟩
           | ⟨intexp⟩ * ⟨intexp⟩
           | - ⟨intexp⟩

⟨var⟩ ::= a | b | c | ···

In this definition, we see familiar constructions of integer and boolean expressions, as well as a looping construction and a conditional construction. The skip statement does nothing, and is thus fairly pointless; its purpose will become clear when we discuss the semantics of the Simple Imperative Language. Notice the exclusion of parentheses and other grouping constructions. As usual, the syntax in our definition is assumed to be an abstract syntax, in which parsing has already been done, so we will use parentheses to disambiguate, but don’t need them in the described syntax. A particular concrete syntax for the Simple Imperative Language would likely include grouping constructions in some fashion. Notice also that many familiar constructions, including subtraction, have been excluded from this language. Many of these can be introduced later as additional syntax or syntactic sugar. For example, the expression $ a - b $ can be viewed as syntactic sugar for $ a + {-b} $.

Also notable about the Simple Imperative Language is the absence of a goto statement. Although simply called “imperative”, the Simple Imperative Language is, more precisely, a structured language.

Formulating an operational semantics for an imperative language requires some care; imperative languages are fundamentally different from functional languages, in that they are based on the execution of commands, rather than the evaluation of expressions. Thus, we cannot formulate a semantics for an imperative language based on some “final” computed value, because there is no such value. Imperative languages are run for their side-effects.

Our formulation of an operational semantics for imperative languages must begin with an understanding of what a program in an imperative language “does”. Computation is performed by assigning values to mutable variables; the result is obtained by reading off the values of one or more of the variables. Variables are nothing more or less than named locations in a mutable store (the “memory”). Hence, an operational semantics for an imperative language should include some notion of a “store” of values.

We’ve seen a store before, $ \sigma $. But, the term “store” and the concept of a store are more general than we’ve used before: a store is simply a place to store things. In fact, our heap, $ \Sigma $, is also a kind of store, but we usually only use the term “store” to refer to stores indexed by variable names.

In functional languages, variables are always immutable (invariable?), so mutable values were relegated to a separate structure, the heap. In the semantics we defined for functional languages, $ \sigma $ would never change across the $ \to $ morphism; changes to $ \sigma $ represented the lexical scope of a particular expression, as variables were added in let bindings. So, to represent a mutable variable, we needed two links: one immutable link in $ \sigma $ to a label, and one mutable link in $ \Sigma $ from the label to a value. This is also represented in the user language: in OCaml, for instance, a user must explicitly get (with !) the value of a reference, using its label. In the Simple Imperative Language, we will simply allow $ \sigma $ to mutate. Note that as we get more complicated, however, it’s not uncommon to reintroduce this two-way link (variable in $ \sigma $ to label in $ \Sigma $, simply because it’s difficult to make $ \sigma $ simultaneously represent both the environment, which changes as you go down the code’s syntax tree, and the mutable store, which changes across steps of reduction. The Simple Imperative Language has only a single global scope, so we can use $ \sigma $ for both, and when we introduce procedures, we’ll use a different trick to stick to just $ \sigma $.

In terms of the actual structure and operations allowed, $ \sigma $ in the Simple Imperative Language behaves the same as $ \sigma $ in previous modules.

The important values for our semantics of the Simple Imperative Language will be stores, and terminal values will be stores with no further statements to execute. That is, our program has terminated when it has run every statement, and the value it produces is simply the store. In addition to returning a store as our final answer, we must keep in mind that the meaning of a command like x := x + 1 can only be determined in the context of a store as well; we need to know the value bound to x! Thus, just like with let bindings, our semantics will operate over pairs of the form $ \langle S, \sigma \rangle $, so we will be defining the morphism $ \langle S, \sigma \rangle \to \langle S', \sigma' \rangle $, where $ S $ is a program ($ \langle \mathit{stmtlist} \rangle $ and $ \sigma $ is a store. As $ \sigma $ will change over $ \to $, it represents mutable state, so we can also describe this as a change in both program and state. Only the := statement actually modifies $ \sigma $, so for all other statements and expressions, if $ \langle X, \sigma \rangle \to \langle X', \sigma' \rangle $, then $ \sigma = \sigma' $. Our program will start with an empty store.

Aside (Errata): This pairing is confusingly backwards from how it was written in earlier modules, and worse yet, it is reversed again later in this module. Hopefully you can read around the reversing order of $ S $ and $ \sigma $.

Before we begin a construction of a semantics for the full language, we observe that the Simple Imperative Language possesses two expression subsets: the sublanguage of integer expressions and the sublanguage of boolean expressions.

To formulate a semantics for these subsets, we can use our semantics for numbers in the $ \lambda $-calculus as a model:

Definition 2. (Small-Step Operational Semantics for Integer Expressions)

We identify the set of terminal values in the semantics of the sublanguage of integer expressions within the Simple Imperative Language as the set of integers. Let the metavariables $ M $, $ x $, $ \sigma $, and $ N $ range over integer expressions, integer variables, stores, and integers, respectively. A small-step semantics for integer expressions is as follows:

\[ \textbf{IntOpLeft} \quad \dfrac{op \in \{+,\,*\} \qquad \langle M_1,\,\sigma \rangle \to \langle M_1',\,\sigma \rangle} {\langle M_1\ op\ M_2,\,\sigma \rangle \to \langle M_1'\ op\ M_2,\,\sigma \rangle} \]\[ \textbf{IntOpRight} \quad \dfrac{op \in \{+,\,*\} \qquad \langle M,\,\sigma \rangle \to \langle M',\,\sigma \rangle} {\langle N\ op\ M,\,\sigma \rangle \to \langle N\ op\ M',\,\sigma \rangle} \]\[ \textbf{Add} \quad \dfrac{N_1 + N_2 = N_3} {\langle N_1 + N_2,\,\sigma \rangle \to \langle N_3,\,\sigma \rangle} \]\[ \textbf{Mul} \quad \dfrac{N_1 \cdot N_2 = N_3} {\langle N_1 * N_2,\,\sigma \rangle \to \langle N_3,\,\sigma \rangle} \]\[ \textbf{NegStep} \quad \dfrac{\langle M,\,\sigma \rangle \to \langle M',\,\sigma \rangle} {\langle {-M},\,\sigma \rangle \to \langle {-M'},\,\sigma \rangle} \]\[ \textbf{Neg} \quad \dfrac{N' = -N} {\langle {-N},\,\sigma \rangle \to \langle N',\,\sigma \rangle} \]\[ \textbf{Var} \quad \dfrac{N = \sigma(x)} {\langle x,\,\sigma \rangle \to \langle N,\,\sigma \rangle} \]

The semantics are very similar to those for natural numbers from Module 3, with a few exceptions:

Operators in the Simple Imperative Language are infix ($ a + b $ instead of prefix ($ + \; a \; b $.
Numbers in the Simple Imperative Language are integers, not naturals, so there are no special cases.
We’ve abbreviated the repetitive rules for reducing the left before the right into IntOpLeft and IntOpRight, which match any binary operation.

The integer expression sublanguage is pure functional, and so $ \sigma $ is never changed. We could have defined $ \to $ without the $ \sigma $ on the right at all, since it’s guaranteed not to change, but this makes the definition of $ \to^* $ awkward, so we’ve defined it with both sides having the same form. Ultimately, what we care about for a Simple Imperative Language program is how it changes the store, but as the integer sublanguage does not change the store, for it, we care about the value generated, just like in functional languages. Generally speaking, expression sublanguages of all imperative languages behave in this way.

Note that if $ \sigma(x) $ is not defined for a particular variable $ x $ in an application of the semantic rules, these rules get stuck. That is the only circumstance under which these rules can get stuck: as we will see when we get to :=, all variables store integers in the Simple Imperative Language, so there’s no possibility of any other type error.

The formulation of a semantics for boolean expressions is similar:

Definition 3. (Small-Step Semantics for Boolean Expressions)

We identify the set of terminal values in the semantics of the sublanguage of boolean expressions within the Simple Imperative Language as the two-element set $ \mathbb{B} = \{\mathtt{true},\,\mathtt{false}\} $, which is the set of boolean values. Let the metavariables $ B $, $ M $, $ \sigma $, $ N $, and $ V $ range over boolean expressions, integer expressions, stores, integer values, and boolean values, respectively. Then a small-step semantics for boolean expressions is as follows:

\[ \textbf{BoolOpLeft} \quad \dfrac{op \in \{>, <, =\} \qquad \langle M_1,\,\sigma \rangle \to \langle M_1',\,\sigma \rangle} {\langle M_1\ op\ M_2,\,\sigma \rangle \to \langle M_1'\ op\ M_2,\,\sigma \rangle} \]\[ \textbf{BoolOpRight} \quad \dfrac{op \in \{>, <, =\} \qquad \langle M,\,\sigma \rangle \to \langle M',\,\sigma \rangle} {\langle N\ op\ M,\,\sigma \rangle \to \langle N\ op\ M',\,\sigma \rangle} \]\[ \textbf{GtTrue} \quad \dfrac{N_1 > N_2}{\langle N_1 > N_2,\,\sigma \rangle \to \langle \mathtt{true},\,\sigma \rangle} \qquad \textbf{GtFalse} \quad \dfrac{N_1 \leq N_2}{\langle N_1 > N_2,\,\sigma \rangle \to \langle \mathtt{false},\,\sigma \rangle} \]\[ \textbf{LtTrue} \quad \dfrac{N_1 < N_2}{\langle N_1 < N_2,\,\sigma \rangle \to \langle \mathtt{true},\,\sigma \rangle} \qquad \textbf{LtFalse} \quad \dfrac{N_1 \geq N_2}{\langle N_1 < N_2,\,\sigma \rangle \to \langle \mathtt{false},\,\sigma \rangle} \]\[ \textbf{EqTrue} \quad \dfrac{N_1 = N_2}{\langle N_1 = N_2,\,\sigma \rangle \to \langle \mathtt{true},\,\sigma \rangle} \qquad \textbf{EqFalse} \quad \dfrac{N_1 \neq N_2}{\langle N_1 = N_2,\,\sigma \rangle \to \langle \mathtt{false},\,\sigma \rangle} \]\[ \textbf{NotSub} \quad \dfrac{\langle B,\,\sigma \rangle \to \langle B',\,\sigma \rangle}{\langle \mathtt{not}\ B,\,\sigma \rangle \to \langle \mathtt{not}\ B',\,\sigma \rangle} \]\[ \textbf{NotTrue} \quad \langle \mathtt{not}\ \mathtt{true},\,\sigma \rangle \to \langle \mathtt{false},\,\sigma \rangle \qquad \textbf{NotFalse} \quad \langle \mathtt{not}\ \mathtt{false},\,\sigma \rangle \to \langle \mathtt{true},\,\sigma \rangle \]\[ \textbf{AndLeft} \quad \dfrac{\langle B_1,\,\sigma \rangle \to \langle B_1',\,\sigma \rangle} {\langle B_1\ \mathtt{and}\ B_2,\,\sigma \rangle \to \langle B_1'\ \mathtt{and}\ B_2,\,\sigma \rangle} \]\[ \textbf{AndSC} \quad \langle \mathtt{false}\ \mathtt{and}\ B,\,\sigma \rangle \to \langle \mathtt{false},\,\sigma \rangle \qquad \textbf{AndRight} \quad \langle \mathtt{true}\ \mathtt{and}\ B,\,\sigma \rangle \to \langle B,\,\sigma \rangle \]\[ \textbf{OrLeft} \quad \dfrac{\langle B_1,\,\sigma \rangle \to \langle B_1',\,\sigma \rangle} {\langle B_1\ \mathtt{or}\ B_2,\,\sigma \rangle \to \langle B_1'\ \mathtt{or}\ B_2,\,\sigma \rangle} \]\[ \textbf{OrSC} \quad \langle \mathtt{true}\ \mathtt{or}\ B,\,\sigma \rangle \to \langle \mathtt{true},\,\sigma \rangle \qquad \textbf{OrRight} \quad \langle \mathtt{false}\ \mathtt{or}\ B,\,\sigma \rangle \to \langle B,\,\sigma \rangle \]

The rules for and, or, and not should be familiar from Module 3, with the following changes:

Like integer operations, boolean binary operations are infix ($ a\ \mathtt{or}\ b $ not prefix ($ \mathtt{or}\ a\ b $.
The description of short-circuiting (only evaluate the right of and/or if necessary) is described slightly differently, though with the same results.

In addition, these rules add integer comparisons. Note that the subexpressions of a comparison operator ($ <, >, = $ must, by the syntax of the Simple Imperative Language, be integer expressions. We will see later that our rules do not allow variables to hold booleans, so integer expressions can only evaluate to integers. This creates a strict stratification of expressions: boolean expressions may contain integer expressions, but integer expressions may not contain boolean expressions.

The rules in the above definitions provide us with a complete small-step semantics for the functional subset of the Simple Imperative Language. We may now consider the formulation of a semantics for the language as a whole:

Definition 4. (Small-Step Semantics for the Simple Imperative Language)

The set of terminal values in the Simple Imperative Language is pairs of the skip statement and a store, i.e., $ \langle \mathtt{skip},\,\sigma \rangle $.¹ We will define reductions over both statement lists and individual statements. Let the metavariables $ B $, $ M $, $ \sigma $, $ V $, $ N $, $ Q $, and $ L $ range over boolean expressions, integer expressions, stores, boolean values, integer values, statements, and statement lists, respectively. Then a small-step semantics for the Simple Imperative Language is as follows:

\[ \textbf{BlockOnly} \quad \langle \mathtt{begin}\ L\ \mathtt{end},\,\sigma \rangle \to \langle L,\,\sigma \rangle \]\[ \textbf{BlockRest} \quad \langle \mathtt{begin}\ L_1\ \mathtt{end};\ L_2,\,\sigma \rangle \to \langle L_1;\ L_2,\,\sigma \rangle \]\[ \textbf{Skip} \quad \langle \mathtt{skip};\ L,\,\sigma \rangle \to \langle L,\,\sigma \rangle \]\[ \textbf{StmtList} \quad \dfrac{\langle Q,\,\sigma \rangle \to \langle Q',\,\sigma' \rangle} {\langle Q;\ L,\,\sigma \rangle \to \langle Q';\ L,\,\sigma' \rangle} \]\[ \textbf{AssignStep} \quad \dfrac{\langle M,\,\sigma \rangle \to \langle M',\,\sigma \rangle} {\langle x := M,\,\sigma \rangle \to \langle x := M',\,\sigma \rangle} \]\[ \textbf{Assign} \quad \dfrac{\sigma' = \sigma[x \mapsto N]} {\langle x := N,\,\sigma \rangle \to \langle \mathtt{skip},\,\sigma' \rangle} \]\[ \textbf{IfCond} \quad \dfrac{\langle B,\,\sigma \rangle \to \langle B',\,\sigma \rangle} {\langle \mathtt{if}\ B\ \mathtt{then}\ Q_1\ \mathtt{else}\ Q_2,\,\sigma \rangle \to \langle \mathtt{if}\ B'\ \mathtt{then}\ Q_1\ \mathtt{else}\ Q_2,\,\sigma \rangle} \]\[ \textbf{IfTrue} \quad \langle \mathtt{if}\ \mathtt{true}\ \mathtt{then}\ Q_1\ \mathtt{else}\ Q_2,\,\sigma \rangle \to \langle Q_1,\,\sigma \rangle \]\[ \textbf{IfFalse} \quad \langle \mathtt{if}\ \mathtt{false}\ \mathtt{then}\ Q_1\ \mathtt{else}\ Q_2,\,\sigma \rangle \to \langle Q_2,\,\sigma \rangle \]\[ \textbf{While} \quad \langle \mathtt{while}\ B\ \mathtt{do}\ Q,\,\sigma \rangle \;\to\; \langle \mathtt{if}\ B\ \mathtt{then}\ \mathtt{begin}\ Q;\ \mathtt{while}\ B\ \mathtt{do}\ Q\ \mathtt{end}\ \mathtt{else}\ \mathtt{skip},\,\sigma \rangle \]

We’ve used $ Q $ instead of $ S $ for statements because $ S $ has previously been used for substitutions, and we will soon be using substitutions in the Simple Imperative Language, though they aren’t needed yet.

The BlockRest, BlockOnly, Skip, and StmtList rules are over statement lists; the rest are over individual statements. BlockRest and BlockOnly simply describe the destructuring of a block: if we have a block as a statement, we simply replace it with its constituent statements. There are two such cases because there are two forms for $ \langle \mathit{stmtlist} \rangle $. The Skip rule tells us that the skip statement does nothing, so if it is the first statement, it will simply be removed. The StmtList rule specifies that if there is a reduction rule for the first statement in a statement list, then we can use that to reduce the first statement in the list. The individual statement reduction rules are designed to eventually reduce to skip, so that multiple steps of the StmtList (and sometimes Block*) rules will apply, and then a final Skip when the statement is done.

The AssignStep rule reduces the right-hand side of an assignment statement. Syntactically, the right-hand side of an assignment statement can only be an integer expression, so assuming the reduction doesn’t get stuck, it will always reduce to an integer. The Assign rule then assigns that integer to a variable, by replacing $ \sigma $ with a modified $ \sigma' $, which is $ \sigma $ with the mapping for the variable to its value added.

The IfCond rule takes a single step over the condition of an if statement. Multiple steps of IfCond will be taken until the condition is either true or false. At that point, the IfTrue or IfFalse rule, respectively, will replace the if statement with the sub-statement which should actually be executed; the first for true, the second for false. Other rules may then continue to reduce the statement.

4.1 Loops

The While rule is unusual, in that it “reduces” a while statement into a longer if statement, which actually contains the original while statement! To understand why this works, and examine the rest of the reductions while we’re at it, let’s run a simple program that increments the variable x from 0 to 2 in a loop. The resulting reduction steps are shown in Figure 1. Since a while statement reduces to an if, its body is only run conditionally. Since the whole while statement is at the end of the then case of the if statement, after the loop body finishes running, the while statement simply runs again. This cycle continues until the generated if statement’s condition is false, at which point it reduces to skip.

Of course, a real implementation of an imperative language doesn’t mutate the statement list in this way, but this is a reasonable description of the steps to performing a loop. Each iteration is a simple conditional, and it is the fact that the condition ends by repeating the loop that defines while’s behavior.

Figure 1: Reduction steps for x := 0; while x < 2 do x := x + 1

\[ \begin{aligned} &\langle x := 0;\ \mathtt{while}\ x < 2\ \mathtt{do}\ x := x + 1,\ \{\} \rangle \\ \xrightarrow{\text{Assign}} \;&\langle \mathtt{skip};\ \mathtt{while}\ x < 2\ \mathtt{do}\ x := x + 1,\ \{x \mapsto 0\} \rangle \\ \xrightarrow{\text{Skip}} \;&\langle \mathtt{while}\ x < 2\ \mathtt{do}\ x := x + 1,\ \{x \mapsto 0\} \rangle \\ \xrightarrow{\text{While}} \;&\langle \mathtt{if}\ x < 2\ \mathtt{then}\ \mathtt{begin}\ x := x+1;\ \mathtt{while}\ x < 2\ \mathtt{do}\ x := x+1\ \mathtt{end}\ \mathtt{else}\ \mathtt{skip},\ \{x \mapsto 0\} \rangle \\ \xrightarrow{\text{Var, LtTrue}}^* \;&\langle \mathtt{if}\ \mathtt{true}\ \mathtt{then}\ \mathtt{begin}\ x := x+1;\ \mathtt{while}\ x < 2\ \mathtt{do}\ x := x+1\ \mathtt{end}\ \mathtt{else}\ \mathtt{skip},\ \{x \mapsto 0\} \rangle \\ \xrightarrow{\text{IfTrue}} \;&\langle \mathtt{begin}\ x := x+1;\ \mathtt{while}\ x < 2\ \mathtt{do}\ x := x+1\ \mathtt{end},\ \{x \mapsto 0\} \rangle \\ \xrightarrow{\text{BlockOnly}} \;&\langle x := x+1;\ \mathtt{while}\ x < 2\ \mathtt{do}\ x := x+1,\ \{x \mapsto 0\} \rangle \\ \xrightarrow{\text{Var}} \;&\langle x := 0+1;\ \mathtt{while}\ x < 2\ \mathtt{do}\ x := x+1,\ \{x \mapsto 0\} \rangle \\ \xrightarrow{\text{Add}} \;&\langle x := 1;\ \mathtt{while}\ x < 2\ \mathtt{do}\ x := x+1,\ \{x \mapsto 0\} \rangle \\ \xrightarrow{\text{Assign, Skip}}^* \;&\langle \mathtt{while}\ x < 2\ \mathtt{do}\ x := x+1,\ \{x \mapsto 1\} \rangle \\ \xrightarrow{\text{While}} \;&\langle \mathtt{if}\ x < 2\ \mathtt{then}\ \mathtt{begin}\ x := x+1;\ \mathtt{while}\ x < 2\ \mathtt{do}\ x := x+1\ \mathtt{end}\ \mathtt{else}\ \mathtt{skip},\ \{x \mapsto 1\} \rangle \\ \xrightarrow{\text{Var, LtTrue}}^* \;&\langle \mathtt{if}\ \mathtt{true}\ \mathtt{then}\ \mathtt{begin}\ x := x+1;\ \mathtt{while}\ x < 2\ \mathtt{do}\ x := x+1\ \mathtt{end}\ \mathtt{else}\ \mathtt{skip},\ \{x \mapsto 1\} \rangle \\ \xrightarrow{\text{IfTrue}} \;&\langle \mathtt{begin}\ x := x+1;\ \mathtt{while}\ x < 2\ \mathtt{do}\ x := x+1\ \mathtt{end},\ \{x \mapsto 1\} \rangle \\ \xrightarrow{\text{BlockOnly}} \;&\langle x := x+1;\ \mathtt{while}\ x < 2\ \mathtt{do}\ x := x+1,\ \{x \mapsto 1\} \rangle \\ \xrightarrow{\text{Var, Add, Assign, Skip}}^* \;&\langle \mathtt{while}\ x < 2\ \mathtt{do}\ x := x+1,\ \{x \mapsto 2\} \rangle \\ \xrightarrow{\text{While}} \;&\langle \mathtt{if}\ x < 2\ \mathtt{then}\ \mathtt{begin}\ x := x+1;\ \mathtt{while}\ x < 2\ \mathtt{do}\ x := x+1\ \mathtt{end}\ \mathtt{else}\ \mathtt{skip},\ \{x \mapsto 2\} \rangle \\ \xrightarrow{\text{Var}} \;&\langle \mathtt{if}\ 2 < 2\ \mathtt{then}\ \mathtt{begin}\ x := x+1;\ \mathtt{while}\ x < 2\ \mathtt{do}\ x := x+1\ \mathtt{end}\ \mathtt{else}\ \mathtt{skip},\ \{x \mapsto 2\} \rangle \\ \xrightarrow{\text{LtFalse}} \;&\langle \mathtt{if}\ \mathtt{false}\ \mathtt{then}\ \mathtt{begin}\ x := x+1;\ \mathtt{while}\ x < 2\ \mathtt{do}\ x := x+1\ \mathtt{end}\ \mathtt{else}\ \mathtt{skip},\ \{x \mapsto 2\} \rangle \\ \xrightarrow{\text{IfFalse}} \;&\langle \mathtt{skip},\ \{x \mapsto 2\} \rangle \end{aligned} \]

while loops in Pascal have the same form as while loops in the Simple Imperative Language. Pascal also has two other kinds of loops (for-do loops and repeat-until loops), but both can easily be viewed as syntactic sugar for while loops, so we will not discuss them here.

4.2 The Initialization Problem

The Simple Imperative Language is quite nearly type safe, in spite of not having any explicit types. This is because of our expression stratification; although there exist boolean and integer values in the language, variables can only contain integers, and it is syntactically impossible to write the wrong type. We will lose this almost-type-safety when we introduce other primitives into the language, but we can also examine why we don’t have true type safety now.

The Simple Imperative Language is not quite type safe, as demonstrated by this simple program which gets stuck: x := y. Since we never defined y, the premise for the Var reduction doesn’t match, and so the reduction is stuck. Not all variables which are used are guaranteed to be in the store. As with all cases where semantics get stuck, “getting stuck” isn’t meaningful for a real language implementation, but it is an indication of a real problem: we haven’t guaranteed that variables are initialized before they are used. This is called the initialization problem, and is a surprisingly pervasive problem in the definition of imperative languages.

There are four solutions (of which the first is a non-solution) to the initialization problem: garbage values, default values, forced initialization, and static analysis.

Garbage values. A language that uses garbage values leaves the value of an uninitialized variable unpredictable and has no guarantees whatsoever, because its value is simply whatever was in a slot of memory before the variable was assigned to that space. Pascal and, famously, C and C++ use garbage values, so if you fail to initialize a variable, it could have any unpredictable value. With a data type such as integers, garbage values are harmless in terms of type safety, since all sequences of bits can be interpreted as valid integers. This is not true of other data types, however, so garbage values are generally not safe. Formal semantics usually do not model garbage values, because formal semantics usually do not actually model RAM, but it is possible with a rule like this one:

\[ \textbf{GarbageVar} \quad \dfrac{x \notin \sigma \qquad N \in \mathbb{Z} \qquad \sigma' = \sigma[x \mapsto N]} {\langle x,\,\sigma \rangle \to \langle N,\,\sigma' \rangle} \]

It is exceedingly rare for a semantics to have a rule like this one, because it’s non-deterministic! This allows us to take the step $ \langle x,\,\sigma \rangle \to \langle N,\,\sigma' \rangle $ for any integer $ N $, and any such step is valid. Of course, that accurately models the actual behavior of such a language, so as rare as it is, it is correct; formal semantics authors just usually don’t want non-deterministic semantics.

Aside: Another way to semantically define garbage values is to semantically model memory. We’ll look at this option again in Module 10.

Default values. In a language with default values, every type has a default value (or, if there are no types, there is a single default value), and every variable is assigned that default value until it is initialized. For instance, we could represent default values in the Simple Imperative Language by adding a rule to handle uninitialized variables, like so:

\[ \textbf{UninitVar} \quad \dfrac{x \notin \sigma} {\langle x,\,\sigma \rangle \to \langle 0,\,\sigma \rangle} \]

Languages like Java and JavaScript use default values.

Forced initialization. An example of forced initialization is the let bindings we saw in functional languages. The syntax of a let binding forces the programmer to give a value to the variable, and the variable is only defined in the scope of the let binding, so the initialization problem simply never arises. Forced initialization has its drawbacks, however: it makes recursive data types (such as a circular linked list) extremely hard or impossible to define. Forced initialization is uncommon among imperative languages, and common among functional languages.

Static analysis. Static analysis is a general name for analyzing code. In some cases, it’s possible to inspect code and reject programs which might access variables before they’ve been initialized. It is rare for a language to make such static analysis part of the definition of the language, because that means the language is defined by a particular algorithm, but it is common to use static analysis to provide warnings for the possibility of an uninitialized variable. Most C, C++, and Pascal compilers warn programmers about possibly-uninitialized variables using static analysis.

You may wonder why we don’t simply demand that for any variable x, an assignment x := M appear before any use of x. The problem is the definition of “before”. With conditionals and loops, it’s not always obvious whether a given statement will definitely happen before another, and when we add procedures, arrays, and records, this problem will only get worse. In the Simple Imperative Language, we could probably become safe by demanding that x be initialized in a non-conditional statement before its first use, which is actually an extremely simplistic form of static analysis.

4.3 Types

As has been discussed, the Simple Imperative Language has few type errors, but does have an initialization problem. We can nonetheless describe a type judgment for it, and to make our type judgment sound, we will write it with something similar to the simplistic static analysis that ended the last subsection. Note that we have no explicit statement of types in the Simple Imperative Language, but explicit types are not actually necessary to perform type judgment.

Recall that our type judgments have previously been of the form $ \Gamma \vdash E : \tau $, where $ \Gamma $ is a type environment, $ E $ is an expression, and $ \tau $ is the determined type of that expression. But, what if instead of an expression, we had a statement? Statements do not actually have a type; statements have an action. In this case, we simply write a type judgment that does not yield a type; we can judge a statement to be well-typed, but cannot actually give a type to the statement, since it does not yield a value. We therefore write type judgments of statements as $ \Gamma \vdash Q $. In our type judgment for the Simple Imperative Language, we will see both forms, since we have both expressions and statements.

As an additional simplification, we will assume that all programs consist of a statement list in which the last statement is explicitly specified as skip. All programs can be rewritten like this simply by adding ; skip to the end. The reason for this is that our type environment will carry information from the first statement in a statement list to the rest of the statement list, which wasn’t necessary for our semantics, so structurally we want to guarantee that the rest of the statement list will always be there.

First, we’ll need a type language. We only have two types, so our type language is insultingly simple:

⟨type⟩ ::= int | bool

Every value is either an integer (int) or a boolean (bool).

Now, let’s type integer expressions.

Definition 5. (Type Judgment for Integer Expressions)

Let the metavariables $ \Gamma $, $ \tau $, $ M $, $ x $, and $ N $ range over type environments, types, integer expressions, variables, and integers. A type judgment for integer expressions is as follows:

\[ \textbf{T_IntBinOp} \quad \dfrac{op \in \{+,\,*\} \qquad \Gamma \vdash M_1 : \mathtt{int} \qquad \Gamma \vdash M_2 : \mathtt{int}} {\Gamma \vdash M_1\ op\ M_2 : \mathtt{int}} \]\[ \textbf{T_Neg} \quad \dfrac{\Gamma \vdash M : \mathtt{int}} {\Gamma \vdash {-M} : \mathtt{int}} \qquad \textbf{T_Int} \quad \Gamma \vdash N : \mathtt{int} \]\[ \textbf{T_Var} \quad \dfrac{\Gamma(x) = \tau} {\Gamma \vdash x : \tau} \]

Again, quite simple: since integer expressions are syntactically isolated, all we need to do is check that all subexpressions type correctly. Even that is only necessary because T_Var requires that the variable actually be defined.

Next, let’s type boolean expressions:

Definition 6. (Type Judgment for Boolean Expressions)

Let the metavariables $ \Gamma $, $ \tau $, $ B $, and $ M $ range over type environments, types, boolean expressions, and integer expressions. A type judgment for boolean expressions is as follows:

\[ \textbf{T_CmpOp} \quad \dfrac{op \in \{>, <, =\} \qquad \Gamma \vdash M_1 : \mathtt{int} \qquad \Gamma \vdash M_2 : \mathtt{int}} {\Gamma \vdash M_1\ op\ M_2 : \mathtt{bool}} \]\[ \textbf{T_BoolBinOp} \quad \dfrac{op \in \{\mathtt{and},\,\mathtt{or}\} \qquad \Gamma \vdash B_1 : \mathtt{bool} \qquad \Gamma \vdash B_2 : \mathtt{bool}} {\Gamma \vdash B_1\ op\ B_2 : \mathtt{bool}} \]\[ \textbf{T_Not} \quad \dfrac{\Gamma \vdash B : \mathtt{bool}} {\Gamma \vdash \mathtt{not}\ B : \mathtt{bool}} \qquad \textbf{T_True} \quad \Gamma \vdash \mathtt{true} : \mathtt{bool} \qquad \textbf{T_False} \quad \Gamma \vdash \mathtt{false} : \mathtt{bool} \]

Again, the rules are straightforward. Comparisons are well-typed if their operands are both integers. Other boolean operations are well-typed if their subexpressions are booleans.

Finally, the type judgment for the whole program:

Definition 7. (Type Judgment for the Simple Imperative Language)

Let the metavariables $ \Gamma $, $ \tau $, $ B $, $ M $, $ Q $, and $ L $ range over type environments, types, boolean expressions, integer expressions, statements, and statement lists. A type judgment for Simple Imperative Language programs suffixed with ; skip follows:

\[ \textbf{T_SkipOnly} \quad \Gamma \vdash \mathtt{skip} \qquad \textbf{T_SkipRest} \quad \dfrac{\Gamma \vdash L} {\Gamma \vdash \mathtt{skip};\ L} \]\[ \textbf{T_Assign} \quad \dfrac{\Gamma \vdash M : \tau \qquad \forall \tau_1.\, \langle x, \tau_1 \rangle \notin \Gamma \qquad \langle x, \tau \rangle + \Gamma \vdash L} {\Gamma \vdash x := M;\ L} \]\[ \textbf{T_Reassign} \quad \dfrac{\Gamma \vdash M : \tau \qquad \Gamma(x) = \tau \qquad \Gamma \vdash L} {\Gamma \vdash x := M;\ L} \]\[ \textbf{T_If} \quad \dfrac{\Gamma \vdash B : \mathtt{bool} \qquad \Gamma \vdash Q_1;\ \mathtt{skip} \qquad \Gamma \vdash Q_2;\ \mathtt{skip} \qquad \Gamma \vdash L} {\Gamma \vdash \mathtt{if}\ B\ \mathtt{then}\ Q_1\ \mathtt{else}\ Q_2;\ L} \]\[ \textbf{T_While} \quad \dfrac{\Gamma \vdash B : \mathtt{bool} \qquad \Gamma \vdash Q;\ \mathtt{skip} \qquad \Gamma \vdash L} {\Gamma \vdash \mathtt{while}\ B\ \mathtt{do}\ Q;\ L} \]\[ \textbf{T_Block} \quad \dfrac{\Gamma \vdash L_1 \qquad \Gamma \vdash L_2} {\Gamma \vdash \mathtt{begin}\ L_1\ \mathtt{end};\ L_2} \]

Most of these rules are fairly trivial: a statement types if its sub-statements and subexpressions type. T_Assign and T_Reassign are the interesting cases, but let’s quickly look at the other cases. Our statement list is guaranteed to be terminated by skip, so only T_SkipOnly and T_SkipRest are concerned with the possibility of not having a “rest” of the statement list. Everything else assumes that all statement lists are of the form $ Q;\ L $. To make this work with the sub-statements in if and while statements, we explicitly append ; skip, turning a statement into a statement list.

Now, let’s examine T_Assign and T_Reassign. These are the only type judgments that affect $ \Gamma $. T_Assign is for initial assignments, and so checks that the assigned variable $ x $ is not present in $ \Gamma $ (i.e., $ \forall \tau_1.\, \langle x, \tau_1 \rangle \notin \Gamma $, there is no type associated with $ x $ in $ \Gamma $. It judges the remaining statements $ L $ in an environment that includes the variable $ x $. Once a variable has been given a type, we don’t allow it to change, since if it changes conditionally, this will make it possible for a variable to have values of different types in different circumstances. Thus, the T_Reassign rule rejects assignments where $ x $ is present but has a different type. Of course, in the Simple Imperative Language, syntactically, only one type is possible, int. This rule will be more interesting when we introduce other types.

Consider how T_Assign and T_Reassign relate to nested statements and nested blocks. If we first assign a variable inside of an if statement, then its type will not be visible outside the if statement. But, if we first assign it before the if statement, its type will be visible within the if statement. In this way, we’ve actually given some lightly lexical scoping to the Simple Imperative Language, which only has a single global scope. This is possible and correct only because of the strict ordering of statements.

Most typed imperative languages would instead require explicit variable declarations, and extend $ \Gamma $ with declared types. If those variable declarations have lexical scopes, then the semantics must be modified to nest $ \sigma $ correctly as well.

5. Procedures

The Simple Imperative Language has no procedures, and no nested scopes. There is only a global scope and global variables. As we’ve discussed, procedures are not necessary for an imperative language to be Turing-complete — and indeed, as long as our integers have unlimited range, the Simple Imperative Language is Turing-complete, albeit quite awkward to use — but procedures are extremely common. There are several ways to represent procedures, but the simplest uses substitution and “freshening” (creating new, distinct variable names) like we saw in Module 2. To do this, we will need a syntax for procedures. We will put procedures and variables in the same store — in Pascal, they’re in the same environment. Our procedures will have a similar form to Pascal procedures: a subroutine header, a declaration list, and a body. For simplicity, we will implement procedures, and not functions. The declaration list will only consist of names, since the Simple Imperative Language is not typed. And, our procedures will be statements, not declarations; essentially, they are a form of assignment statement in which the assignment itself is implicit (we are assigning a variable to a procedure).

Aside: Just like typed functional languages tend to have a file syntax distinct from their expression syntax, most typed imperative languages allow procedure and variable declarations at the global scope, but no expressions, and many do not allow procedure definitions as statements in other procedures. This creates several syntaxes within the same language which partially overlap. We’ve somewhat sidestepped this issue by using Pascal as our exemplar, because its global scope is a procedure!

This introduces another use of semicolon, since procedure declarations use semicolons after the header and after each declaration. This is unambiguous with the semicolons which separate statements only because a procedure must end with a begin-end block. Unfortunately, this will create some hard-to-read syntax; remember to look for the corresponding begin-end block whenever you see a procedure.

We will call our extended version of Simple Imperative Language with procedures SIL-P. This is our Pascal factorial program from above, rewritten in SIL-P, and storing the result of fac(5) in the variable x:

procedure fac(n); begin if n = 1 then x := 1 else begin fac(n+-1); x := x * n end end; fac(5)

Rewriting this with some indentation for clarity:

procedure fac(n);
begin
  if n = 1 then
    x := 1
  else
  begin
    fac(n + -1);
    x := x * n
  end
end;
fac(5)

We extend the Simple Imperative Language as follows:

⟨stmt⟩ ::= ···
          | ⟨procdec⟩
          | ⟨var⟩ (⟨arglist⟩)

⟨procdecl⟩ ::= procedure ⟨var⟩ (⟨paramlist⟩) ; ⟨decllist⟩ begin ⟨stmtlist⟩ end

⟨arglist⟩    ::= ε  |  ⟨intexp⟩ ⟨arglistrest⟩
⟨arglistrest⟩ ::= ε  |  , ⟨intexp⟩ ⟨arglistrest⟩

⟨paramlist⟩    ::= ε  |  ⟨var⟩ ⟨paramlistrest⟩
⟨paramlistrest⟩ ::= ε  |  , ⟨var⟩ ⟨paramlistrest⟩

⟨decllist⟩ ::= ε  |  ⟨var⟩ ; ⟨decllist⟩

Now, we need semantic rules for procedures. Let the metavariables $ A $, $ P $, and $ D $ range over argument lists, parameter lists, and declaration lists, respectively. We’ll start with the procedure declaration itself, which adds it to $ \sigma $:

\[ \textbf{ProcDecl} \quad \dfrac{Q = \mathtt{procedure}\ x(A);\ D\ \mathtt{begin}\ L\ \mathtt{end} \qquad \sigma' = \sigma[x \mapsto Q]} {\langle Q,\,\sigma \rangle \to \langle \mathtt{skip},\,\sigma' \rangle} \]

Now, we need procedure calls. Procedure declaration lists form environments — that is, each procedure can see its own declared variables, and can see surrounding variables, but cannot see or interfere with other procedure calls’ variables — so we need some way of distinguishing the variables within a procedure call from the variables outside of it. The Simple Imperative Language’s version of $ \sigma $ does not form a tree, just a single map, and there’s no easy way to make it serve both roles. Our solution to this will be substitution: when we call a procedure, we create fresh new variable names for all of the variables it defines, and substitute the variables in the procedure body for their new names. In this way, the procedure’s variables are isolated from other variables.

We will not formally define the function to generate new variable names for all variables in a procedure. Suffice it to say, for each variable $ x $ in the parameter list or declaration list of a procedure $ Q $, $ \mathit{freshen}(Q) $ will create a fresh variable name $ x' $ and substitution $ [x'/x] $. $ \mathit{freshen}(Q) $ returns this substitution list. Since we now have substitutions, we will let the metavariable $ S $ range over substitutions and substitution lists.

Now, we have the necessary framework to define the semantics of a procedure call. In addition, we need to resolve each argument of a procedure call, and we can do that as well:

\[ \textbf{CallArg} \quad \dfrac{\langle M_1,\,\sigma \rangle \to \langle M_1',\,\sigma \rangle} {\langle x(N_1, N_2, \ldots, N_n, M_1, M_2, \ldots, M_m),\,\sigma \rangle \to \langle x(N_1, N_2, \ldots, N_n, M_1', M_2, \ldots, M_m),\,\sigma \rangle} \]\[ \textbf{ProcCall} \quad \dfrac{ \sigma(x_1) = Q = \mathtt{procedure}\ x_2(x_{a,1}, x_{a,2}, \ldots, x_{a,n});\ D\ \mathtt{begin}\ L\ \mathtt{end} \qquad S = \mathit{freshen}(Q) }{ \langle x_1(N_1, N_2, \ldots, N_n),\,\sigma \rangle \to \langle \mathtt{begin}\ (x_{a,1} S) := N_1;\ (x_{a,2} S) := N_2;\ \cdots;\ (x_{a,n} S) := N_n;\ (L\, S)\ \mathtt{end},\,\sigma \rangle } \]

Let’s take these one at a time.

CallArg says that if you have a call with arguments $ N_1, N_2, \ldots, N_n, M_1, M_2, \ldots, M_m $ — that is, the first $ n $ arguments have been reduced — then the $ (n+1) $th argument can be reduced.

ProcCall is the actual call. It replaces a call $ (x_1(N_1, N_2, \ldots, N_n)) $ with a begin-end block. That block starts with $ n $ assignment commands to each corresponding parameter $ (x_{a,1} \cdots x_{a,n}) $, then has the procedure’s body $ (L) $ with its variables freshened. From this point, each statement in the procedure will be executed, and since the declared variable names have been replaced, it effectively has its own scope.

To demonstrate, let’s follow through the reduction of our SIL-P program with the fac procedure, but to keep it reasonable, only to fac(2). There are many correct ways to implement $ \mathit{freshen} $, so we will assume that variables are suffixed with a counter. The reduction steps are shown in Figure 2. Since we freshened our variables, the recursion can simply be flattened into a sequence of statements. Note that the $ n_1 $ and $ n_2 $ variables are from separate, recursive calls to fac, but both are present in some program states. x, on the other hand, since it was not defined within the procedure, refers to the same (global) x.

Another aspect of this definition worth noting is that it pollutes $ \sigma $ with an only-increasing number of entries; $ \sigma $ never shrinks. If we implemented this directly on a real computer, we could easily chew through all of memory! Real implementations need techniques to clear out the store: for the stack, popping stack frames, and for the heap, either explicit memory management (malloc/free) or implicit memory management (garbage collection). Luckily, we operate in the world of mathematical logic, so our response to this pollution is: $ \sigma $ will simply get polluted, and we don’t care.

Figure 2: Reduction steps for procedure fac(n); begin if n = 1 then x := 1 else begin fac(n+-1); x := x * n end end; fac(2)

\[ \begin{aligned} &\langle \mathtt{procedure\ fac}(n);\ \mathtt{begin}\ \mathtt{if}\ n = 1\ \mathtt{then}\ x := 1\ \mathtt{else}\ \mathtt{begin}\ \mathtt{fac}(n{+}{-}1);\ x := x * n\ \mathtt{end}\ \mathtt{end};\ \mathtt{fac}(2),\ \{\} \rangle \\ \xrightarrow{\text{ProcDecl}} \;&\langle \mathtt{skip};\ \mathtt{fac}(2),\ \{\mathtt{fac} \mapsto \mathtt{procedure\ fac}(n)\ldots\} \rangle \\ \xrightarrow{\text{Skip}} \;&\langle \mathtt{fac}(2),\ \{\mathtt{fac} \mapsto \mathtt{procedure\ fac}(n)\ldots\} \rangle \\ \xrightarrow{\text{ProcCall}} \;&\langle \mathtt{begin}\ n_1 := 2;\ \mathtt{if}\ n_1 = 1\ \mathtt{then}\ x := 1\ \mathtt{else}\ \mathtt{begin}\ \mathtt{fac}(n_1{+}{-}1);\ x := x * n_1\ \mathtt{end}\ \mathtt{end},\\ &\quad \{\mathtt{fac} \mapsto \mathtt{procedure\ fac}(n)\ldots\} \rangle \\ \xrightarrow{\text{BlockOnly}} \;&\langle n_1 := 2;\ \mathtt{if}\ n_1 = 1\ \mathtt{then}\ x := 1\ \mathtt{else}\ \mathtt{begin}\ \mathtt{fac}(n_1{+}{-}1);\ x := x * n_1\ \mathtt{end},\ \{\mathtt{fac} \mapsto \ldots\} \rangle \\ \xrightarrow{\text{Assign, Skip}}^* \;&\langle \mathtt{if}\ n_1 = 1\ \mathtt{then}\ x := 1\ \mathtt{else}\ \mathtt{begin}\ \mathtt{fac}(n_1{+}{-}1);\ x := x * n_1\ \mathtt{end},\ \{\mathtt{fac} \mapsto \ldots,\ n_1 \mapsto 2\} \rangle \\ \xrightarrow{\text{Var, EqFalse}}^* \;&\langle \mathtt{if}\ \mathtt{false}\ \mathtt{then}\ x := 1\ \mathtt{else}\ \mathtt{begin}\ \mathtt{fac}(n_1{+}{-}1);\ x := x * n_1\ \mathtt{end},\ \{\mathtt{fac} \mapsto \ldots,\ n_1 \mapsto 2\} \rangle \\ \xrightarrow{\text{IfFalse}} \;&\langle \mathtt{begin}\ \mathtt{fac}(n_1{+}{-}1);\ x := x * n_1\ \mathtt{end},\ \{\mathtt{fac} \mapsto \ldots,\ n_1 \mapsto 2\} \rangle \\ \xrightarrow{\text{BlockOnly}} \;&\langle \mathtt{fac}(n_1{+}{-}1);\ x := x * n_1,\ \{\mathtt{fac} \mapsto \ldots,\ n_1 \mapsto 2\} \rangle \\ \xrightarrow{\text{Var, Neg, Add}}^* \;&\langle \mathtt{fac}(1);\ x := x * n_1,\ \{\mathtt{fac} \mapsto \ldots,\ n_1 \mapsto 2\} \rangle \\ \xrightarrow{\text{ProcCall}} \;&\langle \mathtt{begin}\ n_2 := 1;\ \mathtt{if}\ n_2 = 1\ \mathtt{then}\ x := 1\ \mathtt{else}\ \mathtt{begin}\ \mathtt{fac}(n_2{+}{-}1);\ x := x * n_2\ \mathtt{end}\ \mathtt{end};\ x := x * n_1,\\ &\quad \{\mathtt{fac} \mapsto \ldots,\ n_1 \mapsto 2\} \rangle \\ \xrightarrow{\text{BlockRest}} \;&\langle n_2 := 1;\ \mathtt{if}\ n_2 = 1\ \mathtt{then}\ x := 1\ \mathtt{else}\ \mathtt{begin}\ \mathtt{fac}(n_2{+}{-}1);\ x := x * n_2\ \mathtt{end};\ x := x * n_1,\\ &\quad \{\mathtt{fac} \mapsto \ldots,\ n_1 \mapsto 2\} \rangle \\ \xrightarrow{\text{Assign, Skip}}^* \;&\langle \mathtt{if}\ n_2 = 1\ \mathtt{then}\ x := 1\ \mathtt{else}\ \mathtt{begin}\ \mathtt{fac}(n_2{+}{-}1);\ x := x * n_2\ \mathtt{end};\ x := x * n_1,\\ &\quad \{\mathtt{fac} \mapsto \ldots,\ n_1 \mapsto 2,\ n_2 \mapsto 1\} \rangle \\ \xrightarrow{\text{Var, EqTrue}}^* \;&\langle \mathtt{if}\ \mathtt{true}\ \mathtt{then}\ x := 1\ \mathtt{else}\ \mathtt{begin}\ \mathtt{fac}(n_2{+}{-}1);\ x := x * n_2\ \mathtt{end};\ x := x * n_1,\\ &\quad \{\mathtt{fac} \mapsto \ldots,\ n_1 \mapsto 2,\ n_2 \mapsto 1\} \rangle \\ \xrightarrow{\text{IfTrue}} \;&\langle x := 1;\ x := x * n_1,\ \{\mathtt{fac} \mapsto \ldots,\ n_1 \mapsto 2,\ n_2 \mapsto 1\} \rangle \\ \xrightarrow{\text{Assign, Skip}}^* \;&\langle x := x * n_1,\ \{\mathtt{fac} \mapsto \ldots,\ n_1 \mapsto 2,\ n_2 \mapsto 1,\ x \mapsto 1\} \rangle \\ \xrightarrow{\text{Var, Var, Mul, Assign}}^* \;&\langle \mathtt{skip},\ \{\mathtt{fac} \mapsto \ldots,\ n_1 \mapsto 2,\ n_2 \mapsto 1,\ x \mapsto 2\} \rangle \end{aligned} \]

5.1 Procedures and Types

Procedures complicate types for two reasons: first, there is the question of whether our procedures should be first-class values, and second, the ordering of statements becomes more complex with procedures. We will of course need a procedure type as well.

In most procedural languages, procedures are not first-class. SIL-P is no exception. By our semantics, procedures cannot be assigned to variables or arguments, because our semantics get stuck if they reduce to anything but an integer (remember, $ N $ is an integer). This means that any time we call a procedure, we know with certainty what procedure we’re calling.²

Second, consider the ordering of statements, in particular with respect to T_Assign and T_Reassign. If a procedure assigns to a global variable, but is defined before the first definition of that global variable, we need to assure that its effect on that global is the same as the global definition. This is back to the initialization problem: with unpredictable ordering, it’s unclear who’s in charge of initialization. The simple solution to this, and the solution that Pascal uses, is explicit type declarations. If we had bothered to define our global variables, it would be trivial to assure that they’re used correctly.

For a procedure type, we need a constructed type defined by the argument types to the procedure:

⟨type⟩ ::= ··· | procedure(⟨typelist⟩)

⟨typelist⟩    ::= ε  |  ⟨type⟩ ⟨typelistrest⟩
⟨typelistrest⟩ ::= ε  |  , ⟨type⟩ ⟨typelistrest⟩

Note that unlike functions in functional languages, procedures may have zero arguments, and have no return type. Of course, procedures with returns are common in procedural languages (e.g. functions in Pascal), so a return type is also possible.

Some languages, such as C, define a void type for the return type from procedures which don’t return values, but this complicates typing, since there are no values of type void. It also opens a whole range of bizarre behaviors. Consider, for instance, the following valid C snippet:

void a() {
  /* perform some task... */
}
void b() {
  return a();
}

The b function has a return statement in spite of not returning a value, and this is valid because a() is of type void, and so isn’t a value. This adds confusion to the semantics of C, since a void function can have a return statement with an expression to return, but the step to evaluate that expression should not produce a value. The easiest way to avoid this problem is Pascal’s solution: separate subroutines which do return values (functions) from subroutines which don’t (procedures).

Exercise 1. Write the type judgments for procedures. Consider how to deal with the initialization problem.

6. Arrays and References

Arrays are as fundamental to most imperative languages as lists are to most functional languages. An array is — and I apologize for pedantically defining this when you’ve undoubtedly been using arrays for years — a mapping from integer indices to values, in which the integer indices are all part of a continuous domain, typically either $ (0, n) $ or $ (1, n) $, where $ n $ is either one less than the size of the array or the size of the array, respectively. The decision of whether to start arrays from 0 or 1 has ended friendships, ruined lives, and sparked several small-scale wars [citation needed], but ultimately doesn’t matter. Either definition works fine, and neither has reliably proved to be any easier to use than the other; programmers frequently make off-by-one errors in any language.

Arrays generally have a fixed size, and are usually implemented such that access to a field within an array is quite fast. Pascal has two kinds of arrays, static and dynamic arrays, but we will only focus on the latter, as the former can easily be rewritten in terms of the latter.

Arrays in Pascal are declared with a specific element type, e.g. x : array of integer. Arrays are thus a constructed type. SIL-P has no types, of course, so we will not need any such declaration.

Before using an array in Pascal, you must first allocate it. This is done with the built-in function setLength. For instance, to allocate space for 15 integers in x declared above, one calls setLength(x, 15). You can then access elements of the array with square brackets, from 0 to the size of the array minus one (in this case, 15). For instance, the following (pointless) procedure allocates an array and fills it with the squares of its indices:

procedure foo();
var x: array of integer;
var i: integer;
begin
  setLength(x, 15);
  i := 0;
  while i < 15 do
    x[i] := i*i
end

Pascal does not perform any bounds-checking, so if you attempt to access the array at an index below zero or greater than or equal to the length of the array, unpredictable behavior will occur. Quite precisely, it will write to memory at an address outside the range of the array, but since our formal semantics don’t model memory, that’s unpredictable to us; we will revisit this again in Module 10.

Consider a program to generate the Fibonacci sequence. The dynamic programming version of this algorithm involves carrying the last two Fibonacci numbers in variables, but if we’re saving all of the Fibonacci numbers in an array, then we always have the previous two available. If we put this in a procedure, we need to do one of three things:

Allocate the array ourselves and return it.
Take the array as an argument and fill it.
Share an array in a variable in the global scope.

The first option creates the additional confusion of what it means to return an array; should we duplicate it, or return a reference? The second option creates the same confusion, with respect to arguments. The third option is impractical, so we will discard it. Ultimately, we would like to be able to pass an array to a procedure, have the procedure change it, and see those changes in the calling procedure. We already made a solution to this problem in functional languages: references. Arrays are a reference type.

A reference type is a type that is always referential; i.e., values of a referential type are always stored by way of a link to the heap ($ \Sigma $, and all that is ever present in an expression or the store ($ \sigma $ is a label. Thus, we need to reintroduce $ \Sigma $ to our reduction.

While setLength works well for Pascal, it is not the natural way to describe arrays in our semantics. Instead, we will invent a new syntax for allocating arrays, which is an expression: $ \mathtt{array}[M] $. Previously, we stratified our expressions into boolean and integer expressions, but with the introduction of arrays, we will need to broaden integer expressions to simply “expressions”, and so we will rename $ \langle \mathit{intexp} \rangle $ to $ \langle \mathit{exp} \rangle $. Note that this still excludes boolean expressions.

We extend SIL-P to SIL-PA (Simple Imperative Language with Procedures and Arrays) with our new expressions for allocating and accessing arrays, plus a new statement for writing to them, and create a new syntax for arrays in the heap:

⟨exp⟩ ::= ···
         | array[⟨exp⟩]
         | ⟨exp⟩[⟨exp⟩]

⟨stmt⟩ ::= ···  |  ⟨var⟩[⟨exp⟩] := ⟨exp⟩

⟨array⟩ ::= []
           | [⟨arglist⟩]

Note that $ \langle \mathit{array} \rangle $ is not referred to by any other production. We will store arrays in our heap, but you cannot write an array in SIL-PA; it is not part of the language’s syntax. Also, the array syntax we’ve described technically lets arbitrary expressions be part of an array, but in practice, they will always be reduced to values. Additionally, we need labels for our reductions, but as with labels previously, they will not have any defined syntax. Labels just need to be unique. Labels are values, but arrays are not, since no expression can evaluate to an array anyway.

Now, let’s add semantics. Because we’ve reintroduced a heap, we will need to expand what we’re reducing over to a triple again, $ \langle \Sigma, \sigma, x \rangle $. You may assume that all previously-defined reductions don’t touch the heap. Let $ \ell $ range over labels, $ M $ range over expressions, and $ N $ range over values. All other metavariables will have the same range as they previously had.

\[ \textbf{AllocStep} \quad \dfrac{\langle \Sigma, \sigma, M \rangle \to \langle \Sigma', \sigma, M' \rangle} {\langle \Sigma, \sigma, \mathtt{array}[M] \rangle \to \langle \Sigma', \sigma, \mathtt{array}[M'] \rangle} \]\[ \textbf{Alloc} \quad \dfrac{\ell\ \text{is a fresh label} \qquad N \in \mathbb{N} \qquad \Sigma' = \Sigma[\ell \mapsto [0_1, 0_2, \ldots, 0_N]]} {\langle \Sigma, \sigma, \mathtt{array}[N] \rangle \to \langle \Sigma, \sigma, \ell \rangle} \]\[ \textbf{IndexLeft} \quad \dfrac{\langle \Sigma, \sigma, M_1 \rangle \to \langle \Sigma', \sigma, M_1' \rangle} {\langle \Sigma, \sigma, M_1[M_2] \rangle \to \langle \Sigma', \sigma, M_1'[M_2] \rangle} \]\[ \textbf{IndexRight} \quad \dfrac{\langle \Sigma, \sigma, M \rangle \to \langle \Sigma', \sigma, M' \rangle} {\langle \Sigma, \sigma, N[M] \rangle \to \langle \Sigma', \sigma, N[M'] \rangle} \]\[ \textbf{Index} \quad \dfrac{\Sigma(\ell) = [N_{a,0}, N_{a,1}, \ldots, N_{a,N}, \ldots, N_{a,n}]} {\langle \Sigma, \sigma, \ell[N] \rangle \to \langle \Sigma, \sigma, N_{a,N} \rangle} \]\[ \textbf{ArrAssignLeft} \quad \dfrac{\langle \Sigma, \sigma, M_1 \rangle \to \langle \Sigma', \sigma, M_1' \rangle} {\langle \Sigma, \sigma, x[M_1] := M_2 \rangle \to \langle \Sigma', \sigma, x[M_1'] := M_2 \rangle} \]\[ \textbf{ArrAssignRight} \quad \dfrac{\langle \Sigma, \sigma, M \rangle \to \langle \Sigma', \sigma, M' \rangle} {\langle \Sigma, \sigma, x[N] := M \rangle \to \langle \Sigma', \sigma, x[N] := M' \rangle} \]\[ \textbf{ArrAssign} \quad \dfrac{ \sigma(x) = \ell \qquad v = N_1 \qquad \Sigma(\ell) = [N_{a,0}, \ldots, N_{a,v-1}, N_{a,v}, N_{a,v+1}, \ldots, N_{a,n}] \qquad \Sigma' = \Sigma[\ell \mapsto [N_{a,0}, \ldots, N_{a,v-1}, N_2, N_{a,v+1}, \ldots, N_{a,n}]] }{ \langle \Sigma, \sigma, x[N_1] := N_2 \rangle \to \langle \Sigma', \sigma, \mathtt{skip} \rangle } \]

The AllocStep, IndexLeft, IndexRight, ArrAssignLeft, and ArrAssignRight rules simply reduce a subexpression.

Alloc defines the allocation of an array. Note that an array is allocated on the heap, so Alloc reduces to a label. In our definition, the array starts filled with 0s, hence $ 0_1, 0_2, \ldots, 0_N $. This label can then be used to access the array with Index. Index requires a label $ \ell $ as its target, and that $ \Sigma(\ell) $ maps to an array, and, implicitly, that its index $ N $ is a number, and that the array has at least $ N $ elements. The elements are labeled $ N_{a,0} $ through $ N_{a,n} $, and so we extract the $ N $th element, reducing to $ N_{a,N} $. Our semantics will get stuck if we try to access an element outside the bounds of the array, or if we try to index an array with something other than an integer.

ArrAssign is the most sophisticated reduction in this set, and perhaps the most sophisticated reduction we’ve seen in this course. Let’s take it one condition at a time:

To match the left-hand side of $ \to $, it must be an assignment to a variable $ x $ indexed by a value $ N_1 $. ArrAssignLeft assures that $ N_1 $ will be a value (or that the reduction will get stuck before reaching this point).
$ \sigma(x) = \ell $ specifies that the variable $ x $ must be in the store, and furthermore, that $ x $ must refer to a label $ \ell $.
$ v = N_1 $ simply renames $ N_1 $ as $ v $, since otherwise it will be difficult to read in the next step.
$ \Sigma(\ell) = [\cdots] $ specifies that $ \ell $ must be in the heap, that it must reference an array, and that that array must have an element $ v $ ($ N_{a,v} $.
$ \Sigma' = \Sigma[\ell \mapsto [\cdots]] $ defines a new array value in a new heap $ \Sigma' $, identical to $ \Sigma(\ell) $ except that $ N_{a,v} $ has been replaced by the value we were assigning, $ N_2 $. $ N_2 $ is guaranteed to be a value by ArrAssignRight.

Note that we’ve modified $ \Sigma $, but not $ \sigma $. As a consequence, if we reassign a different variable — or an argument to a function — to refer to the same array, changes made to the array will be visible in both, because they both share the same label. This definition of array assignment demands that the target be a variable, but most languages allow an expression, so long as that expression evaluates to a label. Our semantics for assignment will get stuck in the same situations that our semantics for indexing would get stuck. Note that nothing in our semantics has required us to store integers in our array — SIL-PA is a dynamically-typed language — so we can store nested arrays, or even, if we’re feeling perverse, procedures in our array.

Now, let’s return to our example. We have the infrastructure in SIL-PA required to make a procedure which generates the Fibonacci sequence into an array. We will define a procedure fibarr which is called with an array and a number, where the number specifies the length of the Fibonacci sequence to generate into the array.

procedure fibarr(arr, ct);
idx;
begin
  idx := 0;
  while idx < ct do
  begin
    if idx = 0 then
      arr[0] := 1
    else if idx = 1 then
      arr[1] := 1
    else
      arr[idx] := arr[idx + -1] + arr[idx + -2];
    idx := idx + 1
  end
end

Exercise 2. Work out the values of $ \sigma $ and $ \Sigma $ every time they change in the evaluation of the statement fibarr(array[4], 4).

Extending our type system to work with arrays would be complicated. Most type systems in languages with arrays abandon one aspect of type safety: out-of-bounds access gets stuck. Integrating array bounds into the type system has proved to be a continuing difficulty in language design, and it and similar problems spawned the area of dependent type systems, in which types can be defined by values (such as the size of an array). Dependent type systems are beyond the scope of this course, but are the right area to study if you find this problem interesting.

7. Records

An alternate to arrays for storing compound data is records. If we were to implement records in an extension to the Simple Imperative Language, we would simply take them as syntactic sugar for arrays — they are no more powerful, just more usable — so we will not discuss their semantics, just their form in Pascal.

Record types, familiar to C programmers as structs, define their own stores, mapping a specific set of names to values. These name-value pairs are called fields, and every value of a given record type has the same field names, but not (necessarily) the same field values. The order of the names theoretically shouldn’t matter for the type, but is usually important for how the record is stored. If we were to convert records into arrays, for example, then the first field would become index 0, and the second index 1, etc.

Record types are constructed types, so we need a syntax for constructing them. In Pascal, a record with two fields, x and y, both of type integer, is written as follows:

record
  x: integer;
  y: integer;
end

It gets cumbersome to write such a long type everywhere you need it, so Pascal allows you to write type aliases, which are declarations which give abbreviations to types. We can give this record the name point as follows:

type point = record
  x: integer;
  y: integer;
end

We can then write point in place of the long record type.

In a language with types, records are particularly important because arrays are always composed of elements of a single type, while records may be of multiple types. For instance, we can associate an array of integers with a single integer like so:

record
  samples: array of integer;
  median: integer;
end

In Pascal, declaring a variable with a record type is sufficient to make space for it. In fact, like in C, this allocates it for the particular subroutine (on the stack), but the record becomes unusable as soon as the subroutine ends. So, we can define a subroutine with several points like so:

function foo(bx, by, ex, ey: integer);
var b: point;
var e: point;
begin
  ...
end

Accessing fields of records is similar to accessing elements of arrays, but instead of brackets, a dot followed by a name is used. This name is not a variable in the surrounding scope, but the name of the field, so the field name must be written explicitly into the code. You cannot access an arbitrary field named by an expression, only a specific field. Let’s finish our function to one that calculates the Manhattan distance between two points, rather pointlessly adding them to records to do so:

function manhattan(bx, by, ex, ey: integer);
var b: point;
var e: point;
begin
  b.x := bx;
  b.y := by;
  e.x := ex;
  e.y := ey;
  manhattan := (e.x-b.x) + (e.y-b.y)
end

Records don’t actually add any power to our language: everything records can do, arrays can (awkwardly) do as well. But, this grouping of values by named fields became a fundamental building block for object-oriented programming, which is the next module.

8. Fin

In the next module, we will look at arguably the most popular and successful programming paradigm in existence: object-oriented programming. Assignment 5 will focus on implementing imperative programming like you saw in this module.

References

[1] Edsger W. Dijkstra. Go To statement considered harmful. Communications of the ACM, 11(3):147–148, 1968.

[2] Niklaus Wirth. The programming language Pascal. Acta informatica, 1(1):35–63, 1971.

Module 8: Object-Oriented Programming

“Object-oriented design is the roman numerals of computing.” — Rob Pike

In this module, we discuss Object-Oriented Programming (OOP), a paradigm that has enjoyed considerable popularity in recent decades. Proponents of OOP cite its potential for information-hiding and code-reuse; opponents argue that the hierarchical type systems imposed by object-oriented languages are not always an accurate reflection of reality, and can lead to compromised and unintuitive type hierarchies.

Objects were first introduced in 1967, as part of the programming language Simula, a descendent of Algol. While Simula had many of the features we now attribute to OOP, they came almost by accident from its goal of simulation (hence the name). Later languages, in particular our exemplar, Smalltalk, are responsible for expanding on and refining them into modern OOP.

However, the widespread adoption of OOP into the programming mainstream did not happen until decades later. OOP grew from fairly niche to major importance in the 1990s. Currently, OOP is one of the most popular (indeed, perhaps the most popular) paradigms among programmers; most widely-used modern programming languages have some kind of support for OOP. However, there are relatively few languages that conform strictly to the object-oriented mentality and may thus be legitimately considered purely object-oriented. Smalltalk is one such language, but it is difficult even to name a second, setting aside research languages and languages with no modern maintenance.

Instead, OOP acts more as a “meta-paradigm” that may be combined with other paradigms. For example, Ada95 and C++ are fundamentally structured, procedural languages that include support for objects. Java conforms more strictly to the object-oriented mentality than do C++ and Ada (as Java forces programs to be organized into classes), but the language in which methods are written in Java remains fundamentally structured programming. On the other hand, languages like CLOS and OCaml add object-orientation to functional languages.

In Module 1, when we introduced Smalltalk, it was described as being “so object oriented that it’s barely a procedural imperative language”. It is the complete absence of traditional procedures, the fact that even simple numbers are objects, and the encapsulation of conditions and loops—structured programming—into objects that makes Smalltalk purely object-oriented. Mostly-OOP languages like Java eschew this level of OOP purity in favor of familiarity and predictability.

Section 1: What OOP Is(n’t)

Because of its current widespread popularity, object-oriented programming is a particularly difficult paradigm to study in the abstract. Part of the reason for the difficulty is that there is no widespread consensus among programmers and language designers about what the defining features of an object-oriented language should be. Here we will present the most common language features possessed by object-oriented languages. When we discuss semantics, we will discuss it in the context of Smalltalk, with occasional sidebars to discuss how other languages differ.

At the very least, OOP has objects: the encapsulation of data and behavior into a single abstraction, which can be viewed equivalently as records with associated code, or code with associated records. Some other language features commonly associated with object-oriented programming are outlined below:

Data Abstraction: objects often provide facilities by which we may separate interface from implementation, often hiding parts of the implementation behind abstraction barriers;
Inclusion Polymorphism: object types tend to be arranged in hierarchies by a subtyping relationship that allows some types to be used in place of others;
Inheritance: objects often share some of the details of their implementation with other objects;
Dynamic Dispatch: the actual code associated with a particular method invocation may not be possible to determine statically. Dynamic dispatch is a form of dynamic binding, which we briefly mentioned in Module 2.

Most object-oriented languages possess at least some of the above characteristics.

Section 2: Exemplar: Smalltalk

You’ve already learned and used Smalltalk in this course, so there’s no need to introduce its syntax here. Instead, we’ll discuss Smalltalk’s place in language history.

Smalltalk was developed at Xerox PARC by Alan Kay, Dan Ingalls, Adele Goldberg, and many others. Xerox PARC is a research group famous for inventing everything before anyone else and then failing to monetize any of it. Among their various inventions are:

the graphical user interface, later monetized by Apple and Microsoft;
What-You-See-Is-What-You-Get (WYSIWYG) editing, later monetized by numerous corporations;
“fully-fledged” object-oriented programming (in Smalltalk), later monetized by Sun and later still by numerous corporations;
prototype-based object orientation (which we will briefly look at in this module), later monetized by Netscape as JavaScript.

Xerox is known for printers.

In fact, the first and third items in that list are one and the same: Smalltalk! The graphical user interface that Xerox PARC was famous for inventing was Smalltalk. Indeed, a course on the history of human-computer interaction (HCI) would probably discuss Smalltalk as a major leap forward in HCI, with only a brief mention of the fact that it’s also a programming language. It’s nearly impossible to separate the two concepts, because of the experience of programming in Smalltalk: there is no such thing as a Smalltalk file, and to write Smalltalk code, one interacts with a Smalltalk environment in which they can graphically create and define new classes. The Smalltalk language and implementation were described thoroughly in the so-called “blue book”, Smalltalk-80: The Language and its Implementation. The Smalltalk programming environment was described in the so-called “red book”, Smalltalk-80: The Interactive Programming Environment.

Smalltalk’s design throws out the idea of a main function or starting point, and instead opts for the program and interface to be a uniform medley of objects. This is useful for keeping Smalltalk quite pure in its OOP design: in Smalltalk, everything is an object, but we need many objects to already exist to even perform basic functions (think of Smalltalk’s true and false), so it’s hard to rectify this design with a “starting point”. Instead, Smalltalk software is distributed as images, which are essentially a frozen state of one of these object medleys; one loads an image and is then experiencing the same environment as that in which the programmer wrote their software.

Ultimately, much of that design, while central to Smalltalk’s philosophy, isn’t relevant to our interest in object-oriented programming. GNU Smalltalk gets around it by having a special file syntax and semantics. We’ll get around it by borrowing a concept from Java: a main class with a main method.

Smalltalk is untyped, so when discussing types, we’ll use an extended syntax borrowed from Strongtalk, a typed variant of Smalltalk. In Strongtalk, the types of fields, variables, arguments to methods, and returns from methods are annotated with explicit types, like so:

hypotenuseWithSide: x <Number> andSide: y <Number> ^<Number> [
    | x2 <Number> y2 <Number> |
    x2 := x*x.
    y2 := y*y.
    ^(x2 + y2) sqrt
]

The <Number> next to each argument indicates that that argument is of type Number, and the ^<Number> at the end indicates that this method returns a Number as well. Like in the Simply Typed λ-calculus, the return type can be discovered from the return statements in the method, so it doesn’t need to be explicitly specified. But, as methods may have multiple return statements in Smalltalk, we’ll specify it explicitly.

Aside: Strongtalk is technically optionally typed: you may specify types if you desire, but are not required to, and there is no type inference, so code without types behaves as in Smalltalk.

Section 3: Classes

Most—but not all—object-oriented languages use classes to describe objects, and can be described as class-based languages. A class is a description of a set of objects with identical behavior and identical form, but (presumably) different internal state. This definition is similar to the definition of a type, and indeed, classes form the types of an object-oriented type system.

In a purely object-oriented, class-based language, the global scope contains only classes, and all imperative code must be boxed into classes. A class contains fields and methods. The fields are conceptually the same as fields of a record from imperative languages, and in that aspect, classes can be considered to be an extension of records: every object of a given class has values for each field declared in the class.

Methods are conceptually similar to procedures, but they are associated with classes, and a given method is called on an object of the class. The object that a method is called on is the receiver of the message in Smalltalk terms, and the target of the call otherwise. It is sometimes conceptually convenient to describe the target as a “hidden parameter”, and that is indeed how implementations of object-oriented languages work, but taking this concept too far will make typing of class-based languages inconsistent. Within a method, the target object can be accessed, typically with the name self or this, depending on the language.

We’ll use GNU Smalltalk’s syntax for classes. In GNU Smalltalk, we declare a class like so:

Rectangle subclass: Square [
    " content of the Square class... "
]

This introduces the name Square into the global scope, referencing the class. The class can be used to create objects with Square new.

Let’s create a syntax for class declarations:

⟨prog⟩        ::= ⟨classdecl⟩ ⟨prog⟩
               |  ε
⟨classdecl⟩   ::= ⟨var⟩ subclass: ⟨var⟩ [ ⟨fieldsdecl⟩ ⟨methodslist⟩ ]
⟨fieldsdecl⟩  ::= ε
               |  "|" ⟨varlist⟩ "|"
⟨varlist⟩     ::= ε
               |  ⟨var⟩ ⟨varlist⟩
⟨methodslist⟩ ::= ...

Thus, a program is a list of class declarations, and a class declaration has four parts: a superclass, a name, field declarations, and method declarations.

Aside: In this sense, purely object-oriented languages are declarative: declarations are primary, not behavior. Indeed, the same thing tends to happen with typed procedural languages. For historical reasons, the “declarative” paradigm is considered opposite to imperative, and so has less to do with declarations than referential transparency.

When a class is declared, it is declared a subclass of some existing class, which is in turn its superclass. In our Square example, Square was declared as a subclass of the Rectangle class. This means that it inherits all of the fields and methods from Rectangle. A subclass may be defined with additional fields and methods as well as those defined by the superclass.

Finally, a subclass may re-define (override) methods that were defined in the superclass. In an overridden method, you may call the superclass’s original implementation using a special syntax that looks like calling the method on the object super. For instance, consider this partial implementation of Rectangle and Square:

Object subclass: Rectangle [
    | width height |
    " ... constructor, etc... "

    setWidth: v [
        width := v.
    ]

    setHeight: v [
        height := v.
    ]

    setWidth: w setHeight: h [
        self setWidth: w.
        self setHeight: h.
    ]

    area [
        ^width * height
    ]
]

Rectangle subclass: Square [
    setWidth: v [
        super setWidth: v.
        super setHeight: v.
    ]

    setHeight: v [
        self setWidth: v.
    ]
]

By overriding the setWidth: and setHeight: methods, a Square guarantees that it will always be square: the setWidth: method calls Rectangle’s setWidth: and setHeight: methods on the same value, and the setHeight: method calls Square’s setWidth: method. There’s no need to override the setWidth:setHeight: method, since it calls setWidth: and setHeight:, but it’s important to note the behavior of self setWidth: and self setHeight:: if self is a Square, then this will call Square’s method even though it’s actually part of the implementation of Rectangle!

Any method which is not overridden is inherited, so regardless, the subclass is guaranteed to have at least all the same methods as the superclass. Since classes define objects with the same fields and methods, the implication of this inheritance is that an object of the class Square is an object of the class Rectangle: it has at least all the same fields and methods, so anything you can do with a Rectangle, you can do with a Square. But it may have more than Rectangle has. So, all Squares are Rectangles, but not all Rectangles are Squares.

Section 4: Semantics

At the basic level, the semantics of an object-oriented language follow naturally from the semantics of an imperative language. However, as basic mathematical operators are—or at least, can be—methods, we don’t need most of the weight from even the Simple Imperative Language. Instead, we will define a simple(ish) semantics, with some parts based very loosely on Featherweight Java, for an object-oriented language in the style of Smalltalk.

We make the following restrictions to make the language tractable for formal semantics:

Field access is syntactically distinct from local variable access, using an arrow, e.g. self->width.
All user-defined methods are of the colon-separated form (e.g. setWidth:setHeight:); two built-in nullary methods are defined separately.
Every method has exactly one return statement, and that return statement is the last statement in the method.
Every statement must be of one of the following forms:
- x := M, where x is a variable name and M is a method call (both target and arguments must be variable names);
- x := [ ... ], an assignment of a block to a local variable;
- x := y, an assignment of a variable to another variable;
- x := self->y, the transfer of a field of self to a local variable;
- self->y := x, the transfer of a local variable to a field of self;
- ^x, a return statement returning a local variable;
- x, just a variable name.

For instance, we would rewrite

x := r setWidth: (s area).

x1 := s area.
x := r setWidth: x1.

Since even mathematical operators are methods in Smalltalk, any expression can be broken up into individual steps in this way.

The store, $ \sigma $, will contain global class declarations and “freshened” local variables. We also need a heap, $ \Sigma $, to store our objects, and labels to reference them. An object is defined by its class and the values of all of its fields:

⟨object⟩         ::= ⟨var⟩ [ ⟨fieldvaluelist⟩ ]
⟨fieldvaluelist⟩ ::= ε
                  |  ⟨var⟩ ":=" ⟨value⟩ . ⟨fieldvaluelist⟩

The $ \langle \text{var} \rangle $ in the definition of $ \langle \text{object} \rangle $ is the object’s class. Each field is written $ \langle \text{var} \rangle := \langle \text{value} \rangle $. The values in $ \Sigma $ will all be objects, and the values in $ \sigma $, along with the globally-defined classes, will be labels for objects in $ \Sigma $.

We also need blocks in $ \sigma $. We restrict them similarly to methods: the last statement in the block must just be a variable name, which will be the value that the block evaluates to.

Like with Haskell, our global scope contains only declarations, so we need a $ \text{resolve} $ function to populate $ \sigma $ with all of the defined classes.

Exercise 1. Define $ \text{resolve} $ for Smalltalk, given the restriction that files contain only class declarations.

We will sully our object-oriented pureness by defining the program as a list of class declarations and one statement.

Let the metavariables $ L, Q, O, M, v\text{–}w, x\text{–}z $, and $ \ell $ range over statement lists, statements, objects, method declarations, values, variable names, and labels, respectively.

There are only a small number of possible statements, so we simply define a semantics for each. A few method calls require special implementations. First, the built-in method for creating a new object, new. Because of inheritance, new needs a way of gathering all of the fields in all of the superclasses of the named class. We define this as $ \text{fields}(\sigma, x) $, which returns a list of field names for all the fields in $ x $ and all of its superclasses.

\[ \dfrac{ \text{fields}(\sigma, x_2) = y_1, y_2, \ldots, y_n \qquad O = x_2[y_1 {:=} \text{nil}.\ y_2 {:=} \text{nil}.\ \cdots\ y_n {:=} \text{nil}] \qquad \ell \text{ is fresh in } \Sigma \qquad \sigma' = \sigma[x_1 \mapsto \ell] \qquad \Sigma' = \Sigma[\ell \mapsto O] }{ \langle \Sigma,\ \sigma,\ x_1 {:=}\ x_2\ \text{new}.\ L \rangle \to \langle \Sigma',\ \sigma',\ L \rangle } \quad \text{(New)} \]

The first premise says that the fields for $ x_2 $ are $ y_1 $ through $ y_n $. The second premise defines an object with an appropriate shape for the class: it has all of the $ y_i $ fields, each given the value nil. The remaining premises add a link from the variable to a label and from that label to the object in the store and heap.

Next, the value method of blocks:

\[ \dfrac{ \sigma(y) = [L_1.z] }{ \langle \Sigma,\ \sigma,\ x {:=}\ y\ \text{value}.\ L_2 \rangle \to \langle \Sigma,\ \sigma,\ L_1.\ x {:=} z.\ L_2 \rangle } \quad \text{(Block)} \]

To evaluate a block, we run its statements, and assign the value of its last statement—which we’ve restricted to being a single variable name—to the target of our assignment.

Simple assignments of a variable to another variable:

\[ \dfrac{ \sigma' = \sigma[x \mapsto \sigma(y)] }{ \langle \Sigma,\ \sigma,\ x {:=} y.\ L \rangle \to \langle \Sigma,\ \sigma',\ L \rangle } \quad \text{(VarAssg)} \]

Field reading and writing:

\[ \dfrac{ \sigma(x_2) = \ell \qquad \Sigma(\ell) = x_3[\cdots\ y {:=} w\ \cdots] \qquad \sigma' = \sigma[x_1 \mapsto w] }{ \langle \Sigma,\ \sigma,\ x_1 {:=}\ x_2 \mathord{\to} y.\ L \rangle \to \langle \Sigma,\ \sigma',\ L \rangle } \quad \text{(FieldRead)} \]\[ \dfrac{ \sigma(x_1) = \ell \qquad \sigma(x_2) = w \qquad \Sigma(\ell) = x_3[z_1 {:=} v_1.\ \cdots\ y {:=} v_m\ \cdots\ z_n {:=} v_n] \qquad \Sigma' = \Sigma[\ell \mapsto x_3[z_1 {:=} v_1.\ \cdots\ y {:=} w\ \cdots\ z_n {:=} v_n]] }{ \langle \Sigma,\ \sigma,\ x_1 \mathord{\to} y {:=} x_2.\ L \rangle \to \langle \Sigma',\ \sigma,\ L \rangle } \quad \text{(FieldWrite)} \]

To read a field, we look up the object by name in $ \sigma $, look up that label in $ \Sigma $, and find the field-value pair matching the name $ y $. To write a field, we similarly look up the object and value in $ \sigma $, and update $ \Sigma $ to have an identical object, but with the field-value pair for $ y $ replaced with the new value.

All that remains is method calls. Calling a method is similar to calling a procedure from Module 7, with one major difference: we need to find the method. We will define a function $ \text{method}(\sigma, x, y) $, which finds a method of class $ x $ with name $ y $:

\[ \dfrac{ \sigma(x) = z\ \text{subclass:}\ x[\cdots\ y[Q]\ \cdots] }{ \text{method}(\sigma, x, y) = y[Q] } \]\[ \dfrac{ \sigma(x) = z\ \text{subclass:}\ x[\cdots] \qquad y \notin \sigma(x) }{ \text{method}(\sigma, x, y) = \text{method}(\sigma, z, y) } \]

The first rule extracts a method named $ y $ from class $ x $ if it is present. The second rule says that if $ y $ is not found in the class, then the superclass is searched. If the method is not defined at all, our semantics will get stuck.

With method defined, we can call methods similarly to the Simple Imperative Language, freshening variable names to avoid conflict:

\[ \dfrac{ \sigma(x_2) = \ell \quad \Sigma(\ell) = x_3[\cdots] \quad M = \text{method}(\sigma, x_3, y_1{:}y_2{:}\cdots y_n{:}) \quad S = \text{freshen}(M) \quad x_4 = \text{self}_S \quad M_S = y_1{:}w_1\ y_2{:}w_2\ \cdots\ y_n{:}w_n\ [L_1.\ \hat{}w_r] }{ \langle \Sigma,\ \sigma,\ x_1 {:=}\ x_2\ y_1{:}z_1\ y_2{:}z_2\ \cdots\ y_n{:}z_n.\ L_2 \rangle \to \langle \Sigma,\ \sigma,\ x_4 {:=} \ell.\ w_1 {:=} z_1.\ \cdots\ w_n {:=} z_n.\ L_1.\ x_1 {:=} w_r.\ L_2 \rangle } \quad \text{(Call)} \]

Breaking this down: the first premise says that $ x_2 $ must refer to a label; the second says that label refers to an object of class $ x_3 $; the third uses that class name to look up the method; the fourth generates a substitution to freshen the names in the method; the fifth freshens self; and the sixth defines all the names in the freshened method. In the conclusion, we rewrite the method as writes to the freshened variable names corresponding to its arguments, then the method body, then a write of the return value to the target variable.

It’s surprising that our semantics have no conditions or loops, but in fact, they’re not needed: we can build them out of True and False classes exactly like Smalltalk does.

Section 5: Inclusion Polymorphism

Inclusion polymorphism is based on the arrangement of types into a hierarchy of subtypes. A language may define subtypes however it wishes, but as a minimum restriction, a type $ a $ is a subtype of a type $ b $ if a value of type $ a $ can be used anywhere that a value of type $ b $ can be used.

We define subtyping formally with a relation $ <: $, which is reflexive and transitive, and is typically a partial order. That is, every type is a subtype of itself ($ \forall \tau_1.\ \tau_1 <: \tau_1 $; if $ \tau_1 <: \tau_2 $ and $ \tau_2 <: \tau_3 $, then $ \tau_1 <: \tau_3 $; and there exist pairs of types $ \tau_1 $ and $ \tau_2 $ for which neither $ \tau_1 <: \tau_2 $ nor $ \tau_2 <: \tau_1 $.

The key type rule associated with inclusion polymorphism is known as subsumption:

\[ \dfrac{ \Gamma \vdash e : \tau_1 \qquad \tau_1 <: \tau_2 }{ \Gamma \vdash e : \tau_2 } \quad \text{(T_Subsumption)} \]

In words, an expression of type $ \tau_1 $ may be given type $ \tau_2 $ whenever $ \tau_1 <: \tau_2 $. Typically, T_Subsumption isn’t written quite so generally, because this version is not syntax-directed. Instead, subsumption is often written implicitly with respect to particular rules. For instance, consider a subsumptive version of a rule for adding integers:

\[ \dfrac{ \Gamma \vdash e_1 : \tau_1 \qquad \tau_1 <: \text{int} \qquad \Gamma \vdash e_2 : \tau_2 \qquad \tau_2 <: \text{int} }{ \Gamma \vdash e_1 + e_2 : \text{int} } \quad \text{(T_AddSubs)} \]

There are, broadly, two answers to the question of which types may be related by $ <: $: structural subtyping and nominal subtyping.

5.1 Structural Subtyping

Consider these two Smalltalk classes, defined without any explicit supertype:

HelloEN [
    greeting ^<String> [ ... ]
]

HelloFR [
    valediction ^<String> [ ... ]
    greeting ^<String> [ ... ]
]

Aside: “Valediction” is to “greeting” as “goodbye” is to “hello”: it’s a word for words and phrases of parting.

Both of these classes have no fields, and have a method named greeting. The greeting methods have a lot in common, which we call their signature. The signature of a method is its name, number and type of arguments, and type of return. Since greeting was the only method of HelloEN, anywhere where a HelloEN can be used, a HelloFR can also be used.

A language in which that minimum bar is the only bar is said to employ structural subtyping. Assuming that the $ \text{signature} $ function in our formal model extracts the signature of a method, we can write a rule for subtyping. Let the metavariables $ F $ and $ M $ range over field lists and methods, respectively:

\[ \dfrac{ \tau_1 = x[F_1\ M_{1,1}\ M_{1,2}\ \cdots\ M_{1,n}] \qquad \tau_2 = y[F_2\ M_{2,1}\ M_{2,2}\ \cdots\ M_{2,m}] \qquad \forall v \in (1,n).\ \exists w \in (1,m).\ \text{signature}(M_{1,v}) = \text{signature}(M_{2,w}) }{ \tau_2 <: \tau_1 } \quad \text{(T_Structural)} \]

If, for every method $ M_{1,*} $ in $ \tau_1 $, there exists a method $ M_{2,*} $ in $ \tau_2 $ with the same signature, then $ \tau_2 <: \tau_1 $. The particular quantifications are very important: every method in $ \tau_1 $ must have a corresponding method in $ \tau_2 $, but the reverse is not true. Note that we haven’t discussed fields; we’ll get to why when we discuss encapsulation, in Section 8.

Structural subtyping is uncommon in practice, for pragmatic reasons of implementation. Structural subtyping is sometimes also called duck typing—if it looks like a duck and quacks like a duck, it’s a duck—but “duck typing” is more frequently used to describe an informal sense of types in a dynamically-typed language than a static type system.

Structural subtyping is distinct from parametric polymorphism by the presence of a hierarchy. In parametric polymorphism, a function with type parameter $ \alpha $ must work for any substitution of $ \alpha $. In a structurally-subtyped inclusion polymorphic language, if a method’s parameter has type $ \tau $, it may get a value of a subtype of $ \tau $, but it can still count on having a definition for $ \tau $ that holds.

5.2 Nominal Subtyping

In nominal subtyping, the explicit subclass specification is the subtyping relationship. That is, $ \tau_1 <: \tau_2 $ if $ \tau_1 $ was explicitly specified to be a subclass of $ \tau_2 $.

Smalltalk already has this: every class is declared as a subclass of another class, and if class $ a $ is a subclass of class $ b $, then the type represented by $ a $ is a subtype of the type represented by $ b $. We don’t define the superclass relationship as reflexive, however: a class is not a subclass of itself. For instance:

Object subclass: Greeter [
    greeting ^<String> [ ... ]
]

Greeter subclass: Parter [
    valediction ^<String> [ ... ]
]

Parter implicitly has the structure of Greeter because of inheritance, so the basic requirement of subtyping is satisfied. Inheritance automatically gives us a sufficient relationship for subtyping, so nominal subtyping is simply using the inheritance relationship as the subtyping relationship.

Object-oriented languages have two answers to how to define a “first class” (with no superclass):

In many languages in which classes are “bolted on” to an existing imperative language (such as C++), classes may be defined without any superclass. These classes inherit nothing.
In most languages which were designed to be object-oriented from the beginning (such as Smalltalk and Java), there is a single class, typically called Object, defined a priori by the language implementation. This allows the language designer to ensure that some methods are available on every object.

Nominal subtyping is usually justified by implementation concerns, rather than theory. When we compile Smalltalk code, we compile classes into virtual tables (also called vtables), which are simply arrays of pointers to machine code. Every method gets an index in this array. For instance, the Greeter class above would have a virtual table with one element, which points to the machine code for greeting.

To support subclasses, all we need to do is make a subclass’s virtual table compatible with its immediate superclass’s virtual table. The virtual table for Parter has two elements: the first points to greeting, and the second points to valediction. Every Parter behaves like a Greeter, because the first element of its virtual table is a greeting method.

This virtual table design works well, but makes structural subtyping essentially impossible. The compatibility between two virtual tables depends not just on the types of all the methods, but the order in which they happen to be declared. For instance, HelloEN and HelloFR would have incompatible virtual tables. Thus, the desire for well-performing implementations spurred on a desire for nominal subtyping.

It is not impossible to implement similar optimizations in a structurally-typed language—this is done for modern dynamic languages such as JavaScript, and for structural subsets of languages like Java—thanks to just-in-time compilation (JIT) and type profiling, but these are beyond the scope of this course.

Section 6: Smalltalk’s Weird Methods

Smalltalk has a very unusual method syntax. However, it’s superficial: there’s nothing about its unusual method syntax that makes methods behave any differently in Smalltalk than they do in any other programming language.

Methods in Smalltalk take three forms:

Nullary methods (taking no arguments): simply named by an identifier, such as area, and called by placing the name next to the target object, such as x area.
Operators: named by symbols, but are otherwise just methods. In Smalltalk, 2 * 2 is a call to the * method on the Number object 2, with argument 2.
n-ary methods: the signature is given with a sequence of pairs of a partial method name and a parameter name. For instance, the setWidth: w setHeight: h signature is for a method with name setWidth:setHeight:, and parameters w and h. In a language like Java, this method would probably be written setWidthAndHeight(x, y), but the difference is superficial.

As specified in Module 1, we can also write Rectangle>>setWidth:setHeight: to indicate specifically the setWidth:setHeight: method of Rectangle.

Section 7: Subtyping and Methods

Method types are exactly the same as procedure types: a list of parameter types, plus a return type. We write a method type as $ (\tau_1, \tau_2, \ldots, \tau_n) \to \tau_r $, where $ \tau_1 $ through $ \tau_n $ are the parameter types and $ \tau_r $ is the return type.

Methods are not values in class-based languages. A method is called on an object. That object cannot then return a method, because that method would be naked—it would have no object to call it on. Thus, method types are not complete types: they only exist as part of object types. No expression can have a method type, but method types nonetheless exist, bound up within object types.

For instance, the Rectangle>>setWidth: method would be written in Strongtalk as:

setWidth: v <Number> ^<Rectangle> [ ... ]

Smalltalk has no unit type, so the typical return type from a method that doesn’t need to return anything is the type of the surrounding class, and the default return value is self. This method’s type is $ (\text{Number}) \to \text{Rectangle} $. Note that the hidden parameter self does not represent any part of this type: method types are part of their surrounding object types, so the type of the hidden parameter is implied.

Now, Square>>setWidth: would be written in Strongtalk as:

setWidth: v <Number> ^<Square> [ ... ]

This method’s type is $ (\text{Number}) \to \text{Square} $. But Rectangle>>setWidth:’s type was $ (\text{Number}) \to \text{Rectangle} $—not quite the same. We need a concept of method compatibility and method subtyping.

Covariant return types. Square>>setWidth: returns a Square, but Rectangle>>setWidth: was expected to return a Rectangle. Squares are Rectangles, so the return type is compatible so long as it’s a subtype:

\[ \dfrac{ \cdots \qquad \tau_{1,r} <: \tau_{2,r} }{ (\tau_{1,1}, \ldots, \tau_{1,n}) \to \tau_{1,r}\ <:\ (\tau_{2,1}, \ldots, \tau_{2,n}) \to \tau_{2,r} } \]

Contravariant parameter types. Consider overriding Square>>setWidth: to take an Object as its argument instead of a Number. Since Number <: Object, this wouldn’t break anything: we can still accept all argument values that Rectangle>>setWidth: can accept, and more. What’s surprising is that it’s in reverse: it was safe for Square>>setWidth:, in a subtype of Rectangle, to take a supertype of Number as its argument.

With this, we may fill in the complete method subtyping rule:

\[ \dfrac{ \forall v \in (1,N).\ \tau_{2,v} <: \tau_{1,v} \qquad \tau_{1,r} <: \tau_{2,r} }{ (\tau_{1,1}, \tau_{1,2}, \ldots, \tau_{1,n}) \to \tau_{1,r}\ <:\ (\tau_{2,1}, \tau_{2,2}, \ldots, \tau_{2,n}) \to \tau_{2,r} } \]

Note how the type requirement for the parameters is the inverse of the requirement for the return: the parameter types on the right must be subtypes of the parameter types on the left, while the return type on the left must be a subtype of the return type on the right. This reversal of the subtyping relationship is called contravariance, and when the relationship is not reversed, we call it covariance. An additional possibility is invariance, which we discuss in Section 8.

The hidden parameter self. The type of self in Square>>setWidth: is Square, since that method will always be called on Squares. Similarly, the type of self in Rectangle>>setWidth: is Rectangle. So, the hidden parameter type is covariant, even though the other parameter types are contravariant! This is precisely why it’s often confusing to think of the target as a hidden parameter: the very fact that we found this method means self must have been a Square.

Section 8: Fields and Encapsulation

One of the principles of object-oriented programming is that the details of how a particular object works should be hidden, so that those details can be changed without needing to change the interface.

If we allow all fields to be fully accessible, objects truly behave like records with methods. Consider an expanded version of Rectangle with explicit field types:

Object subclass: Rectangle [
    | width <Number> height <Number> |
    ...
]

Now consider a new class IntSquare, where Integer <: Number:

Square subclass: IntSquare [
    | width <Integer> height <Integer> |
    ...
]

And an unrelated method that expands a rectangle by a factor of 1.5:

expandRectangle: r <Rectangle> ^<Rectangle> [
    r->width := r->width * 1.5.
    r->height := r->height * 1.5.
    ^r
]

If we call expandRectangle: with an IntSquare, we attempt to set width to a floating-point number, but IntSquare’s width field is of type Integer! So, we cannot let field overrides be covariant.

What about contravariant field overrides? Consider an AbstractSquare that allows the sides to be of any object type:

Square subclass: AbstractSquare [
    | width <Object> height <Object> |
]

This still doesn’t work. If we pass an AbstractSquare to expandRectangle:, it attempts to call the * method on an Object, and so fails.

In fact, the only way we can define a width field in the subclass is with exactly the same type as in the superclass. We call this requirement type invariance, and because of it, there’s no reason to allow field overrides at all.

Invariance also comes up with OCaml-style references in a language with method (or function) subtyping: $ \text{ref}\ \tau_1 $ is only a subtype of $ \text{ref}\ \tau_2 $ if $ \tau_1 = \tau_2 $.

Exercise 2. Work out why references require invariant typing. To do this, imagine a reference class with methods put and get.

Smalltalk takes the most opposite approach: fields are completely private; you can’t even access the fields of a superclass in a subclass! Because fields are private to particular classes, you can even define fields with the same name in a subclass. The field doesn’t override, because it’s a different field: Rectangle>>width is a different field from IntSquare>>width, and when a method in Rectangle reads width, it will only find its own.

Weirdly, although Smalltalk fields are as private as possible, Smalltalk doesn’t even support private methods. All methods in Smalltalk are usable by anyone with a reference to the object.

Most programming languages choose a compromise somewhere between Smalltalk and our hypothetical all-open language, allowing individual fields and methods to be declared private or public.

Because the purpose of the subtyping relationship $ \tau_1 <: \tau_2 $ is to ensure that a $ \tau_1 $ can be used anywhere where a $ \tau_2 $ is expected, field and method privacy also affects subtyping. The effect is quite simple: we can simply disregard all private fields and methods, since they don’t affect the public interface of objects. This is also why we disregarded fields entirely while describing $ <: $: fields are completely private in Smalltalk.

Section 9: Overloading

In Smalltalk, an object may have two fields with the same name if the fields are defined on different classes. This is one kind of overloading. Most statically-typed object-oriented languages also allow method overloading.

Broadly, overloading is a facility by which different entities in the same context can share the same name. Overloading is one (very limited) form of polymorphism. References to the shared name are disambiguated by information from the context of the reference. In the case of overloaded methods, the compiler examines the number and type of the arguments passed to the method, and in some languages, also the return type of the method. It then selects the instance of the shared name that best fits. The process of disambiguating references to overloaded names is known as overload resolution.

Consider a Strongtalk class for handling money in dollars and cents:

Object subclass: Money [
    | dollars <Integer> cents <Integer> |

    Money class >> withDollars: d cents: c [
        " ... constructor ... "
    ]

    dollars ^<Integer> [ ^dollars ]
    cents ^<Integer> [ ^cents ]

    + second <Money> ^<Money> [
        | d c |
        d := dollars + second dollars.
        c := cents + second cents.
        c > 100 ifTrue: [
            d := d + 1.
            c := c - 100.
        ].
        ^Money withDollars: d cents: c
    ]
]

With method overloading, we could define a second + method taking a Number:

+ second <Number> ^<Money> [
    " ... round second's cents, perform the addition, etc... "
]

Now consider the expression m + 42. The compiler knows the types of m (Money) and 42 (Integer). Looking into the Money class, it finds two + methods: one expecting a Money argument, and one expecting a Number. 42 is an Integer, not a Money, so we need a method with an argument type which is a supertype of Integer. Since Number is such a supertype: $ (\text{Number}) \to \text{Money} <: (\text{Integer}) \to \text{Object} $, we select that version of the method.

Overload resolution can get vastly more complicated than this. In some languages, ambiguity is simply disallowed. In other languages, such as C++, there is an algorithm by which a “closest match” is chosen, but that algorithm is not always intuitive or obvious.

Method overloading introduces several complications. Almost without realizing it, we’ve lost erasability: with method overloading, types actually affect compiled code—we need to know which overloaded method to call, and the only way to decide is with the types. This causes difficulty both for our formal semantics and for a language implementation, which often needs to have a unique name for a method simply to generate code. Erasing all type declarations from Strongtalk yields Smalltalk, and that fact is part of why Strongtalk doesn’t allow method overloading.

In implementations, this is resolved by having a type-directed compilation step called name mangling. Name mangling is simply a renaming stage in which methods are renamed to system-generated names containing their surrounding class’s name, the types of all of the arguments, and (if used in overloading) the return type. A Strongtalk-with-overloading compiler could rename + on numbers as Money>>+<Number>. Name mangling may also involve reducing the character set of mangled names for platform-specific reasons; for example, it could produce something like _ZN5MoneyplE6Number.

Section 10: Completing The Type Hierarchy

Now that our types form subtypes in a hierarchy, we can discuss the extreme points of that hierarchy.

We’ve already talked about Object, the root of the hierarchy. In Smalltalk, every class is, unavoidably, a subclass (directly or indirectly) of Object, and thus every value is a subtype of Object. Object forms the top of our type hierarchy, as $ \forall C.\ C <: \text{Object} $. The top of a type hierarchy is written as $ \top $.

Many object-oriented languages do not have Object as the root of their hierarchy:

C++ has no root of the class hierarchy; it implements a form of parametric polymorphism called templates to work around this.
Java uses coercion—an ad hoc form of polymorphism in which values of some types are allowed to take the place of values of another type by explicitly converting them at run-time—to paper over the problem. For instance, Object o = 42; in Java inserts code to create an instance of the Integer class, which is a subclass of Object.

Aside: Java also implements overloading, so Java has inclusion polymorphism, overloading polymorphism, and coercion polymorphism! That’s three of the four forms of polymorphism from the Cardelli-Wegner polymorphism hierarchy discussed in Module 4.

We’ve found the top of the type hierarchy. Is there a bottom? That is, is there some type $ \tau $ which is the subtype of every possible type? The bottom type is written $ \bot $.

Haskell’s answer: $ \bot $ is the type of errors, exceptions, failures, and infinite loops—expressions that will not produce a “normal” value.
No bottom type: With inclusion polymorphism, the bottom type would need to have every conceivable method, which a value cannot have. The only problem with this is the initialization problem from Module 7: what are the values of fields before you’ve first assigned them?
Null (the most controversial and most popular answer).

“I call it my billion-dollar mistake. It was the invention of the null reference in 1965.” — Tony Hoare

The null reference is written in different ways in different languages: nullptr, null, nil, etc. We’ll use nil. nil is both a type and a value, and there is only one value of type nil: nil. nil is an object with no methods and no fields. And $ \forall \tau.\ \text{nil} <: \tau $. This doesn’t arise from the typing rules; it makes no sense structurally. It’s simply an axiom:

\[ \dfrac{ }{ \text{nil} <: \tau } \quad \text{(T_Nil)} \]

We simply allow nil to be taken as any type, by fiat.

This, of course, breaks the entire concept of typing. If we try to call a method on a value that has no methods, the semantics get stuck, and the implementation usually crashes or throws an exception.

“At that time, I was designing the first comprehensive type system for references in an object oriented language (ALGOL W). My goal was to ensure that all use of references should be absolutely safe, with checking performed automatically by the compiler. But I couldn’t resist the temptation to put in a null reference, simply because it was so easy to implement. This has led to innumerable errors, vulnerabilities, and system crashes, which have probably caused a billion dollars of pain and damage in the last forty years.” — Tony Hoare

The less self-flagellating answer is the initialization problem: nil gives us a powerful aid in initialization—a default value. Since nil is a subtype of every type, we can set every field to nil to start with, and leave the problem of correct initialization to the programmer.

An amusing second consequence of nil is that every proof of type safety of an object-oriented language with nil has an extra caveat: an expression will yield a value of the judged type, reduce forever, or attempt to dereference null.

Section 11: Casting and Recovering Types

In our semantics, every object in the heap carries with it the name of its class. This is a form of run-time type information, and it allows us to recover type information. It is another way in which types are usually not erased in object-oriented languages.

For instance, consider the following snippet of Strongtalk code:

Object subclass: IntHider [
    | value <Integer> |

    add: x <Number> [
        value := value + x.
    ]
]

This contains a type error: adding an Integer to a Number yields a Number, not an Integer! Most object-oriented languages provide some kind of run-time type checking and casting:

Object subclass: IntHider [
    | value <Integer> |

    add: x <Number> [
        | v <Number> |
        v := value + x.
        v is<Integer> ifTrue: [
            value := v <Integer>.
        ] ifFalse: [
            self halt. " unrecoverable, crash "
        ].
    ]
]

v is<Integer> is an imagined syntax for checking if v is of the class Integer, and v <Integer> is a similarly imagined syntax for casting.

Type checking involves looking in the run-time type information to validate the type. Because of inheritance, this means looking through an entire chain of classes. In most object-oriented languages, Class is a class, and classes are elements of that class, so that the run-time system can look information up. Thus, the Class class contains a method something like this:

isSubclassOf: query <Class> [
    self = query ifTrue: [ ^true ].
    superclass isNil ifTrue: [ ^false ]. " we've reached the root of the object hierarchy "
    ^superclass isSubclassOf: query
]

Casting, which we’ve written as v <Integer>, gives the expression v <Integer> the type Integer, but at runtime performs a check so that the type is guaranteed to be correct. If the check succeeds, this expression is the same as v. If the check fails, the behavior depends on the language: some languages evaluate to nil, others raise an error.

Yet another solution is flow typing, which allows us to eschew casting entirely. If we examine code with knowledge of how is<Integer> works, it’s clear that in the ifTrue: block, v must be an integer. A flow-typing-based type checker would give v the type Integer in that block, and the type Number everywhere else.

Section 12: Generics

Consider a class defining a linear linked list:

Object subclass: List [
    | el next |
    " constructor... "

    setEl: to [
        el := to.
    ]

    el [ ^el ]

    setNext: to [
        next := to.
    ]

    next [ ^next ]
]

If we want to give this class types, next is relatively easy (it’s a List), but what about el? Making el of type Object would force every user of the class to cast every time they retrieve an element from the list. If we make el a more specific type, our list would only be useful for that type.

The solution is parametric polymorphism: in object-orientation, the same concept exists over classes as generics.

A generic is a family of classes, defined by a function over types. In Strongtalk, we declare and call a type function with square brackets:

Object subclass: List[A] [
    | el <A> next <List[A]> |
    " constructor... "

    setEl: to <A> [
        el := to.
    ]

    el ^<A> [ ^el ]

    setNext: to <List[A]> [
        next := to.
    ]

    next ^<List[A]> [ ^next ]
]

A here is not a specific class, but any class. List is not a class, but a family of classes. To get a specific class, we instantiate List with a specific type argument, such as List[Integer]. Like $ \Lambda $-abstractions in System F, this is simply done by substitution, substituting A for Integer:

Object subclass: List[Integer] [
    | el <Integer> next <List[Integer]> |
    ...
]

Generics raise some type-theoretic issues. First, there’s the question of when to type-check a generic class. Second and more important: is $ \text{List}[\text{Integer}] <: \text{List}[\text{Number}] $? It can’t be, because that would allow the setEl: method to be used wrongly. And is $ \text{List}[\text{Number}] <: \text{List}[\text{Integer}] $? Again, it can’t be; el would return the wrong type. In most languages, generic types are invariant over their parameter types for this reason.

Some languages instead allow parameter types to be declared as explicitly covariant or contravariant:

An explicitly covariant parameter type cannot be used in the return from a method (because its actual type may be a supertype of its specified type).
An explicitly contravariant parameter type cannot be used in a method argument.

Generics are erasable, in that a generic family of classes can be replaced with a single class if all types are removed. Java’s generic types are erased for backwards compatibility, but as a consequence, casts between different List types are allowed, causing errors when they’re used.

Aside: If generics are a form of parametric polymorphism, and Java has generics… yes, Java has inclusion polymorphism, overloading polymorphism, coercion polymorphism, and parametric polymorphism! That’s all four forms in the Cardelli-Wegner polymorphism hierarchy, bingo!

Templates in C++ extend the concept of generics to any number of classes or procedures. Although this makes templates very powerful, they’re not especially interesting from our perspective. We can simply consider “template” as another name for “generic”.

Section 13: Multiple Inheritance

Some object-oriented languages permit multiple inheritance, by which a class may inherit code from more than one other class. For instance:

(Money, Rectangle) subclass: Wallet [ ... ]

Multiple inheritance poses semantic problems when a class inherits from two or more classes that export methods with the same name. Consider adding + to Rectangle. Given v1 of type Wallet, what is the meaning of v1 + v2? It could mean either. Overloading could provide a solution, but even that falls flat if v2 is also a Wallet. There is no general answer to this conundrum.

Most languages ban multiple inheritance entirely because of it. Some make the conflicting method unusable on the problematic class, requiring an explicit cast to differentiate which method is meant. Some have explicit ways of stating preferences. Some, including Strongtalk, have a specific ordering to multiple inheritance, with later superclasses having priority over earlier ones.

Another consequence of multiple inheritance is repeated inheritance. Suppose class A inherits from classes B and C, each of which also inherits from a class D, which has a method g. Then A inherits g twice: once through B and once through C. If g is not overridden in either B or C, the two gs can be merged into one. But if one of the classes, say C, overrides g, we once again have a name clash.

As bad as multiple inheritance is for our semantics, it’s much worse for implementation. Virtual tables worked because subclasses could look like their parent classes in the first elements of their virtual table, then add their own methods afterwards. But the virtual table for Money will want dollars in the first element, and the virtual table for Rectangle will want setWidth: in the first element, so there is no obvious way to define a virtual table for Wallet. There are many solutions, and none are very good:

Disallow multiple inheritance. (The simplest solution.)
Multiple virtual tables. Java classes have one virtual table for their class, and an extra virtual table for each interface they implement. A Java interface is a list of method signatures with no bodies, so a Java class doesn’t inherit from an interface at all, which solves the name clash issue. The difficulty is looking them up; classes in Java contain a hash-map mapping interfaces to their virtual tables.
Sparse virtual tables, used by many C++ compilers. If the compiler can see all of the classes it needs to compile, it can intentionally define the virtual tables for Money and Rectangle such that they don’t conflict.

Section 14: Blocks

In order for a language to be Turing-complete, it must have a way to represent decisions, and a way to repeat. Smalltalk, in an effort to be maximally object-oriented, instead opted for blocks.

Blocks always need to be specially handled, because they contain statements in the context of the surrounding method. Thus, a block is not like a method—it is nested within another method. In Smalltalk, all blocks are objects of the Block class, which has a field containing the internal representation of the block’s code. Thanks to encapsulation, the details of how the statements are actually stored do not need to be exposed to the end user.

Blocks may also define their own local variables. Most modern Smalltalk implementations treat blocks like procedures, creating multiple instances of the variables the block contains (multiple stack frames). But classic Smalltalk implementations have only one instance of a block’s variable per call to the surrounding method. Thus, this method will have different effects depending on the version of Smalltalk:

| block |
block := [:then :x |
    then value: [] value: 5.
    x
].
(x value: x value: 42) displayNl.

If only one x is allocated, this method will print 5, because the recursive call has replaced the value of x. If an x is allocated for each call, this method will print 42.

An even more unusual behavior of blocks is return statements. A block may contain return statements, and if one is encountered, it returns from the surrounding method. Consider this example, in which we’ve explicitly defined True for context:

Boolean subtype: True [
    ifTrue: block [
        ^ block value
    ]
]

Object subclass: Foo [
    foo [
        true ifTrue: [
            ^ 42
        ]
    ]
]

When we call Foo>>foo, it calls True>>ifTrue: with a block as an argument. True>>ifTrue:, in turn, calls value, evaluating the block. The call stack now looks something like this:

Foo>>foo
True>>ifTrue:
block from Foo>>foo

When that block evaluates its return statement, Foo>>foo returns, bypassing True>>ifTrue: entirely! So, Smalltalk implementations need to be able to break out of multiple layers of the stack.

Aside: For those curious about implementation, this is done by having block objects contain a stack location, and then (fairly brutishly) forcing the stack pointer to that location when they return. This technique wouldn’t work in all languages; the alternative is to explicitly read and “unwind” the stack, which is what’s done by exceptions in most languages that implement them.

Weirder still, since blocks are values, we can actually strip a block from its surrounding method, even if it contains a return statement:

Object subclass: Bar [
    bar [
        ^[ ^42 ]
    ]
]
Object subclass: Baf [
    baf [
        | x |
        x := Bar new.
        x := x bar.
        x value. " Where does this block return from??? "
    ]
]

There is no good answer to the question of where a block removed from its surrounding method returns. Most just raise an error. For instance, GNU Smalltalk will report:

Object: 42 error: return from a dead method context

Section 15: Object-Based and Prototype-Based Languages

We’ve focused our attention on class-based languages, but a language does not need to be class-based to be object-oriented. Object-based languages differ from class-based languages in that they lack an explicit “class” construction for defining groups of similar objects. Many of the characteristics typical of object-based languages are outlined in a document known as The Treaty of Orlando.

Proponents of object-based languages argue that a class is a rigid structure that establishes a template for all future code reuse, thereby constraining the ways in which software may evolve. Proper use of classes requires some degree of foresight—a “vision” of the structure of the overall system. In the early stages of development, classes may get in the way; a design change may require that the entire class hierarchy be redesigned. Further, classes are not well-suited to capturing idiosyncratic behaviour.

On the other hand, as software projects mature, wholesale design changes are less likely to occur, and the rigidity, strong typing, and stability provided by classes become more valuable. For these reasons, object-based languages are often used for prototyping software systems in the early stage of their development (hence, prototyping languages). Some languages, such as TypeScript, support both styles for this reason.

We will use mostly the syntax of Self in this section. Self’s syntax is similar to, and derived from, Smalltalk’s.

In the absence of classes, we must have a syntax to define objects directly, and field and method declarations must take place within the objects themselves:

x = (|
    n = 15.
    f = (
        self n := self n + 1
    )
|).

This defines an object with fields n and f, in which the field f is bound to a method. That object is bound to the name x. Note that we don’t distinguish fields from methods in object-based languages: to put a method on an object, we assign it to a field. As a consequence, methods are values in object-based languages.

Cloning. Most object-based languages allow objects to be cloned (shallowly duplicated):

y = x clone.

The object y is an exact copy of x, having fields n and f. Using cloning, we may create any number of objects with identical behavior. Since methods are just fields, we can specialize a cloned object’s behavior by changing that field in a clone.

Prototyping. While cloning works well, it’s fairly restrictive. Self also allows prototyping, by which an object may delegate behavior to another object:

z = (| prototype* = x |).

When a field is looked up in z, if it’s not directly defined in the object, it instead looks for it in the prototype* field. If it finds a method in the prototype, in this case x, the self parameter will still be z, so that method can access z’s fields. In this way, we can build classes fairly directly with prototypes: we build all the methods into one prototype, then create objects that have the correct fields and that prototype. Self allows an object to have multiple prototypes—any field named with a * is a prototype—and thus supports multiple inheritance. It uses priority to disambiguate, with alphabetically earlier names prioritized over alphabetically later names.

The constructor pattern. In Self, it’s common to use prototypes and a constructor method:

Rectangle = (|
    withWidth: x height: y = (
        | r |
        r := (|
            prototype* = self.
            width = x.
            height = y
        |).
        ^r
    ).

    area = (
        ^self width * self height
    )
|)

The withWidth:height: method is meant to be called on Rectangle itself, and creates an object with Rectangle (as self) as a prototype. The area method won’t work if called on Rectangle itself, because it will look for fields named width and height, but no such fields exist on Rectangle. Instead, objects created by withWidth:height: will have these fields, and through their prototype, the area method. A much more popular prototype-based language, JavaScript, does not support cloning, and directly supports the constructor pattern.

Section 16: Miscellany

An object-oriented system usually requires some initial configuration by the implementation, which cannot be implemented in the language itself. For instance, Smalltalk requires every class to be a subclass of some other class, except for Object. There is no way to create a class without a superclass in Smalltalk, so Object must be defined by the implementation. The same is true of blocks, and the Class class, as well as the special values true, false, and nil. On the other hand, the classes for those special values—True, False, and Undefined, respectively—can actually be defined in Smalltalk, and are.

Because Smalltalk systems are usually environments, rather than files, they have images. An image is a file-based representation of the state of all objects in a Smalltalk system—bearing in mind that classes are objects of the Class class—which can then be loaded to recreate the environment. Essentially, an image is a file representing $ \sigma $ and $ \Sigma $. GNU Smalltalk was chosen for this course because it’s incredibly difficult to grade an image.

Although neither Smalltalk nor Self are themselves particularly popular, their implementations have had profound effects on computing. The concept of Just-in-Time Compilation (JIT)—compiling code as the program runs—was invented for Smalltalk, refined in Self, and then popularized in Java and JavaScript. This is the usual fate of a research language: its concepts are used, but the language is not.

References

Gilad Bracha and David Griswold. Strongtalk: Typechecking Smalltalk in a Production Environment. ACM SIGPLAN Notices, 28(10):215–230, 1993.
Adele Goldberg. Smalltalk-80: The Interactive Programming Environment. Addison-Wesley Longman Publishing Co., Inc., 1984.
Adele Goldberg and David Robson. Smalltalk-80: The Language and its Implementation. Addison-Wesley Longman Publishing Co., Inc., 1983.
Atsushi Igarashi, Benjamin C. Pierce, and Philip Wadler. Featherweight Java: A Minimal Core Calculus for Java and GJ. ACM Transactions on Programming Languages and Systems (TOPLAS), 23(3):396–450, 2001.
Lynn Andrea Stein, Henry Lieberman, and David Michael Ungar. A Shared View of Sharing: The Treaty of Orlando. Brown University, Department of Computer Science, 1988.
David Ungar and Randall B. Smith. Self: The Power of Simplicity. In Conference Proceedings on Object-Oriented Programming Systems, Languages and Applications, pages 227–242, 1987.

Module 9: Concurrent Programming

“Concurrency is the most extreme form of programming. It’s like white-water rafting without the raft or sky diving without the parachute.” — Peter Buhr

Concurrent Programming

We’ve developed two core calculi and looked at four exemplar programming languages, but we’re as yet missing one of the most important characteristics of modern computing: concurrency. First, let’s clarify some terms, because the concurrent programming community are quite picky about these terms:

Parallelism is the phenomenon of two or more things—presumably, two computations—actually happening at the same time.
Concurrency is the experience of two or more things—such as applications, tasks, or threads—appearing to happen at the same time, whether or not they actually are.

For instance, an average computer in the 1990’s had only one CPU and one core, and so had no parallelism, but could still run multiple applications “at the same time”, and so still had concurrency. In this case, concurrency is achieved by quickly switching which task the CPU is concerned with between the various running programs. It is also possible for a system to be parallel without being concurrent: for instance, a compiler may optimize an imperative loop into a specialized parallel operation on a particular CPU, but this is only visible to the programmer as a speed boost, so the programmer’s experience still lacks concurrency. And, of course, a system can have both: on a modern, multi-core CPU, one uses concurrency to take advantage of the parallelism, by running multiple programs or threads on their multiple cores. The concurrent tasks appear to happen at the same time because they do happen at the same time.

The domain of concurrent programming has changed dramatically over time, in response to growing parallelism. Early exploration into concurrency was theoretical, and then became practical with networks. When multi-core CPU’s started becoming the norm, concurrency rose in importance again, because it is often impossible to take advantage of real parallelism without concurrency.

The core idea of a concurrent system is that there are multiple tasks which can communicate with each other, and computation can proceed in any of them at any rate. This is contrary to the behavior of the $ \lambda $-calculus or the Simple Imperative Language, and contrary to all of our exemplar languages (at least, as far as we’ve investigated them), so we will need both a new formal language and a new exemplar language for concurrency. In practice, of course, in the same way that many languages which are not fundamentally object oriented have picked up object-oriented features, most languages which are not fundamentally concurrent have picked up at least some kind of concurrent features.

In a formal model of concurrency, we need a way of expressing multiple simultaneous tasks, and, as usual, a way of taking a step. Unlike our previous formal semantics, we want this semantics to be non-deterministic: the fact that any of multiple tasks may proceed is the essence of concurrency. However, we will not define our semantics as truly parallel: a step will involve only one task taking a step.

Another issue to be addressed is models of concurrency, i.e., how a concurrent language presents multiple tasks. Models of concurrency are defined across two dimensions: how one specifies multiple tasks, and how those multiple tasks communicate. Specification comes down to what structures a language provides for a program to create tasks, and often, to specify real parallelism as well. Options include threads of various sorts, actors, processes, and many others, but all of these terms are imprecise and ambiguous. Except for demonstrating the mechanisms for specifying concurrency in our formal model and exemplar language, we will put little focus on specification.

The other dimension is communication, which falls largely into two forms: shared-memory concurrency and message-passing concurrency. In shared-memory concurrency, either the entire heap or some portion of the heap is shared between multiple tasks. In our terms, all tasks have the same $ \Sigma $, and if they share any labels, they can see changes made by other tasks. However, because multiple tasks can compute at the same time, there may not be any guarantee that one task’s write to a location in $ \Sigma $ occurs before another task’s read. Other mechanisms, such as locks, are required to guarantee ordering.

In message-passing concurrency, two new primitive operations are introduced: sending a message and waiting for a message. A task which is waiting for a message cannot proceed until another task sends it a message. A task may send a message at any time, but to do so, must have a way of communicating with the target task. Thus, message-passing concurrency requires some form of message channels, by which two tasks can arrange to exchange messages. Ordering is guaranteed by the directionality of message passing; a waiting task will not proceed until a sending task sends a message.

Aside: Shared-memory and message-passing concurrency are equally powerful, and in fact, either can be rewritten in terms of the other. However, this rewriting is fairly unsatisfying: shared memory can be rewritten in terms of message passing by imagining $ \Sigma $ itself as a “task” that expects messages instructing it to read and write certain memory locations. Equivalently, we can consider a modern CPU as sending messages to the memory bus, rather than simply reading and writing memory. As a practical matter, message-passing implementations tend to be slower than shared-memory implementations of the same algorithm.

This course is not intended to compete with CS343 (Concurrent and Parallel Programming), so will take a very language-focused view of concurrency. That course uses concurrency to solve problems; in this course, concurrency is only a cause of problems.

$ \pi $-Calculus

In $ \lambda $-calculus, we built a surprising amount of functionality around abstractions: with only three constructs in the language (abstractions, applications, and variables), we could represent numbers, conditionals, and ultimately, anything computable. $ \pi $-calculus (The Pi Calculus) takes a similar approach to message-passing concurrency. The only structures are concurrency, communication, replication, and names, but these will be sufficient to build any computation. Notably lacking is functions (or abstractions). $ \pi $-calculus itself was developed by Robin Milner, Joachim Parrow, and David Walker in A Calculus of Mobile Processes, but it was the culmination of a long line of development of calculi of communicating processes, to which Uffe Engberg and Mogens Nielsen also made significant contributions.

We will look at the behavior of $ \pi $-calculus, but will not discuss how to encode complex computation into $ \pi $-calculus concurrency. $ \pi $-calculus is Turing-complete, but like $ \lambda $-calculus, it’s more common to layer other semantics on top of it than to take advantage of its own computational power.

A $ \pi $-calculus program consists of any number of concurrent tasks, called processes, separated by pipes ($ | $. Those tasks can be grouped. Each process can create a channel, receive a message on a channel, send a message on a channel, replicate processes, or terminate. The only values in $ \pi $-calculus are channels, so any encoding of useful information also needs to be done with channels, and the only value you can send over a channel is another channel. Channels are also called “names”, because they are simply named by variables; two processes must agree on a name in order to communicate (as well as another restriction which we will discuss soon).

The syntax of $ \pi $-calculus is as follows, presented in BNF, with $ \langle \textit{program} \rangle $ as the starting non-terminal:

⟨program⟩  ::=  ⟨program⟩ "|" ⟨program⟩
             |   ⟨receive⟩
             |   ⟨send⟩
             |   ⟨restrict⟩
             |   ⟨replicate⟩
             |   ⟨terminate⟩

⟨receive⟩  ::=  ⟨var⟩ ( ⟨var⟩ ) . ⟨program⟩
⟨send⟩     ::=  ⟨var⟩ "⟨" ⟨var⟩ "⟩" . ⟨program⟩
⟨restrict⟩ ::=  ( ν ⟨var⟩ ) ⟨program⟩
⟨replicate⟩::=  ! ⟨program⟩
⟨terminate⟩::=  0
⟨var⟩      ::=  a | b | c | ···

Note that $ \nu $ is the Greek letter nu, not the Latin/English letter ‘v’, because somebody decided that using confusing, ambiguous Greek letters was acceptable; we will avoid using v as a variable for this reason. Like in $ \lambda $-calculus, we will actually be more lax in our use of variable names than this BNF suggests, for clarity. Like in the Simple Imperative Language, this is assumed to be an abstract syntax, and we will add parentheses to disambiguate as necessary.

The pipe ($ | $ in $ \langle \textit{program} \rangle $ has the lowest precedence, so, for instance, $ x(y).0 \mid z(a).0 $ is read as $ (x(y).0) \mid (z(a).0) $, not $ x(y).(0 \mid z(a).0) $.

Unfortunately, $ \pi $-calculus uses several of the symbols which we use in BNF as well, which we’ve surrounded in quotes to separate them from BNF, as well as an overline. Here are two small examples to clarify the syntax. The following snippet receives a message on the channel $ x $, into the variable $ y $, before proceeding with the process $ P $:

\[ x(y).P \]

The following snippet sends $ y $ on the channel $ x $, before proceeding with the process $ P $:

\[ \bar{x}\langle y \rangle.P \]

As discussed, a program consists of a number of processes, separated by pipes. Each process is itself a program, so the distinction is just usage. We will use the term “process” to refer to any construction other than the composition of multiple programs with a pipe, so that a program can be read as a list of processes.

A program proceeds through its processes sending and receiving messages on channels until they terminate. For instance, this program consists of two processes, of which the first sends the message $ h $ to the second, and the second then attempts to pass that $ h $ along on another channel:

\[ \bar{x}\langle h \rangle.0 \mid x(y).\bar{z}\langle y \rangle.0 \]

The first process is $ \bar{x}\langle h \rangle.0 $, which consists of a send of $ h $ over the channel $ x $, and then termination of the process ($ 0 $. The second process is $ x(y).\bar{z}\langle y \rangle.0 $, which consists of a receive of $ y $ from the channel $ x $, then a send of $ y $ over the channel $ z $, then termination. Programs in $ \pi $-calculus proceed by sending and receiving messages; in this case, the program can proceed, because a process is trying to send on $ x $, and another process is trying to receive on $ x $. After sending that message, the program looks like this:

\[ 0 \mid \bar{z}\langle h \rangle.0 \]

Like in $ \lambda $-calculus applications, message receipt works by substitution, so the $ y $ was substituted for the received message, $ h $. A terminating process can simply be removed, so the next step is as follows:

\[ \bar{z}\langle h \rangle.0 \]

This program cannot proceed, because no process is prepared to receive a message on the channel $ z $.

Consider this similar program:

\[ \bar{x}\langle h \rangle.z(a).0 \mid x(y).\bar{z}\langle y \rangle.0 \mid x(y).0 \]

This time, we have three processes. The first sends the message $ h $ over the channel $ x $, then receives a message on the channel $ z $, then terminates. The second is identical to our original second process: it receives a message on $ x $, then sends it back on $ z $. The third receives a message on $ x $, then immediately terminates. This program can proceed, because a process is sending on $ x $ and a process is prepared to receive on $ x $. But, which process receives the message? $ \pi $-calculus is non-deterministic, so the answer is that either process may receive the message. Both ways for the program to proceed are valid. In this case, these two reductions are both valid:

\[ \Rightarrow\; z(a).0 \mid \bar{z}\langle h \rangle.0 \mid x(y).0 \qquad\qquad \Rightarrow\; z(a).0 \mid x(y).\bar{z}\langle y \rangle.0 \mid 0 \]\[ \Rightarrow\; 0 \mid 0 \mid x(y).0 \qquad\qquad\qquad\qquad \Rightarrow\; z(a).0 \mid x(y).\bar{z}\langle y \rangle.0 \]\[ \Rightarrow\; 0 \mid x(y).0 \]\[ \Rightarrow\; x(y).0 \]

The first sequence of reductions occurs if the message is received by the second process, and the second sequence occurs if the message is received by the third process. The first sequence may seem more complete, or valid, since two of the three processes terminated, but both are valid. We are using $ \Rightarrow $ informally for “takes a step”, because we haven’t yet formally defined $ \to $, and we will discover that there’s an extra complication to the definition of steps in $ \pi $-calculus in the section on Structural Congruence.

Because name uniqueness is so important to communication, $ \pi $-calculus also has a mechanism to restrict names. For instance, consider this rewrite of the above program:

\[ (\nu x)(\bar{x}\langle h \rangle.z(a).0 \mid x(y).\bar{z}\langle y \rangle.0) \mid x(y).0 \]

It is identical to the previous program, except that the first two processes are nested inside of a restriction: $ (\nu x) $. A restriction is like a variable binding, in that it scopes the variable to its context: the $ x $ inside of the restriction is not the same as the $ x $ outside of the restriction. Unlike variable binding, however, it doesn’t bind it to any particular value—remember, names are all there are in $ \pi $-calculus, so there’s nothing else it could be bound to—it merely makes for two distinct interpretations of the name. That’s why it’s called a restriction; it restricts the meaning of, in this case, $ x $, within the restriction expression. Now, this program can only proceed like so:

\[ \Rightarrow\; (\nu x)(z(a).0 \mid \bar{z}\langle h \rangle.0) \mid x(y).0 \]\[ \Rightarrow\; (\nu x)(0 \mid 0) \mid x(y).0 \]\[ \Rightarrow\; (\nu x)(0) \mid x(y).0 \]\[ \Rightarrow\; x(y).0 \]

In the last step, we can remove a restriction when it is no longer restricting anything (i.e., when its contained process terminates). When a restriction is at the top level like this, it can always be rewritten by renaming variables to new, fresh names, so the above program is equivalent to the following program:

\[ \bar{x'}\langle h \rangle.z(a).0 \mid x'(y).\bar{z}\langle y \rangle.0 \mid x(y).0 \]

Aside: If your eyes are glazing over from $ \pi $-calculus syntax, don’t worry, you’re not the only one. Something about $ \pi $-calculus’s use of overlines and $ \nu $ and pipes makes it semantically dense and difficult to read. Just be careful of the pipes and parentheses.

A restriction only applies to the variable it names, so processes within restrictions are still allowed to communicate with processes outside of restrictions:

\[ (\nu x)\bar{z}\langle h \rangle.0 \mid z(y).\bar{x}\langle y \rangle.0 \;\Rightarrow\; (\nu x)\,0 \mid \bar{x}\langle h \rangle.0 \;\Rightarrow\; \bar{x}\langle h \rangle.0 \]

Substitution must be aware of restriction, because a restricted variable is distinct from the same name in the surrounding code. For instance:

\[ (x(y).(\nu x).x(y).0)[z/x] \;=\; (z(y).(\nu x).x(y).0) \]

This exception is the same as is introduced by $ \lambda $-abstractions in substitution for $ \lambda $-calculus.

The only messages that processes can send are names. Names are also the channels by which processes send messages. As a consequence, processes can send channels over channels. For instance, consider this program:

\[ \bar{x}\langle z \rangle.0 \mid x(y).\bar{y}\langle h \rangle.0 \mid z(a).0 \]

The first process will send the name $ z $ over the channel $ x $. The second process is waiting to receive a message on the channel $ x $. The third process is waiting to receive a message on the channel $ z $, but there is no send on the channel $ z $ in the entire program. The third process cannot possibly proceed, but the first and second can:

\[ \Rightarrow\; 0 \mid \bar{z}\langle h \rangle.0 \mid z(a).0 \;\Rightarrow\; \bar{z}\langle h \rangle.0 \mid z(a).0 \]

The second process received a $ z $ on the channel $ x $, as the variable $ y $. But, it then proceeds to send on the channel $ y $. Because message receipt works by substitution, that $ y $ has been substituted for $ z $. This program can now proceed, by sending $ h $ to the third process.

Restriction has an unusual interaction with sending channels. For instance, consider this program:

\[ (\nu x)\bar{z}\langle x \rangle.x(y).0 \mid z(a).\bar{a}\langle x \rangle.0 \]

The first process is under a restriction for $ x $, and the second process is not. But, the behavior of the first process is to send $ x $ over the channel $ z $, and the second process is waiting to receive a message on the channel $ z $. It’s then going to send $ x $ back, but that $ x $ isn’t the same $ x $ as the first process’s $ x $, because of the restriction. The answer is made clear by our statement that a restriction can always be rewritten by simply using a new name. In this case, this program can be rewritten like so:

\[ \bar{z}\langle x' \rangle.x'(y).0 \mid z(a).\bar{a}\langle x \rangle.0 \]

From this state, the steps are clear:

\[ \Rightarrow\; x'(y).0 \mid \bar{x'}\langle x \rangle.0 \;\Rightarrow\; 0 \mid 0 \;\Rightarrow\; 0 \]

Finally, $ \pi $-calculus supports process creation: a process may create more processes. There are actually two mechanisms of process creation. First, processes may simply be nested. For instance, consider this program:

\[ x(y).(\bar{y}\langle h \rangle.0 \mid \bar{y}\langle m \rangle.0) \mid f(a).\bar{z}\langle a \rangle.0 \mid f(b).\bar{z}\langle b \rangle.0 \mid \bar{x}\langle f \rangle.0 \]

Note in particular the position of the parentheses: this program has four processes, not five! The first process is $ x(y).(\bar{y}\langle h \rangle.0 \mid \bar{y}\langle k \rangle.0) $. Although this process has the pipe which separates multiple processes within it, those are not two independent processes until this process has received a message on the channel $ x $. Essentially, as soon as this process receives a message, it will split into two processes. This program can proceed as follows (this is not the only possible sequence):

\[ \Rightarrow\; (\bar{f}\langle h \rangle.0 \mid \bar{f}\langle m \rangle.0) \mid f(a).\bar{z}\langle a \rangle.0 \mid f(b).\bar{z}\langle b \rangle.0 \]\[ \Rightarrow\; \bar{f}\langle h \rangle.0 \mid \bar{f}\langle m \rangle.0 \mid f(a).\bar{z}\langle a \rangle.0 \mid f(b).\bar{z}\langle b \rangle.0 \]\[ \Rightarrow\; 0 \mid \bar{f}\langle m \rangle.0 \mid \bar{z}\langle h \rangle.0 \mid f(b).\bar{z}\langle b \rangle.0 \]\[ \Rightarrow\; \bar{f}\langle m \rangle.0 \mid \bar{z}\langle h \rangle.0 \mid f(b).\bar{z}\langle b \rangle.0 \]\[ \Rightarrow\; 0 \mid \bar{z}\langle h \rangle.0 \mid \bar{z}\langle m \rangle.0 \]\[ \Rightarrow\; \bar{z}\langle h \rangle.0 \mid \bar{z}\langle m \rangle.0 \]

Exercise 1. Give another possible sequence for this example.

The other mechanism of process creation is replication. The $ ! $ operator creates an endless sequence of identical processes. For instance, consider the following program:

\[ !x(a).0 \mid \bar{x}\langle b \rangle.0 \mid \bar{x}\langle c \rangle.0 \mid \bar{x}\langle d \rangle.0 \]

There are three processes trying to send on the channel $ x $, but only one process with a receive on the channel $ x $. However, the $ ! $ creates any number of the same process, so all three of the sending processes can proceed, in any order. This is one possible sequence:

\[ \Rightarrow\; x(a).0 \mid !x(a).0 \mid \bar{x}\langle b \rangle.0 \mid \bar{x}\langle c \rangle.0 \mid \bar{x}\langle d \rangle.0 \]\[ \Rightarrow\; 0 \mid !x(a).0 \mid \bar{x}\langle b \rangle.0 \mid 0 \mid \bar{x}\langle d \rangle.0 \]\[ \Rightarrow\; !x(a).0 \mid \bar{x}\langle b \rangle.0 \mid 0 \mid \bar{x}\langle d \rangle.0 \]\[ \Rightarrow\; !x(a).0 \mid \bar{x}\langle b \rangle.0 \mid \bar{x}\langle d \rangle.0 \]\[ \Rightarrow\; x(a).0 \mid !x(a).0 \mid \bar{x}\langle b \rangle.0 \mid \bar{x}\langle d \rangle.0 \]\[ \Rightarrow\; 0 \mid !x(a).0 \mid 0 \mid \bar{x}\langle d \rangle.0 \]\[ \Rightarrow\; !x(a).0 \mid 0 \mid \bar{x}\langle d \rangle.0 \]\[ \Rightarrow\; !x(a).0 \mid \bar{x}\langle d \rangle.0 \]\[ \Rightarrow\; x(a).0 \mid !x(a).0 \mid \bar{x}\langle d \rangle.0 \]\[ \Rightarrow\; 0 \mid !x(a).0 \mid 0 \]\[ \Rightarrow\; !x(a).0 \mid 0 \]\[ \Rightarrow\; !x(a).0 \]

Concurrency vs. Parallelism

The astute reader may have noticed that we have described sequences of steps, with no true parallelism. For instance, consider the following program:

\[ \bar{x}\langle a \rangle.0 \mid \bar{y}\langle b \rangle.0 \mid x(z).0 \mid y(z).0 \]

There are two possible steps this program can take—it can send a message on $ x $ or $ y $—but we describe it as taking one or the other, not both at the same time. The concurrency comes from the lack of prioritization, and non-determinism: each of these two options is equally valid, and to consider how this program proceeds, we need to consider both possibilities. But, the concurrency is restricted by the nature of messages: only pairs of matching sends and receives can actually proceed.

Because concurrency models the appearance of multiple tasks happening simultaneously, in most cases, it is not necessary to model true parallelism. The most complex formal models in the domain of concurrency and parallelism are models of parallel shared-memory architectures, and even they are formally descriptions of concurrency rather than parallelism, in that they model parallel action as a non-deterministic ordering.

Structural Congruence

Because processes may proceed in any order in $ \pi $-calculus, $ P \mid Q $ is not meaningfully distinct from $ Q \mid P $. Similarly, $ (\nu x)P \mid Q $ is not meaningfully distinct from $ P[y/x] \mid Q $, where $ y $ is a new name, and $ \alpha $-equivalent programs are also indistinct.

In $ \lambda $-calculus, these equivalences mostly gave us a baseline for comparing things. In $ \pi $-calculus, it would be difficult or impossible to define reduction without this equivalence, because of the non-deterministic ordering of steps.

This equivalence is defined formally as structural congruence, written as $ \equiv $. That is, $ P \equiv Q $ means that $ P $ is structurally congruent to $ Q $. Structural congruence is reflexive, symmetric, and transitive.

The formal rules for structural congruence follow. Note that different presentations of $ \pi $-calculus present slightly different but equivalent rules of structural congruence, so this may not exactly match other materials on the same topic.

Definition 1. (Structural congruence)

Let the metavariables $ P $, $ Q $, and $ R $ range over programs, and $ x $ and $ y $ range over names. Then the following rules describe structural congruence of $ \pi $-calculus programs:

\[ \textbf{C_Alpha} \quad \dfrac{P =_\alpha Q}{P \equiv Q} \qquad\qquad \textbf{C_Nest} \quad \dfrac{P \equiv P'}{P \mid Q \equiv P' \mid Q} \]\[ \textbf{C_Order} \quad P \mid Q \equiv Q \mid P \qquad\qquad \textbf{C_Paren} \quad (P \mid Q) \mid R \equiv P \mid (Q \mid R) \equiv P \mid Q \mid R \]\[ \textbf{C_Termination} \quad 0 \mid P \equiv P \]\[ \textbf{C_Restriction} \quad \frac{y \text{ is a fresh variable}}{(\nu x)P \equiv P[y/x]} \]\[ \textbf{C_Replication} \quad {!P} \equiv P \mid {!P} \]

The C_Alpha rule specifies that $ \alpha $-equivalence implies structural congruence, i.e., two $ \alpha $-equivalent programs are also structurally congruent. The C_Order and C_Nest rules allow us to reorder programs and apply structural congruence to subprograms. The C_Paren rule specifies that all concurrent processes are equivalent, and different placements of parentheses do not affect their composition, so we can remove parentheses at the top level of a program. The C_Termination rule describes termination in terms of equivalence: rather than termination being a step, we can describe a program with a terminating process as equivalent to a program without that process. The C_Restriction rule makes explicit our description of restriction as creating a fresh variable. Finally, the C_Replication rule makes replication a property of structural congruence, rather than a step: a replicating process is simply equivalent to a version with a replica, and thus, by the transitive property, equivalent to a version with any number of replicas.

Because of C_Termination, C_Restriction, and C_Replication, only sending and receiving messages is described as an actual step of computation. Everything else is structural congruence.

Note that C_Restriction does not allow us to remove all restrictions from a program, because restrictions may be nested inside of other constructs, and structural equivalence does not allow us to enter any other constructs. For instance, the program $ x(y).(\nu z).0 $ has no structural equivalent (except for $ \alpha $-renaming), because the restriction of $ z $ is nested inside of a receipt on the channel $ y $.

Formal Semantics

With structural congruence, we may now describe the formal semantics of $ \pi $-calculus.

Definition 2. (Formal semantics of $ \pi $-calculus)

Let the metavariables $ P $, $ Q $, and $ R $ range over programs, and $ x $, $ y $, and $ z $ range over names. Then the following rules describe the formal semantics of $ \pi $-calculus:

\[ \textbf{Congruence} \quad \dfrac{P \equiv Q \quad Q \to Q' \quad Q' \equiv R}{P \to R} \]\[ \textbf{Message} \quad \bar{x}\langle y \rangle.P \mid x(z).Q \mid R \;\to\; P \mid Q[y/z] \mid R \]

Because of structural congruence, these two rules are all that is needed to define the semantics of $ \pi $-calculus. By Congruence, $ P $ can reduce to $ R $ if $ P $ is equivalent to some $ Q $ which can reduce to some $ Q' $, and $ Q' $ is equivalent to $ R $. That is, reduction is ambivalent to structural congruence in either its “from” or “to” state. Message describes the only actual reduction step in our semantics: if there is a process to send a message, and a process to receive a message on the same channel, then we may take a step in both, by removing the send from the sending process, removing the receipt from the receiving process, and substituting the variable in the receiving process with the value sent by the sending process.

Message itself is deterministic. The non-determinism in $ \pi $-calculus is introduced by Congruence. Every program $ P $ has infinitely many structurally congruent equivalents. Some number of those structurally congruent programs are able to take steps with Message. Each of those is equivalently correct, and none has priority; all are valid reduction steps.

Consider our previous example:

\[ !x(a).0 \mid \bar{x}\langle b \rangle.0 \mid \bar{x}\langle c \rangle.0 \mid \bar{x}\langle d \rangle.0 \]

We may now define one possible sequence formally, with structural congruence and reduction:

\[ !x(a).0 \mid \bar{x}\langle b \rangle.0 \mid \bar{x}\langle c \rangle.0 \mid \bar{x}\langle d \rangle.0 \]\[ \equiv\; x(a).0 \mid !x(a).0 \mid \bar{x}\langle b \rangle.0 \mid \bar{x}\langle c \rangle.0 \mid \bar{x}\langle d \rangle.0 \qquad \textbf{(C_Replication)} \]\[ \equiv\; \bar{x}\langle c \rangle.0 \mid x(a).0 \mid !x(a).0 \mid \bar{x}\langle b \rangle.0 \mid \bar{x}\langle d \rangle.0 \qquad \textbf{(C_Order)} \]\[ \to\; 0 \mid 0 \mid !x(a).0 \mid \bar{x}\langle b \rangle.0 \mid \bar{x}\langle d \rangle.0 \qquad \textbf{(Message)} \]\[ \equiv\; !x(a).0 \mid \bar{x}\langle b \rangle.0 \mid \bar{x}\langle d \rangle.0 \qquad \textbf{(C_Termination)} \]\[ \equiv\; x(a).0 \mid !x(a).0 \mid \bar{x}\langle b \rangle.0 \mid \bar{x}\langle d \rangle.0 \qquad \textbf{(C_Replication)} \]\[ \equiv\; \bar{x}\langle b \rangle.0 \mid x(a).0 \mid !x(a).0 \mid \bar{x}\langle d \rangle.0 \qquad \textbf{(C_Order)} \]\[ \to\; 0 \mid 0 \mid !x(a).0 \mid \bar{x}\langle d \rangle.0 \qquad \textbf{(Message)} \]\[ \equiv\; !x(a).0 \mid \bar{x}\langle d \rangle.0 \qquad \textbf{(C_Termination)} \]\[ \equiv\; x(a).0 \mid !x(a).0 \mid \bar{x}\langle d \rangle.0 \qquad \textbf{(C_Replication)} \]\[ \equiv\; \bar{x}\langle d \rangle.0 \mid x(a).0 \mid !x(a).0 \qquad \textbf{(C_Order)} \]\[ \to\; 0 \mid 0 \mid !x(a).0 \qquad \textbf{(Message)} \]\[ \equiv\; !x(a).0 \qquad \textbf{(C_Termination)} \]

The Use of $ \pi $-Calculus

In concurrent programming, most problems can be simplified to happens-before relationships. That is, with multiple processes able to perform tasks concurrently, you want to guarantee that some task happens before some other task. Concurrent systems are modeled in terms of $ \pi $-calculus to prove these kinds of happens-before relationships.

For instance, let’s say we want to verify that a given program always sends a message on channel $ x $ before sending a message on channel $ y $. Here is a program that fails to guarantee such a relationship:

\[ \bar{a}\langle x \rangle.\bar{b}\langle y \rangle.0 \mid a(m).\bar{m}\langle h \rangle.0 \mid b(n).\bar{n}\langle h \rangle.0 \mid x(q).0 \mid y(q).0 \]

We can demonstrate this by showing a reduction that sends on $ y $ before sending on $ x $:

\[ \to\; \bar{b}\langle y \rangle.0 \mid \bar{x}\langle h \rangle.0 \mid b(n).\bar{n}\langle h \rangle.0 \mid x(q).0 \mid y(q).0 \]\[ \equiv\; \bar{b}\langle y \rangle.0 \mid b(n).\bar{n}\langle h \rangle.0 \mid \bar{x}\langle h \rangle.0 \mid x(q).0 \mid y(q).0 \]\[ \to\; 0 \mid \bar{y}\langle h \rangle.0 \mid \bar{x}\langle h \rangle.0 \mid x(q).0 \mid y(q).0 \]\[ \equiv\; \bar{y}\langle h \rangle.0 \mid y(q).0 \mid \bar{x}\langle h \rangle.0 \mid x(q).0 \]\[ \to\; 0 \mid 0 \mid \bar{x}\langle h \rangle.0 \mid x(q).0 \qquad \text{(Premise violated)} \]

Proving that happens-before relationships hold is, of course, far more complicated, since it is impossible to enumerate the infinitely many possible structurally congruent programs. Luckily, C_Replication is the only case that can introduce infinite reducible programs, and the difference between them is uninteresting (only how many times the replicated subprogram has been expanded). So, in many cases, it is possible to enumerate all interesting reductions. If the program can reduce forever, then it is instead necessary to use inductive proofs for most interesting properties.

Generally, $ \pi $-calculus is extended with other features to represent the actual computation that each process performs, rather than performing computation through message passing. For instance, $ \lambda $-calculus and $ \pi $-calculus can be overlain directly by allowing processes which contain $ \lambda $-applications to proceed as in the $ \lambda $-calculus, while process pairs containing a matching send and receive can proceed as in $ \pi $-calculus. Such combinations are often used for proving type soundness of concurrent languages.

Exemplar: Erlang

In $ \pi $-calculus, we have found a formal semantics for message-passing concurrency. Although there are many programming languages with support for concurrency, and even many programming languages with support for message-passing concurrency, there is a stand-out example which is to message-passing concurrency as Smalltalk is to object orientation: Erlang.

Erlang is a language built on the principle that “everything is a process”. It was created in the late 1980’s at Ericsson by Joe Armstrong, Robert Virding, and Mike Williams, to manage telecom systems. There were three primary goals in that context:

that the system scale from single systems (where many processes would run on one computer) to distributed systems (where processes could be distributed across many computers) with little or no rewriting,
that processes would be sufficiently isolated that faults in one process would not (necessarily) affect the rest of the system, and
that individual processes could be replaced live in a running system, allowing for smooth upgrades without any downtime.

These goals led Erlang to a quite extreme design, whereby Erlang programs use processes in the same way as Smalltalk programs use objects. Nearly all compound data is bound in processes, and one interacts with processes by sending and receiving messages. Just like in $ \pi $-calculus, one can create processes and send channels in messages, allowing sophisticated interactions.

In fact, unlike Smalltalk’s objects, Erlang does support some primitive data types which are not processes. Integers, floating point numbers, tuples, lists, key-value maps, and Prolog-like atoms are all supported, and ports—Erlang’s name for one end of a communication channel—are not themselves processes. So, it’s not quite true that everything is a process, but everything* is a process. In fact, it’s perfectly possible to treat Erlang as a mostly-pure functional language and write totally non-concurrent code. However, we’ll look only at its concurrency features.

Although Erlang has its own unique syntax, like Prolog, variables in Erlang are named with capital letters, and atoms and functions are named with lower-case letters, but its behavior is otherwise more similar to a functional language than a logic language.

We will take only a very cursory glance at Erlang, to discuss how processes and concurrency can be used to build more familiar data structures.

Modules

Erlang divides code into modules. An Erlang file is a module, and must start with a declaration of the module’s name. For instance, a module named sorter begins as follows:

-module(sorter).

Most modules define some public functions and some private functions. Any functions which should be usable from other modules must be exported. For instance, if we define a function merge taking two arguments, we make that function visible like so:

-export([merge/2])

Note that functions in this context are named with their arity, in this case 2, in the same fashion as Prolog, so merge/2 is a function named merge which takes two arguments.

Functions in the same module can be called with only their name:

merge([1, 2, 3], [2, 2, 4])

Functions in other modules need to be prefixed with the target module:

sorter:merge([1, 2, 3], [2, 2, 4])

Processes

A process is created in Erlang with the built-in spawn function. spawn is called with a module and function name, and the arguments for that function, and the newly created process starts running that function. spawn returns a process reference, which can then be used to communicate with the process.

For instance, the following function spawns two processes, passing the process reference of the first to the second. The first process runs the pong function in the pingpong module with no arguments, and the second runs the ping function in the pingpong module with the arguments 5 and the reference to the pong process:

start() ->
    Pong = spawn(pingpong, pong, []),
    spawn(pingpong, ping, [5, Pong]).

Generally, functions perform a list of comma-separated actions like this.

Messages are sent in Erlang with the ! operator, as Target ! Message. The message can be any Erlang value, but in practice, it is either an atom or a tuple in which the first element is an atom. The atom specifies the kind of message, and any arguments that the message has fill the rest of the tuple. In our case, the “pong” process does not have a reference to the “ping” process, but the “ping” process does have a reference to the “pong” process, so the “ping” message will need to send a reference along in order for the “pong” process to know how to reply. In $ \pi $-calculus terms, “ping” must send the channel on which “pong” is to reply.

Messages are received in Erlang with a receive expression, which resembles a pattern match, in that it matches the shape of the message received. For a message to be successfully sent, the target process must be running a receive with a matching pattern, in the same way that for a $ \pi $-calculus program to make progress, a sending process must have a matching receiving process. Because receive matches particular shapes of messages, a process can receive messages in any order, but process them in the order it chooses, simply by performing receives in sequence that match only the kinds of messages it wishes to process.

Now, let’s write the ping and pong functions. ping will send the given number of “ping” messages to the given process, and expect an equal number of “pong” messages in response. pong will expect a sequence of “ping” messages, and send a “pong” to each.

ping(0, _) -> io:format("Ping finished~n", []);
ping(N, Pong) ->
    Pong ! {ping, self()},
    receive
        pong -> io:format("Pong received~n", [])
    end,
    ping(N - 1, Pong).

Like in Haskell, functions can be declared in multiple parts with implicit patterns. In this case, the ping function simply outputs “Ping finished” to standard out and terminates if the first argument (the number of times to ping) is 0. If N is not 0, it sends a message to the Pong process, and then awaits a pong message back. The sent message is a tuple containing the atom ping, to indicate that this is a ping message, and a reference to the current process, obtained with the built-in self function. Once a ping has been sent and a pong received, it prints “Pong received”, and then recurses with one fewer ping left to send.

Now, let’s write pong:

pong() ->
    receive
        {ping, Ping} ->
            Ping ! pong,
            io:format("Ping received~n", [])
    end,
    pong().

Where ping starts with a send, pong instead starts with a receive. Once pong has received a message matching the pattern {ping, Ping} (remember, Ping is a variable because it starts with a capital letter), it sends a pong message back (pong is an atom in this case, not the function), and then prints “Ping received”. We’ve intentionally written this with the print after the send, to demonstrate concurrency.

If the start function we wrote above is run, one possible output is:

Ping received
Pong received
Ping received
Pong received
Ping received
Pong received
Ping received
Pong received
Ping received
Pong received
Ping finished

However, because the pong process sends its pong before printing that the ping was received, other orders are possible, such as this one:

Pong received
Ping received
Ping received
Pong received
Ping received
Pong received
Pong received
Ping received
Pong received
Ping finished
Ping received

Exercise 2. What orders are not possible?

In this example, the pong process never actually terminates: only ping knew how many times to ping, so pong is left waiting endlessly for another ping that will never arrive. For cleanliness, we could instead add a terminate message like so:

ping(0, Pong) ->
    io:format("Ping finished~n", []),
    Pong ! terminate;
ping(N, Pong) ->
    Pong ! {ping, self()},
    receive
        pong -> io:format("Pong received~n", [])
    end,
    ping(N - 1, Pong).

pong() ->
    receive
        terminate ->
            io:format("Pong finished~n", []);
        {ping, Ping} ->
            Ping ! pong,
            io:format("Ping received~n", []),
            pong()
    end.

Since pong does not recurse if terminate is received, it instead simply ends, terminating the process.

Processes as References

Erlang does not have mutable variables. But, surprisingly, they can be built with nothing but processes!

When representing mutable data in functional languages, we needed a way to put that data aside, separate from the program, in the heap ($ \Sigma $. But, concurrent processes are already “aside” and separate from one another, so all we actually need is a way for a process to store a piece of data like a single mapping in the heap, similarly to Haskell’s monads.

To achieve this, we will make a module which exports three functions: ref/1, get/1, and put/2. The ref function will generate a reference, like OCaml’s ref. The get function will retrieve the value stored in a reference, like OCaml’s !. The put function will store a value in a reference, like OCaml’s :=. The actual value used for the reference will be a process reference, to a process carefully designed to work this way.

The complete solution follows:

-module(refs).
-export([ref/1, refproc/1, get/1, put/2]).

ref(V) ->
    spawn(refs, refproc, [V]).

refproc(V) ->
    receive
        {get, Return} ->
            Return ! {refval, V},
            refproc(V);
        {put, Return, NV} ->
            Return ! {refval, NV},
            refproc(NV)
    end.

get(Ref) ->
    Ref ! {get, self()},
    receive
        {refval, V} -> V
    end.

put(Ref, V) ->
    Ref ! {put, self(), V},
    receive
        {refval, _} -> Ref
    end.

The ref function spawns a new process, returning the process reference. The new process represents the reference, and runs the function refproc, with the initial value as its arguments. The get function sends a get message to the given process (which must be a reference process for this to work), and expects a refval message in response with the value stored in the reference. The put function sends a put message to the given process, containing the value to put in the reference, and also waits for a refval message. put doesn’t actually care about the value returned by refval; it’s only used to make sure that the message has been received and acted on before the current process continues.

The refproc function contains all of the interesting behavior, as it is the function used by the actual reference process. refproc must be exported because of how spawn works—there are ways to get around “polluting” the exported names in this way, but they’re not important for our purposes. refproc’s behavior is quite similar regardless of whether it receives a get message or a put message: it returns a value and then recursively calls refproc again. The difference is in which value. The value stored in the reference is in the (immutable) V variable. With get, it returns that value, and then recurses with the same value. With put, it instead expects a new value, NV, and returns and recurses with NV instead of V. In this way, although no variables are mutable, the reference itself is, since if it receives a put message, then it will respond to future get messages with the new value, until another put is received.

In shared-memory concurrency, two tasks may access the same mutable memory, and each may mutate it. With these references, we have in fact implemented shared-memory concurrency on top of message-passing concurrency: if two processes each have a reference to the reference process, then either may mutate its value, and both can see the other’s mutations. The fact that it is then extremely difficult to guarantee that the processes mutate things in the correct order is the usual argument for using message-passing concurrency instead of shared-memory concurrency in the first place, but this form of references demonstrates that neither is more powerful than the other.

Processes as Objects

If you’re accustomed to object-oriented programming, you’ve probably noticed that references from the reference module above act a lot like objects with a get and put method. Indeed, we can extend this metaphor to implement objects with only processes with immutable variables. For instance, this module implements a reverse polish notation calculator object very similar to the one written for the Smalltalk segment of Module 1:

-module(rpncalc).
-export([newrpn/0, rpn/1, push/2, binary/2, add/1, sub/1, mul/1, divide/1]).

newrpn() ->
    spawn(rpncalc, rpn, [[]]).

rpn(Stack) ->
    receive
        {push, Return, V} ->
            Return ! {rpnval, V},
            rpn([V | Stack]);
        {op, Return, F} ->
            rpnop(Stack, F, Return)
    end.

rpnop([R, L | Rest], F, Return) ->
    V = F(L, R),
    Return ! {rpnval, V},
    rpn([V | Rest]).

push(RPN, V) ->
    RPN ! {push, self(), V},
    receive
        {rpnval, _} -> V
    end.

binary(RPN, F) ->
    RPN ! {op, self(), F},
    receive
        {rpnval, V} -> V
    end.

add(RPN) ->
    binary(RPN, fun(L, R) -> L + R end).

sub(RPN) ->
    binary(RPN, fun(L, R) -> L - R end).

mul(RPN) ->
    binary(RPN, fun(L, R) -> L * R end).

divide(RPN) ->
    binary(RPN, fun(L, R) -> L / R end).

An RPN’s sole field, the stack, is represented by the Stack variable of the rpn function, which is the function run by an RPN process. The push and binary functions send an RPN process messages corresponding to one of its two supported “methods”: push and op. The op message carries a function, representing the binary operation to perform, so like in the Smalltalk version, specific operations can be implemented in terms of it.

Again, there are only immutable variables and processes, but the fact that sending and receiving messages establishes a sequence allows us to emulate more sophisticated features. Erlang has several libraries implementing more elegant object orientation, but still using processes as objects. Erlang is designed for programs to have thousands of processes, so it’s common to mix these styles as well; for instance, fields can be implemented as references which are in turn implemented as processes as in the previous section.

Implementation

Operating systems implement processes and threads: in OS terms, two processes do not share memory, but two threads within the same process do share memory. Erlang uses the term “process” because Erlang’s processes do not share memory. However, using operating system processes to implement Erlang processes would result in catastrophically poor performance! Indeed, even using operating system threads to implement Erlang processes would be similarly fraught. The reason is simply that switching between threads or processes in an operating system—that is, context switching—is an expensive operation.

Instead, highly-concurrent languages such as Erlang use so-called green threads. Basically, the Erlang interpreter must implement its own form of thread switching, and maintain the stack for each thread as a native data structure in the host language. When an Erlang process executes a receive and it does not immediately have a matching message available to act upon, Erlang instead sets aside the thread for that process and loads a stack for another process; in effect, it performs the same kind of context switch that an operating system performs, but with much less context to switch. It runs as many operating system threads as there are CPU cores available, but each one can switch between many green threads. It is not uncommon for an Erlang process to have tens of thousands of processes, so keeping green threads light is extremely important.

To know which processes are available to run, an Erlang process must also be able to pattern match very quickly. Typically, a process that is waiting for a message has its pattern stored, and when another process sends it a message, it performs a pattern match immediately, to determine if the other process can be awoken.

The other major implementation roadblock to message-passing concurrency is the actual message passing. In languages with mutable values, it is necessary to copy the message being passed, so that two processes cannot see the same mutable memory (which would be shared-memory concurrency, not message-passing concurrency). Erlang largely sidesteps this issue by being fundamentally immutable, so that it’s harmless to pass around values in any form. Two processes can share a pointer to a value if neither will actually mutate that value.

Miscellany

Erlang does actually support some (very limited) mutable data structures, but they may not be sent in a message.

The message-reply style used in all of our examples was fairly brittle, in that we used a specific atom as the expected reply, but there’s nothing to stop another process from sending the same atom. Erlang supports creating unique values, called “references” (to create needless confusion), with the built-in create_ref function. Usually, two processes which communicate would exchange such unique references to make sure that they’re receiving messages from the process they thought they were speaking to.

Most Erlang programs are written in a so-called “let it crash” style. That is, instead of trying to anticipate all forms of errors, code is written to simply re-spawn processes that fail unexpectedly. Since processes are mostly independent of each other, large systems can operate even with major bugs. In the Erlang shell, you can re-compile modules, and swap out processes using the old module for processes using the new version, and it is thus often possible to fix bugs in a running system with no downtime. Many Erlang proponents cite this style as the major advantage of Erlang.

Fin

In the next (and final) module, we will very briefly look at how our mathematical model of programming languages interfaces with the real world, through models of systems programming.

References

[1] Robin Milner, Joachim Parrow, and David Walker. A calculus of mobile processes, I. Information and computation, 100(1):1–40, 1992.

Module 10: Systems Programming

“Give a man a program, frustrate him for a day. Teach a man to program, frustrate him for a lifetime.” — Muhammad Waseem

Aside: Module 10 is not assessed anywhere in CS442. This module is quite brief, and more of an introduction to the problems of systems programming than the solutions. At the very least, read the final section.

Systems Programming

This module focuses on systems programming. However, it is an outlier, in that we won’t formally model anything. Instead, we will briefly overview how a real system, CompCert, did so. The goal of this module is to help you think about concrete applications of everything we’ve done in previous modules, but it will not go into much depth about any of those applications. In essence, we use CompCert as both an exemplar of systems programming and as a success story for using formal semantics to solve real problems, and then look at some other theoretical computer science problems through the lens of systems programming. This module is, thus, more of a discussion than a lesson.

The Trouble with Systems Programming

Through this entire course, we’ve focused on formal semantics for reasoning about programming languages. But, whenever we try to relate those formal systems to reality, cracks start to form. What does it mean to get stuck? Just how much junk can I put in $\sigma$ and $\Sigma$? What are labels? What happens with integer overflow, or division by zero?

Usually, formal semantics are used as a tool to prove particular properties. For instance, we may want to prove type safety, or prove that certain programs will always halt. To accomplish this, we eliminate irrelevant language features and focus on just those that interest us. A real language implementation has to handle type errors, but we only have to prove whether they can occur or not, so we don’t need to model anything more sophisticated than getting stuck.

As a consequence, formal semantics usually stops well short of systems programming.

“Systems” is a very broad term, and “systems programming” is barely definable as a paradigm. In essence, a language is a systems programming language if it lets us interact with the hardware on a fairly low level. This is ill-defined, because exactly what “low-level interaction” is is unclear, and anyway, most languages could be given low-level access through libraries. And, of course, such access can be very different on different kinds of machines, and one language may be suitable for one kind of machine but not another.

Generally, we restrict the term to languages with a long history of being used to make core systems components, such as operating system kernels, memory managers, and certain language implementations which interact with machine code as both code and data. With such a narrow view, we can reduce the range of potential languages to be considered “systems languages” to a handful: Fortran, Pascal, C, C++, and the assembly languages. There are some rare exceptions where other languages are used in very special ways, and one upstart which may eventually be added to the list (Rust), but otherwise, systems programming is nearly as sparse a programming paradigm as logic programming.

Our exemplar is C, and more specifically a (slightly) reduced version of C implemented by an unusual compiler, CompCert. Indeed, CompCert C itself is our exemplar; CompCert is a real compiler, but it is also a formally-defined programming language! It is assumed that you’re already familiar with C, so we won’t do our usual introduction to the language syntax.

The minimum unifying feature between all systems programming languages is pointers. We certainly cannot write an operating system kernel or a memory manager without first-class memory addresses. And, that’s where systems languages come into conflict with the formally modeling languages as mathematical logic: we have never bothered to actually model memory.

Modeling memory is hard. To understand why, consider this C program:

int foo(int x) {
    return (&x)[-1];
}

This brutish C is perfectly valid C code. In this example it’s impossible to know if foo actually does anything useful, but similar “out of bounds” behavior isn’t at all rare in systems programming. If you’ve taken CS350 or an equivalent operating systems course, you’ve undoubtedly written similarly perplexing code. In fact, this code is fairly mundane, since it only reads a seemingly out-of-bounds value. We can cause much more mischief like so:

int foo(int x) {
    (&x)[-1] = 42;
    return (&x)[-1];
}

In these examples, I’ve also kept things fairly mundane by keeping to data pointers, but certain language implementations routinely cast data pointers to function pointers. How is formal semantics to cope?

A simple answer to this problem is “it’s undefined behavior, so we don’t have to bother with it”. And, if your definition of C is the C specification, you’re even right: the C specification leaves the behavior of the above program undefined. But, “undefined behavior” actually has a specific meaning:

undefined behavior: behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this document imposes no requirements. Note 1 to entry: Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message). — ISO/IEC 9899:2018 (C17 standard)

In particular, note the italicized words: one correct interpretation of undefined behavior is… documented, and therefore defined, behavior. In fact, undefined behavior as defined by the C standard is simply behavior which is not defined by the C standard. In a real system, there are several standards sitting above C. For instance, if you’re on a Unix-like system, on the AMD64 architecture, then the way memory is laid out, and to a certain extent the way that system calls are performed and the executable format, are defined by the System V Application Binary Interface AMD64 Architecture Processor Supplement. With knowledge of how memory is laid out, the correct behavior of the above program can actually be defined. Unfortunately, the above example is still undefined because the memory it accesses is on the stack, the arrangement of which is left to the compiler to define. But, you guessed it, the compiler has a specification too. And, of course, the processor manufacturers publish specifications for their processors. With an exhaustive list of specifications, there is no such thing as undefined behavior. All programs’ behaviors are well-defined, although they may still be non-deterministic or dependent on external factors.

This leaves us with only two problems:

All of these specifications are informal, and don’t leave room to prove anything about systems programming.
The implementation of any of the myriad systems between your program and physical reality could have bugs.

The second problem can be solved by the first: if we can prove that every layer is correct, by some definition of “correct”, and we do so in a way that involves modeling every layer formally, then we can formally reason about the behavior of real programs, and not just theoretical languages with theoretical programs.

Aside: Of course, at the lowest level, the actual physical implementation of a processor comes down to physics, not math. There is always some layer at which we must leave the comforting (?) embrace of formal proof, so the goal of formal systems programming is to minimize the number of uncertain layers, not to reduce the number to zero.

Formally specifying every layer is a monumental task, but not an impossible one. As this is a language course, we will focus on everything from the programming language down to the CPU (i.e., not the operating system, and not the physical implementation of the CPU). Enter CompCert.

Exemplar: CompCert C

CompCert is a C compiler, with an unusual claim to fame: the full compilation process is formally proved correct. Precisely, to use terminology from Formal Verification of a Realistic Compiler (Leroy, 2009), this means that if we define $S$ as a source program and $C$ as a compiled program for some specific pair of languages and a compiler between them, and we define $X \Downarrow B$ to mean “executing the program $X$ has observable behavior $B$”, then $S \Downarrow B \Rightarrow C \Downarrow B$ for all programs in $S$ which are “safe”. That is, if $S$ compiles to $C$ and $S$ has behavior $B$, then $C$ has behavior $B$.

To define this, we need a formal definition for both the language of $S$ and the language of $C$, and a formal definition of compilation from one to the other. At the extremes, the language of $S$ is C (confusingly), and the language of $C$ is machine code for some architecture. In practice, however, C isn’t compiled directly to machine code, but through several compilation phases, and so there are many steps $C_1, C_2, \ldots, C_n$, eventually reaching machine code, and each of these compilation phases must be proved.

We have a lot of experience formally defining languages in terms of their behavior, but not a lot of experience formally defining compilation. In fact, all the mechanisms are the same, just used for a different purpose: we formally define a compiler by defining a function $\text{Comp}$, where $\text{Comp}(S) = \text{OK}(C)$, or $\text{Comp}(S) = \text{Error}$ if the program cannot compile. That function can be described through (many) formal rules for each specific case, precisely like the $\to$ morphism.

Thus, to formally prove a compiler, we must do the following:

Formally define the semantics of a source language, e.g. C.
Formally define the semantics of a target language, e.g. machine code.
Formally define the process of compiling, $\text{Comp}$.
Prove that, for any program $S$ in the source language, the behavior of $\text{Comp}(S)$ as defined by the formal semantics of the target language is the same as the behavior of $S$ as defined by the formal semantics of the source language.

It is impractical (although not impossible) to write out the formal semantics for a language as large as C, with many non-local effects, without some tool assistance.

Recall that logic programming languages allow us to define relations. The $\to$ morphism is one kind of relation, so we could represent a language semantics in a logic programming language, by defining the clauses for this morphism. If done carefully, a logic programming language could thereby “run” a program in another language, by querying the relation which describes the language’s small-step semantics.

In practice, Prolog is not up to the task, because the range of predicates it can prove is quite limited. Instead, CompCert’s semantics are implemented in Coq, a theorem-proving language with a much more sophisticated theorem-proving algorithm than Prolog’s, which allows for a much richer language of predicates, including predicates which are defined themselves as functional programs. The upside of Coq over Prolog is that much more can be proved; the downsides are (a) that while Prolog’s algorithm is straightforward and easy to define, so multiple interpreters can implement Prolog, Coq’s algorithm is a particular implementation, and subject to change, (b) that Coq’s algorithm has unpredictable time bounds, and (c) that almost no anglophone can say “Coq” without snickering.

While we will come nowhere near sharing the entire semantics for C, or converting them into more familiar Post syntax, here is one small example, the (partial) semantics for C addition, as defined by CompCert in Coq, to give you an idea of what Coq definitions look like:

Definition sem_add (cenv: composite_env) (v1:val) (t1:type) (v2: val) (t2:type) (m: mem): option val :=
  match classify_add t1 t2 with
  | add_case_pi ty si =>  (* pointer plus integer *)
      sem_add_ptr_int cenv ty si v1 v2
  | add_case_pl ty =>     (* pointer plus long *)
      sem_add_ptr_long cenv ty v1 v2
  | add_case_ip si ty =>  (* integer plus pointer *)
      sem_add_ptr_int cenv ty si v2 v1
  | add_case_lp ty =>     (* long plus pointer *)
      sem_add_ptr_long cenv ty v2 v1
  | add_default =>
      sem_binarith
        (fun sg n1 n2 => Some(Vint(Int.add n1 n2)))
        (fun sg n1 n2 => Some(Vlong(Int64.add n1 n2)))
        (fun n1 n2 => Some(Vfloat(Float.add n1 n2)))
        (fun n1 n2 => Some(Vsingle(Float32.add n1 n2)))
        v1 t1 v2 t2 m
  end.

These semantics are queryable, like Prolog relations, so by defining the semantics, CompCert naturally includes an (absurdly slow) interpreter for C as well.

CompCert includes a semantics for C, for its various intermediate languages, for an idealized but non-specific CPU architecture called “Mach”, and for several CPU architectures: PowerPC, ARM, x86, AMD-64 (x86_64), and RISC-V. Of course, it is always possible that some bug was introduced in the transcription of these languages from their informal specification to the formal specification, or that Coq itself has bugs; such bugs can only be dealt with in the usual way, through years of use and bug hunting.

Proving that an individual compilation step is correct involves proving $S \Downarrow B \Rightarrow C \Downarrow B$ for every structure in the language $S$, inductively for subexpressions.

Memory Models Redux

CompCert necessarily models C’s pointer-based memory, defining steps for allocating memory, freeing memory, and reading and writing of memory by pointers. The goal in this context is to verify the compiler, and so to verify that C performs the same interaction with memory as $S$. This model is sufficient for CompCert’s goals, but only because CompCert is intended for non-concurrent programs.

Consider the following C snippet:

int shared = 0;

void *foo(void *ign) {
    for (int i = 0; i < 1024; i++) {
        shared++;
    }
    return NULL;
}

The foo function simply increments shared 1024 times.

C’s implementation of concurrency is shared-memory concurrency. So, if two concurrent threads of execution both run foo, they can both read and write to shared at the same time. If foo is run twice concurrently, what can be the final value of shared?

The answer is that it depends on how the particular CPU implements memory. An int is, in all modern systems, 32 bits. Are those 32 bits written all at once? In order from most significant to least significant? One byte at a time? The same question applies to reading. And, of course, incrementing may look like one operation, but it’s actually both.

In fact, it’s even worse than it sounds. On a modern, multi-core system, if the reader happens to be moved from one core to another, then it’s possible for shared to end up with a number less than 1024 in the end, as if neither instance of foo ran to completion! This is because not all memory accesses go to real memory: there is at least a hierarchy of caches, and they’re only guaranteed to be correct for a single thread of execution.

Every CPU that allows true parallelism also has a mechanism (or many mechanisms) for guaranteeing the ordering of two threads’ memory accesses, locking. But, lock-free concurrent algorithms also exist. To write a lock-free concurrent algorithm requires a deep understanding of exactly what memory orderings are possible in a concurrent program, which is CPU-specific.

Thus, an array of memory models exist, describing the behavior of particular hardware. Broadly, they can be categorized into strongly consistent models and weakly consistent models, based on whether memory accesses across all concurrent threads appear to happen in a consistent order, or only in a consistent order per each thread. Proving that a particular algorithm works with a particular memory model is of similar complexity to CompCert; creating a single lock-free algorithm is a Ph.D. worth of work.

Unfortunately, there is a trend for languages to simply give up on memory models in their own specifications. C’s specification only defines the behavior of programs that use locks to guarantee ordering. Java’s assumes strong consistency, so Java implementations must somehow force strong consistency on weakly-consistent architectures.

Software Verification

“It’s not a systems language if I can’t implement a garbage collector in it.” — Gregor Richards

CompCert’s goal is to prove the correctness of a compiler, not the compiled program.

Consider, for instance, this snippet of code from a garbage collector, simplified slightly:

(struct GGGGC_Pool *) ((size_t) (ptr) & ((size_t) -1 << 24))

In short, this takes an address (ptr), masks it with -1 << 24, and casts the resulting number to a different type of pointer. This strange code is fairly usual for a memory manager.

Consider trying to type-check this code. The type of every expression is clear enough, but could a type-checker prove that the computed address actually points to a struct GGGGC_Pool? Although some work has been done on type systems that are sufficient to verify some properties of a memory manager, most type systems are woefully insufficient for interesting systems code.

In fact, type safety is a very weak form of program correctness, but there is a much wider range of correctness properties one might want to prove about a program. Usually, these are specific to a particular piece of software or kind of software; for instance, we would want to prove that the above snippet always yields a pointer to a struct GGGGC_Pool. More practically, we may want to prove that an elevator is never inoperable when someone is inside, or that a radiation therapy machine never exposes a patient to a dose of radiation above its specification.

This opens us to the area of software model checking. Software model checking is (one method for) the verification of particular properties for a piece of software. One can consider type checking to be one very limited form of model checking, and CompCert’s proofs of correctness as well, although the model checking community usually defines the term more restrictively.

Programs in safety-critical systems, such as aviation and medical devices, usually use (or at least ought to use!) some combination of software model checking to verify the correctness of their original code, and verified compilation to verify that that code is correctly compiled. Increasingly, international standards bodies actually require this.

Just Use Better Languages!

One may reasonably ask why so much focus is put on the verification of systems languages. Surely, if programmers would just use better languages in the first place, these problems wouldn’t arise!

There is a degree of truth to this, and it is an unsurprising conceit of many programming language researchers to believe that all software problems can be solved through better languages, but what research there is in the area just doesn’t support the idea. Programmers make mistakes with and without type systems, with and without mutation, with and without concurrency.

I have no optimistic conclusion to this topic. Perhaps the lesson to learn from successes such as CompCert is not in using better languages, but in using the right language for the job.

Fin

This concludes CS442, in this very weird term.

The goal of this course isn’t to teach you half a dozen programming languages, although it is to teach you two. The goal isn’t even that you use OCaml or Smalltalk after this, although you may want to, of course.

“A language that doesn’t affect the way you think about programming, is not worth knowing.” — Alan Perlis

I hope that exploring the range of programming paradigms has broadened your horizons in the way you think about programming. Just as the goal of this course wasn’t to teach you programming languages, the goal of the assignments wasn’t to prepare you for implementing languages, since most of you won’t. But, being able to think about programming paradigms mechanically allows you to use them as tools in other programming. The power of programming languages is the power of abstraction, and I hope that after having written six language implementations, you feel more comfortable harnessing that power.

References

ISO/IEC 9899:2018: Information technology — Programming languages — C. Standard, International Organization for Standardization, Geneva, Switzerland, June 2018.
Xavier Leroy. Formal verification of a realistic compiler. Communications of the ACM, 52(7):107–115, 2009.
Michael Matz, Jan Hubicka, Andreas Jaeger, and Mark Mitchell. System V application binary interface: AMD64 architecture processor supplement. 2014.

This definition does not consider efficiency, and many — but not all! — language implementations will still require extra computational cost to compute f(3) if it’s used in place of z. ↩︎ ↩︎
There are also other fixed-point combinators than the Y combinator, and there are fixed-point combinators that Haskell can represent, but sensible binding is a better solution, so we won’t discuss them here. ↩︎ ↩︎
OCaml has yet another special syntax for declaring mutually recursive functions like this, but it’s not relevant here. ↩︎
An algorithm described by Milner and Damas is called Hindley-Milner type inference because the world is cruel. It is sometimes, but rarely, instead called Damas–Hindley–Milner type inference. ↩︎

λ-term	Equivalent λ-term, with least parentheses necessary
\( \lambda x.\, ((xy)\lambda z.\, z) \)	\( \lambda x.\, xy\lambda z.\, z \)
\( ((\lambda x.\, x)y) \)	\( (\lambda x.\, x)y \)
\( ((xw)(zy)) \)	\( xw(zy) \)
\( ((xy)\lambda x.\, z) \)	\( xy\lambda x.\, z \)