One of the most useful features of ML-family languages (OCaml,
Standard ML, F# etc) is the type inference. This post contains the
notes I took when I was trying to understand the foundational
principles of ML type inference.

Damas-Milner Algorithm for Type Inference

Let us take a look at the original type inference algorithm produced
by Damas and Milner. Their language is lambda calculus with
let expression:

e ::= x | λx.e | e e | let x=e in e

Type language composed of a set of type variables (α) and primitive types
(ι).

τ = α | ι | τ → τ
σ = τ | ∀α.σ

A Type substitution is [τ1/α1, τ2/α2, ... , τn/αn]. It distinguished
between an α, a type, and α, a type variable.

We define a relation (≥) that relates a more polymorphic type to a less
polymorphic type. For eg, ∀α.α→α ≥ ∀β.β list → β list.

Typing judgment is Γ⊢e:σ.

Instantiation rule is also subtyping rule, where subtyping is defined in
terms of “more polymorphic”:

Γ⊢e:σ₁ σ₁≥σ₂
--------------
Γ⊢e:σ₂

Since there is no explicit big-lambda expression, generalization rule is
important:

Few months ago, Cheryl birthday puzzle has has
been an internet phenomenon. If you found the puzzle tricky to solve,
we are in the same boat. However, if you observe, the puzzle only
requires us to apply simple logic; not number theory or complex
arithmetic, just simple logic. What, then, makes the puzzle tricky?

I attribute its trickiness to the chain of reasoning involved -
repeated application of simple logic to the existing knowledge to make
new deductions until we arrive at the solution. Since the reasoning is
quite involved, it would be great if we can enlist the services of a
computer to help us in the process. But, how can a computer assist us
in making deductions? The answer lies in the well-known SAT problem.

Knights and Knaves Puzzle

Consider the following simple instance of the famous Knights and
Knaves class of
puzzles:

There is an island in which certain inhabitants, called knights
always tell the truth, and others, called knaves always lie. It is
assumed that every inhabitant of this island is either a knight or
a knave.
Suppose A says, “Either I am a knave or B is a knight.”
What are A and B?

The puzzle is easy to solve mentally. Nonetheless, if you have a 2-SAT
solver handy, you can also let solver do the work. Here is how:

Let AisKnight and BisKnight be boolean variables, denoting the
propositions “A is a Knight”, and “B is a Knight”, respectively. Since
an inhabitant can either be Knight or Knave but not both, boolean
formulas ¬ AisKnight and ¬ BisKnight denote the propositions that
“A is a Knave”, and “B is a Knave”, respectively.

Now, consider the statement made by A: “Either I am a knave or B is a
knight”. Symbolically, it is equivalent to: (¬ AisKnight) ∨
BisKnight. Since Knights always speak truth, and Knaves always lie,
the statement is true if A is a Knight, and false if A is a Knave. In
other words, the statement holds if and only if A is a Knight.
Symbolically, it means:

AisKnight ⇔ (¬ AisKnight) ∨ BisKnight

Observe that the above boolean formula completely describes the
constraints imposed by the puzzle, and does so quite succinctly. If
the above formula is satisfiable, then the constraints imposed by the
puzzle are also satisfiable, which means that puzzle has a solution.
Furthermore, the satisfying assignment to literals in the formula
(i.e., their truth values) give us the solution to the puzzle. To get
those satisfying assignments is what the 2-SAT solver is for. If we
feed the above formula to a 2-SAT solver, we get AisKnight = trueand
BisKnight = true as satisfying assignments, telling us that A and B
are both knights.

Note that satisfying assignment only tells us that “if A is knight,
then the puzzle is solved”. Maybe the puzzle is also solved if A is a
knave. How do we check for this possibility? Again, using the solver.
We explicitly encode the constraint that A is Knave, and ask if our
constraints are satisfiable. The boolean formula is shown below:

A 2-SAT solver cannot find any satisfying assignment to the above
formula, indicating that our constraints are unsatisfiable. Hence, A
can never be a Knave.

Now, lets take a step back and summarize what we have done. We were
given a puzzle that is not difficult, but not obvious either. Instead
of doing the reasoning ourselves, we simply described the puzzle as a
boolean formula, and let a SAT solver do the reasoning for us. In
other words, we have outsourced the reasoning to a solver.This
facility to outsource some or all of the reasoning to a solver can be
quite handy when we are solving logical puzzles that require more
complex reasoning, like Cheryl’s birthday puzzle.

This post contains my notes on Rahul Sharma et al’s ESOP’13 paper: A
Data-Driven Approach for Algebraic Loop Invariants.

The paper proposes a neat approach of inferring algebraic loop
invariants by observing concrete program states, and making use of
techniques from linear algebra to gain insights.

Following is an example take from the paper:

assume (x=0 && y=0);
writelog (x,y);
while(nondet()) do
y := y+1;
x := x+y;
writelog (x, y);

For the above example, we are interested in finding the loop invariant
for the while loop. Assume we already know that invariant can be
non-linear, but no complex than degree 2. Let \(f(x,y)\) denote a
polynomial of degree less than or equal to 2. The equation
\(f(x,y)=0\) is a polynomial equation or algebraic equation in
variables x and y. Since x and y are the only two variables in
scope at the head of the loop, we know that loop invariant will be of
form:

where each \(f_i\) is a unique polynomial of degree utmost 2.
A polynomial of degree 2 is simply a linear combination of monomials
of degree utmost 2. In this case, we have 6 unique monomials:
\({1,x,y,x^2,y^2,xy}\). Therefore, each polynomial will be of the
following form:

The invariant inference problem then is to infer the rational
constants \(w_1-w_6\) such that \(f(x,y)=0\).

The paper proposes a data-driven approach to solve this problem.
Informally, a loop invariant over-approximates the set of all possible
program states that are possible at the loop head. For the above
example program, the program state consists of values of x and y.
We instrument the code to observe the program state (i.e., values of
x and y) at the beginning of every iteration of the loop, when the
program is executed over test inputs (user-provided or random). We
record these observations in a log file.

Next, for every observed program state, we calculate the values of all
6 monomials and record them in a matrix, where a row denotes a program
state (i.e., fixed values for x and y), and a column denotes
values of a fixed monomial in various program states. A sample
datamatrix for the above example is shown below:

Now, any valid solution for \(w_1-w_6\) (hence the loop invariant
problem) should explain the observed program states. Therefore,
following has to hold:

Observe that any solution to the above equation gives us an invariant
that holds over observed program states. However, it may not hold over
all possible program states. Therefore, solving above equation gives
us candidate invariants, which we can then verify for validity. But,
how do we solve the above equation. Here is where techniques from
linear algebra help.

Let us denote the datamatrix with \(A\), and the matrix of
coeffecients (\(w_1-w_6\)) with \(W\). The matrix \(W\) has only one
column, and is referred to as column vector or simply a vector. Set of
all possible solutions to \(W\) (each is a vector) is called the null
space of \(A\). Note that NullSpace(A) is a set of vectors. Span
of any set of vectors \({x_1,..,x_n}\) is set of all vectors that can
be expressed as a linear combination of \({x_1,…,x_n}\). In other
words, span of a set of vectors is the entire vector space represented
by those vectors. Observe that span of a NullSpace(A) is
NullSpace(A), as every linear combination of vectors in
NullSpace(A) results in a vector \(x\), which satisfies \(Ax=0\).
Hence, NullSpace(A) is a vector space. Basis of any vector space is
a set \(B\) of linearly independent vectors such that every other
vector in the space can be expressed as a linear combination of
vectors in \(B\). For a 3-D cartesian vector space, the set
\({(3,0,0), (0,2,0), (0,0,1)}\) can be considered a basis. The
cardinality of the basis is the dimension of the vector space, as is
evident from the basis set of 3-D cartesian space.

For the matrix \(A\), Let \(B_w\) denote the basis of NullSpace(A).
What does set \(B_w\) represent? Each vector in the set is a unique
solution to \(W\) that cannot be expressed as linear combination of
other solutions. Further, any solution to \(W\) can be expressed as a
linear combination of vectors in \(B_w\). Effectively, this means
that by calculating basis set of NullSpace(A), we derive _all
unique algebraic invariants of degree utmost 2 that hold on observed
program states. The invariants are unique as any other invariant can
be expressed as a linear combination of polynomials denoted by the
basis set.

Once candidate solutions are generated by calculating the basis of
null space of \(A\), they are verified via symbolic execution and SMT
solvers. Verification failure results in a counterexample, which is
simply a valuation for x and y for which candidate invariant
fails. In such case, loop is rerun with these values of x and y
and more observations are recorded, thereby increasing the datamatrix
\(A\). Invariants generated by calculating the basis of null space of
larger datamatrix will now definitely account for the counterexample
generated by the solver in the previous step. This counterexample
guided iterative refinement (which they call \({\sf
Guess-And-Check}\)) is similar in flavour to CEGAR, but there is one
fundamental difference. Recall that CEGAR starts with a formula that
over-approximates program behaviours, and gradually refines the
formula until it is valid, or a bug is found in the program. In
contrast, \({\sf Guess-And-Check}\) starts with under-approximation
of program behaviours (represented by the datamatrix), and adds more
behaviours until we have sufficient representation of the all possible
program behaviours. In this sense, \({\sf Guess-And-Check}\) is more
similar to CEGIS. The only difference is that, while CEGIS relies on
an SMT oracle to generate positive samples, \({\sf
Guess-And-Check}\) relies on tests.

In practice, the number of possible monomials at the loop head could
be large. The paper uses a heuristic to discard monomials that are
unlikely to be part of the final invariant. The heuristic is that
monomials which grow at a rate much larger than the loop counter \(i\)
(i.e., at a rate \(\omega(i^d)\)) are most often red herrings.

A natural view of execution of a multi-threaded program is as follows:

exn = []
while (there is an unfinished thread) {
t = select an unfinished thread;
instr = first(t);
exn := exn ++ [instr];
t := rest(t);
}
return exn;

Observe that exn preserves program order of all threads. This is
sequential consistency. However, there is still non-determinism on
line 3. When there are shared variables, this is the source of
data-races. For example, consider the following program (assume that
initially x is 0):

[
[STORE (x,3)],
[STORE (x,4); PRINT (LOAD x)]
]

Since there is non-determinism, either of the STOREs can be executed
first. Possible sequentially consistent (SC) executions are:

Notice that STORE(x,3) conflicts with STORE(x,4) and also with
LOAD x, as each pair accesses same memory location, and STORE
writes to the memory location. Further, notice that there exists at
least one pair of SC executions, where one execution is obtained from
other by just reordering an adjacent pair of conflicting operations.
Hence the program has data-race. The above SC executions print 4, 3
and 4, respectively. Therefore, the set of acceptable outputs in a
sequentially consistent execution (even in presence of data-races) is
{3,4} (note: the set does not include 0).

Let us assume we have locks in our language. Then, the execution of a
multi-threaded program can be constructed as following:

exn = []
while (there is an unfinished thread) {
do {
t = select an unfinished thread;
} while (first(t) = LOCK(c) and
exn = exn'++ [LOCK(c)] ++ exn'' where
UNLOCK(c) notin exn'')
instr = first(t);
exn := exn ++ [instr];
t := rest(t);
}
return exn;

Locks help us control the amount of non-determinism in the program.
For example, for the following program:

[
[LOCK(c); STORE (x,3); UNLOCK(c)],
[LOCK(c); STORE (x,4); PRINT (LOAD x); UNLOCK(c)]
]

The only acceptable output of the program now is 4. Further, there are
no adjacent conflicting operations in any of the above SC executions.
So, the program is DR-free.

A data-race-free model guarantees SC only for data-race-free programs.
A program is DRF if and only if it is impossible to construct an SC
execution with concurrent conflicting memory accesses, where one of
them is a STORE. (When do we say two memory accesses are concurrent?)

Sequential consistency requires that all data operations appear to
have executed atomically in some sequential order that is consistent
with the order seen at every individual process.

If instead of individual data operations, we apply sequential
consistency to transactions, the resultant condition is called
serializability in database theory.

Linearizability imposes more constraints on an order decided by
sequential consistency - the order needs to make sense to an external
observer. Let us say that an external observer is running two threads
A and B. If thread A performs action a0, then syncs with
thread B following which B performs b0. A sequentially
consistent execution can order b0 before a0, as long as both A
and B perceive the same order. On the other hand, such an execution
is not valid under linearizability condition. Linearizability dictates
that an operation should take effect instantaneously before the
perceived end of the operation.

Evidently, linearizability is a stronger constraint than sequential
consistency (or serializability). Nevertheless, it is said that
linearizability is a local property (i.e., composable), sc is not. (I
don’t understand this. If you know what this means, please drop a
comment.)

This paper by John McCarthy introduces the Lisp programming language
and describes its implementation on IBM 704 machines. Lisp stands out
among the programming languages of its day (eg: Algol 60) as the first
high-level programming language that abstracts the underlying machine
completely. Instead, Lisp allows symbolic computation by letting
programmers define (recursive) functions over symbolic expressions
(called S-expressions). An S-expression is either an atomic symbol
(hence “symbolic”), or an ordered pair of S-expressions. Reserving an
atomic symbol NIL to denote an empty list, we can construct a list of
S-expressions:

[s1,s2,s3]

using ordered pairs as:

(s1,(s2,(s3,NIL)))

Observe that above is also an S-Expression. Hence, a list of
S-expressions is also an S-expression. Since most functions operate on
a list of symbols, programming such functions in Lisp is effectively
LISt Programming (hence, the name “Lisp”).

Lisp defines five primitive S-functions to construct & destruct lists,
and to check equality among atomic symbols:

atom

eq

car

cdr

cons

Lisp, as defined in paper, includes a class of expressions called
Meta-expressions (written M-expressions) to let us define and apply
S-functions over S-expressions. An M-expression is either

A function application. For eg, f[2; EMPTY] is an M-expression that
denotes application of function f over S-expressions of atomic
symbols 2 and EMPTY.

A function definition with grammar: fname[x0, …, xn] = expr, and

Using M-expressions and primitive S-functions, we can define a subst
function to substitute An S-expression (x) for an atomic symbol (y) in
another expression (z) as:

subst[x;y;z] = [
atom[z] -> [
eq[z;y] -> x;
T -> z
]
T -> cons [subst[x; y; car[z]];
subst[x; y; cdr[z]]]
]

This example demonstrates that while M-expressions are control
structures (functions & conditional expressions), S-expressions
represent data to be manipulated by M-expressions.

The definition (semantics) of Lisp are given as an interepreter
program in Lisp itself. McCarthy’s list interpreter, implemented as
mutually recursive eval and apply functions, is often cited as
an example of metacircular evaluation - defining the evaluation
semantics of embedded language in terms of evaluation semantics of
host-language, which happens to be same as embedded languge.
Interestingly, McCarthy defines Lisp as a dynamically scoped language
(Paul Graham
says so,
but is it really? Reynolds’s 2nd interpreter, which he says is similar
to McCarthy’s, is lexically scoped.).

In order to use metacircular lisp interpreter to interpret a lisp
program, the program first has to be converted to data. In other
words, all M-expressions need to be translated to S-expressions (Note:
this means that M-expressions are only a syntactic sugar). A simple
algorithm to perform the same has been described in the paper. The
algorithm converts all function names and their arguments in the user
lisp program (which are meta-variables) to atomic symbols. The
interpreter (eval function, to be specific) maintains an environment
which defines bindings for these atomic symbols. However, note that
the original program may also contain atomic symbols, and these
symbols have no attached interpretation. To distinguish between atomic
symbols that are present in the user program, and atomic symbols that
are introduced while convering M-expressions to S-expressions, the
former set of atomic symbols are quoted.

As mentioned previously, Lisp is the first language that abstracted
the underlying machine completely. In order to be able to do that, the
run-time should be able to manage scarce machine resources, such as
memory. Towards this end, Lisp introduced automated memory
reclamation, which is now commonly known as “Garbage Collection”.

2.Definitional Interpreters Paper

The Lisp paper defines semantics of Lisp using a metacircular
interpreter written in Lisp itself. The definitional interpreters
paper by Reynolds exposes subtle aspects of such definitional
interpreters. In particular, it shows how semantics of the embedded
language (called the defined language) can depend on semantics of the
host language (called the defining language), such as its treatment of
higher-order functions, and the order of function application . To
demonstrate this point, the paper first constructs a metacircular
interpreter for a higher-order (defined) language in another
higher-order (defining) language, by simply translating an expression
(eg: function application) in defined language as the corresponding
expression in defining language. The paper makes two observations
about this interpreter:

The nature of higher-order functions is not clear.

The order of application is not explicit; the order of evaluating
function applications in defined language depends on
whether the defining language is call-by-value or call-by-name.

Subsequently, Reynolds proposes two transformations to make the
treatment of higher-order functions, and order of function
applications explicit in the semantics of defined language:

Defunctionalization: A function in defined language is not a
function in defining language. Instead, functions and function
closures are represented explicitly as data structures (For eg, as
S-expressions in Lisp interpreter), and the semantics of a function
application is defined explicitly, as a transformation over data.
Defunctionalization effectively lets us write a definitional
interpreter for a higher-order language in a first-order language.

CPS transformation: A continuation is a function that explicitly
represents “rest of the computation”. At every atomic step of
evaluation in a definitional interpreter, if we explicitly describe
what “rest of the evaluation” is, then the semantics of defined
language no longer depends on order of evaluation of defining
language; the order of evaluation is now explicit in the
interpreter. This kind of interpreter is said to be written in
continuation passing style (CPS).

The paper also describes how imperative features, such as assignments,
can be encoded explictly in a definitional interpreter written in
applicative style.

My intention in writing this note is to understand the relation
between conventional model of distributed systems that they usually
teach in the distributed systems course and the the distributed web
services hosting replicated datatypes. Fault tolerance is a concern in
the former, and it is studied separately from communication link
failures. On the other hand, partition tolerance is a concern in the
later, and it looks like this means tolerance to communication
failures. Consistency and availability are major concerns in
distributed web services, whereas they don’t figure anywhere in the
formal study of distributed systems. What is the connection? How can
we express CAP theorem in conventional models?

The System Model

The conventional model of a distributed system is that of a replicated
state machine (automaton), where the automaton models a local
computation that reacts to external messages. This makes it clear
that the focus in conventional model is on how to perform a replicated
computation rather than on how to maintain replicated data.

The above image is taken from Reliable Distributed Systems textbook.

This view is also corroborated by Eric Brewer in his PODC’2000 talk,
where he conjectured CAP. Here are some exerpts from his
(slides):

Inktomi builds distributed systems ... but very little use of
classic DS research.
"Distributed Systems" don't work
Persistent state is hard. Classic DS focuses on computation, not
data. This is WRONG; computation is the easy part.

The semantics of a distributed system that is intented to maintain
replicated data is best captured in Burckhardt et al’s POPL’14 paper.
Since we are familiar with the model, I am not going to reproduce it
here.

Communication Links and Failures

In the conventional model, process is the unit of communication and
the the unit of failure. It is assumed that there is a one-to-one
communication link between every pair of processes in the system.
Conventional model allows for processes to fail. The widely used model
of process failure is crash-stop, where a process crashes and stops,
and therefore becomes inaccessible. A communication link may also
fail, but this failure shows up as a process failure - processes at
both ends of a link preceive each other as failed processes. However,
other processes may still view them as correct processes, and the
conventional model allows for this disparity in processes’ view of
what other processes have failed.

Replicated data stores are usually geo-distributed, so there do not
exist one-to-one links between processes. Processes are organized into
subnets, which are often connected through a single communication
link. Therefore, in distributed web services model, a communication
link is considered a unit of failure. Failure of a link can cause
network to be paritioned into disconnected subnets, and web-services
are often required to tolerate these partitions. For most purposes, a
process failure can also be modeled as network partitioning by simply
assuming that the failed process is in its own parition. In summary,
in distributed webservices model, faults occur in communication links
thereby leading to network partitions, and fault tolerance effectively
means partition tolerance.

Timing Assumptions

Both conventional and current models of a distributed system are
asynchronous models - they do not make any assumptions about time
bounds on communication delays or (relative) process speeds. In both
the models, we rely on logical clocks (eg: vector clocks) and causal
ordering to understand the behaviour of the system with respect to
passage of time.

Consistency & Availability

Ideally, a distributed web-service has to be consistent in the sense
that it should behave the same as if it is operating on an atomic data
object. Any operation performed on the data object should see the
effects of all previous operations. For example, consider a
web-service for a simple read/write register. The operations allowed
are write and read, which write a value to the register and read
the current value in the register, respectively. Let us assume that
the client performs two successful writes of values 20 and 40 to the
register. Then a subsequent read should return 40, failing which the
system is deemed inconsistent. This kind of consistency guarantee is
called strong consistency or sequential consistency. Under this
consistency guarantee, there must exist a total order on all
operations such that each operation looks as if it were completed at a
single instance.

Along with being consistent, a distributed web-service needs to be
available - every request made to the web-service should be met
with a response, given that the web-service remains accessible on the
network. To be available, no non-failing node in the distributed
system implementing the web-service should wait infinitely for an
event to occur before responding to a client request.

CAP theorem says that it is impossible to guarantee (strong)
consistency and availability in a system that needs to be partition
tolerant. First, let us see if this makes sense intuitively. Consider
the read/write register web-service described previously. Assume that
it is being implemented by a distributed system with two
geo-distributed nodes, each holding a replica of the register.
Consider a client which makes following requests to the web-service in
the order shown:

1. write 20
2. write 40
3. read

The order is called session order. Let us tag each request with is
serial number in session order. Assume that write 20 goes to first
node, which immediately writes 20 to the local register and
simlutatenously forwards the request to second node. Now, should the
first node wait for the acknowledgement from second node before
responding to the client? Given that we are operating in an
asynchronous environment and there is no time bound on the delivery of
acknowledgement, the wait time for first node could potentially be
infinite. This leads to the violation of the availability guarantee.
Therefore, the node should respond immediately to the client, and use
timeouts and retransmissions to eventually propagate client request to
the second node.

Consider a scenario where second write request (write 40) also goes
to first node, but the read request goes to the second node. Assume
that by the time read request was made to second node, it already
received the write 20 request forwarded by the first node. Looking
at their serial numbers (1 for write 20 and 3 for read), the
second node knows that there is a request made by the client before
the read that it has not yet received. This missing request
could be a write (it is indeed a write in this case). Therefore,
if the node has to respond with a correct value for the read request
, it has to wait until it receives the missing request, in which case
it might have to wait infinitely as network may get partioned in
meantime. This violates availability. The other option is to respond
with the current value of the register without waiting for the second
request to be delivered. In this case, the client reads 20 instead of
the expected 40, which means that the system is no longer (strongly)
consistent. Therefore, it is impossible for our service to be both
available and consistent in presence of network partitions.

Consistency & Avalability in Conventional Model

The conventional model accounts for network partitioning through
process failures - A faulty process that stops communicating with rest
of the processes in the system effectively leads to network
partitioning. Fault tolerance is the ability of a system to behave in
well-defined manner once faults occur. There has been an extensive
research on fault tolerance in conventional models, but does this
research include a study of consistency and availability properties in
presence of faults?

The answer is yes. They were studied under broad categories of safety
and liveness properties, respectively. A safety property of a system
is usually expressed as a set of legal system configurations, commonly
referred to as an invariant. To be safe, a system has to always
remain in the set of safe states as defined by the safety property.
Consistency is a safety property as it restricts the set of observable
states of the system. For the read/write register example, consistency
dictates that the observable state of the register after the first two
writes is the value 40. Any other state is inconsistent or unsafe.

On the other hand, a liveness property claims that some good thing will
eventually happen during system execution. Liveness properties are
eventuality properties - a traffic signal should eventually allow
every car waiting at an intersection to pass through. Availability is
a liveness property, as it requires every request to be eventually met
with a response. In literature on fault tolerance, availability
property is more commonly referred to as guaranteed service.

Recall that fault tolerance is the ability of a system to behave in
well-defined manner once faults occur. In his landmark survey paper on
fault tolerance in distributed systems, Gartner identifies four
forms of fault tolerance based on which among safety (S) and liveness
(L) properties hold when faults occur in the system. The four are
listed below:

Masking fault tolerance: when both S and L hold in presence of
faults

Fail-safe: when only S holds

Non-masking: when only L holds

None (or fault intolerant): when none of them hold.

Gartner’s paper was published in 1999, before CAP was conjectured, and
before eventually consistent data stores proliferated. The paper
reflects the kind of applications that researchers had in mind when
studying distributed systems. These applications are quite different
from replicated data stores, which explains the disparity between
conventional models and current systems. Below, I reproduce some of
the relevant verbatim from the paper, which is suggestive of this
gulf:

fail-safe fault tolerance and is preferable to non-masking
whenever safety is much more important than liveness. An example
is the ground control system of the Ariane 5 space missile project
[Dega 1996]. The system was masking fault tolerance for a single
component failure, but was also designed to stop in a safe state
whenever two successive component failures occurred [Dega 1996].
For the latter type of faults, the launch of the missile
(liveness) was less important than the protection of its pre-
cious cargo and launch site (safety).
....
In effect (of non-masking fault tolerance), the user may
experience a certain amount of incorrect system behavior (i.e.,
failures). For example, a calculation result will be wrong or a
replication variable may not be up to date
....
Research has traditionally focused on forms of fault tolerance
that continuously ensure safety. This can be attributed to the
fact that in most fault-tolerance applications, safety is much
more important than liveness.
...
For a long time nonmasking fault tolerance has been the “ugly
duckling” in the field, as application scenarios for this type of
fault tolerance are not readily visible. However, the potential of
non-masking fault tolerance lies in the fact that it is strictly
weaker than masking fault tolerance, and can therefore be used in
cases where masking fault tolerance is too costly to implement or
even provably impossible.
...
(Talking about self-stabilizing programs, which are pre-cursors of
eventually consistent programs) Examples show that such programs
are quite difficult to construct and verify [Theel and Gärtner
1998]. Also, their nonmasking nature has inhibited them from yet
becoming practically relevant.
...

In summary, the paper says that masking fault tolerance, where both
safety and liveness is preserved in presence of faults is “strictest
and most costly” form of fault tolerance, and that ensuring such
tolerence is a “major area of research”. Instead, fail-safe fault
tolerance is preferable for most practical applications.

The CAP Theorem

Now that we are aware of Gartner’s categorization of fault tolerance,
we can state the CAP theorem simply as:

It is impossible to have masking fault tolerance in an unreliable
distributed system

In the words of Gilbert and Lynch, who gave the first the proof of the
theorem:

The CAP Theorem, in this light, is simply one example of the
fundamental fact that you cannot achieve both safety
and liveness in an unreliable distributed system
- From "Perspectives on the CAP Theorem"

It should therefore be noted that oft reproduced formulation of the
CAP theorem as “pick any two among Consistency, Availability and
Partition Tolerance” is misleading at best. A better formulation is:

A distributed system that is network partition tolerant cannot be
consistent and available at the same time.

So, it is more about picking one among the two rather than picking two
among the three.

Eventual Consistency & Its Impact on Availability

Bailis et al’s paper on Highly Available Transactions: Virtues and
Limitations, classifies operations on a replicated data store as
unavailable, sticky-available and available. An operation, such as a
write on read/write register, that has no requirements on
consistency is classified as available. This is expected, as write
can be applied to any replica without any need to wait for an event,
and the client can be informed of success/failure of write

Consider the following definition of Nat-indexed Vector GADT in
OCaml and Haskell:
OCaml:

Haskell:

Observe that quantified type variables in the type of constructors are
generalized by the type-constructor signature. Infact, this is not
always true. For example, consider the GADT for higher-order abstract
syntax:

The type scheme of App contains a type variable (a) that is not
generalized by the type-constructor signature.

This type is treated as existential type when an expression of type
Exp b is destructed and matched agains App.

Therefore, the correct observation is this: All type variables
generalized by the signature of type constructor are also generalized
in type schemes of its value constructors.

Here is the nat-indexed vector inductive datatype in Coq:

The above definition should be read as following: vec is a Type
indexed by a Type and a nat. Here are the only ways to construct
values of Vec X N, where X could be any Type and N could be any
nat:

For any TypeA, vnil A is a value of Vec A O

For any TypeA and for any natn, vcons A n accepts a
value of type A, a value of type vec A n, and generates a value
of type Vec A (S n)

Notice that it looks quite similar to Haskell definition, with type
variables explicated. An empty list in explicitly typed haskell is
VNil Int, whereas in Coq, it is vnil int. However, unlike Haskell,
where there is term-type-kind hierarcy, Coq has full dependent types
with infinite hierarchy of universes. One consequence of this is that we
will frequently see arrows mapping members of one universe to members
of a different universe. For example, if we Print vec., this is what
we see:

Evidently, vec maps a member of Type(59) universe and member of Set
universe to a type universe at higher level. Compare this to the type
of Vec type-constructor in haskell * -> * -> *, which simply says
that when a is a type, and n is a type (i.e., a and n have
kind *), then Vec a n is a type (i.e., it has kind *).

In the inductive definition of vec, observe that both constructors
construct values of type vec A _. Therefore, we can as well
generalize A at the type of type-constructor (here, this is a mere
syntactic convenience):

The above definition should be read as following: vec is a Type
indexed by a TypeA and a nat. Here are the only ways to construct
values of Vec A N, where N could be any nat …

Another consequence of having full dependent types in Coq is that the
type of inductive type-constructors can be dependent as well. For eg,
here is the inductive definition of sig datatype:

sig T P is the type of values of type T that satisfy proposition
P. sig is a function on type, and propositions on that type.
Observe that the type of second argument to type-constructor sig
depends on the value of its first argument.

How did we arrive at that type definition? Let us say we want to
define a refinement type for values of nat satisfying a predicate
P. Effectively, we would like to index nat with a predicate P
over natural numbers. Since P is a predicate, P : nat -> Prop. We
define a new TypenatST (read “nat, such that”) that is indexed by
P : nat -> Prop:

What should be its constructors? There has to be only one constructor
that, given a nat (n), and a proof that it satisfies the predicate
(i.e., proof of P n), returns evidence for natST P. In other
words, we have an evidence for natST P if and only if we have a nat,
and the evidence that it satisfies the predicate. Destructing a natST
P value would produce a nat and a proof that it satisfies P.

Type natST helps us easily write type for division operation:

Implicit Arguments

Following is to set the implicit arguments option in Coq.

When the mode for automatic declaration of implicit arguments is on,
the default is to automatically set implicit only the strict implicit
arguments (eg: tyarg of Cons is strict implicit, whereas that of
Nil is contextual implicit).

By default, Coq does not automatically set implicit the contextual
implicit arguments. To tell Coq to infer also contextual implicit
argument, use command

Setting contextual implcit too may not infer ty annot for Nil. Why?
How does coq know whether we want to partially applied Nil or fully
apply it? Setting Maximal Implicit works, but that would prevent us
from writing certain partial function applications. Therefore, we
explciity set implicit args for Nil:

Notation, Associativity, and Precedence

The level is for precedence. Assume for instance that we want
conjunction to bind more than disjunction. This is expressed by
assigning a precedence level to each notation, knowing that a lower
level binds more than a higher level. Hence the level for disjunction
must be higher than the level for conjunction. 100 is the highest
level. There is a special 200 level higher than 100.

Since connectives are the less tight articulation points of a text, it
is reasonable to choose levels not so far from the higher level which
is 100, for example 85 for disjunction and 80 for conjunction

The Omega Tactic

Omega is a decision procedure for linear arithmetic (also
known as Presburger arithmetic). It can be used to discharge any silly
goals on nats and integers like:

Omega can deal with nats. It converts nats to integers. Following
operations on nat are recognized by omega:

Predicates, such as =, le, lt, gt, ge

Functions, such as +, minus, mult, pred, S, O

Omega needs to be imported:

apply and eapply tactics

apply H tries to unify the current goal with conclusion of H. If it
succeeds, it generates a sub-goal for each assumption (argument) of H.
For instance, in the following context:

using apply Union_introl, where

unifies U with A, B with s1, C with (Union A s2 s3), and x with x to
generate following sub-goal:

which is already an assumption.

The tactic eapply does the same, except that instead of failing due
to inability to instantiate variables, it replaces them with
existentials. When you have existential in you context, destruct it to
skolemize. The ?6201 style variables in goal when you use eapply are
actually existential variables.

The Auto Tactic

The auto tactic solves goals that are solvable by any combination of

intros, and

apply.

By default, auto only does apply on local hypotheses. To add to the
global hint database of auto:

One can optionally declare a hint database using the command Create
HintDb.

To see hints applicable in the current context, and list of all hints,
respectively:

Using auto tactic with database of hints named sets:

You can extened hint DB for one application of auto with using
keyword. To extend with Union_introl:

To extend with all constructors of Union:

More avatars of auto:

trivial tactic - non recursive auto. depth =0.

autounfold and autorewrite to use Hint Unfold and Hint
Rewrite hints in the hint database.

Auto Examples

Completely manual proof for theorem union_is_associative:

Note that proof search tactics never perform any rewriting step
(tactics rewrite, subst), nor any case analysis on an arbitrary data
structure or predicate (tactics destruct and inversion), nor any
proof by induction (tactic induction). So, proof search is really
intended to automate the final steps from the various branches of a
proof. It is not able to discover the overall structure of a proof.

The Solve Tactical

Solve tactical takes a list of tactics, and tries them left to right,
stopping when a tactic is successful. If none of the tactics are
successful, the solve fails.

The difference between try solve [e1 | e2]' and try e1; try e2` is
the short-cut behaviour of solve.

The Intuition Tactical

Intuition t applies propositional decision procedure to simplify the
goal, and then applies tactic t on the resultant.

intuition fail succeeds if the goal is a propositional tautology.
Fails otherwise. This is equivalent to tauto tactic.

Simply using intuition is equivalent to using intuition auto with \*.

Notes:

To instantiate a universally quantified hypothesis H on variable n,
just do (H n). Eg: destruct (H n).

When there is ambiguity about which term rewrite rewrites when
there are multiple rewritable terms in the goal, assert the required
hypothesis as a separate assertion, and prove it (using apply).

It is futile to define strong lemmas with conjunction in consequent.
Not only that applying them is difficult, but also that auto only
tries such lemmas when current goal is a conjunction at top level.
Instead, define two separate lemmas.