On Distributed Object Checkpointing and Recovery

Manhoi Choy



Department of Computer Science
Hong Kong University of Science and Technology
Hong Kong
choy@cs.ust.hk
Hong V. Leong
Department of Computing
The Hong Kong Polytechnic University
Hong Kong
cshleong@comp.polyu.edu.hk
Man Hon Wong

y

Department of Computer Science
The Chinese University of Hong Kong
Hong Kong
mhwong@cs.cuhk.hk

Abstract

Recovery by checkpointing on distributed shared
memory systems is investigated in this paper. The notion
of consistent global states on a sequentially consistent
shared memory system is defined. We investigate
how consistent checkpoints can be obtained in these systems.
In addition, a novel lazy checkpointing approach
is proposed. It allows a controlled degree of concurrency
and, at the same time, limits the amount of rollback
propagation during recovery. Correctness requirements
for efficient checkpointing are explored first and algorithms
satisfying the requirements are developed subsequently.
Several interesting properties of checkpointing
on distributed shared memory systems are discovered.
In particular, we show that for low levels of laziness,
one can achieve better concurrency with more stable
storage.

1 Introduction

Among various programming paradigms, shared
memory programming is relatively easier because of its
good abstraction of communication and synchronization.
Low cost shared memory systems can be implemented
in a distributed manner using clusters of standalone
machines and a proper shared memory access interface.
These systems are commonly called distributed
shared memory (DSM) systems or shared virtual memory
(SVM) systems. A number of such systems have
been built [16, 21, 28] since the concept was grounded
by Li and Hudak [19]. We consider the important issues
of maintaining a robust shared memory system in
this paper. As the size of a parallel system increases, it
becomes more important for the system to be reliable.



Partially supported by grant D4157 DAG94/95.EG13.

y

The author is partially supported by Earmarked Grant
CUHK262/94E.

In particular, the failure of a single component should
not corrupt an entire parallel computation, which may
have been running for days on tens or hundreds of processors.
To maintain high throughput and reasonable
response time and speedup, it is therefore necessary to
design systems that can tolerate various kinds of errors
and failures. No system can tolerate any arbitrary kind
of failures without imposing stringent assumptions [8].
In practice, it is usually assumed that some stable storage
exists that always survives failures. On these systems,
checkpointing is perhaps the most popular technique
for fault-tolerant computing. Periodically, processes
involved in a computation save their local states
as a checkpoint on stable storage. On recovering from
failures, the processes use the saved information to restore
their computation to a consistent state from which
the computation resumes.
Checkpoints must be taken frequently enough during
a computation so that a failure would not lead to
too much rollback of the computation. However, since
checkpointing involves accessing local state information
and stable storage, a non-negligible cost is incurred on
each checkpoint. Therefore, too frequent checkpointing
will slow down normal computation significantly. Using
the well-known idea of overlapping CPU bounded and
I/O bounded jobs, it is sometimes possible to reduce
the impact of checkpointing on the normal computation.
This technique is sometimes referred to as lazy
checkpointing or asynchronous checkpointing. Unfortunately,
the possibility of low impact on normal computation
does not come for free. Due to the requirement
of consistent recovery, more old checkpoints must be
kept on stable storage if new checkpoints are not saved
in a timely manner. Since a checkpoint is the image
of an entire local computation, its size should not be
under-estimated. The efficient use of stable storage is
therefore another issue that we have to consider. When
multiple checkpoints are kept for each process, finding
a consistent global state during recovery may become
more difficult. A phenomenon known as cascading rollback
 may occur [22]. A prolonged recovery process is
expected. Dependency tracking information may be
kept with checkpoints in addition to the local state of
a computation so as to accelerate the recovery process.
However, since dependency tracking information must
be collected during normal computation, this strategy

In Proceedings of the 14th ACM Annual Symposium on Principles
of39326-707
Computing
Copyright (C) 1995 by the Association for Computing Machinery, Inc.
Permission to make digital or hard copies of part or all of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. Copyrights for
components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, to republish,
to post on servers, or to redistribute to lists, requires prior specific
permission and/or a fee. Request permissions from Publications Dept,
ACM Inc., fax +1 (212) 689-0481, or .

may have an adverse effect on normal computations and
communication requirements. From these observations,
it becomes a challenge to design good checkpointing algorithms
that minimize the amount of rollback propagation,
the cost of tracking communicating dependency
information, and the stable storage requirement, and
yet have minimal impacts on normal computations.
Although the systems we consider for checkpointing
are of distributed shared memory types, at a lower level
these systems are implemented by message passing protocols.
Formal recovery models for message passing
based computations are well established [14]. Although
the same technique can be applied on a shared memory
system, a number of authors have pointed out that
more efficient checkpointing is possible if the content of
the messages implementing the shared memory system
can be used [13]. However, as far as we know, models
for recoverable shared memory based computation have
not yet been clearly and formally defined. Without a
carefully defined model, it becomes difficult to justify
the correctness of existing protocols. For the sake of designing
correct protocols that take into account of the
previously mentioned issues, we develop a formal model
for distributed shared memory systems and define several
dependency relations in Section 3. Hopefully, this
will provide insights into verifying the correctness of
other checkpointing protocols as well. In Section 4, we
investigate the relation between correct checkpointing,
the amount of rollback propagation, and the requirements
on stable storage. We conclude from our results
that for low levels of laziness, one can achieve better
concurrency with more stable storage. In Section 5,
we present a lazy checkpointing algorithm with such a
property. Finally, we conclude briefly in Section 6.

2 Related Work

Checkpointing is a useful fault-tolerant technique in
parallel and distributed computing. In the context of
message passing based computations, a wide variety of
algorithms have been proposed for maintaining recoverable
global states [5, 6, 15, 22]. Chandy and Lamport
define the notion of a consistent global state, on which a
distributed computation should be based [6]. They propose
taking checkpoints on processes in a coordinated
manner. Recovery after a failure is merely a restoration
to the nearest checkpoint. Koo and Toueg extend
Chandy and Lamport's approach into a two phase approach
[15]. Strom and Yemini suggest taking checkpoints
asynchronously and performing rollback recovery
after failures [22]. Message logging is adopted as
an alternative to checkpointing by Borg et al. [5]. In
their approach, messages are logged onto stable storage
before they are processed. Johnson and Zwaenepoel
formalize recovery in message passing systems based
on consistency and recoverability of system states [14].
Leong and Agrawal observe that some messages, such as
state inquiry messages and state re-initialization messages,
do not affect the state of a system [18]. Thus,
the real dependency induced by the messages is reduced.
As a result, their scheme allows recovery to
some global states that are not recoverable in Johnson
and Zwaenepoel's scheme.
To impose a limit on the number of checkpoints required,
many protocols make use of loosely synchronized
local clocks [24] to guarantee the existence of a
recoverable global state. Independent checkpoints are
taken frequently enough on each process so that it is
always possible to form a consistent global state among
the set of available checkpoints. Checkpoint dependency
is also exploited to reduce the number of checkpoints
required [25]. To limit the amount of excessive
dependency in the event of a failure, semi-synchronous
checkpointing strategies are also studied in the literature.
Ahamad and Lin force checkpoints to be taken
when harmful dependency can develop to cause cascading
rollback [2]. Wang and Fuchs suggest artificially delaying
the delivery of a message to control the amount
of rollback propagation [26].
A number of authors discuss about providing faulttolerance
in distributed shared memory (DSM) systems
[11, 12, 23, 27]. Wu and Fuchs propose the twin
page approach in which a page is used to store the
valid data and its twin acts as a checkpoint or represents
an obsolete version of the data [27]. Stumm
and Zhou [23] extend four DSM algorithms to tolerate
single host failures and argue that this kind of fault
tolerance is sufficient for most applications. Janakiraman
and Tamir [11] adapt the coordinated checkpointing
schemes of [15] on message passing systems
to the context of distributed shared memory. Pending
shared memory operations are allowed to complete
and a consistent checkpoint is taken. While most work
on recovery on DSM assume memory is sequential consistent
[17], Janssens and Fuchs [12] consider efficient
checkpointing on DSM with relaxed consistency.
All the above mentioned approaches are considered
as communication-induced [13]. They assume shared
memory systems are implemented on a message passing
platform. Consequently, a consistent global state of a
shared memory system can be inferred from a consistent
global state of the underlying message passing platform.
Recently, researchers has begun to investigate into the
possibility of incorporating the semantics of read and
write shared memory operations to reduce system state
dependency.Janssens and Fuchs argue that several messages
are irrelevant to the consistency of global state in
certain shared memory systems [13]. Checkpoint recovery
can be made more efficient by ignoring the effects of
these messages. So far, there is no well-defined model
purely based on the consistency requirements and dependency
of shared memory systems.

3 System Model

3.1 State Transition Systems

Our distributed system model is based on [4]. A system
consists of a set of processes (or processors), a set
of events, and a set of read/write memory objects. Every
object is assumed to have a serial specification [9].
We do not distinguish between shared memory and local
memory. Instead, memory appears to be local if
one and only one process accesses it in any execution.
A process is an automaton with states and a transition
function that is applied when events occur. The inputs
to the transition function are the current state, and an

event. The transition function produces a new state
and a set of events. The transformation of a system
state into another system state through the transition
function is called a step. Processes interact with each
other by executing read and write operations on the
memory. A read operation to object x returning value

v is denoted by r(x)(v). Similarly, a write operation
that assigns v to data object x does not return any
value and is denoted by w(x; v)(). The execution of an
operation is modeled by two events: an invocation event
and a response event. A process performs an operation
on an object by issuing an invocation. It then waits
until a response is returned. Each process may have
at most one outstanding invocation at any time, but
several outstanding invocations on an object may exist
simultaneously. The history of a process is the sequence
of steps taken by the process. An execution of a set of
processes is a set of histories, one from each process.
Given an execution oe, we define a total order on events
in oe. Let events(oe) denote the sequence of invocation
and response events appearing in oe in real-time order,
breaking ties by using process identifiers. An operation
sequence  for a collection of processes and objects is

legal if, for each object X, the restriction of  to operations
of X satisfies the serial specification of X. A

complete execution is an execution in which for each
invocation event, there exists a corresponding response
event in the execution. We will consider only complete
executions unless otherwise stated.
It is assumed that operations of a process i are totally
ordered by the individual program order of the
process, denoted by ) i . The union of these individual
program orders is denoted by ). The notion of writeinto
 defined in [3] is adopted in this paper. The writeinto
 relation relates operations according to the semantics
of read and write operations. If a write operation

o 1 = w(x; v)() writes into a read operation o 2 = r(x)(v)

then o 2 returns the value written by o 1 , denoted by

o 1 !o 2
1

. The causal dependency ; of an execution is
defined as the transitive closure of the union of the program
order relation and the write-into relation.

3.2 Consistent Checkpoint

A local checkpoint, or simply a checkpoint, is taken
by dumping the current local state of a process onto stable
storage. It is assumed that checkpoints are taken
atomically with respect to local read/write operations.
At any moment, a process can have a number of checkpoints
taken at several points during its execution. Let

CP (s) denote the set of checkpoints currently saved by
the processes at global state s. Since the amount of
stable storage is finite, there is a maximum number C i

of checkpoints that process i can save. We call these
saved checkpoints the active checkpoints of a process.
The jth active checkpoint of process i at global state s,

is denoted by ch ij (s), where 1  i  N and 1  j  C i .
For simplicity, we omit the state when it is clear from
context to which state we are referring. A cut is a set

1

For simplicity, we have adopted the assumption that writes
to the same variable are associated with different values [3]. An
alternative definition of write-into without making such an assumption
is also possible.

of local checkpoints, which contains exactly one checkpoint
from each process

2

. A minimal cut of a set of cuts

S is a cut in S such that no cut in S precedes it. It is
necessary to impose a certain order, checkpoint order,
on the set of cuts.

Definition 1 (Checkpoint Order) A cut P1 precedes
another cut P2 in checkpoint order if

9 x 2 P1  y 2 P2 : x precedes y in program
order, and

8 x 2 P1 : 9 y 2 P2 : x = y  x precedes y

in program order.

The notion of consistent global states for message
passing systems has been formally defined by Chandy
and Lamport [6]. Surprisingly, the notion of consistent
global states for shared memory systems has never been

formally defined in the literature. In general, consistent
global states for shared memory systems can only be
defined based on the consistency requirement imposed
on the system. We adopt sequential consistency defined
by Lamport [17] as our correctness requirement, since
this is the most widely accepted correctness criterion.
The definition of sequential consistency formally stated
by Attiya and Welch [4] is as follows:

Definition 2 (Sequential Consistency) An execution
 oe is sequentially consistent if there exists a legal
sequence  of operations such that, for each process p,

the restriction of events(oe) to operations of p is equal
to the restriction of  to operations of p.

The sequence  is called a serialization of the execution.
We assume a model in which each object is owned
by an owner at any time. The local state of a process
is the values of the objects owned by the process. The
global state of a system is the set of local states of the
processes in the system.

Definition 3 (Consistent Global State) A global
state s is consistent if the execution oe that leads the
system from the initial state to s is sequentially consistent
and the final state of the serialization of oe is the
same as s.

We assume that the owner of an object is the last
process that wrote into it. Since the checkpointing of
local states is atomic with respect to local read and
write operations, each cut corresponds to a global state.
A cut that corresponds to a consistent global state is
called a consistent cut. We say that CP (s) is consistent
if it contains a consistent cut. It is not difficult to show
the following lemma from the definitions of consistent
global states and consistent cuts.

Lemma 1 A cut is consistent if every read preceding
the cut is written into by a write preceding the cut.

2

Our definition of cuts should not be confused with the definition
of cuts of distributed computations [6].

Proof: Let oe be an execution and oe

0

ae oe be an execution
that leads the system to the global state corresponding
to a cut. Suppose every read preceding the cut
is written into by a write preceding the cut. We show
that the cut is consistent. In other words, we show that

oe

0

is sequentially consistent and the final state of the
serialization of oe

0

is the same as the state of the cut.
Since the system is sequentially consistent, there exists
a legal sequence  of operations such that, for each process
 p, the restriction of events(oe) to operations of p is
equal to the restriction of  to operations of p. Let 

0

be the operation sequence obtained from  by removing
all the operations after the cut. Obviously, for each
process p, the restriction of events(oe

0

) to operations of

p is equal to the restriction of 

0

to operations of p. The
proof is complete, if we can show that 

0

is legal and
the state resulted from applying oe

0

on the initial state
is equal to the state of the cut. Let  = h 1 \Delta o \Delta h 2 , where

h 1 and h 2 are operation sequences (may be empty sequences)
, o is the last operation in the sequence that
is after the cut, and "\Delta" denotes the concatenation of
operation sequences. By definition, no operation in h 2

is before the cut. Suppose o is executed by process i.

Then o must be the last operation executed by process

i in  . There are two cases. In case 1, o is a read operation,
then no operation in h 2 will be affected after o is
removed, i.e., h 1 \Delta h 2 is legal. In case 2, o is a write operation.
By assumption, every read preceding the cut
is written into by a write preceding the cut. Therefore,

o does not write into any read in h 2 , since all the read
operations in h 2 must precede the cut. Hence, removing
 o does not affect any operation in h 2 , i.e., h 1 \Delta h 2 is
legal. By repeating the above argument, we can remove
all the operations after the cut from  and the resulting
operation sequence, 

0

, is legal. Since the last writer
of a variable in a cut is also the owner of the variable
at the time the checkpoints of the cut are taken, the
final state obtained by applying the sequence 

0

on the
initial state is the same as that of the cut. The proof
follows. 2

The correctness criterion derived from Lemma 1 is
weaker than traditional models based on message passing
[6]. As an example, the checkpoints in Figure 1
form a consistent global state when the semantics of
read/write operations is considered. The reason is that
the read request is sent at a and the response is received
at b. With the semantics of read operations, the read
operation can be considered to occur at a point between

a and b, i.e., before checkpoint c 1 . Furthermore, the
write operation occurs before checkpoint c 2 . By virtue
of Lemma 1, checkpoints c 1 and c 2 form a consistent
global state. However, under Chandy and Lamport's
model [6], the response message is an orphan message
that has been received but not yet been sent in the cut
formed by the checkpoints. Hence c 1 and c 2 do not form
a consistent global state under their model.
Lemma 1 can be used to determine whether a cut
is consistent or not if processes coordinate with each
other when taking checkpoints. However, if processes
take checkpoints individually, it may be difficult to apply
the lemma directly. This is because the informa-

c1
c2
read request read response

[

a b

]
] write
read
Figure 1: Consistent Global State

b
a
d
c

checkpoint
checkpoint
Figure 2: Commit Dependency
tion needed to apply the lemma may not be available
locally. To facilitate local determination of consistent
cuts, we define the notion of commit dependency relation
 7!. Intuitively, we say that an operation a precedes
another operation b in a commit dependency relation

7!, if whenever the process that executes b rolls back to
a checkpoint earlier than b, the process that executes

a also has to roll back to a checkpoint earlier than a.

For example, if b is a write operation that writes into a,

then a precedes b in the commit dependency relation.
Moreover, when the process that executes a rolls back
to a checkpoint, say C, the effect of all the write operations
after C will be removed. Note that these write
operations may be executed before or after a. Therefore,
the read operations that are written into by these
write operations also precede b in the commit dependency
order. For example, in Figure 2, a and d precede

b and d precedes c in the commit dependency relation.
(Note that in Figure 2, an arrow across processes represents
a write-into relation.) The formal definition of the
commit dependency relation is given in Definition 4.

Definition 4 (Commit Dependency) An operation

a precedes another operation b in the commit dependency
relation, (or a 7! b) if

ffl b writes into a, or

ffl there exist operations e and f such that

-- e writes into a,

-- f 7! b, and

-- f precedes (in program order) the earliest
checkpoint, if any, that follows (in program
order) e.

Intuitively, an operation is committed if its effect persists
after a failure. In other words, the state recovered

after a failure will always include the effect of the operation.
Obviously, if there is no checkpoint taken after
a write operation, the operation is not committed
(i.e. uncommitted). Moreover, all the operations that
precede an uncommitted operation in a commit dependency
relation are also uncommitted. Hence, we can
define formally the notion of uncommit as follows.

Definition 5 (Uncommit) An operation is uncommitted
if

ffl it is a write and no checkpoint is taken after it, or

ffl it precedes an uncommitted (write) operation in a
commit dependency relation.

From the definition of uncommit, we observe that
an operation will become committed when a number of
checkpoints are taken properly. Therefore, there exists
a minimal set of checkpoints which causes an operation
to become committed. A systematic notation for this
set of checkpoints is a minimal cut, which is defined as
a cover as below.

Definition 6 (Cover) The cover of an operation is
the minimal cut that causes the operation to be committed.


As an example, in Figure 3 operation d precedes operation
 c as well as operation a in the commit dependency
relation. Furthermore, operation a precedes operation

b, and a can become committed if b becomes committed.
Taking checkpoint c 1 causes b to commit, and hence a

can commit. After committing a and taking checkpoint

c 2 , c can commit. Since d is not a write operation, it
can commit if both b and c have committed. Thus the
set of checkpoints c 1 , c 2 and c 3 forms the cover of d.

b
a
d
c

c2
c1
c3
Figure 3: Cover

4 Designing a Lazy Checkpointing Algorithm

As one may expect, it is possible to design a lazy
checkpointing algorithm by properly sending background
messages to keep track of dependency relations.
It is more advantageous, however, if these background
messages can be piggybacked with the messages due to
the underlying DSM protocol. We therefore focus our
attention on this class of algorithms. Under our DSM
model, the messages that a writer sends to serve a remote
read establish the write-into relation between the
writer and the reader. If the value written by a write
operation has not been saved by the writer, then checkpointing
the reader after the remote read operation may
be a waste. This is because the failure of the writer requires
the writer to roll back to a state before the write
operation is executed. Consequently, the reader has to
roll back to a state before the read operation in order
to maintain a consistent global state. Intuitively, in
order to reduce the amount of cascading rollback, one
might want to limit the extent that the write-into relation
can penetrate across processes. We shall use this
as an invariant to our to-be-derived algorithm.

Invariant 1 Let w 0 ! r 1 ) w 1 ! r 2 :::) w j \Gamma1 ! r j

be a chain of alternations of write-into relation and program
order such that w i\Gamma1 and r i , 1  i  j, are from
different processes. If w 0 follows (in program order) the
first active checkpoint, then j  some constant G.

Henceforth, we call this chain the write-penetration
chain. The value of j is the length of the chain. We
call the constant G in Invariant 1 the write-penetration
bound. Another invariant that one might maintain in a
checkpointing algorithm is the following:

Invariant 2 Given any two consecutive checkpoints of
a process, there is a write that writes into some remote
read.

The reason of maintaining such an invariant is obvious.
A checkpoint can simply replace the previous
checkpoint, if there does not exist any write operation
that writes into a remote read after the previous checkpoint.
This is because any rollback of the process to
the earlier checkpoint can be replaced by a rollback of
the process to the later checkpoint.
Despite the fundamental differences between the
causal dependency relation and the commit dependency
relation, we show that for small values of G, i.e., G ! 3,
bounding the lengths of causal dependency chain also
bounds the number of local checkpoints needed per process.
In particular, the number of checkpoints for each
process is G+1. This result holds as long as invariants 1
and 2 are satisfied. We further demonstrate in the Appendix
that such a result does not hold for G  3, if
real time constraint is not imposed.
Since the number of checkpoints for each process is
to be bounded above by the write-penetration bound,
a new checkpoint always replaces some old checkpoint.
We prove in our main theorem that it is sufficient to
replace the oldest checkpoint when Invariant 2 is maintained.
Before proving the theorem, we prove the following
lemma.

Lemma 2 If CP (s) is consistent and there exists a
process x such that x has at least two saved checkpoints
and it does not read any uncommitted value in the interval
 (ch x1 ; ch x2 ) then CP (s) \Gamma fch x1 g is also consistent.
Proof: Suppose the assumptions of the lemma hold.
The proof is obvious if there is a consistent cut in

CP (s) that does not contain ch x1 . So assume ch x1

is part of any consistent cut in CP (s). Let A be
a consistent cut in CP (s). If x does not read any

value in the interval (ch x1 ; ch x2 ) then, by Lemma 1,
(A \Gamma fch x1 g)[fch x2 g ` CP (s) is a consistent cut. Consequently,
 CP (s) \Gamma fch x1 g is consistent and the lemma
follows in this case. Otherwise, if x reads in the interval
 (ch x1 ; ch x2 ) then consider the cut B that is the
maximum of A and all the covers of these reads. By assumption,
 x does not read any uncommitted values in
the interval (ch x1 ; ch x2 ). Therefore each of these mentioned
covers is in CP (s). Consequently, B ` CP (s).
In addition, from the definition of B, any read preceding

B reads from a write preceding B. Therefore, according
to Lemma 1, B is consistent. Furthermore, according
to the definition of cover, the covers of x's reads in the
interval (ch x1 ; ch x2 ) contain ch x2 or a checkpoint of x

after ch x2 . We conclude that B is a consistent cut of

CP (s) that does not contain ch x1 . The proof follows.

2

Theorem 1 At any global state s, if C i ? G 2 f0; 1; 2g

for any process i and CP (s) contains a consistent cut,
then there exists a process x such that at the state s

0

,
which is the resulting state of taking a new checkpoint
at process x and discarding ch x1 from state s, CP (s

0

)

contains a consistent cut.

Proof: Suppose the assumptions in the theorem hold.
Consider the case G = 0 first. Let LCP (s) be the cut
that consists of the last checkpoint of each process at
state s. By Invariant 1, no process reads uncommitted
values. Therefore, by Lemma 1, LCP (s) is consistent
and the theorem follows trivially in this case.
Next we consider the cases of G = 1 and G = 2. We
show that there exists a process x such that CP (s

0

) contains
a consistent cut. Since C i ? 1, every process has
at least two checkpoints. Consequently, by Lemma 2, it
is sufficient to show that there exists a process x such
that each of its reads in the interval (ch x1 ; ch x2 ) will
become committed at the state s

0

.
Case 1: G = 1,
For the sake of contradiction, suppose no such process
as stated in the theorem exists. Then, by Lemma 2,
every process m has read uncommitted values in the interval
 (ch m1 ; chm2 ) at state s. By the definition of uncommit,
there exists a process y that writes into some
reads after y's last checkpoint. Since y also reads some
(uncommitted) value in the interval (ch y1 ; ch y2 ), there
exists a write-penetration chain of length 2 ? G, contradicting
our assumption.
Case 2: G = 2,
For the sake of contradiction, suppose no such process
exists. As in Case 1, by Lemma 2, there exists
a process m with a read operation r in the interval
(ch m1 ; chm2 ) and a write w of y occurring after y

0

s last
checkpoint such that r precedes w in the commit dependency
relation. For notational convenience, let r = r 0 ;

and w = w j+1 . Also assume a hypothetical read operation
 r j+1 of y that follows w and a hypothetical write
operation w 0 of m that precedes r. According to the
definition of commit dependency , there is a sequence of
processes p 0 (= m); p 1 ; p 2 ; ::::; p j ; p j+1 (= y) such that,
for 0  i  j + 1,

ffl process p i has a read r i and a write w i

ffl r i precedes the checkpoint immediately after w i , if
any, and
ffl w i+1 writes into r i .
It is sufficient to show that ch p i 2 ) p i w i ) p i r i , for
0  i  j+1. In particular, for i = 0, this asserts that r 0

occurs after chm2 , giving us the desired contradiction.
The remaining proof is by induction on i from j + 1
to 0. The base case, where i = j + 1, holds trivially.
Now consider the induction steps for i, 0  i  j. By
induction hypothesis, w i+1 occurs after ch p i+12 . By assumption,
 p i+1 reads some uncommitted value in the interval
 (ch p i+11 ; ch p i+12 ). Consequently, if r i is followed
by a write of p i that writes into some process then there
exists a write-penetration chain of length 3 ? G, contradicting
our assumption. Therefore, no write of p i

that writes into some remote read follows r i . From this
and Invariant 2, r i must occur after ch p i 2 . Furthermore,
from the definition of r i and w i , w i occurs between ch p i 2

and r i . This completes the induction step and the proof
of the case G = 2. 2

Theorem 1 gives us a hint on how to derive a lazy
checkpointing algorithm. The process which can take a
new checkpoint and discard the oldest checkpoint without
rendering the system inconsistent may go ahead and
do so. It is also possible to show that even if more than
one process are concurrently taking new checkpoints
and discarding their oldest checkpoints, some consistent
cut still remains. The remaining problem is to
identify such a process and is addressed in the next
section where we derive a lazy checkpointing algorithm
based on Theorem 1. Furthermore, a system that takes
no new checkpoint will easily satisfy both invariants and
the theorem. However, such a system will definitely not
be acceptable, due to the lack of progress. In Section 5,
an algorithm that guarantees progress, as well as both
invariants, will be derived.
By varying the value of G, algorithms with different
degrees of laziness are resulted. In particular, when

G = 0, our algorithm behaves in the same way as a synchronous
checkpointing algorithm [15]. In most implementations
of synchronous checkpointing algorithms,
two checkpoints are required per process. However,
from Theorem 1, we show that only one checkpoint per
process is sufficient to guarantee the existence of a consistent
global state. This simplifies the reconstruction
of the global state after a failure. When G = 1 or G = 2,
a higher degree of laziness is exhibited. This results in
higher concurrency but requires more stable storage.
Note that Theorem 1 does not hold for larger values of

G, i.e., for G  3. We have constructed executions that
violate the theorem for G  3. One such execution is
depicted in the Appendix. However, these executions
are constructed based on the assumption that read operations
may return future values, a property that is
not forbidden by sequential consistency. It is an open
question whether Theorem 1 will hold or not for G  3
if sequential consistency is strengthened, e.g. to realizable
sequential consistency [10] or linearizability [9].

5 Realizing a Lazy Checkpointing Algorithm

In this section, we present a lazy checkpointing algorithm
based on Theorem 1. To ensure correctness, our
algorithm needs to maintain the safety requirements of
invariants 1 and 2 and to guarantee progress. For simplified
presentation, we have not pursued for obvious
optimizations in our algorithm, such as forcing all processes
to take a checkpoint after a failure before executing
the recovery protocol, as in [14].
To uphold Invariant 1, the lengths of writepenetration
chains should be bounded. According to
the invariant, only causal dependencies occurring after
the first active checkpoints of each process need to be
kept track of. We shall refer to these causal dependencies
active causal dependencies. Since the values of G

are small, we choose to maintain a dependency graph
to keep track of active causal dependencies.
It may not be difficult to maintain Invariant 2: when
a process takes a new checkpoint, the new checkpoint
can replace the previous checkpoint, if the process does
not serve any remote read between the two checkpoints.
This results in collapsing the two checkpoints into one.
A small difficulty is encountered, however, to maintain
Invariant 2 at the initial state. Theorem 1 assumes
that, at initial state s, there are at least G + 1 checkpoints
at each process. Therefore, for the theorem to
be applicable, we shall maintain G+ 1 identical initial
checkpoints. To satisfy Invariant 2 at such an initial
configuration, we assume that there are default committed
remote reads in each initial checkpoint interval.
Next, to ensure progress, we need to set up the conditions
under which new checkpoints may be taken and
Theorem 1 may be applied to remove outdated checkpoints.
We do so, in our algorithm, by keeping track
of the status of the commitment of remote reads. In
fact, if the first active checkpoint interval of a process
does not contain any uncommitted read and there are
at least G+ 1 active checkpoints, a new checkpoint can
be taken. (Subsequently, the oldest active checkpoint
may be removed or the new checkpoint may replace its
immediately preceding checkpoint.)
Data structure at process i:
SeqNo i : sequence number for the latest checkpoint
on process i;
CausalDepend i : dependency graph after the latest
checkpoint on process i;
remote i : true if a remote read has been served since
the last checkpoint;
Figure 4: Local Variables at Process i

Figure 4 shows the variables used by a process i.

Checkpoints of each process are assigned with monotonically
increasing sequence numbers. The sequence
number of the last checkpoint of process i is stored in

SeqNo i . The variable CausalDepend i keeps track of
the active causal dependency that is related to operations
performed after the last checkpoint of process i.

We assume that it is represented in the form of a de-

When a checkpoint is taken:

begin

save local state of computation and current value of

CausalDepend i ;

CausalDepend i / null;
remote i / false;
SeqNo i / SeqNo i + 1;
arrange for sending the value SeqNo i in a "checkpointed"
message asynchronously;

end
Figure 5: Checkpointing Procedure at Process i

pendency graph. As shown in Figure 5, a checkpoint of
process i is saved together with the the current value of

CausalDepend i , which is then reset to empty. Invariant
2 is also checked to see if the immediately preceding
checkpoint is useful or not. It is replaced by the new
checkpoint if process i has not served any remote read
since then. The checkpoint sequence number is then
incremented and piggybacked or propagated to other
processes. The completion of a remote checkpoint may
cause some uncommitted operations to commit. Consequently,
some active causal dependency chains may be
shortened, and conditions for taking checkpoints may
be enabled.
Process i writing to data object x:

begin
if x:state = exclusive then

perform the write operation;

return;
else send a message to the manager for a copy;

endif;
wait until the copy arrives;

x:state / exclusive;
x:copyset / fig;

perform the write operation;

end
Figure 6: Performing a Write Operation
Depending on the nature of the DSM protocol that
we want to implement, the actions taken by a write operation
may differ. In Figure 6, we highlight the steps to
serve a write request of process i in an ownership based
protocol. Write operations are easier to handle than
read operations because write-into relations are established
when remote reads are served. As a result, the
processing of a read operation is more complicated since
it involves the tracking of causal dependency. Figures 7
and 8 outline the handling of remote read requests. To
serve a remote read, a process changes its state to reflect
that the data object is shared. The data object
is sent to the requester together with the current dependency
graph of the serving process. Then local flag

remote i is set to true. At the reader side, the dependency
graph received will be merged to the current de-

pendency graph.
Serving a remote read on data object x from process j:

begin
while depth(CausalDepend i )  G do wait;

/* maintain Invariant 1 */

x:copyset / x:copyset [ fjg;

x:state / shared;

send (x; CausalDepend i \Delta hi; SeqNo i i) to j;

/* "\Delta" denotes appending an item to a graph */

remote i / true;

end
Figure 7: Process i Serving a Remote Read
When (x; dep) is received from process j:

begin

CausalDepend i / merge(CausalDepend i ; dep);

end
Figure 8: Process i Receiving a Read Reply
Process i receives the message "checkpointed" from process j

with content seqno:

begin
if hj; ci 2 CausalDepend i and c ! seqno then

change the entry into hj; ?i;

remove all items hk; ?i in CausalDepend i which
do not have any ancestor;

endif;
end
Figure 9: Remote Operation Committed
Owing to the asynchrony of checkpointing and information
dispersal, processes may occasionally receive
information that a remote process has taken a checkpoint.
This will cause uncommitted operations to become
committed. Furthermore, it can reduce the length
of the write-penetration chain. Items in the variable

CausalDepend i reflect the set of checkpoints that the
current process i is waiting to complete. Upon receiving
the committed values, the items are removed from
the graph, and the length of the chain may change as
well, as indicated in Figure 9.
We adopt the approach of rollback recovery, by assuming
that the set of the latest checkpoints can form
a consistent global state, as depicted in Figure 10. If it
really is a consistent global state, all dependency graphs
are empty. Otherwise, rollback starts to occur, but the
cascading effect cannot go beyond the write-penetration
bound G. By controlling the value of G, protocols with
different rollback cost, processing overhead, and latency
in waiting for remote checkpoints to complete can be resulted.
Note that in the distributed recovery scheme,

After a failure, process i executes the following:

begin
if CausalDepend i is null then

restore to the most recent checkpoint;

else find the best checkpoint c

such that c:CausalDepend ! c:SeqNo;

restore to the checkpoint c;
SeqNo i / c:SeqNo;

broadcast SeqNo i ;

endif;
end
Figure 10: Recovery Protocol
the broadcasting mechanism is adopted, and this results
in a larger number of messages. However, a special
process can be elected the coordinator for the recovery
procedure. The coordinator serves as the gateway to
propagate the rollback information among the involved
processes. This can reduce the number of messages at
the expense of a longer recovery time, due to sequential
message exchanges at the coordinator.

6 Conclusion and Discussion

In this paper, we investigate the correctness requirements
on a shared memory system and define a formal
model to reason about the correctness. To do this, we
first define the notion of consistent global state in the
context of the read/write memory systems, under the
widely accepted correctness criterion of sequential consistency.
As far as we know, this is the first formal
model based on the semantics of read and write memory
operations. With this model, we demonstrate that
the synchronization constraints are less restricted than
those based on traditional message passing models such
as Johnson and Zwaenepoel's model [14].
A set of correctness conditions that a family of lazy
checkpointing algorithms should observe are defined
and proved. From the conditions, a lazy checkpointing
algorithm based on the ownership protocol [19]
is derived. The algorithm is a generalization of several
existing algorithms. For example, when the writepenetration
bound G is set to 0, the algorithm behaves
as if it were a synchronous checkpointing algorithm [15].
Moreover, we show that only one checkpoint is required
for each process even though most existing algorithms
require two checkpoints per process. When G increases,
the degree of laziness increases as well. This results in
an improved degree of concurrency at the expense of
stable storage.
We assume that the DSM protocol used is ownershipbased,
so that only the owner can update an object.
This implies that the owner will continue to hold its
value and is responsible for checkpointing the object.
Other processes will not have this value without reading
from the owner, and establishing a write-into relation.
Whilst this is in general true in a majority of
DSM protocols, it is not true in remote procedure call
based DSM protocols, as in Split-C [7]. One way to incorporate
these protocols into our model is to let writer

processes checkpoint the variables they updated even
though they are no longer the owners. Another way is
to model a remote write by process i to a server process
 j (which owns the object) as a write of process i,

followed by a read of the value from process j (this establishes
the write-into relation), and then a write of
the same value by j (this establishes the program order)
. As a result, i will no longer be the owner of the
object and it will need to request j for the most updated
value if a subsequent read to the object is issued
by i. This does not impose an additional constraint,
since reading the value again from the server is necessary
in a remote procedure based protocol to guarantee
sequential consistency.
The present work suggests some continuations. We
are studying the performance of the algorithm derived
in this paper under different degrees of laziness by simulation.
Furthermore, the generalization of our model
towards weaker memories, such as causal memory [3],
pipelined random access memory [20], and weak ordering
[1] may be an interesting topic to study.
References

[1] S.V. Adve and M.D. Hill. Weak ordering - A new
definition. In Proceedings of the 17th Annual International
Symposium on Computer Architecture,

pages 2--14. IEEE, May 1990.
[2] Mustaque Ahamad and Luke Lin. Using checkpoints
to localize the effects of faults in distributed
systems. In Proceedings of the 9th International
Conference on Distributed Computing
Systems, pages 2--11. IEEE, 1989.
[3] Mustaque Ahamad, Gil Neiger, Prince Kohli,
James E. Burns, and Phillip W. Hutto. Causal
memory: Definitions, implementation, and programming.
Technical Report 93/55, College
of Computing, Georgia Institute of Technology,
September 1993. Submitted for Publication.
[4] H. Attiya and J.L. Welch. Sequential consistency
versus linearizability. ACM Transactions on Computer
 Systems, 12(2):91--122, 1994.
[5] A. Borg, J. Baumbach, and S. Glazer. A message
system supporting fault tolerance. In Proceedings
of the Ninth ACM Symposium on Operating System
 Principles, pages 90--99, October 1983.
[6] K. Mani Chandy and Leslie Lamport. Distributed
snapshots: Determining global states of
distributed systems. ACM Transactions on Computer
 Systems, 3(1):63--75, February 1985.
[7] D.E. Culler, A. Dusseau, S.C. Goldstein, A. Krishnamurthy,
S. Lumetta, T. Von Eicken, and
K. Yelick. Parallel programming in Split-C. Technical
report, Computer Science Division, University
of California, Berkeley, 1993.
[8] Michael J. Fisher, Nancy A. Lynch, and Michael S.
Paterson. Impossibility of distributed consensus
with one faulty process. Journal of the ACM,

32(2):374--382, April 1985.
[9] Maurice P. Herlihy and Jeannette M. Wing. Linearizability:
a correctness condition for concurrent
objects. ACM Transactions on Programming Languages
and Systems, 12(3):463--492, July 1990.
[10] Phillip W. Hutto and Mustaque Ahamad. Slow
memory: Weakening consistency to enhance concurrency
in distributed shared memories. In Proceedings
of the 10th International Conference on
Distributed Computing Systems, pages 302--309.
IEEE, 1990.
[11] G. Janakiraman and Y. Tamir. Coordinated
checkpointing-rollback error recovery for distributed
shared memory multicomputers. In Proceedings
of the 13th IEEE Symposium on Reliable
Distributed Systems, pages 42--51. IEEE, 1994.
[12] B. Janssens and W.K. Fuchs. Relaxing consistency
in recoverable distributed shared memory. In Proceedings
of the 23rd International Symposium on
Fault-Tolerant Computing. IEEE, 1993.
[13] B. Janssens and W.K. Fuchs. Reducing interprocessor
dependence in recoverable distributed
shared memory. In Proceedings of the 13th IEEE
Symposium on Reliable Distributed Systems, pages
34--41. IEEE, 1994.
[14] D. B. Johnson and W. Zwaenepoel. Recovery
in distributed systems. Journal of Algorithms,

11(3):462--491, September 1990. Preliminary version
in Proceedings of the 7th Symposium on Principles
of Distributed Computing, 1988.
[15] R. Koo and S. Toueg. Checkpointing and rollbackrecovery
for distributed systems. IEEE Transactions
on Software Engineering, SE-13(1):23--31,
January 1987.
[16] Z. Lahjomri and T.H. Priol. KOAN: A shared virtual
memory for the iPSC/2 hypercube. In Parallel
Processing: CONPAR 92-VAPP V. Second Joint
International Conference on Vector and Parallel
Processing, pages 441--452, 1992.
[17] Leslie Lamport. How to make a multiprocessor
computer that correctly executes multiprocess
programs. IEEE Transactions on Computers,

28(9):690--691, September 1979.
[18] H.V. Leong and D. Agrawal. Using message semantics
to reduce rollback in optimistic message logging
recovery schemes. In Proceedings of the 14th
International Conference on Distributed Computing
 Systems, pages 227--234. IEEE, June 1994.
[19] Kai Li and Paul Hudak. Memory coherence in
shared virtual memory systems. ACM Transactions
on Computer Systems, 7(4):321--359, November
1989. First appeared in 1986 PODC.
[20] Richard J. Lipton and Jonathan S. Sandberg.
PRAM: A scalable shared memory. Technical Report
CS-TR-180-88, Princeton University, Department
of Computer Science, September 1988.
[21] Bill Nitzberg and Virginia Lo. Distributed shared
memory: A survey of issues and algorithms. IEEE
Computer, 24(8):52--60, August 1991.

w r write-into relation
checkpoints
w11
c11 c12
c31 c32 c33
P1
P2
P3
w12 w13
w21 w22
w31 w32 w33 r31
r32 to r36
r32 r33 r34
r36
r21
r12 r13
r35
c21 c22
w23
w24
a write-into relation ending at the of p3
Figure 11: Four Checkpoints are Insufficient for G = 3
[22] R. Strom and S. Yemini. Optimistic recovery in
distributed systems. ACM Transactions on Computer
 Systems, 3(3):205--226, August 1985.
[23] M. Stumm and S. Zhou. Fault tolerant distributed
shared memory. In Proceedings of the 2nd IEEE
Symposium on Parallel and Distributed Processing,

pages 719--724. IEEE, 1990.
[24] Z. Tong, R. Y. Kain, and W. T. Tsai. A Low
Overhead Checkpointing and Rollback Recovery
Scheme for Distributed Systems. In Proceedings
of the Eighth Symposium on Reliable Distributed
Systems, pages 12--20, October 1989.
[25] Yi-Min Wang and W. Kent Fuchs. Optimistic
message logging for independent checkpointing in
message-passing systems. In Proceedings of the
11th IEEE Symposium on Reliable Distributed Systems,
pages 147--154. IEEE, 1992.
[26] Yi-Min Wang and W. Kent Fuchs. Scheduling message
processing for reducing rollback propagation.
In Proceedings of the 22nd International Symposium
on Fault-Tolerant Computing, pages 204--211.
IEEE, 1992.
[27] K.L. Wu and W.K. Fuch. Recoverable distributed
shared virtual memory. IEEE Transactions on
Computers, 39(4):460--469, April 1990.
[28] Songnian Zhou, Michael Stumm, Kai Li, and
David Wortman. Heterogeneous distributed shared
memory. IEEE Transactions on Parallel and Distributed
 Systems, 3(5):540--554, September 1992.

Appendix

Figure 11 shows an execution for which Theorem 1
does not hold for G = 3. Invariant 1 is satisfied since
the length of any write-penetration chain is not greater
than the bound G = 3. The execution also satisfies
Invariant 2 because there exists a write operation that
writes into a remote read between every two consecutive
checkpoints. It can be verified that the set of active
checkpoints c 11 , c 21 and c 31 is the only consistent cut in
the execution. However, it is impossible to choose a process
such that adding a new checkpoint and removing
the oldest checkpoint from the process preserves some
consistent cut in the system. For example, when c 21

is removed, checkpoints c 11 , c 22 and c 31 do not form
a consistent cut, since the write-into relation w 31 !r 21

violates Lemma 1. Therefore c 31 can no longer be a
checkpoint of a consistent cut after the removal of c 21

because c 31 is before w 31 . Consider the checkpoints c 11 ,

c 22 and c 32 . They do not form a consistent cut, since
the write-into relation w 23 !r 31 violates Lemma 1. After
considering all the possible combinations of checkpoints,
it can be verified that no consistent cut exists.
Therefore, Theorem 1 does not hold for G = 3.
However, the execution shown in the figure may not
be possible in some systems. In most practical systems,
executions satisfy the property that the order induced
by the happens-before relation is consistent with
the real-time order of the operations. Therefore, the
happens-before relation is acyclic. We refer these executions
to as realizable, similar to the notion of realizable
sequential consistency [10]. It can be shown that
the execution of Figure 11 is not realizable. In fact,
from Invariant 2, there is always a write that writes
into some remote read between any two checkpoints.
Therefore, r 32 must happen before c 12 . Otherwise, the
checkpoint c 12 would have replaced c 11 . Since c 12 happens
before r 12 , by the transitivity of happens-before,

r 32 happens before r 12 . By a similar argument, r 12 happens
before c 33 and hence happens before r 32 . These
two facts together lead to a cycle in the happens-before
relation.

