﻿Detecting Application-Level Failures in Component-based Internet Services
Abstract
Pinpoint is an application-generic framework for using
statistical learning techniques to detect and localize likely
application-level failures in component-based Internet services.
Assuming that most of the system is working most
of the time, Pinpoint looks for anomalies in low-level behaviors
that are likely to reflect high-level application faults, and
correlates these anomalies to their potential causes within the
system. In our experiments, Pinpoint correctly detected and
localized over 70-88% of the faults, depending on the type
of fault, we injected into our testbed system, as compared
to the 50-70% detected by current techniques. By demonstrating
the applicability of statistical learning and providing
an application-generic platform on which additional machine
learning techniques can be applied to the problem of fast failure
detection, we hope to hasten the adoption of statistical
approaches to dependability for complex software systems.
1 Introduction
A significant part of recovery time (and therefore availability)
is the time required to detect and localize service failures.
A 2003 study by Business Internet Group of San Francisco
(BIG-SF)[7] found that of the 40 top-performing web sites
72% had suffered user-visible failures in common functionality,
such as items not being added to a shopping cart or an
error message being displayed. These application-level failures,
or brown-outs, include failures where only part of the
functionality of a site goes down, and failures where functionality
is only intermittently unavailable to end-users. Figure
1 shows one brown-out the authors have encountered. Our
conversations with Internet service operators confirm that detecting
and localizing these failures is a significant problem:
one large site estimates that about 93% of the time they spend
recovering from application-level failures is spent detecting
(75%) and diagnosing them (18%) [12, 2]. Other sites agreed
that brown-outs can sometimes take days to detect, though
they are usually repaired quickly once found. This situation
Emre Kıcıman and Armando Fox
{emrek, fox}@cs.stanford.edu
1
has a serious effect on the overall reliability of Internet services:
a study of three sites found that earlier detection might
have mitigated or avoided 65% of reported user-visible failures
[31].
We present Pinpoint, an application-generic framework for
monitoring component-based Internet services, and detecting
and localizing application-level failures without requiring
a priori knowledge about the application. Pinpoint’s insight
is that failures that affect user-visible behavior are also
likely to effect a system’s internally visible behaviors. By
monitoring low-level behaviors that are closely tied to application
functionality, Pinpoint develops a model of the normal
patterns of behavior inside a system. When these patterns
change, Pinpoint has high confidence that the application’s
functionality has also changed, indicating a potential failure.
Specifically, Pinpoint monitors inter-component interactions
and the shapes of paths (traces of client requests that
traverse several components) to quickly build a dynamic and
self-adapting model of the “normal” behavior of the system.
Once it notices an anomaly, Pinpoint correlates it to its probable
cause in the system, a set of likely faulty components.
We recognize that the most rigorous test of Pinpoint’s usefulness
is through validation in a real Internet service environment
and are currently deploying Pinpoint in such an environment
for exactly this reason. In this paper, however, we
take advantage of a controlled environment to test Pinpoint
with a completeness that would be difficult to reproduce in a
live service. By methodically injecting a wide-variety of failures,
including source-code bugs, Java exceptions and others
into all the components of several applications running on our
testbed cluster, we evaluate how well Pinpoint discovers failures
and characterize Pinpoint’s strengths and weaknesses.
We introduced path-analysis for localizing failures in Internet
services in [14], and in [13, 12] broadened our pathanalysis
techniques to show how this visibility into a system
can aid fault management, fault impact analysis, and evolution
management. The primary differentiator between Pinpoint
and Chen et al.’s fault diagnosis in [11] is that Chen
assumes the pre-existence of labeled data (e.g., failed or suc-
Figure 1: Instead of a complete itinerary, this screenshot of an airline
site shows no flight details at all. A naive fault monitor, however,
would find no problems: the site responds to pings, HTTP requests,
and returns valid HTML.
cessful). In contrast Pinpoint assumes no prior knowledge of
faults, and starts with fault detection. While we briefly discussed
fault detection using path-shape analysis in Section
4.1.2 of [12], this paper provides the details of the behavior
modeling and anomaly detection algorithms for both pathshape
analysis and component-interaction analysis, and evaluates
when and how these fault detection algorithms work.
Request paths have also been used for black-box monitoring
for performance troubleshooting [1] and performance prediction
and debugging [4]. In contrast to those efforts, we
focus on the initial detection of failures and also focus on
failures in application functionality rather than performance
problems. The contributions of this paper are as follows:
1. We identify two kinds of system behaviors—pathshapes
and component interactions—that are easy to observe
with application-generic instrumentation and serve
as reliable indicators of changes in high-level application
behavior.
2. We show how existing statistical analysis and machine
learning techniques for anomaly detection and classification
can be used to monitor these behaviors to discover
failures without a priori knowledge of the correct behavior
or configuration of the application.
3. We evaluate these techniques by integrating Pinpoint
with a popular middleware framework (J2EE) for building
Internet services. We inject a wide variety of failures
into several J2EE applications, and measure how well
Pinpoint detects and localizes the faults. We also test
Pinpoint’s susceptibility to false positives by performing
a set of "normal" changes, such as a software upgrade.
Pinpoint does not attempt to detect problems before they
happen. Rather, we focus on detecting a failure as quickly as
possible after it occurs, to keep it from affecting more users
2
and to prevent cascading faults. Pinpoint does not provide
a detailed root-cause diagnosis, but rather localization of the
fault within the system. Combined with a simple generic recovery
mechanism such as microreboots [8], simply knowing
the location of a fault is often sufficient for fast recovery.
Section 2 describes Pinpoint’s approach to detecting and
localizing anomalies, and the two types of low-level behaviors
that are monitored—path shapes and component interactions.
Section 3 explains in detail the algorithms and data
structures used to detect anomalies in each of these behaviors.
In Section 4, we describe an implementation of Pinpoint
and our testbed environment, and present experimental evaluation
in Section 5, including both an evaluation of Pinpoint’s
effectiveness at discovering failures and its resilience to falsepositives
in the face of normal changes in behavior. We conclude
by discussing related work and future directions.
1.1 Current Monitoring Techniques
To be most useful in a real Internet service, a monitoring technique
must be:
Accurate: A monitoring service should correctly detect
and localize a broad range of failuree. Ideally, it would catch
never-before-seen failures anywhere in the system.
Few false-alarms: The effort and cost to respond to falsealarms
should be much less than the benefit provided by early
detection of true faults. This cost depends on both the falsepositive
rate of the monitoring system and its interaction with
the human operators and/or an automatic recovery system.
Deployable: A monitoring service must be easy to develop
and maintain, even as the monitored application evolves.
From our discussions with Internet service operators,
we find that existing detection methods fall into three
categories[2]. First, low-level monitors are machine and protocol
tests, such as heartbeats, pings, and HTTP error code
monitors. They are easily deployed and require few modifications
as the service develops; but these low-level monitors
miss high-level failures, such as broken application logic or
interface problems.
Secondly, application-specific monitors, such as automatic
test suites, can catch high-level failures in tested functionality.
However, these monitors usually cannot exercise all interesting
combinations of functionality (consider, for example, all
the kinds coupons, sales and other discounts at a typical ecommerce
site). More importantly, these monitors must be
custom-built and kept up-to-date as the application changes,
otherwise the monitor will both miss real failures and cause
false-alarms. For these reasons, neither the sites that we have
spoken with, nor those studied in [31] make extensive use of
these monitors.
Thirdly, user-activity monitors watch simple statistics
about the gross behavior of users and compare them to historical
trends. Such a monitor might track the searches per
second or orders per minute at a site. These monitors are
generally easy to deploy and maintain, and at a site with many
users, can detect a broad range of failures that affect the given
statistic, though they do not often give much more than an indication
that something might have gone wrong. Since these
monitors are watching user behavior, they can generate false
alarms due to external events, such as holidays or disasters.
Finally, customer service complaints from users are the catchall
fault detectors.
2 Pinpoint Approach
With Pinpoint, we attempt to combine the easy deployability
of low-level monitors with the ability of the higher-level monitors
to detect application-level faults. This section describes
our assumptions and our approach.
2.1 Target System
Pinpoint makes the following assumptions about the system
under observation and its workload:
1. Component-based: the software is composed of interconnected
modules (components) with well-defined narrow
interfaces. These may be software objects, subsystems
(e.g., a relational database can be thought of as
a single large black-box component), or physical node
boundaries (e.g., a single machine running one of the
Web server front-ends to an Internet service).
2. Request-reply: a single interaction with the system is
relatively short-lived, and its processing can be broken
down as a path, a tree of the names of components that
participate in the servicing of that request.
3. High volume of largely independent requests (e.g., from
different users): combining these allows us to appeal to
“law of large numbers” arguments justifying the application
of statistical techniques. The high request volume
ensures that most of the system’s common code paths
are exercised in a relatively short time.
In a typical large Internet service, (1) arises from the service
being written using one of several standard component
frameworks, such as .NET or J2EE, and from the threetier
structure (Web servers, application logic, persistent store)
[25] of many such services. (2) arises from the combination
of using a component-based framework and HTTP’s requestreply
nature. (3) arises because of the combination of large
numbers of (presumably independent) end users and highconcurrency
design within the servers themselves.
2.2 Observation, Detection, Localization
The Pinpoint approach to detecting and localizing anomalies
is a three-stage process of observing the system, detecting
3
anomalies in its behavior, and correlating these anomalies to
a probable cause.
1. Observation: We capture the runtime path of each request
served by the system: an ordered set of coarsegrained
components, resources, and control-flow used
to service the request. From these paths, we extract two
specific low-level behaviors likely to reflect high-level
functionality: component interactions and path shapes.
2. Analysis: We build a dynamic model of the normal behavior
of an application with respect to component interactions
and path shapes, under the assumption that most
of the system is working correctly most of the time. Behaviors
that are anomalous with respect to the dynamic
model are flagged as possible failures.
3. Localization: We attempt to identify what features of
anomalous paths (components in the path) are correlated
with (predictive of) the observed anomaly; these
are marked as suspected-faulty components. A separate
policy agent is responsible for deciding how to respond
to detected anomalies.
To capture the runtime paths of requests in the observation
phase, we concentrate on instrumenting standard middleware
frameworks such as J2EE, so that any componentized application
using the middleware is automatically instrumented.
The middleware layer has the semantic context to directly
identify and track request paths. In contrast, Magpie instruments
the operating system, and uses a separate middlewareor
application-specific schema to add the necessary semantic
context to generate complete request paths[4].
To improve resilience against changes in the workload mix
presented to an Internet service, we “bin” our request paths by
request type, and separately extract normal models for each
context. The degree of resilience is determined by the quality
of the binning function. In our testbed we use the URL of a
request; a more sophisticated classifier might also use URL
arguments, cookies, etc.
In the analysis phase, Pinpoint first builds a dynamic model
of “normal” behavior under the assumption that “the system
is mostly working correctly most of the time,” and then looks
for deviations from this model. Note that this assumption
does not require the training model to be fault-free. Dynamically
building accurate models quickly is practical because of
the high traffic and large number of independent requests: a
large fraction of the service’s code base and functionality is
exercised in a relatively short amount of time. This is not true
for some other applications of anomaly detection, such as intrusion
detection in multi-purpose or lightly-used servers, in
which it is not reasonable to assume that we can observe a
large volume of requests.
We use historical analysis to look for anomalies in components
relative to their past behavior, and peer analysis to
look for anomalies relative to the current behaviors of their
replicated peers. Historical analysis can detect acute failures,
but not those that have always existed; peer analysis, which
only works for components that are replicated, is resilient to
external variations that affect all peers equally (such as workload
changes), but a correlated failure that affects all peers
equally will be missed. Steady-state failure conditions affecting
the whole system would not be detected by either type of
analysis.
In the localization phase, Pinpoint uses decision tree analysis
to discover the components that are most highly correlated
with a particular anomaly [11].
3 Algorithms and Data Structures
From our observations, we extract two behaviors: path shapes
and component interactions. The techniques we use to model
these behaviors and use these models to detect anomalies are
detailed in the rest of this section.
3.1 PCFGs Model Path Shapes
The shape of a path is the ordered set of logical software components
(as opposed to instances of components on specific
machines) used to service a client request [12]. We represent
the shape of a path in a call-tree-like structure, except that
each node in the tree is a component rather than a call site
(i.e., calls that do not cross component boundaries are hidden).
We represent the normal behavior of these paths as a
probabilistic context-free grammar (PCFG) [29], a structure
used in natural language to calculate the probabilities of different
sentences being generated by a particular language and
the probabilities of different parses of a sentence. A PCFG
consists of a set of grammar rules, Rij : N i → ζj , where
N i is a symbol in the grammar and ζ j is a sequence of zero
or more symbols in the grammar. Each grammar rule is annotated
with a probability P (Rij), such that ∀i �
j Rij = 1.
The probability of a sentence occurring in the language represented
by that grammar is the sum of the probabilities of
all the legal parsings of that sentence. We collect a number
of request paths, extract their path shapes, and treat each path
shape as the parse tree of a sentence in a hypothetical grammar,
using the component calls made in the path shapes to
assign the probabilities to each production rule in the PCFG.
Figure 2 shows an example of a trivial set of path shapes and
the corresponding PCFG.
To build a historical model of behavior, we build a PCFG
based on the system’s behavior in the past—for example, we
might use a model based on the system’s behavior during the
same time last week. For peer analysis, we wish to detect
paths that are anomalous in comparison to the other paths we
are seeing and build a PCFG from current path shapes. In
both cases, we want to be sure that our models are based on
4
R1,1 : S → A p = 1.0 R3,2 : B → C p = 0.2
R2,1 : A → B p = 0.66 R3,3 : B → CB p = 0.2
R2,2 : A → BB p = 0.33 R3,4 : B → $ p = 0.2
R3,1 : B → CC p = 0.4 R4,1 : C → $ p = 1.0
Figure 2: Top left: a set of three inter-component call paths through
a system consisting of three component types (A, B, C). Top right:
Call paths 1 and 2 have the same shape, while 3 is different. Bottom:
PCFG corresponding to this collection of paths. S is the start
symbol, $ is the end symbol, and A, B, C are the symbols of the
grammar.
enough observations that we capture most of the different behaviors
in the system.
To determine whether a subsequently-observed path shape
is anomalous, we start at the root of the tree of component
calls in a path shape and compare each transition to its corresponding
rule Rij in our PCFG. The score of the path shape
is
�
∀Rij∈t
min(0, P (Rij) − 1/ni) (1)
where t is the path shape being tested, and n is the total
number of rules in our PCFG beginning at N i . This function
scores the deviation of a path shape’s transitions from the
expected probability of a transition beginning at each symbol
N i . The simple intuition behind this scoring function is
that low probability transitions are not necessarily anomalous.
Consider a PCFG with 100 equally probable rules beginning
with N i and probability 0.005 each, and 1 rule with probability
0.5. Rather than penalize the low-probability 0.005 transitions,
this scoring mechanism will calculate that they deviate
very little from the expected probability of 1/101. Figure 3
shows that this scoring function does a good job of separating
normal paths from faulty paths.
Once we have scored our path shapes, if more than αn
paths score above the (1−n) th percentile of our normal distribution,
we mark these paths as anomalous. For example, any
path with a score higher than any we have seen before (i.e.,
above the 100th percentile) will be marked as anomalous.
Similarly, if α is 5 and 1% of paths suddenly score higher
than our historical 99.9th percentile, we will mark these 1%
of paths as anomalous, since 0.01 > 5 ∗ (1 − 0.999). The α
coefficient allows us some degree of control over our the ratio
of false-positives to true-positives. All other things being
equal, we would expect to have less than 1
α of our anomalous
paths be false-positives when a failure occurs.
Num requests
100
80
60
40
20
Successful requests
0
0 0.05 0.1
Score
0.15 0.2
Num requests
100
80
60
40
20
Successful requests
Failed requests
0
0 0.05 0.1
Score
0.15 0.2
Figure 3: The left graph is a histogram of request scores during
normal behavior. The right shows the scores when we cause part
of the system to misbehave. Some requests fail as a result, and are
clearly separated from the successful requests by our scoring algorithm.
This data is taken from our experiments with the Petstore 1.3,
described in Section 4.
3.2 Modeling Component Interactions
The second low-level behavior that we analyze is component
interactions. We model the behavior of a component A as a
set of weighted links between A and other components. Each
link is weighted by the proportion of runtime paths that enter
or leave A along the link. The intuition is that if these
weights change, the functional behavior of A is also likely to
be changing.
Component interaction analysis gives a different view of
the system than path shape analysis. Rather than inspecting
individual paths and slicing across components, looking at individual
components’ behaviors slices across all the requests
flowing through it. For example, it is normal for a passwordverification
component to occasionally reject login attempts
and path-shape analysis of a request that ended in a login
failure would not be anomalous. However, component interaction
analysis would be able to detect an anomaly if the
component was rejecting all login attempts. Conversely, an
occasional error that affects only a few requests might not significantly
change the proportions of component interactions,
but path-shape analysis would detect changes in the individual
paths.
We generate our historical model of the behavior of component
instances in the system by averaging the weights of
links through them over time. Our peer model is generated
by averaging the current behaviors of replicated peers in the
system.
We detect anomalies by measuring the deviation between a
single component’s current behavior and our model of normal
behavior using the χ 2 test of goodness-of-fit:
Q =
k� (Ni − wi) 2
i=1
where Ni is the number of times link i is traversed in our
component instance’s behavior; and wi is the expected number
of traversals of the link according to the weights in our
wi
(2)
5
1 9.41 CatalogEJB
2 1.09 ShoppingCartEJB
3 0.34 ShoppingControllerEJB
4 0.12 JspServlet
5 0.02 MainServlet
Figure 4: This list shows the top five χ 2 goodness-of-fit scores of
components after injecting an anomaly into CatalogEJB. The scores
are normalized 1 is the threshold for statistical significance. CatalogEJB,
the most anomalous, is significantly more anomalous than
other components.
model of normal behavior. Q is our confidence that the normal
behavior and observed behavior are based on the same
underlying probability distribution, regardless of what that
distribution may be. The higher the value of Q, the less likely
it is that the same process generated both the normal behavior
and the component instance’s behavior.
We use the χ 2 distribution with k − 1 degrees of freedom,
where k is the number of links in and out of a component,
and we compare Q to an anomaly threshold based on our desired
level of significance α, where higher values of α are
more sensitive to failures but more prone to false-positives,
and the . In our experiments, we use a level of significance
α = 0.005. Figure 4 shows an example output from one of
our fault detection runs.
3.3 Decision Trees Localize Anomalies
Once we have detected anomalous path shapes or component
behaviors, we localize these anomalies to their potential
causes by looking for features of the paths and components
that are highly correlated with the anomalies. We use the ID3
algorithm [32] for learning a decision tree, a data structure
that represents a discrete-valued function, where each branch
of the tree is a test of some attribute of the input, and where
the leaves of the tree hold the result of the function. For
anomalous paths, the attributes correspond to the path information
that Pinpoint collects, such as the names of EJB’s, IP
addresses of server replicas in the cluster, etc. Decision tree
learning is the process of deciding what tests to use at each
node of the tree in order to build the most accurate classification
function. The general approach to building a tree is to
calculate the entropy of the data at each node of the tree, and
choose a test that will split the data in a way that minimizes
the entropy of the child nodes.
Once we have built a decision tree, we convert it to an
equivalent set of rules, by generating a rule for each path from
the root of the tree to a leaf. We rank each of these rules based
on the number of paths that they correctly classify as anomalous.
From these rules, we extract the hardware and software
components that are correlated with failures.
The training set for our decision-tree learning is the set of
Figure 5: Left: A decision-tree learned for classifying faulty paths
in one of our experiments. Here we injected a fault into TheInventory
component on one machine in a cluster. The decision tree learning
algorithm chose the “name=TheInventory” as the most important
classifier, and the machine’s ip address as the the second. Right:
J2EE provides a three-tiered architecture for building applications.
paths classified as normal or anomalous by our PCFG detector.
The input to our target function is a path, and the output
of the function is whether or not the path is anomalous. Our
goal is to build a decision tree that approximates our observed
anomalies based on the components and resources used by the
path.
Note that decision trees can represent both disjunctive hypotheses,
meaning that we can learn hypotheses that describe
multiple independent faults, as well as conjunctive hypotheses,
meaning that we can localize failures based on multiple
attributes of a path rather than just one, i.e. caused by interactions
of sets of components rather than by individual components.
More interestingly, it allows us to avoid specifying
a priori the exact fault boundaries in the system. Rather than
assuming that we want to localize to a particular instance of
a component, we can allow the decision tree to choose to localize
to a class of components, a particular version of a component,
all components running on a specific machine, etc.
We use the same approach to construct decision trees for
component-interaction anomalies; in that case, the decision
tree uses attributes of the software and hardware stack each
component is built from or depends on.
4 Experimental Setup and Methodology
The goal of our experiments is to characterize Pinpoint’s fault
detection, faulty request identification, and localization capabilities,
and compare them to existing application-generic
techniques. To measure this, we connect Pinpoint to a widelyused
middleware system, inject various faults into applications
running on top of this middleware, and evaluate how
well Pinpoint detects and localizes our injected fault. Because
we inject faults into a live application running on enterpriseready
middleware, we have confidence that the application’s
behavior following an error is realistic, even though the fault
itself is artificially injected. To measure Pinpoint’s resilience
to marking day-to-day normal changes as false positives, we
6
applied common benign changes such as workload variations
and minor software upgrades to our testbed.
The core of Pinpoint is a plugin-based analysis engine. We
have created plugins corresponding to the anomaly-detection
and localization algorithms described in Section 3. For our
experiments we collected the application traces and analyzed
them off-line separately, due to the large number of runs; in
practice we would expect Pinpoint to be deployed as a nearreal-time
monitor. The rest of this section describes our experimental
setup.
4.1 Instrumentation
We instrumented the JBoss (www.jboss.org) open-source implementation
of the J2EE (www.javasoft.com/j2ee) middleware
standard which provides a standard runtime environment
for three-tier enterprise applications (Figure 5. In the
“presentation” or Web server tier, our instrumentation captures
the request (URL) details and invocation and returns
for used JSP pages, JSP tags and servlets. In the application
logic (J2EE server) tier, which manages and runs the
Enterprise Java Bean modules that make up the core of the
application, we capture calls to the naming directory (used
by components to find each other) and the invocation and return
data for each call to an EJB. Finally, we capture all the
SQL queries sent to the third (database) tier by instrumenting
JBoss’s Java Database Connection (JDBC) wrappers. Each
observation includes details such as the component information,
the IP address of the current machine, a timestamp, an
ordered observation number, and a globally-unique request
ID, used to correlate observations from across the cluster and
reconstruct the request’s complete runtime path.
Our instrumentation adds 1150 new lines of code spread
over 17 Java source files. If reporting observations synchronously,
our tracing increases request latency by 2 −
−40ms depending on the length of the path, and decreases
system throughput by 17%, from 35 to 29 requests/sec. When
the machine is overloaded, we drop observations rather than
hurt performance; the deployment of commercial instrumentation
packages such as IntegriTea (www.tealeaf.com) on
large sites such as Priceline.com suggests that fine-grained
instrumentation is practical if some engineering effort is focused
on the implementation.
4.2 Applications and workloads
We deployed three different applications in our testbed cluster.
PetStore 1.1 is Sun’s sample J2EE application that simulates
an e-commerce web site (storefront, shopping cart, purchase,
tracking, etc.). It consists of 12 application components
(EJBs and servlets), 233 Java files, and about 11K lines
of code, and stores its data in a Cloudscape database. Its primary
advantage is that we have been able to modify the application
to distribute its presentation and business logic across
a cluster of machines.
Petstore 1.3.1 is a significantly re-architected version that
includes a suite of applications for order processing, administration,
and supply-chain management: 47 components, 310
files, 10K lines of code. Because of the new architecture, we
were unable to cluster Petstore 1.3, but its increased functionality
makes it a useful test application.
RUBiS is an auction website, developed at Rice University
for experimenting with different design patterns for
J2EE [10]. RUBiS contains over 500 Java files and over 25k
lines of code. More relevant for our purposes, RUBiS has 6
EJBs and several servlets.
While RUBiS comes with a custom load generator, we built
our own HTTP load generator for use with our Petstore applications.
The Petstore workloads presented by our load generator
simulates traces of several parallel, distinct user sessions
with each session running in its own client thread. A session
consists of a user entering a site, performing various operations
such as browsing or purchasing items, and then leaving
the site. We choose session traces so that the overall load on
the service fully exercises the components and functionality
of the site. If a client thread detects an HTTP error, it retries
the request. If the request continues to return errors, the
client quits the trace and begins the session again. The traces
are designed to take different routes through the web service,
such that a failure in a single part of the web service will not
artificially block all the traces early in their life cycle. While
our synthetic workload is very regular our discussions with
multiple large Internet services indicate that this regularity is
realistic. One site’s aggregate user behavior at any time is
generally within 1-2% of the behavior at the same time the
previous week, with the exception of major events such as
holidays or disasters.
When we deploy these applications, we run our observation
collector, the application, the database, and the load generator
on separate machines. Our clustered version of Petstore
1.1 runs with one front-end node running the code to
handle presentation, and three application-logic nodes.
4.3 Fault and Error Load
The goal of Pinpoint is to detect and localize application-level
failures, defined as those that affect application-level behavior
but do not cause obvious failures at lower-levels of the system
stack. We surveyed studies of faults in deployed systems
and the faults injected by other researchers in their experiments
[31, 23, 16].
Java exceptions: Because Java coerces many different
kinds of failures, from I/O errors to programmer errors, to
manifest as exceptions, injecting exceptions is an appropriate
method of testing an application’s response to real faults. To
test an application’s response to both anticipated and possibly
unexpected faults, we inject both exceptions that are declared
in component interfaces and undeclared runtime exceptions.
7
Fault
Type
Description NUM MAJ CAS
Exception Force a component to
97 63% 66%
Omission
throw a declared or runtime
exception
Intercept and ignore com-
47 67% 62%
faults
ponent calls.
Source
Insert common errors into
41 88% 52%
code bugs application code.
Table 1: Injected fault types including the number(NUM) of faults
we injected in this category, the percentage of these faults that
caused major(MAJ) failures, and the average degree of fault cascading(CAS).
For example, if we inject a fault into 3 requests, and
see a fourth request fail as well, we say there is a 33% cascade.
Omission Faults: We intercept a method call and simply
omit it. If the function should have returned a value, we return
0 or a null value.
Source code bug injection: Even simple programming
bugs remain uncaught and cause problems in real
software[35, 33], so introducing them can be a useful method
of simulating faults due to software bugs [15].
We do not inject low-level hardware or OS faults, such as
CPU register bit-flips, memory corruptions and IO errors because,
empirically, these faults do not manifest as applicationlevel
failures that would otherwise go unnoticed [9].
We expect that exceptions and omissions are extremely
likely to effect the structural behavior of an application, while
source code bugs may or may not cause change application
structure. By injecting this range of faults, we test both
whether our algorithms detect anomalies when the application’s
structural behavior changes, and whether more subtle
faults that cause user-visible errors are also likely to change
the application’s structural behavior.
Together, these injected faults cause a variety of errors. As
an example of a mild failure, faults injected into a the InventoryEJB
component of Petstore 1.1 are masked by the application
such that the only user-visible effect is that items are
perpetually “out of stock”. At the other end of the spectrum,
injecting a fault into the ShoppingController component in
Petstore 1.3 prevents the user from seeing the website at all,
instead displaying an internal server error for all requests.
In our experiments, we concentrate on faults that cause
user-visible errors. We verify this in our Petstore 1.1 and
1.3 experiments by modifying our load generator verify all
the HTML output with MD5 hashes from fault-free runs. To
make this verification feasible, we force our dynamic applications
to produce mostly deterministic output by resetting all
application state between experiments and running a deterministic
workload. Any remaining non-deterministic output,
such as order numbers, are canonicalized before hashing.
Even though we are in control of the fault injection it is
not trivial to determine which client requests in an experi-
ment failed. Component interactions and corrupted state or
resources can all lead to cascading failures. As our groundtruth
comparison, we mark a request as having failed if we
directly inject a fault into it, if it causes an HTTP error, or if
its HTML output fails our MD5 hash. When a fault affects
more than 1% of the requests in an experiment, we call it a
major fault.
No fault injection scheme can accurately mimic the variety
and effects of failures that occur in the real world. However,
given the number and breadth of faults we have injected into
each of our three applications—185 distinct faults in total—
we are confident that our experiments capture a wide range of
realistic fault behaviors.
4.4 Comparison Monitors
To better evaluate Pinpoint’s ability to detect failures, we
compare it to several low-level monitors, as introduced in
Section 1.1. Our own load generator doubles as an HTTP
monitor as well as an HTML monitor. It scans the HTML
content being returned to users for obvious signs of errors
and exceptions; in our case, whether the keywords “error”
and “exception” appear in the HTML text.
We considered detecting faults by monitoring for exceptions
being thrown. However, exceptions are used for exceptional
circumstances, not simply for failures, and when
we instrumented the Java virtual machine we found that a 5min
failure-free run of Petstore 1.3.1 on JBoss generates over
62,000 Java exceptions—an unacceptable false-positive rate.
We do not include application-specific monitors in our
comparison, since deciding what application functionality to
test would have been the determining factor in detecting many
of these failures, since an application-specific monitor can be
engineered to test for almost any expected fault. Additionally,
Pinpoint’s main improvement in comparison to these monitors
is not its ability to detect failures, but the fact that Pinpoint
is a low-maintenance, application-generic solution to
high-level fault-detection. Since we do not have real users
generating workload on our site, we do not include monitors
which watch for changes in user behavior.
Log monitoring is a common error detection mechanism
in deployed systems. Though not as involved as applicationlevel
monitors, log monitors are still system- and applicationspecific,
usually requiring operators to write regular expression
searches to match potential errors in server log files. We
wrote a simple log monitor (searching for “ERROR” messages)
for our testbed and found that it detected “failures”
in almost all our experiments, including false-alarms in our
fault-free control experiments! After some study, we concluded
that distinguishing these false alarms from the true
failures was non-trivial.
8
4.5 Metrics
To measure the efficacy of Pinpoint’s fault detection and localization,
we borrow the metrics of recall and precision from
information retrieval research. When searching for items belonging
to some target population, recall measures the proportion
of the target population that is correctly returned. Precision
measures the proportion of returned items that actually
match the target population. Precision is also equal to (1-
False Positive Rate). An ideal system has recall=1 and precision=1.
The following summarizes three different aspects
of Pinpoint’s detection and localization ability, and what the
metrics recall and precision mean in each.
Fault Detection: Recall and precision measure, respectively,
how well failures in the system are detected as anomalies,
and how many of the detected anomalies are real failures.
Faulty Request Identification: Path-shape analysis, as
well as our HTTP and HTML monitors, detect whether individual
requests are faulty. Our experiments measure the recall
and precision of request identification given the existence
of an anomaly in the system.
Fault Localization: Once we notice a failure, fault localization
gives us a suspect set of locations where the fault
might be occurring. Recall and precision measure how well
this suspect set matches the true location of the fault.
Two very different kinds of false positives can affect precision.
Algorithmic false positives are due to the inherent
statistical behavior of statistical learning algorithms, models,
and thresholds. Semantic false positives occur when an algorithm
correctly (from the algorithm’s point of view) detects
an anomaly, but the anomaly does not correspond to a failure.
For example, a major system upgrade, or sudden application
mode change could trigger an anomaly yet not be an indication
of a fault. The majority of our experiments measure
recall and precision due to algorithmic false positives. Section
5.4 presents experiments to measure Pinpoint’s susceptibility
to semantic false positives.
5 Experimental Results
In this section we present a set of experiments to test Pinpoint’s
fault detection, faulty request identification, and fault
localization ability in realistic, though small, Internet services.
We use historical path-shape and component interaction
analyses to detect failures. We build our historical models
based on a 5-minute, fault-free run of the application under
our normal workload, about 800-1000 requests total.
In addition, we study Pinpoint’s ability to detect common
source code bugs, and test how Pinpoint’s semantic false positive
rate may fare in the face of “normal” anomalies, such as
workload changes and software upgrades.
Figure 6: An example of a normal path and an anomalous path
discovered by our PCFG detection technique.
5.1 Does Pinpoint Detect Failures?
As an example, Figure 6 shows one normal and one anomalous
path for the commitorder request type. These two paths
are obviously different, but it is not immediately obvious that
the one on the right, asking for the user to sign in, is wrong.
By inspecting the user’s session trace, however, it becomes
clear that it is in fact a failure. Prior to coming to this page,
the user has already created a new account, logged in, and is
on the last stage of purchasing when he is asked again to sign
in. This is a cascaded fault, where a fault we injected during
the account creation process, caused the system loses track of
this user’s session information. Even though this request is
not the primary fault, it is useful to recognize to analyze the
impact of faults.
The results of our systematic fault detection experiments
are summarized in Table 2 1 . Both path-shape analysis and
component-interaction analysis performed well when detecting
89-100% of major faults. The exception was source code
bug injections, which we discuss in detail in Section 5.2.
The detection rates for Pinpoint’s analyses are quite stable
across different applications and different fault types. Source
code bug injection is the only bug that is not detected as well
as the others. Even most minor faults are detected by our
combined algorirthms. Together, they detect 107 of the 122
exceptions and omissions we injected. In contrast, the efficacy
of checking the HTTP error codes and HTML output for
errors varies significantly by application. What works well
for Petstore 1.1 does not appear to work well for Petstore 1.3.
Even used together, the HTTP and HTML monitors detect
only 88 of the 122 injected exceptions and omissions. All of
these faults are also detected by our path-shape and component
interaction analyses.
While our analyses do not do as well detecting failures that
affect less than 1% of the system, it is possible that this is an
1 The experimental traces for RUBiS were generated with its own appspecific
load generator, which does not include an HTTP or HTML monitor.
We are currently porting our monitoring to RUBiS’s load generator
9
Figure 7: The recall and precision rates for detecting failing requests
for the request-oriented fault monitors, sorted by injected
fault type. Each point represents the precision and recall of one
fault-injection experiment. When points overlap, the background
gradient shows their density. The dark spots at the origins of the
HTTP and HTML monitor plots indicates that in many experiments,
none of the faulty requests are being detected.
Figure 8: The precision and recall of discovering faulty requests
with path-shape analysis as we vary α.
artifact of our experimental setup. With only 1000 requests
per experiment, a minor failure will affect less than 10% of
requests, and may not be noticeable by our dynamic thresholding
algorithm, or statistically significant for our χ 2 testing.
We are planning to investigate this further by running longer
experiments for our minor faults.
Overall, our path-shape analysis does a good job of detecting
faulty requests without detecting false positives. It is
worth noting that the faulty request identification precision
and recall values in this section are during anomalous periods.
Because of our dynamic thresholding, we can catch
most faulty requests during these times, (even if their PCFG
scores are individually acceptable), and avoid detecting false
positives when the system is behaving normally.
In Figure 8, we investigate how adjusting the α parameter
affects the recall and precision of our path-shape analysis. As
we raise α, we lower recall and improve precision, from a
median of p = 14%,r = 68%|α = 1 to p = 93%,r =
34%|α = 8. Note that we do not have to recall all the faulty
requests to detect a fault in the system. Not including our
source code bug expts, our path-shape analysis detects 83%
of the faults when α = 8.
Monitor Declared Exc. Runtime Exc. Omissions Src-PS 1.3 PS 1.1 PS 1.3 RUBiS
Path Shape, α = 2 78% / 93% 90% / 96% 90% / 100% 12% / 50% 61% / 92% 90% / 91% -
Path Shape, α = 4 75% / 89% 90% / 96% 85% / 96% 7% / 37% 59% / 89% 84% / 88% 68%/-
Comp. Interaction 56% / 89% 68% / 96% 70% / 96% 12% / 37% 44% / 89% 84% / 88% 83/-%
HTTP Errors 48% / 64% 53% / 70% 44% / 65% 10% / 37% 43% / 83% 28% / 32% -
HTML Monitoring 28% / 40% 24% / 35% 17% / 20% 2% / 13% 2% / 6% 66% / 72% -
Table 2: For each monitor, we show how well it detected each type of fault across all our applications, and how well it detected all the faults
in each application. In each cell, the first number indicates how well we detected all faults in the category, the second rate is how well we
detected major faults in the category. Since we only injected source code bugs into Petstore 1.3, we report those faults separately, and do not
include them in the overall Petstore 1.3 report.
5.2 Source Code Bugs
Talking with Internet service operators, we find that software
upgrades occur often in large sites[2]. While a major software
upgrade will change functionality significantly and require
Pinpoint to retrain its models of system behavior, these
major upgrades occur on the order of months, while minor
software upgrades, such as bug fixes, occur on the order of
days or weeks.
To see whether Pinpoint would be more likely to detect
faults introduced by minor bugs in source code, we wrote a
source-code bug injector for the Java language. We use the
Polyglot extensible compiler framework [30] as a base, and
can inject several different simple bugs, summarized in Table
3. While these are all minor bugs, evidence suggests that
no bug is so trivial that it does not occur in real software [22].
For our experiments, we first generate an exhaustive list of
the spots where a bug can be injected within a component,
after eliminating locations in unused code. Then, we iterate
over these “bug spots” and inject one bug per run of the application.
At runtime, we record when this modified code is
exercised to track what requests may be tainted by the fault.
After running these experiments, we found that while pathshape
analysis and component interaction analysis did not do
significantly better, overall, than the http monitors. Most of
the software bugs that were injected, even though they caused
user visible changes to the output of the application, did not
cause component-level changes internally. Pinpoint’s analyses
were able to detect bugs when they did cause changes,
however. For example, one of the source code bugs we injected
kept the shopping cart from correctly iterating over its
contents. While this behavior is internal to the shopping cart
and not noticeable by itself, it also causes some JSP tags in
the “view shopping cart” page to stop iterating—since the cart
never sends its contents to be displayed. Together, the three
top monitors detected 15% of the bugs we injected, as opposed
to 7-12% individually.
5.3 Does Pinpoint Localize Failures?
The overall results of our localization tests comparing Pinpoint’s
detection and localization techniques to each other are
10
shown in Figure 9. In this figure, we show how our well our
decision-tree based localization and our component interaction
based localization fare in our experiments. We show the
results for three variants of our decision tree, each showing
how well the decision tree fares as the requests classified as
faulty become more and more “noisy”. First, we apply our
decision tree to only the faults that we injected failures into.
These results are competetive with our component-interaction
analysis—the only misdiagnoses or false positives that occur
are due the structure of the application itself. For example,
the decision tree cannot distinguish between two components
that are always used together. Also, there exist components
which appear to correlate very well with some failures, hiding
the true cause of a fault.
Second are the results for our decision tree applied to the
requests known to have been injected with a fault or affected
by a cascading fault. These results are noisier, and introduce
what we measure as misdiagnoses when we actually diagnose
the cascaded fault. Finally, the results for our end-toend
PCFG and decision tree show the highest miss rate, as
we contend with noise both from the PCFG selection mechanism
and the cascaded faults. Not represented on this graph,
but worth noting is that in our clustered experiments with Petstore
1.1, when the decision tree was not able to pinpoint the
exact instance of a component that was causing the problem,
it was still able to narrow down the problem, either to the correct
class of components, or to the machine the faulty component
was running on.
From this we conclude that a decision tree’s ability to localize
failures depends heavily on the noise in the traces. Here,
the decision tree’s localization capability drops when we add
cascaded failures and false positives from runs of the pathanalysis
algorithm. This indicates that heuristics for separating
primary from cascaded faulty requests, such as picking
the first anomalous request from a user’s session as the primary
fault, are likely to improve the performance of decisiontree
localization in real deployments.
5.4 Semantic False positives
To test Pinpoint’s resilience against erroneously marking
these “normal changes” as anomalies, we ran two experi-
Bug Type Description Good code Bad code Num
Loop errors Inverts loop conditions; loop{ while(b){stmt;} loop{ while(!b){stmt;} 15
Mis-
Replaces the left-hand-side of an assign- i = f(x); j = f(x); 1
assignmentment
with a different variable
Misinitialization
Clear a variable initialization. int i = 20; int i = 0; 2
Mis-reference Replaces variables in expressions with a
different but correctly typed variable
Avail = InStock − Ordered; Avail = InStock − OnOrder; 6
Off-by-one E.g., replaces < with <= or >= with > for(i = 0;i<count;i++) for(i = 0;i<=count;i++) 17
Table 3: Overview of source code bug types and how many we injected into our applications. None are detected by the Java compiler.
Figure 9: Localization recall of our techniques per fault type.
ments. In one, we signifcantly changed the load offered by
our workload generator—we stopped sending any ordering
or checkout related requests. In our second experiment, we
upgraded the Petstore v1.3.1 to a bug-fix release, Petstore
v1.3.2. For both our path-shape and component interaction
analyses, we used a historical analysis based on the behavior
of Petstore 1.3.1 under our normal workload.
In both of these experiments, neither our path-shape analysis
nor our component-interaction analysis triggered falsepositives.
In the workload-change experiment, none of the
paths were anomalous—as to be expected, as they are all
valid paths. And though the gross behavior of our components
did change with the workload, the fact that we analyze
component interactions in the context of different types of requests
compensated for this, and we detected no significant
changes in behavior. In the upgrade to Petstore 1.3.2, we also
did not detect any new path shapes; our component behaviors
did change noticeably, but still did not pass our threshold according
to the χ 2 test. Though not comprehensive, these two
experiments suggest that our fault detection techniques are
robust against reporting spurious anomalies when application
functionality has not changed.
6 Discussion
6.1 The Base-Rate Fallacy
The base-rate fallacy declares that, when looking for rare
events, any non-zero false positive rate will overwhelm a de-
11
tector with even perfect recall. [3] argues that this makes most
existing intrusion detection systems unusable.
In the context of our broader project, Recovery Oriented
Computing, we argue that false positives are only a problem
when dealing with them is expensive. We advocate making
the cost of online repair for failures sufficiently low, such that
a reasonable degree of “superfluous recovery” will not incur
significant overhead. This has been successfully demonstrated
in the context of a storage subsystem for Internet services
[27] where, in response to another set of Pinpoint’s
anomaly detection algorithms, any replica can be rapidly rebooted
at any time without impacting performance or correctness,
and in a modified J2EE middleware [9] in which
microreboots make it almost free to restart individual components
and clear transient failures such as resource leaks and
deadlocks. Cheap recovery has another benefit for fault detection
as well: when false positives are inexpensive, its reasonable
to lower detection thresholds to catch more faults (and
more false positives), and potentially catch problems earlier.
6.2 Limitations of Pinpoint
Here, we describe two limitations of Pinpoint that we have
observed. The first is a reconfirmation of one of Pinpoint’s assumptions,
the second is an interesting dependency between
our component interaction model, and component naming
schemes.
We decided to test Pinpoint’s assumption of monitoring a
request-reply based system in which each request is a shortlived,
largely independent unit of work. We applied Pinpoint
to a remote method invocation (RMI) based application.
While RMI is a request-reply system, a single interaction between
the client and server can encompass several RMI calls,
and the unit of work is not always well defined.
We chose to apply Pinpoint to monitor ECPerf 1.1, an
industry-standard benchmark for measuring the performance
of J2EE implementations. Because running unmodified applications
is one of Pinpoint’s goals, we decided to use the
simplest definition of a unit of work, a single RMI call from
the client to the server. Unfortunately, most of the paths we
captured were single component calls, with no structure behind
them. When we injected faults into ECPerf, there was
no observable change in the path, since there was very little
behavior in the path in the first place. While path-analysis did
detect some anomalies in the system when a fault occurred,
they were not in the “requests” we injected with faults (presumably
some state in the application was corrupted and affected
later RMI calls). To generalize Pinpoint to this kind
of system, we should expand our definition of a path to encompass
multiple interactions, though this will likely entail
application-specific tailoring.
We found another limitation when we were monitoring the
clustered version of Petstore 1.1. One of the components we
monitored was the middleware-level naming service. This
service has one behavior mode in the web tier (where it
mostly initiates name lookups) and another in the application
tier, where it mostly replies to lookups. When a component
exhibits multiple modes of behavior depending on its physical
location, our component interaction model attempts to capture
a non-existent average of the multiple modes, and subsequently
detects all the components as deviating from this
average!
One possible solution is to use a model that captures
multiple modes of behavior, though this has the danger of
mis-classifying truly anomalous behavior as simply “another
mode.” Another, possibly more attractive, option is to extend
the component-identification scheme to differentiate between
components placed, for example, in the web tier vs the backend,
thus allowing Pinpoint’s analysis to build separate models
for each. The difficulty is to ensure the naming scheme
captures the right details, without splitting valid component
groups apart to the extent that the models lose their ability to
validate behaviors across many peers.
6.3 Other Statistical Learning Techniques
There exist many more statistical learning algorithms that
may appropriate for analyzing the structural behavior of a
system. Here is a brief description of some alternatives we
have tried.
As an alternative to our current PCFG scoring function,
based on measuring the deviation from expected probabilities,
we also tried and discarded two scoring functions used
in natural language processing: first, taking the product of the
probabilities of all the transition rules, and second taking their
geometric mean. While the former had an unacceptable bias
against long paths, the latter masked problems when only a
small number of improbable transitions were exercised.
Recently, together with Peter Bodik, we have used standard
support vector machine (SVM) to detect anomalies in the path
shapes and component interaction we captured in our experiments.
While SVM worked as well as our χ 2 test of goodness
of fit for detecting anomalies in component behaviors, standard
SVM techniques did not work well when analyzing pathshape
behaviors. A core step in an SVM analysis is to compute
the similarity score (a dot-product-like function called a
12
kernel method) between two paths. However, in the case of
path-shapes, a good anomaly detector must not only detect
differences in a path, but also estimate how significant those
differences are. While our PCFG scoring method does this
with its calculation of the deviation from expection, the standard
SVM tree-comparison techniques do not. We are currently
investigating whether any feature-weighting additions
to SVMs might improve the SVM analysis of path-shapes.
As an alternative to decision-trees, we have used data clustering
to correlate components with a failure[14]. While this
worked reasonably well for single-component faults, we believe
the decision-tree model can more naturally expresses
localizations in the presence of multiple independent failures
and latent faults.
7 Related Work
Magpie [4] captures path-like dynamic control flow of a
program to localize performance bottlenecks. Magpie uses
stochastic context free grammars to model system behavior
at a lower level than we do, focusing on resource consumption
and performance modelling. Several commercial products
provide request tracing facilities for J2EE systems; PerformaSure
(www.sitraka.com/software/performasure), AppAssure
(www.alignmentsoftware.com) are applied in predeployment
testing and IntegriTea (www.tealeaf.com) can be
applied to a live system, validating our position that it is practical
to record paths at a fine level of detail. As far as we
know, none of these tools performs application-level failure
detection or failure diagnosis.
Localizing failures has also been challenging. Eventcorrelation
systems for network management [34, 5] and
commercial problem-determination systems such as Open-
View (www.hp.com/openview) and Tivoli (www.tivoli.com)
typically rely on either expert systems with human-generated
rules or on the use of dependency models to assist in fault
localization [39, 17, 20]. Aguilera et al. [1] have used packetsniffing
and statistical analysis to derive the call-graphs between
black-boxes. In contrast to the direct tracing done by
Pinpoint, this only produces a view of the majority behavior
of the system, and thus hides any anomalies existant within
the system. Brown et al. [6] have also used dynamic observation
to automatically build such dependency models. This
approach can produce a rank-ordered list of potential causes,
but they are intrusive and require a human to first identify the
components among which dependencies are to be discovered.
In contrast, Pinpoint can identify the root cause (modulo the
coverage of the workload) non-intrusively and without requiring
human identification of components.
Anomaly detection has gained currency as a tool for detecting
“bad” behaviors in systems where many assumedgood
behaviors can be observed, including intrusion detection
[18, 36], Windows Registry debugging [37], finding bugs
in system code [19], and detecting possible violation of run-
time program invariants regarding variable assignment [21] or
or assertions [26]. Although Ward et al. previously proposed
anomaly detection as a way to identify possible failures for
Internet sites [38], they start with a statistical model based
on 24 hours of observing the system, whereas Pinpoint builds
and adjusts its model dynamically.
8 Future Directions
Pinpoint detects injected and secondary faults in realistic but
small-scale test applications. The value of path-based analysis
for failure management has been demonstrated in one
production Internet service already [12], and we are currently
working to deploy with a large Internet service to apply Pinpoint
monitoring to their systems. In addition, versions of
Pinpoint have been integrated as fault monitors for two systems
at Stanford [27, 24]. Pinpoint has also been distributed
to outside researchers as a basis for new experimentation, and
is available upon request.
In addition to path shapes and component interactions, we
are investigating additional lower-level behaviors that could
be analyzed to reveal information about different high-level
behaviors, such as database access patterns.
We believe Pinpoint’s techniques may be applicable to
other types of systems where building models of application
behaviors and policies to detect anomalies and inconsistencies.
Applications running on overlay networks, or peer-topeer
applications, are potential targets, though in a widelydistributed
application, centralizing the information for analysis
would be more challenging. Sensor networks are a useful
subclass of peer-to-peer systems whose applications are
often data-analysis-centered by nature and in which dataaggregation
machinery is already being put in place [28],
making sensor nets a potential appealing target as well.
9 Conclusions
Pinpoint’s key insight is that aggregating low-level behavior
over a large collection of requests, using it to establish
a baseline for “normal” operation, and then detecting anomalies
with respect to that baseline is an effective way to detect
a variety of faults. Indeed, the key strength of machine
learning techniques is to identify patterns and deviations from
those patterns, from large collections of undifferentiated data.
A large number of independent observations is necessary for
this approach to work (and in particular to allow the baseline
model to be built), but we argue that in today’s componentized
applications this is a safe assumption. As long as the
system is working mostly correctly most of the time, the baseline
model can be extracted from this very behavior. Furthermore,
even complex services provide a fairly limited number
of discrete interactions to the user, so the number of “legitimate”
code paths tends to be much smaller than the number
13
of possible code paths, which gives further intuition behind
the success of the approach.
We showed that the specific techniques of PCFG’s and
decision tree learning to analyze component interaction and
path shapes can yield information about a variety of realistic
transient faults, with no a priori application-specific knowledge.
This approach combines the generality and deployability
of low-level monitoring with the sophisticated failuredetection
abilities usually exhibited only by applicationspecific
high-level monitoring.
Finally, by exploiting the fact that many interactive Internet
services are being built on standard middleware platforms,
we can apply these techniques without modifying the applications
themselves, with minimal impact on the applications’
normal throughput.
We believe Pinpoint represents a useful addition to the roster
of dependability-related uses of statistical anomaly detection,
and hope to more deeply explore its potential with our
ongoing work.
References
[1] M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and
A. Muthitacharoen. Performance debugging for distributed
systems of black boxes. In Proc. 19th ACM Symposium on Operating
Systems Principles, Bolton Landing, New York, 2003.
[2] Anonymous. Personal Comm. We are in the process of obtaining
disclosure from several individuals., 2004.
[3] S. Axelsson. The base-rate fallacy and its implications for the
difficulty of intrusion detection. In Proceedings of the 6th ACM
conference on Computer and communications security, pages
1–7. ACM Press, 1999.
[4] P. Barham, R. Isaacs, R. Mortier, and D. Narayanan. Magpie:
real-time modelling and performance-aware systems. In 9th
Workshop on Hot Topics in Operatings Systems (HotOS IX),
May 2003.
[5] A. Bouloutas, S. Calo, and A. Finkel. Alarm correlation and
fault identification in communication networks. IEEE Transactions
on Communications, 1994.
[6] A. Brown, G. Kar, and A. Keller. An active approach to characterizing
dynamic dependencies for problem determination in
a distributed environment. In Seventh IFIP/IEEE International
Symposium on Integrated Network Management, Seattle, WA,
2001.
[7] Business Internet Group San Francisco. The black friday report
on web application integrity, 2003.
[8] G. Candea, J. Cutler, A. Fox, R. Doshi, P. Garg, and R. Gowda.
Reducing recovery time in a small recursively restartable system.
In The International Conference on Dependable Systems
and Networks (IPDS Track), Washington, D.C., 2002.
[9] G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox.
A microrebootable system: Design, implementation, and evaluation.
In submission to Usenix Symposium on Operating Systems
Design and Implementation 2004.
[10] E. Cecchet, J. Marguerite, and W. Zwaenepoel. Performance
and scalability of EJB applications. In Proc. 17th Conference
on Object-Oriented Programming, Systems, Languages, and
Applications, Seattle, WA, 2002.
[11] M. Chen, A. Zheng, J. Lloyd, M. Jordan, and E. Brewer. A
statistical learning approach to failure diagnosis. In In Proceedings
of the International Conference on Autonomic Computing,
May 2004.
[12] M. Y. Chen, A. Accardi, E. Kiciman, D. Patterson, A. Fox,
and E. Brewer. Path-based failure and evolution management.
In The 1st USENIX/ACM Symposium on Networked Systems
Design and Implementation (NSDI ’04), San Francisco, CA,
March 2004.
[13] M. Y. Chen, E. Kiciman, A. Accardi, A. Fox, and E. Brewer.
Using runtime paths for macro analysis. In 9th Workshop on
Hot Topics in Operating Systems, Kauai, HI, 2002.
[14] M. Y. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer.
Pinpoint: Problem determination in large, dynamic internet
services. In The International Conference on Dependable Systems
and Networks (IPDS Track), Washington, D.C., 2000.
[15] P. M. Chen, W. T. Ng, S. Chandra, C. Aycock, C. Rajamani,
and D. Lowell. The Rio File Cache: Surviving Operating System
Crashes. In Proc. 7th International Conference on Architectural
Support for Programming Languages and Operating
Systems, Cambridge, MA, 1996.
[16] R. Chillarege and N. S. Bowen. Understanding large system
failures - a fault injection experiment". In In Proceedings of the
International Symposium on Fault Tolerant Computing, June
1989.
[17] J. Choi, M. Choi, and S. Lee. An alarm correlation and fault
identification scheme based on osi managed object classes. In
Proc. of IEEE Conference on Communications, 1999.
[18] D. E. Denning. An intrusion-detection model. IEEE Transactions
on Software Engineering, 13(2):222–232, February 1987.
[19] D. Engler, D. Y. Chen, S. Hallem, A. Chou, and B. Chelf. Bugs
as deviant behavior: A general approach to inferring errors in
systems code. In Symposium on Operating Systems Principles,
2001.
[20] B. Gruschke. A new approach for event correlation based on
dependency graphs. In Proc. of the 5th Workshop of the Open-
View University Association: OVUA ’98, 1998.
[21] S. Hangal and M. Lam. Tracking down software bugs using
automatic anomaly detection. In Proceedings of the International
Conference on Software Engineering, May 2002.
[22] D. Hovemeyer and W. Pugh. Finding bugs is easy. In submission.
http://findbugs.sf.net/.
14
[23] M.-C. Hsueh, T. K. Tsai, and R. K. Iyer. Fault injection techniques
and tools. Computer, 30(4):75–82, 1997.
[24] A. C. Huang and A. Fox. A persistent hash table with cheap
recovery: A step towards self-managing state stores. In preparation.
[25] D. Jacobs. Distributed computing with BEA WebLogic server.
In Proceedings of the Conference on Innovative Data Systems
Research, Asilomar, CA, Jan. 2003.
[26] B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan. Sampling
user executions for bug isolation. In The Workshop on Remote
Analysis and Measurement of Software Systems, Portland, OR,
May 2003.
[27] B. Ling, E. Kiciman, and A. Fox. Session state: Beyond soft
state, March 2004.
[28] S. R. Madden, M. J. Franklin, J. M. Hellerstein, and W. Hong.
The design of an acquisitional query processor for sensor networks.
In SIGMOD, San Diego, CA, June 2003.
[29] C. D. Manning and H. Shutze. Foundations of Statistical Natural
Language Processing. The MIT Press, Cambridge, MA,
1999.
[30] N. Nystrom, M. R. Clarkson, and A. C. Myers. Polyglot: An
Extensible Compiler Framework for Java. In Proc. of the 12th
International Conference on Compiler Construction, Warsaw,
Poland, Apr. 2003.
[31] D. Oppenheimer, A. Ganapathi, and D. Patterson. Why do
internet services fail, and what can be done about it? In
4th USENIX Symposium on Internet Technologies and Systems
(USITS ’03), 2003.
[32] J. R. Quinlan. Induction of decision trees. Machine Learning,
1(1):81–106, 1986.
[33] D. Reimer, E. Schonberg, K. Srinivas, H. Srinivasan,
B. Alpern, R. D. Johnson, A. Kershenbaum, and L. Koved.
Saber: Smart analysis based error reduction. In Proceedings
of the International Symposium on Software Testing and Analysis,
Boston, MA, July 2004.
[34] I. Rouvellou and G. W. Hart. Automatic alarm correlation for
fault identification. In INFOCOM, 1995.
[35] M. Sullivan and R. Chillarege. Software defects and their impact
on system availability – a study of field failures in operating
systems. In Proc. 21st International Symposium on
Fault-Tolerant Computing, Montréal, Canada, 1991.
[36] G. Vigna and R. A. Kemmerer. Netstat: A network-based intrusion
detection approach. In Proc. of the 14th Annual Computer
Security Conference, 1998.
[37] Y.-M. Wang, C. Verbowski, and D. R. Simon. Persistent-state
checkpoint comparison for troubleshooting configuration failures.
In Proc. of the IEEE Conference on Dependable Systems
and Networks, 2003.
[38] A. Ward, P. Glynn, and K. Richardson. Internet service performance
failure detection. In Web Server Performance Workshop
held in conjunction with SIGMETRICS’98, 1998.
[39] A. Yemeni and S. Kliger. High speed and robust event correlation.
IEEE Communications Magazine, 34(5), May 1996.
15
