Performance Optimization

Chapter 25: Performace Optimzation

Optimization and Evolution
Algorithmic Optimization
Performance Transformations
Optimization in C++
Summary

In this chapter, we describe some techniques for improving the performance of object-oriented systems. While most of the ideas are fairly general, making good on them can be environment, language, compiler, and tool dependent. However, for most of this chapter, we maintain the illusion of implementation independence. As a prelude, Table 1 lists without comment a set of trade-offs that may be applied in performance optimization.

Usually Faster	Usually Slower	Usually Faster	Usually Slower
Internal	External	Hardware	Software
Storage	Computation	Direct	Indirect
Unmediated	Mediated	Fixed	Variable
Implicit	Explicit	Special Purpose	General Purpose
One Layer	Two Layers	One Message	Two Messages
Unguarded	Guarded	Immediate	Queued
Unconditional	Conditional	Computed	Symbolic
Approximate	Exact	Compile-Time	Run-Time
Optimistic	Pessimistic	Transient	Persistent
Control	Coordination	Event-Driven	Polled
Cooperation	Coordination	Point-to-Point	Routed
Dispatching	Forwarding	Local Call	Remote Call
Inline expansion	Call	Reply	Exception
Signaling	Blocking	Update	Construction
Invocation	Construction	Recycling	Construction
Chunked	One-at-a-time	Binding	Evaluation
Indexing	Searching	Lazy	Eager
Reference	Copy	Better Algorithm	Better Code

Optimization and Evolution

Optimization techniques discover special cases and make associated trade-offs in order to improve performance. Some methods trade off other software quality criteria (especially coupling) for the sake of efficiency. Since performance requirements are usually ``hard'' and quality requirements ``soft'', such trade-offs must sometimes be made.

But one would like to optimize systems without completely removing the possibility for them to evolve, be repaired, and be reused. As OO systems become larger and run longer, these concerns become increasingly important, especially considering that one of the reasons that OO systems are replacing non-OO ones is because the old ones could not be made to adapt.

General OO design methods usually ensure that performance tuning measures are at least feasible. They leave room for as wide a variation in lower-level design and implementation decisions as logically possible. Any concrete implementation class that obeys the required interface of its abstract superclass can be plugged in at any time in development. The best optimization techniques are those that continue to allow for at least some kinds of evolutionary changes without completely sacrificing performance. But if everything can change, then you cannot optimize anything. While there is a fine line here, some good compromises are available.

The stance most compatible with OO design methods is to support explicitly only those changes representing extension by addition. This preserves the ability to adapt to changes resulting from the addition of new objects, operations, and classes, but not destructive modifications of existing ones. Thus, some modifications may break optimized designs, but extensions and improvements will not. Generally, this implies avoidance of optimizations that take advantage of constraints that just so happen to hold, without being declaratively guaranteed to hold. These incidental constraints are the ones that may change. Thus, the declarative constraints, conditions, and effects listed for abstract classes are at least as useful for guiding optimization as are concrete-level considerations. In this sense, optimization is simply ``more design''.

Except where noted, the techniques described in the remainder of this chapter maintain at least limited opportunities for extensibility. However, even though these tactics are locally safe and permit extension, they are not without cost to overall design integrity. They may accentuate the extent to which future destructive changes propagate throughout a system.

Algorithmic Optimization

OOA descriptions may contain information useful for helping to locate plausible spots where clever algorithms and data structures might make a big difference. Probabilistic descriptions of ranges, use cases, and other qualitative information can direct attention to expected execution profiles. These may be used in a qualitative way during optimization efforts, directing attention to concrete components that may need redesign before being committed to. Analytic models and execution profiles of prototypes serve similar roles.

The usual route to algorithmic optimization in OO designs is subclassing. Any concrete implementation class that obeys the interface of its abstract superclass can be plugged in at any time in development. Thus, slow list-based concrete collection classes can be replaced with ones based on fast balanced trees, classes with naively constructed numerical operations may be replaced by ones based on more serious algorithms, and so on.

This strategy must be used with care for components geared toward reuse across many applications or subsystems. Different applications will have different invocation profiles. Fast processing times in one operation cannot always be traded for slow execution in another. However, subclassing may be employed to create different concrete classes that make different trade-offs. Each application may then select the one that is best suited to its profile.

Sometimes more efficient representations and algorithms may be employed only after defining superclasses that define fewer and/or weaker operations. For example, very efficient algorithms exist for implementing UnionFindSETs defining only has(elem) and destructivelyUnion(otherSet) operations [1]. This is not a pure superclass of our SET class, but is a subclass of a superclass containing just has. There are many similar twists and variations, especially for data structure and numerical classes. These kinds of structural modifications enable exploitation of the many known efficient representations and algorithms developed for such purposes.

Performance Transformations

Caching

In designs full of relays, name servers, and other mediators that perform multistage routing, the identities of ultimate recipients of routed messages may be reported to and cached by senders to streamline future communication. For example, in a model-view design that is mediated by a relay, the viewer can send back an identity-revealing message to the model on receiving its first change notice. From that point onward, the model may send notices directly to the viewer, without mediation.

Caching is used to construct proxy objects. Proxies are designed as relays that transform local requests into interprocess messages. However, they may be implemented in a way that locally holds at least some state information, allowing them to reply quickly to internal fns probing simple attributes.

Caching always requires backup strategies to recover and adapt when properties of external entities change. This normally requires notification protocols so that cached entities may inform their cachers when they change. Some OODBs provide a great deal of infrastructure supporting local caching of persistently held objects. When available, this infrastructure may be used to exploit other caching opportunities.

Embedding

An extreme version of these tactics is to force an otherwise nonlocal passive object to become local by embedding it in another. For example, in ODL, concrete passive helper objects are accessed through links from their hosts. In some cases, these objects can be directly embedded within their enclosures, in a manner similar to ``records'' in procedural programming languages. Passive components declared as own are always safe for embedding, thus may be relabeled as packed. This saves a level of indirection on access and sometimes simplifies storage management.

Most design methods actually preclude thoughtless application of this measure. Since links always refer to abstract types, the storage space needed to embed corresponding concrete objects is not knowable, so embedding is impossible. However special subclasses may be designed in order to enable embedding. Any class declaring fixed links to other abstract components may be subclassed with versions specifying concrete links for each kind of concrete class actually used. These may then be packed safely. While the new subclasses are not very extensible, their parents remain so.

Replication.

Transient embedding or caching may be obtained through replication of immutable objects across processes or even across objects. As is the case for proxies of any sort, no client processing may visibly depend on the identities of replicas, just their states and behaviors.

Homogeneous collections.

Embedding strategies may be extended to collections, in which case they are normally labeled as homogeneous. The particular case of homogeneous ARRAYs of built-in types (e.g., ARRAY[REAL]) is generally well worth special attention. These not only improve performance, but are also useful in constructing interfaces to numerical routines implemented in non-OO languages. However, except in the special case of primitives, homogeneous structures cannot be made to operate in safe ways. They run against standard policies stating that every class may be subclassed. The reason they work for primitives is that we happen to secretly know that built-ins such as REAL are not subclassable.

Protocol Strength Reduction

Communication constructs come in many ``strengths''. For example, futures are more expensive than bidirectional operations, which are in turn more expensive than one-way sends.¹ Splitting bidirectional interactions into further-optimizable callback and continuation protocols has been found to provide substantial speed improvements [6].

¹Footnote:
This is not always true. In some systems, one-way sends are actually built out of wrappers around bidirectional ones.

Similarly, exceptions are usually more expensive than sentinels. Messages that require resolution, dispatching, and/or multicast are more expensive than fixed point-to-point messages. The cheapest workable protocol should always be employed. The cheapest communication is no communication at all.

Locking

Locks and similar interference control measures inhibit concurrency, generate overhead, and sometimes lead to deadlock. These problems may be minimized through static analysis. For example, if an object is used in only two joint operations that cannot ever interfere with each other (i.e., if all possible interleavings are known to be either safe or impossible), then locking is unnecessary. More generally, operations may be divided into conflict sets that describe those pairs or sets of operations that may interfere. Locking schemes may be adjusted accordingly. The literature on such techniques is extensive; see, for example [5].

Specialization

Program implementations are at the mercy of native dispatchers and dispatching rules for most execution support. Different languages even have different rules (as well as different subclassing rules, argument formats, and so on). They do not always map precisely to design constructs. However, languages supporting dispatching are amenable to optimizations based on relatively high-level design information. These optimizations eliminate run-time uncertainty about the particular operations that need to be performed given some message. At least some messages can be resolved at design-time. These cases may be determined through static analysis of the information sitting in a given design.

This is something that a clever compiler or static analysis tool might be able to do. In fact, most of these techniques were originally reported by researchers implementing experimental compilers for the OO language SELF [9] . Except for such experimental systems, contemporary design and implementation tools do not undertake such analysis, so it is useful to know how to do this manually.

In an ODL framework, the required analyses may be seen as sets of logical inferences based on declarative class information. To illustrate, suppose there is an abstract class that is implemented by only one concrete class. In this situation, any client of the abstract class must actually be using the concrete version:

class AbstractA
  v: int;
end

class ConcreteA is AbstractA
  v: int { ... }
  inv v = 1
end

op useA(a: AbstractA) { if a.v = 1 then something else otherthing end }

It is clear that in useA, something will always be invoked. This fact can be hard-wired at design time. Surprisingly, this can be done in a relatively nondisruptive manner, by overloading another customized version of useA:

op useA(a: ConcreteA) { something }

Resolution-based dispatching is still required to route a call of useA to the concrete version. (This may require further conversions to selection-based dispatching; see Chapter 21.) But once it is invoked, the internals are streamlined away. This applies even if there is another concrete class. For example:

class AnotherConcreteA is AbstractA
  v: int { ... }
  inv v = 0
end

op useA(a: AnotherConcreteA) { otherthing; }

These optimizations have no other execution cost except storage of additional customized versions of operations. Some kind of dispatch is needed anyway to invoke the right version of an operation with multiple specializations. The heart of the trick is to replace conditionals with class membership tests and dispatches. Another way of viewing this is that customization synthesizes finer-granularity operations and classes than are otherwise present in a design, and adjusts dispatching requirements accordingly.

These improvements require far less global analysis than do other optimization techniques. You do not have to be sure that only one path is taken; you just have to notice that if it is taken, further simplifications are possible since they are all conditional on the same properties that lead to it being taken in the first place.

These forms of specialization only work when there are no subclassing errors. For example, if there were a class SubConcreteA that redefined v to return 3, the strategy would fail. But this subclass would also break the v=1 invariant. Of course, it would have been much better to create an intervening abstract class, say, AbstractAWithVEQ3. This would reflect and advertise the subclass constraints rather than adding them directly to a concrete class, or worse (but temptingly) leaving the constraints entirely implicit. However, this is the right stance only when the subclass constraints must hold, and not when they ``just happen'' to be true, and are thus overridable in subclasses.

There are several further variations on the basic idea. For example, consider the collection operation applyC, which takes an operation and applies it to all elements. This can be customized by defining individual operations that directly apply the operation to all members. Rather than calling aSet.applyC(WRAP(print)), a printAll operation can be added to a concrete SET subclass, and invoked from clients.

Encapsulation

At least at the within-process programming level, caching, embedding, strength reduction, and customization techniques are (only) sometimes more effective when the ``insides'' of target objects can be opened up to their clients. When operations are made more complex and inefficient than they would be if internal components were not hidden (e.g., due to extra forwarding and dispatching), encapsulation may be broken for the sake of performance. However, blind, random removal of encapsulation barriers may cause irreparable harm to designs. Less disruptive mechanisms exist.

Open subclassing.

As discussed in Chapter 17, constructs such as opens can be effective in removing certain encapsulation barriers in a structured fashion. These mechanisms embed (or share) concrete helper class features in ways that allow direct access by hosts, thus avoiding forwarding and indirection. Most OO programming languages support some variant of this construct. For example, in C++, private subclassing (sometimes in conjunction with friend constructs) may be used to this effect.

Inlining.

Inlining is among the most extreme forms of structured encapsulation breakdown. Inside passive components, it is sometimes possible to obtain desired effects without the overhead of a message send.

Inspection of the basic structure of an OO kernel (e.g., in Chapter 15) shows that most of the real execution of internal actions occurs only when operating on primitives. All composite operations just coordinate and relay further requests, usually ultimately reducing to primitive operations. While the notion of ``primitive'' is somewhat arbitrary, it may be safely assumed that primitive operations are executed relatively quickly.

Any program should run faster when paths to primitives are shorter. In the best case, these primitives can be directly computed by their clients through inlining. This avoids all resolution and procedure call overhead, and causes the code to run at ``machine speed'', modulo language-specific factors. In fact, inlining often results in OO programs running faster than predicted on the basis of reduced indirection overhead, since it creates code patterns that are further optimizable through standard compilation techniques. Inlining may be available in the native language or may be manually encoded.

Of course, full inlining is only possible and/or desirable in limited contexts. However, inlining opportunities often occur inside the bottlenecks representing the bulk of the ``90-10'' rule of thumb for efficiency: 90% of the time is spent in 10% of the code in most programs. (Or maybe it's 80-20 or 70-30, but still ...) Selective application of inlining techniques has been found to provide order-of-magnitude speed improvements in some programs and none at all in others. The best means of detecting inlinable calls is customization analysis.

Other Measures

Hiding Computation.

It is almost always possible to obtain better user-visible interactive response times by arranging for slower processing to occur at less visible times. Time-outs and other related mechanisms may be used to control this.

Rebinding.

It is usually faster to change logical state by rebinding a link than by mutating a set of components. Isolating sets of related values in state-classes can clear up bottlenecks filled with excessive modification of value attributes.

Rolling and unrolling operations.

Transformations from top-level operations to relational classes may be run either backward or forward, depending on the relative speeds of object construction versus invocation of multiparameter operations. The backward transformation forces all participants to be listed as arguments. Care is needed to ensure that these sets of participants are always sent in the right way.

Precondition screening.

Overhead for argument error protocols may be bypassed via transformations that provide screening functions that must be invoked before calling the operation of interest. This often opens up further opportunities for simplification and optimization on the client side.

Component factoring.

Components that are redundantly held in many objects may be coalesced. For example, collections and repositories may hold single copies of components accessed by all elements.

Recycling objects.

A constructor may re-initialize and recycle an otherwise unused object rather than creating a fresh one via new. When objects are maintained in a repository, this is both natural and effective. The repository may place removed objects on a special recycling list for use in later constructors. This works well even in a garbage-collection environment. Classes must be fitted with special reInit methods to support this.

Sharing stateless objects.

Sometimes it suffices to use only one instance of truly stateless immutable classes. A generator class may simply keep passing out the same object as a constructor result rather than making fresh ones. However, this technique is only safe when clients do not test links for object identity. If two clients hold two conceptually distinct objects, but they are actually the same and the clients are able to determine this fact through identity tests, then the technique cannot be used.

Sharing stateful objects.

Copy-on-write and related techniques may be used to share objects while they are in the same state. The same concerns about identity apply.

Finite structures.

Most collection classes are designed to be boundless. (This means, of course, that insertions only fail when the system fails, not that they have truly infinite capacity.) Many applications only require predeterminable finite capacities. Implementations of finite capacity structures are usually faster than those for unbounded ones.

Storage management.

Replacing the storage management infrastructure (GC, system memory allocators, etc.) with faster algorithms and implementations speeds up the construction of all objects.

Preconstruction.

Many object-oriented programs spend significant time initializing objects before getting around to using them. Examples include programs making extensive use of collection classes that build up complex data structures at system initialization time. This can be avoided by writing a version of the program that initializes these objects and persistently stores their states. The normal, usable version of the program may then more quickly initialize objects by directly reconstructing them from the saved state information.

Reclustering.

As discussed in Chapter 23, repacking objects into different clusters on the basis of monitoring and experience is very likely to improve performance. Intercluster messages are normally orders of magnitude slower than within-cluster messages, so there is a lot of room for improvement. Dynamic reclustering strategies that move heavily communicating objects ``nearer'' to each other for the duration of an interaction sequence may have dramatic effect.

Optimization in C++

Of course, individual programming languages may offer special optimization opportunities. We illustrate a few such techniques for C++. C++ is a highly tunable language. Most of the optimization strategies described earlier in this chapter may be applied.

Many bottlenecks in C++ programs occur in operations on simple ``lightweight'' objects. Lightweightness is in the eye of the beholder, and differs across applications. However, generally, classes that may be transformed to possess the following properties may be segregated for special treatment:

only embedded components
mostly inlinable methods
mostly transient, local client usage
no need for representation independence.
no need for subclassability (either upwards or downwards).

Common examples include complex numbers, graphics primitives, and fixed-length strings. (Variable-length strings almost fall in this category, but cannot have their representations completely embedded.)

Here, several good design criteria are traded off for the sake of performance. Killing off representation independence and subclassability means that no methods need be declared as virtual, at the likely expense of padding class interfaces with operations normally introduced in subclasses. This then enables compile-time message resolution and inlining. The emphasis on embedded components and mostly local use allows the optimization of copy construction. Local copy-construction is usually already optimized by C++ compilers, so need not be specifically coded.

Summary

Some object-oriented systems are reputed to be slow. They need not be. A number of purely technical manipulations are available to improve performance. However, it is good practice to limit optimizations to those that maintain other software quality criteria. Most OO optimization techniques are specialization-driven. They first discover predeterminable message resolutions, representation strategies, and usage patterns. They then hard-wire these special cases before execution.

Exercises

Many ``intrinsically efficient'' designs are also good when measured by other criteria. Why?
Give an example demonstrating the application of each entry in the table of trade-offs.
One way to categorize optimization measures is whether they trade more space for less time, or vice versa. Categorize the measures described in this chapter.
There is a slogan among some compiler writers that a compiler should never postpone to run-time anything that may be done at compile time. Explain how this slogan has more consequences and requires a great deal more effort in OOPL compilers than those for procedural programming languages.
One reason that some OO languages limit representation options for built-in types is to guarantee that corresponding attributes may be embedded in their hosts. Use a language supporting both embedding and nonembedding options to see what difference this really makes.
Conversion of abstract links to embeddable concrete form replaces instance generation with subclassing. Why should this form of subclassing be postponed as long as possible?
Why would optimization be more difficult if objects were allowed to dynamically rebind concrete code associated with operations?
Would optimization be easier if classes could be specifically denoted as nonsubclassable?

References

1: A. Aho, J. Hopcroft, and J. Ullman. The Design and Analysis of Computer Algorithms. Addison-Wesley, 1974.
2: A. Aho, R. Sethi, and J. Ullman. Compilers: Principles, Techniques AND Tools. Addison-Wesley, 1986.
3: A. Appel. Compiling with Continuations. Cambridge University Press, 1992.
4: J. Bentley. Writing Efficient Programs. Prentice Hall, 1982.
5: W. Cellary, E. Gelenbe, and T. Morzy. Concurrency Control in Distributed Database Systems. North-Holland, 1988.
6: R. Draves, B. Bershad, R. Rashid, and R. Dean. Using continuation structures to implement thread management and communication in operating systems. In 13th ACM Symposium on Operating Systems Principles. ACM, 1991.
7: P. Lee, editor. Advanced Language Implementation. MIT Press, 1991.
8: R. Sedgewick. Algorithms. Addison-Wesley, 1990.
9: D. Ungar. The self papers. Lisp and Symbolic Computation, 1991.

Next: Chapter 26

Doug Lea
Wed May 10 08:01:13 EDT 1995