Chapter 25: Performace Optimzation | |
In this chapter, we describe some techniques for improving the performance of object-oriented systems. While most of the ideas are fairly general, making good on them can be environment, language, compiler, and tool dependent. However, for most of this chapter, we maintain the illusion of implementation independence. As a prelude, Table 1 lists without comment a set of trade-offs that may be applied in performance optimization.
Usually Faster | Usually Slower | Usually Faster | Usually Slower | |
Internal | External | Hardware | Software | |
Storage | Computation | Direct | Indirect | |
Unmediated | Mediated | Fixed | Variable | |
Implicit | Explicit | Special Purpose | General Purpose | |
One Layer | Two Layers | One Message | Two Messages | |
Unguarded | Guarded | Immediate | Queued | |
Unconditional | Conditional | Computed | Symbolic | |
Approximate | Exact | Compile-Time | Run-Time | |
Optimistic | Pessimistic | Transient | Persistent | |
Control | Coordination | Event-Driven | Polled | |
Cooperation | Coordination | Point-to-Point | Routed | |
Dispatching | Forwarding | Local Call | Remote Call | |
Inline expansion | Call | Reply | Exception | |
Signaling | Blocking | Update | Construction | |
Invocation | Construction | Recycling | Construction | |
Chunked | One-at-a-time | Binding | Evaluation | |
Indexing | Searching | Lazy | Eager | |
Reference | Copy | Better Algorithm | Better Code |
Optimization techniques discover special cases and make associated trade-offs in order to improve performance. Some methods trade off other software quality criteria (especially coupling) for the sake of efficiency. Since performance requirements are usually ``hard'' and quality requirements ``soft'', such trade-offs must sometimes be made.
But one would like to optimize systems without completely removing the possibility for them to evolve, be repaired, and be reused. As OO systems become larger and run longer, these concerns become increasingly important, especially considering that one of the reasons that OO systems are replacing non-OO ones is because the old ones could not be made to adapt.
General OO design methods usually ensure that performance tuning measures are at least feasible. They leave room for as wide a variation in lower-level design and implementation decisions as logically possible. Any concrete implementation class that obeys the required interface of its abstract superclass can be plugged in at any time in development. The best optimization techniques are those that continue to allow for at least some kinds of evolutionary changes without completely sacrificing performance. But if everything can change, then you cannot optimize anything. While there is a fine line here, some good compromises are available.
The stance most compatible with OO design methods is to support explicitly only those changes representing extension by addition. This preserves the ability to adapt to changes resulting from the addition of new objects, operations, and classes, but not destructive modifications of existing ones. Thus, some modifications may break optimized designs, but extensions and improvements will not. Generally, this implies avoidance of optimizations that take advantage of constraints that just so happen to hold, without being declaratively guaranteed to hold. These incidental constraints are the ones that may change. Thus, the declarative constraints, conditions, and effects listed for abstract classes are at least as useful for guiding optimization as are concrete-level considerations. In this sense, optimization is simply ``more design''.
Except where noted, the techniques described in the remainder of this chapter maintain at least limited opportunities for extensibility. However, even though these tactics are locally safe and permit extension, they are not without cost to overall design integrity. They may accentuate the extent to which future destructive changes propagate throughout a system.
OOA descriptions may contain information useful for helping to locate plausible spots where clever algorithms and data structures might make a big difference. Probabilistic descriptions of ranges, use cases, and other qualitative information can direct attention to expected execution profiles. These may be used in a qualitative way during optimization efforts, directing attention to concrete components that may need redesign before being committed to. Analytic models and execution profiles of prototypes serve similar roles.
The usual route to algorithmic optimization in OO designs is subclassing. Any concrete implementation class that obeys the interface of its abstract superclass can be plugged in at any time in development. Thus, slow list-based concrete collection classes can be replaced with ones based on fast balanced trees, classes with naively constructed numerical operations may be replaced by ones based on more serious algorithms, and so on.
This strategy must be used with care for components geared toward reuse across many applications or subsystems. Different applications will have different invocation profiles. Fast processing times in one operation cannot always be traded for slow execution in another. However, subclassing may be employed to create different concrete classes that make different trade-offs. Each application may then select the one that is best suited to its profile.
Sometimes more efficient representations and algorithms may be employed only after defining superclasses that define fewer and/or weaker operations. For example, very efficient algorithms exist for implementing UnionFindSETs defining only has(elem) and destructivelyUnion(otherSet) operations [1]. This is not a pure superclass of our SET class, but is a subclass of a superclass containing just has. There are many similar twists and variations, especially for data structure and numerical classes. These kinds of structural modifications enable exploitation of the many known efficient representations and algorithms developed for such purposes.
In designs full of relays, name servers, and other mediators that perform multistage routing, the identities of ultimate recipients of routed messages may be reported to and cached by senders to streamline future communication. For example, in a model-view design that is mediated by a relay, the viewer can send back an identity-revealing message to the model on receiving its first change notice. From that point onward, the model may send notices directly to the viewer, without mediation.
Caching is used to construct proxy objects. Proxies are designed as relays that transform local requests into interprocess messages. However, they may be implemented in a way that locally holds at least some state information, allowing them to reply quickly to internal fns probing simple attributes.
Caching always requires backup strategies to recover and adapt when properties of external entities change. This normally requires notification protocols so that cached entities may inform their cachers when they change. Some OODBs provide a great deal of infrastructure supporting local caching of persistently held objects. When available, this infrastructure may be used to exploit other caching opportunities.
An extreme version of these tactics is to force an otherwise nonlocal passive object to become local by embedding it in another. For example, in ODL, concrete passive helper objects are accessed through links from their hosts. In some cases, these objects can be directly embedded within their enclosures, in a manner similar to ``records'' in procedural programming languages. Passive components declared as own are always safe for embedding, thus may be relabeled as packed. This saves a level of indirection on access and sometimes simplifies storage management.
Most design methods actually preclude thoughtless application of this measure. Since links always refer to abstract types, the storage space needed to embed corresponding concrete objects is not knowable, so embedding is impossible. However special subclasses may be designed in order to enable embedding. Any class declaring fixed links to other abstract components may be subclassed with versions specifying concrete links for each kind of concrete class actually used. These may then be packed safely. While the new subclasses are not very extensible, their parents remain so.
Communication constructs come in many ``strengths''. For example, futures are more expensive than bidirectional operations, which are in turn more expensive than one-way sends.1 Splitting bidirectional interactions into further-optimizable callback and continuation protocols has been found to provide substantial speed improvements [6].
1Footnote:
This is not always true. In some systems, one-way sends are actually built out of wrappers around bidirectional ones.
Similarly, exceptions are usually more expensive than sentinels. Messages that require resolution, dispatching, and/or multicast are more expensive than fixed point-to-point messages. The cheapest workable protocol should always be employed. The cheapest communication is no communication at all.
Locks and similar interference control measures inhibit concurrency, generate overhead, and sometimes lead to deadlock. These problems may be minimized through static analysis. For example, if an object is used in only two joint operations that cannot ever interfere with each other (i.e., if all possible interleavings are known to be either safe or impossible), then locking is unnecessary. More generally, operations may be divided into conflict sets that describe those pairs or sets of operations that may interfere. Locking schemes may be adjusted accordingly. The literature on such techniques is extensive; see, for example [5].
Program implementations are at the mercy of native dispatchers and dispatching rules for most execution support. Different languages even have different rules (as well as different subclassing rules, argument formats, and so on). They do not always map precisely to design constructs. However, languages supporting dispatching are amenable to optimizations based on relatively high-level design information. These optimizations eliminate run-time uncertainty about the particular operations that need to be performed given some message. At least some messages can be resolved at design-time. These cases may be determined through static analysis of the information sitting in a given design.
This is something that a clever compiler or static analysis tool might be able to do. In fact, most of these techniques were originally reported by researchers implementing experimental compilers for the OO language SELF [9] . Except for such experimental systems, contemporary design and implementation tools do not undertake such analysis, so it is useful to know how to do this manually.
In an ODL framework, the required analyses may be seen as sets of logical inferences based on declarative class information. To illustrate, suppose there is an abstract class that is implemented by only one concrete class. In this situation, any client of the abstract class must actually be using the concrete version:
class AbstractA v: int; end class ConcreteA is AbstractA v: int { ... } inv v = 1 end op useA(a: AbstractA) { if a.v = 1 then something else otherthing end }
It is clear that in useA, something will always be invoked. This fact can be hard-wired at design time. Surprisingly, this can be done in a relatively nondisruptive manner, by overloading another customized version of useA:
op useA(a: ConcreteA) { something }
Resolution-based dispatching is still required to route a call of useA to the concrete version. (This may require further conversions to selection-based dispatching; see Chapter 21.) But once it is invoked, the internals are streamlined away. This applies even if there is another concrete class. For example:
class AnotherConcreteA is AbstractA v: int { ... } inv v = 0 end op useA(a: AnotherConcreteA) { otherthing; }
These optimizations have no other execution cost except storage of additional customized versions of operations. Some kind of dispatch is needed anyway to invoke the right version of an operation with multiple specializations. The heart of the trick is to replace conditionals with class membership tests and dispatches. Another way of viewing this is that customization synthesizes finer-granularity operations and classes than are otherwise present in a design, and adjusts dispatching requirements accordingly.
These improvements require far less global analysis than do other optimization techniques. You do not have to be sure that only one path is taken; you just have to notice that if it is taken, further simplifications are possible since they are all conditional on the same properties that lead to it being taken in the first place.
These forms of specialization only work when there are no subclassing errors. For example, if there were a class SubConcreteA that redefined v to return 3, the strategy would fail. But this subclass would also break the v=1 invariant. Of course, it would have been much better to create an intervening abstract class, say, AbstractAWithVEQ3. This would reflect and advertise the subclass constraints rather than adding them directly to a concrete class, or worse (but temptingly) leaving the constraints entirely implicit. However, this is the right stance only when the subclass constraints must hold, and not when they ``just happen'' to be true, and are thus overridable in subclasses.
There are several further variations on the basic idea. For example, consider the collection operation applyC, which takes an operation and applies it to all elements. This can be customized by defining individual operations that directly apply the operation to all members. Rather than calling aSet.applyC(WRAP(print)), a printAll operation can be added to a concrete SET subclass, and invoked from clients.
At least at the within-process programming level, caching, embedding, strength reduction, and customization techniques are (only) sometimes more effective when the ``insides'' of target objects can be opened up to their clients. When operations are made more complex and inefficient than they would be if internal components were not hidden (e.g., due to extra forwarding and dispatching), encapsulation may be broken for the sake of performance. However, blind, random removal of encapsulation barriers may cause irreparable harm to designs. Less disruptive mechanisms exist.
As discussed in Chapter 17, constructs such as opens can be effective in removing certain encapsulation barriers in a structured fashion. These mechanisms embed (or share) concrete helper class features in ways that allow direct access by hosts, thus avoiding forwarding and indirection. Most OO programming languages support some variant of this construct. For example, in C++, private subclassing (sometimes in conjunction with friend constructs) may be used to this effect.
Inspection of the basic structure of an OO kernel (e.g., in Chapter 15) shows that most of the real execution of internal actions occurs only when operating on primitives. All composite operations just coordinate and relay further requests, usually ultimately reducing to primitive operations. While the notion of ``primitive'' is somewhat arbitrary, it may be safely assumed that primitive operations are executed relatively quickly.
Any program should run faster when paths to primitives are shorter. In the best case, these primitives can be directly computed by their clients through inlining. This avoids all resolution and procedure call overhead, and causes the code to run at ``machine speed'', modulo language-specific factors. In fact, inlining often results in OO programs running faster than predicted on the basis of reduced indirection overhead, since it creates code patterns that are further optimizable through standard compilation techniques. Inlining may be available in the native language or may be manually encoded.
Of course, full inlining is only possible and/or desirable in limited contexts. However, inlining opportunities often occur inside the bottlenecks representing the bulk of the ``90-10'' rule of thumb for efficiency: 90% of the time is spent in 10% of the code in most programs. (Or maybe it's 80-20 or 70-30, but still ...) Selective application of inlining techniques has been found to provide order-of-magnitude speed improvements in some programs and none at all in others. The best means of detecting inlinable calls is customization analysis.
It is almost always possible to obtain better user-visible interactive response times by arranging for slower processing to occur at less visible times. Time-outs and other related mechanisms may be used to control this.
It is usually faster to change logical state by rebinding a link than by mutating a set of components. Isolating sets of related values in state-classes can clear up bottlenecks filled with excessive modification of value attributes.
Transformations from top-level operations to relational classes may be run either backward or forward, depending on the relative speeds of object construction versus invocation of multiparameter operations. The backward transformation forces all participants to be listed as arguments. Care is needed to ensure that these sets of participants are always sent in the right way.
Components that are redundantly held in many objects may be coalesced. For example, collections and repositories may hold single copies of components accessed by all elements.
Sometimes it suffices to use only one instance of truly stateless immutable classes. A generator class may simply keep passing out the same object as a constructor result rather than making fresh ones. However, this technique is only safe when clients do not test links for object identity. If two clients hold two conceptually distinct objects, but they are actually the same and the clients are able to determine this fact through identity tests, then the technique cannot be used.
Most collection classes are designed to be boundless. (This means, of course, that insertions only fail when the system fails, not that they have truly infinite capacity.) Many applications only require predeterminable finite capacities. Implementations of finite capacity structures are usually faster than those for unbounded ones.
Replacing the storage management infrastructure (GC, system memory allocators, etc.) with faster algorithms and implementations speeds up the construction of all objects.
Many bottlenecks in C++ programs occur in operations on simple ``lightweight'' objects. Lightweightness is in the eye of the beholder, and differs across applications. However, generally, classes that may be transformed to possess the following properties may be segregated for special treatment:
Common examples include complex numbers, graphics primitives, and fixed-length strings. (Variable-length strings almost fall in this category, but cannot have their representations completely embedded.)
Here, several good design criteria are traded off for the sake of performance. Killing off representation independence and subclassability means that no methods need be declared as virtual, at the likely expense of padding class interfaces with operations normally introduced in subclasses. This then enables compile-time message resolution and inlining. The emphasis on embedded components and mostly local use allows the optimization of copy construction. Local copy-construction is usually already optimized by C++ compilers, so need not be specifically coded.
Some object-oriented systems are reputed to be slow. They need not be. A number of purely technical manipulations are available to improve performance. However, it is good practice to limit optimizations to those that maintain other software quality criteria. Most OO optimization techniques are specialization-driven. They first discover predeterminable message resolutions, representation strategies, and usage patterns. They then hard-wire these special cases before execution.
Bentley [4], Sedgewick [8], and related texts should be consulted for more thorough treatments of the basic techniques available for writing efficient algorithms and programs. The performance enhancements attempted by some compilers have analogs at coarser design and programming levels. See, for example, Aho et al [2], Lee [7], and Appel [3]. The SELF papers [9] present several OO program optimization techniques beyond those described here.