文章基本信息

标题：Performance analysis of personal computer workstations
作者：David W. Blevins
期刊名称：Hewlett-Packard Journal
印刷版ISSN：0018-1153
出版年度：1991
卷号：Oct 1991
出版社：Hewlett-Packard Co.

Performance analysis of personal computer workstations

David W. Blevins

The ability to analyze the performance of personal computers via noninvasive monitoring and simulation allows designers to make critical design trade-offs before committing to hardware.

Today's high-performance personal computers are being used as file servers, engineering workstations, and business transaction processors, areas previously dominated by large, costly mainframes or minicomputers. In this market, performance is of paramount importance in differentiating one product from another. Our objective at HP's Personal Computer Group's performance analysis laboratory is to ensure that performance is designed into HP's offering of personal computers. To achieve this, analysis of a personal computer's subsystem workloads and predictive system modeling can be used to identify bottlenecks and make architectural design decisions. This article describes the tools and methodologies used by HP engineers to accomplish performance analysis for personal computers.

The toolset currently being used at the performance analysis laboratory consists of specialized hardware and software. Typically, the hardware gathers data from a system under test and then the data is postprocessed by the software to create reports (see Mg. 1). This data can also be used to drive software models of personal computer subsystems.

Hardware-Based Tools

The two hardware-based performance analysis tools shown in Fig. 1 are the processor activity monitor (PMON) and the backplane I/O activity monitor BIOMON). Both tools are noninvasive in that they collect data without interfering with the normal activity of the system under test.

Processor Activity Monitor

The processor activity monitor is a hardware device that monitors a personal computer's microprocessor to track low-level CPU activity. The PMON is sandwiched between the computer's CPU and the CPU socket (see Fig. 2). The PMON monitors the processor's address and control pins. For each CPU operation, the PMON will track the duration and address of the operation and output the results to the data capture device.

Gathering statistics on the activities of a personal computer's microprocessor can be very useful in making design decisions about the arrangement of the support circuitry (e.g., cache and main memory, I/O bus interface, and bus lines). In addition, trace files that detail the CPU's requests to the memory system can be used to drive software simultations of various cache memory arrangements as well as more comprehensive CPU and memory or system simulations.

Two data capture devices are commonly used in conjunction with the PMON. The first, an HP 16500 logic analyzer configured with optional system performance analysis software, generates two main types of data. One is a histogram that shows the occurrence mix of a user-defined subset of the possible CPU cycle types Fig. 3a). The performance analysis software averages 1000 samples of cycles from the PMON on the fly, giving a randomly sampled profile of processor activity throughout the duration of a performance benchmark. The second type of data provided by the HP 16500 is real-time calculation of the minimum, maximum, and average time intervals between the beginning and end of user-defined events Fig. 3b). The performance analysis software averages the interval calculations on the fly over a large number of samples to give, for example, the average interarrival time of writes to video memory in a CAD application.

The other data capture device used with the PMON is a less intelligent but higher-capacity logic analyzer. This instrument has a 16-megasample-deep trace buffer (as opposed to the 1000-sample deep buffer in the HP 16500).

Cycles from the PMON are captured in this buffer in real time and the data is later archived to a host computer's hard disk. The buffer typically holds four to five seconds of continuous bus cycle activity generated by a 25-Mhz Intel486 microprocessor running an MS-DOS[R] application. The data can then be used to drive software simulations or processed to create summary reports, such as an address range summary of how the processor's address space is used by operating systems and application software (see Fig. 4).

Backplane I/O Activity Monitor

The backplane I/O activity monitor, or BIOMON, also captures information from a personal computer's hardware, but instead of the CPU activity, the I/O activity on the ISA (Industry Standard Architecture) or EISA Extended Industry Standard Architecture) backplane is monitored (Fig. 5). The BIOMON consists of two backplane I/O cards: the qualify and capture card and the monitor card. The qualify and capture card resides noninvasively in the SUT (system under test) and is connected via a ribbon cable to the monitor card, which is located in another personal computer called the monitor system. The monitor system receives, stores, and processes the I/O events captured on the SUT's backplane.

During operation the qualify and capture card is loaded with capture enable flags for each of the I/O addresses whose activity is to be monitored on the SUT backplane.

Once the qualify and capture card is set up, I/O address accesses on the SLTT's backplane cause an event information packet (address, data, etc.) to be transferred to a first-in, first-out (FIFO) holding buffer, allowing for asynchronous operation of the SUT and the monitor system. The FIFO is unloaded by transferring each event information packet to the monitor system's extended memory. At the end of event capture, this trace of I/O events can be either stored to hard disk or immediately postprocessed for analysis.

One very powerful use of BIOMON is the performance analysis of marked code, which is code that has been modified to perform I/O writes at the beginning and end of specific events within a software routine. The frequency of occurrence and execution time for each marked software event can then be analyzed under different configurations find existing or potential bottlenecks and the optimum operating environment.

As an example, the performance analysis laboratory has developed a special installable software filter that writes to specific I/O addresses at the beginning and end of DOS and BIOS (Basic I/O System) interrupts. For our purposes, a write to I/O port 200 hexadecimal) denotes the beginning of an interrupt, and a write to port 202 denotes the end of an interrupt. The trigger address comparator is told to capture data for I/O addresses 200 and 202, and any normal application using DOS or BIOS functions is run on the SUT. The resulting trace can be postprocessed to show which DOS and BIOS routines were, used by the application, how many times each one was called, and how long they executed Fig. 6a). Other information such as the interarrival time between events, exclusive versus inclusive service time for nested events, and total time spent in various application areas can also be extracted (Fig. 6b). Analysis of this information can assist the software engineer in optimizing frequently-used functions in DOS and the BIOS.

This technique can also be used to analyze protected-mode operating systems such as OS/2 and UNIX[R]. However, because of their nature, these environments must have tags embedded into the operating system code. (Protected-mode operating systems do not allow a user to arbitrarily write to specific I/O locations.)

Another use of the BIOMON is to trigger on reads and/or writes to I/O locations associated with accessory cards such as disk controllers, serial and parallel interfaces, video cards, and so on. For instance, the interarrival rates of data read from a disk controller could be examined to determine the actual data transfer rate attained by the disk mechanism or drive controller subsystem. Additionally, by monitoring the disk controller's command registers, an application's disk I/O can be fully characterized.

Software-Based Tools

The software-based tools used by the performance laboratory allow simulation of different memory architectures.

Cache Simulator

The cache simulator is a trace-driven simulation based on the Dinero cache simulator from the University of California at Berkeley.(1) The simulator takes as its input a list of memory accesses (trace file) and parameters describing the cache to be simulated. These parameters include cache size, line size, associativity, write policy, and replacement algorithm. The cache simulator reads the memory accesses from the trace file and keeps statistics on the cache hit rate and the total bus traffic to and from main memory. When the entire trace file has been read, the simulator generates a report of the cache statistics.

A trace file is generated by connecting the PMON to a CPU and storing all the memory accesses on the CPU bus to the high-capacity logic analyzer described above. The data collected from the analyzer can later be downloaded to a host personal computer and archived to hard disk. To get useful data from the simulator, however, the input trace file must be long enough to prime" the simulated cache. The first several thousand memory accesses in the trace file will be misses that fill up the initially empty cache. The simulator will report artificially low hit rates, because in a real system the cache is never completely empty. If the trace file is significantly longer than N[sub.p],(N[sub.p] equals the number of memory accesses in the trace file needed to fill the cache.) priming effects are minimized. When simulating a 128K-byte, 2-way associative cache external to the Intel486, N[sub.p] is approximately 40,000.(2) The high-capacity logic analyzer mentioned above is able to store 16 million memory accesses from the Intel486 via the PMON. A trace file containing 16 million accesses results in a priming error of less than 1% in the hit rate calculation (assuming a hit rate of approximately 90% for the 128K-byte, 2-way cache).

Memory Subsystem Simulator

The memory subsystem simulator, a program written in C++, is a true event-driven simulation that keeps track of time rather than just statistics. It builds on the cache simulator by integrating it into a more comprehensive model that simulates access time to memory. It accepts a parameter file that includes cache parameters, DRAM and SRAM access times, and other memory architecture parameters. It also reads in a PMON trace Me, although this one must contain all accesses (not just memory), and their durations so that the simulator can keep track of time. The result is essentially a running time for the input trace file, along with statistics on all aspects of the memory subsystem.

This simulator can be used for making design trade-offs within a memory subsystem, such as cache size and organization, DRAM speed, interleave, page size, and write buffer depth. Fig. 7 shows the sample results of a memory subsystem simulation of relative memory performance versus external cache size for a 33-MHz Intel486 running a typical DOS application. By simulating various design alternatives in advance, the design engineer can arrive at a memory architecture that is tuned for optimum performance before committing to hardware.

Conclusion

The performance analysis laboratory of HP's Personal Computer Group has developed a suite of hardware and software tools to aid in the design process. The hardware tools give design engineers insight into the low-level performance of existing systems, and the software tools use the data produced by the hardware tools to predict the performance of future architectures.

The performance tools were used extensively in designing the HP Vectra 486, and more recently the Vectra 486/33T. The tools helped show that a burst memory controller (described on page 78) was a better price/performance solution than an external memory cache for the 25-MM Vectra 486, and that an external cache was a necessity for the 33-MHz Vectra 486/33T. The tools also helped predict the performance gain of memory write buffers in a Vectra 486 system. This resulted in the addition of write buffers to the Vectra 486/33T memory architecture.

Acknowledgments

The original PMON was designed by Carol Bassett and Mark Brown. Later versions were implemented by Steve Jurvetson. BIOMON's design owes thanks to several people. Its predecessor was designed by Bob Carnpbell and Greg Woods. Subsequent help came from Ali Ezzet, Chris Anderson, and John Wiese. Greg Woods coded the installable filter and the original postprocessing program. Jim Christy provided additional software help.

References

1. M. D. Hill, "Test Driving Your Next Cache," MIPS, Vol. 1, no. 8, August 1989, pp. 84-92.

2. H. S. Stone, High Performance Computer Architecture, Second Edition, Addison-Wesley, 1990.