OpenVMS Performance Management

Document revision date: 30 March 2001

OpenVMS Performance Management

Contents

Index

4.4 Creating, Maintaining, and Interpreting MONITOR Summaries

Consider the following guidelines when using MONITOR:

Before capturing data, have a specific plan for how you will analyze and apply it.
Avoid an interval value so long that you require unnecessary disk storage for the data.
Do not select an interval so short that you miss significant events.

See the OpenVMS System Manager's Manual, Volume 2: Tuning, Monitoring, and Complex Systems and the OpenVMS System Management Utilities Reference Manual: M--Z for information about using MONITOR.

4.4.1 Types of Output

MONITOR generates the following types of output:

ASCII screen images of statistics from a running system (/DISPLAY qualifier)
Binary recording files containing data collected from a running system (/RECORD qualifier)
Formatted ASCII summary files of statistics extracted from binary recording files (/SUMMARY qualifier)

4.4.2 MONITOR Modes of Operation

MONITOR provides two input modes of operation for collecting data---live and playback.

Live Mode

Use live mode to collect data on a running system and to generate one or more of the following types of MONITOR output---ASCII screen images, binary recording files, or formatted ASCII summary files.

Use live mode to display data about a remote system connected to your system with DECnet for OpenVMS.

Playback Mode

Use playback mode to read a binary recording file and produce one or more of the following types of MONITOR output---ASCII screen images, binary recording files, or formatted ASCII summary files.

4.4.3 Creating a Performance Information Database

As a foundation for the strategy discussed in this chapter, you must develop a database of performance information for your system by running MONITOR continuously as a background process.

The SYS$EXAMPLES directory provides three command procedures you can use to establish the database. The following table describes the procedures:

Procedure Description

SUBMON.COM Starts MONITOR.COM as a detached process.

MONITOR.COM Creates a summary file from the binary recording file of the previous boot, then begins recording for this boot. The recording interval is 10 minutes.

MONSUM.COM (VAX only) Generates two OpenVMS Cluster multifile summary reports: one for the previous 24 hours and one for the previous day's prime-time period (9 a.m. to 6 p.m.). These reports are mailed to the system manager, and then the procedure resubmits itself to run each day at midnight.

Procedure	Description
SUBMON.COM	Starts MONITOR.COM as a detached process.
MONITOR.COM	Creates a summary file from the binary recording file of the previous boot, then begins recording for this boot. The recording interval is 10 minutes.
MONSUM.COM (VAX only)	Generates two OpenVMS Cluster multifile summary reports: one for the previous 24 hours and one for the previous day's prime-time period (9 a.m. to 6 p.m.). These reports are mailed to the system manager, and then the procedure resubmits itself to run each day at midnight.

When MONITOR data is recorded continuously, a summary report can cover any contiguous time segment.

4.4.4 Saving Your Summary Reports

The two multifile summary reports reports are not saved as files. To keep them, you must do either of the following:

Extract them from your mail file.
Alter the MONSUM.COM command procedure to save them.

4.4.5 Customizing Your Reports

The report you require for the evaluation procedure is one that covers a period that best represents the typical operation of your system. You might want, for example, to evaluate your system only during hours of peak acitvity.

To generate a summary of the appropriate time segment, edit the MONSUM.COM command procedure and change the beginning and ending times on one of the two MONITOR commands that produce the summary reports.

4.4.6 Report Formats

The summary reports produced by MONSUM.COM are in the multifile summary format---there is one column of averages for each node in a VMScluster, as well as some overall row statistics. For noncluster systems, the row statistics can be ignored.

If you prefer to use a report in the standard summary format (which includes current, minimum, and maximum statistics), execute a MONITOR playback summary command referencing the input data file of interest as the only file in the /INPUT list. Note that a new data file is created for each system whenever it reboots. Remember to use the /BEGINNING and /ENDING qualifiers to select the desired time period.

4.4.7 Using MONITOR in Live Mode

You are encouraged to observe current system activity regularly by running MONITOR in live mode. In live mode, always begin an analysis with the MONITOR CLUSTER and MONITOR SYSTEM classes to obtain an overview of system performance.

Then, monitor other classes to examine components of particular interest.

Note

All references to MONITOR items in this chapter are assumed to be for the average statistic, unless otherwise noted.

4.4.8 More About Multifile Reports

In multifile reports, a page or more is devoted to each MONITOR class. Each column represents one node, and is headed by the node name and beginning and ending times of the segment requested. In most cases, time segments for all nodes will be roughly the same. Differences of a few minutes are typical, because data collection on the various nodes is not synchronized.

In some cases, one or more time segments will be shorter than others; in these cases, some of the requested data was not recorded (probably because the nodes were unavailable). Note that if data is unavailable for some period within the bounds of a request, that fact is not explicitly specified.

However, such a gap can occur only when the column of data uses more than one input file; and if multiple files contributed to the column, the number is shown in parentheses to the right of the node name. In cases where a time segment is missing, this number must be greater than 1. If no number appears, there is only one input data file for that column, and the column includes no missing time segments.

To summarize, if all beginning and ending times are not roughly the same or if a parenthesized number appears, some data may be unavailable, and you may want to base your evaluation on a different time segment that includes more complete data. Whenever the multifile report is based on incomplete data, the Row Average statistic can be weighted unfairly in favor of one or more nodes.

4.4.9 Interpreting MONITOR Statistics

While interpreting MONITOR statistics, keep in mind that the collection interval has no effect on the accuracy of MONITOR rates. It does, however, affect levels, because they represent sampled data. In other words, the smaller the collection interval, the more accurate MONITOR level statistics will be. (For more information on MONITOR rates and levels, refer to the OpenVMS System Manager's Manual, Volume 2: Tuning, Monitoring, and Complex Systems.)

Although the interval value supplied with MONITOR.COM is adequate for most purposes, it does represent a trade-off between statistical accuracy and the consumption of disk space. Thus, before you base major decisions on MONITOR level statistics, be sure to verify them by running MONITOR for a time with a much smaller collection interval while carefully observing disk space usage.

Chapter 5
Diagnosing Resource Limitations

This chapter describes how to track down system resources that can limit performance. When you suspect that your system performance is suffering from a limited resource, you can begin to investigate which resource is most likely responsible. In a correctly behaving system that becomes fully loaded, one of the three resources---memory, I/O, or CPU---becomes the limiting resource. Which resource assumes that role depends on the kind of load your system is supporting.

5.1 Diagnostic Strategy

Appendix A contains a number of decision trees for diagnosing limiting resources. Note that the diagrams include command recommendations to help you obtain required information. The recommended commands appear in parentheses below the description of the information required.

The procedures use the process of elimination to determine the source of performance problems. There are fairly simple tests you can use to rule out certain classes of problems.

Use the following guidelines when conducting your preliminary investigation:

You must be able to observe the undesirable behavior while you are running these tests. You can determine nothing with these methods unless your system is exhibiting the problem.
Be aware that it is possible to have overlapping limitations; that is, you could find a memory limitation and an I/O limitation occurring simultaneously.
You should be able to detect all major limitations for further resolution using the methods outlined in this section, repeating them as necessary.
Your final investigations might lead you to conclude that the real source of the problem is human error, possibly misuse of the resources by one or more users.

5.2 Investigating Resource Limitations

Your preliminary investigation can proceed by checking for the possibility of memory limitations, then I/O limitations, and finally a CPU limitation.

5.2.1 Memory Limitations

Memory limitations are manifestations of such diverse problems as too little physical memory for the work attempted, inappropriate use of the memory management features, improper assignments of memory resources to users, and so forth.

To determine if you may have memory limitations, use the DCL commands MONITOR IO or MONITOR PAGE as shown in the following table:

If you observe... Then you...

A substantial amount of free memory ¹
Little or no paging ²
Little or no swapping ³
Can rule out memory limitations.

Significant inswapping
Little free memory
Significant paging
Should investigate memory limitations further. (See Chapter 7.)

If you observe...	Then you...
A substantial amount of free memory ¹ Little or no paging ² Little or no swapping ³	Can rule out memory limitations.
Significant inswapping Little free memory Significant paging	Should investigate memory limitations further. (See Chapter 7.)

¹See the entries for Free List Size and Modified List Size.
²See the Page Fault Rate.
³See the Inswap Rate.

You can also determine memory limitations by using SHOW SYSTEM to review the RW_FPG and RW_MPG parameters. If either parameter is displayed consistently, there is a serious shortage of memory. Very little improvement can be made by tuning the system. Compaq recommends buying more memory.

5.2.2 I/O Limitations

I/O limitations occur when the number or speed of devices is insufficient. You will also find an I/O limitation when application design errors either place inappropriate demand on particular devices or do not employ sufficiently large blocking factors or numbers of buffers.

To determine if you may have an I/O limitation, enter the DCL command MONITOR IO or MONITOR SYSTEM and observe the rates for direct I/O and buffered I/O.

If... Then you...

Your system is not performing any direct I/O Do not have a disk I/O limitation.

You observe that there is no buffered I/O Do not have a terminal I/O limitation.

Either or both operations are occurring Cannot rule out the possibility of an I/O limitation. (See Chapter 8.)

If...	Then you...
Your system is not performing any direct I/O	Do not have a disk I/O limitation.
You observe that there is no buffered I/O	Do not have a terminal I/O limitation.
Either or both operations are occurring	Cannot rule out the possibility of an I/O limitation. (See Chapter 8.)

5.2.3 CPU Limitations

The CPU can become the binding resource when the work load places extensive demand on it. Perhaps all the work becomes heavily computational, or there is some condition that gives unfair advantages to certain users.

To determine if there is a CPU limitation, use the DCL command MONITOR STATES.

You might also use the DCL command MONITOR MODES to observe the amount of user mode time. The MONITOR MODES display also reveals the amount of idle time, which is sometimes called the null time.

If... Then...

Many of your processes are in the computable state There is a CPU limitation.

Many of your processes are in the computable outswapped state Be sure to address the issue of a memory limitation first. (See Section 9.2.4.)

The user mode time is high It is likely there is a limitation occurring around the CPU utilization.

There is almost no idle time The CPU is being heavily used.

If...	Then...
Many of your processes are in the computable state	There is a CPU limitation.
Many of your processes are in the computable outswapped state	Be sure to address the issue of a memory limitation first. (See Section 9.2.4.)
The user mode time is high	It is likely there is a limitation occurring around the CPU utilization.
There is almost no idle time	The CPU is being heavily used.

A final indicator of a CPU limitation that the MONITOR MODES display provides is the amount of kernel mode time. A high percentage of time in kernel mode can indicate excessive consumption of the CPU resource by the operating system. This problem is more likely the result of a memory limitation but could indicate a CPU limitation as well. If you decide to investigate the CPU limitation further, proceed through the steps in Chapter 9.

5.3 After the Preliminary Investigation

When you have completed your preliminary investigation, you are ready to:

Isolate the cause of the observed behavior.
Conclude, in general terms, what remedies are available to you.
Apply one or more of the specific corrective procedures outlined in this chapter or in Chapter 10.

5.3.1 Observing the Tuned System

Once you take the appropriate remedial action, monitor the effectiveness of the changes and, if you do not obtain sufficient improvement, try again. In some cases, you will need to repeat the same steps, but either increase or decrease the magnitude of the changes you made. In other cases, you will proceed further in the investigation and uncover some other underlying cause of the problem and take corrective steps.

The diagrams and text do not attempt to depict this looping. Rather, repetition is always implied, pending the outcome of the changes. Therefore, tuning is frequently an iterative process. The approach to tuning presented by this chapter and Chapter 10 assumes that you can uncover multiple causes of performance problems by repeating the steps shown until you achieve satisfactory performance.

Note

Effective tuning requires that you can observe the undesirable performance behavior while you test.

5.3.2 Obtaining a Listing of System Current Values

You will find it especially helpful to keep a listing of the current values of all your system parameters nearby as you conduct the following investigations. Running SYSGEN and specifying a file name is one method for obtaining this listing. (See the OpenVMS System Manager's Manual, Volume 2: Tuning, Monitoring, and Complex Systems.)

$ RUN SYS$SYSTEM:SYSGEN SYSGEN> SET/OUTPUT=filename SYSGEN> SHOW/ALL SYSGEN> SHOW/SPECIAL SYSGEN> EXIT $ PRINT/DELETE filename

Chapter 6
Managing System Resources

Overall responsiveness of a system depends largely on the responsiveness of its CPU, memory, and disk I/O resources. If each resource responds satisfactorily, then so will the entire system.

6.1 Understanding System Responsiveness

Each resource must operate efficiently by itself and it must also interact with other resources.

An important aspect of your evaluation is to distinguish between resources that might be performing poorly because they are overcommitted and those that might be doing so because one or both of the following conditions has occurred:

They are blocked by the overcommitted resource.
They are incurring additional overhead operations caused by the overcommitted resource.

6.1.1 Detecting Bottlenecks

A binding resource or bottleneck is an overcommitted resource that causes the others to be blocked or burdened with overhead operations. Proper identification of such a resource is critical to correction of a performance problem. Upgrading a nonbinding resource will do nothing to improve a bottlenecked system.

Detecting bottlenecks is particularly important for analyzing interactions of the CPU with each of the other resources.

Example

For example, CPU blockage occurs when CPU capacity, though it appears sufficient to meet demand, cannot be used because the CPU must wait for disk I/O to complete or memory to be allocated.

6.1.2 Balancing Resource Capacities

Because of the potential for bottlenecks, it is especially important to maintain balance among the capacities of your system's resources.

Example

For example, when upgrading to a faster CPU, consider the effect the additional CPU power will have on the other primary resources. Because the faster CPU can initiate more I/O requests per unit of time, you must ensure that the disk I/O subsystem has sufficient capacity to handle the increased traffic.

6.2 Evaluating Responsiveness of System Resources

For each resource, key MONITOR statistics help you answer such questions as:

How well is the resource responding to requests for service?
How well is the capacity of the resource meeting demand?
Does the resource have any excess capacity, and if so, can that capacity be attributed to blockage by another, overcommitted resource?

Two prime measures of resource responsiveness include:

The size of the queue of requests for service (compute queue)
The amount of time it takes the system to respond to those requests (response time)

For each resource, you can use MONITOR summaries to examine or estimate one or both of these quantities.

6.3 Improving Responsiveness of System Resources

You can investigate four main ways to improve responsiveness:

Provide equitable sharing
Is the resource shared equitably among processes?
Reduce resource consumption by the system
Can the system's consumption of a resource be reduced, thereby making more of that resource available to users?
The effective amount of a resource available to users is that remaining after the operating system has used its portion.
Ensure load balancing
How well distributed is the demand for a resource? Can overall system responsiveness be improved, either by reconfiguring hardware or by better distributing the demand for it?
Initiate offloading
Can overall system responsiveness be improved by offloading some of the activity on a resource to other less heavily used resource types?
Example

Excess memory capacity is often used to reduce the demand on an overworked disk I/O subsystem by increasing the size of each I/O transfer, thereby reducing the total number of I/O operations.
The CPU benefits as well, because it needs to do less work executing system services and device driver software.
The primary means of offloading I/O to memory is the extensive use of caches (page caches, XQP caches, virtual I/O or extended file caching, RMS blocking) to reduce the number of I/O operations.

If the responsiveness of a poorly performing resource cannot be improved by these methods, you should consider augmenting its capacity with additional or upgraded hardware.

Contents

Index

privacy and legal statement

6491PRO_005.HTML

OpenVMS Performance Management

4.4 Creating, Maintaining, and Interpreting MONITOR Summaries

4.4.2 MONITOR Modes of Operation

4.4.5 Customizing Your Reports

4.4.8 More About Multifile Reports

Chapter 5Diagnosing Resource Limitations

5.2 Investigating Resource Limitations

5.3.1 Observing the Tuned System

5.3.2 Obtaining a Listing of System Current Values

Chapter 6Managing System Resources

6.1.1 Detecting Bottlenecks

Chapter 5
Diagnosing Resource Limitations

Chapter 6
Managing System Resources