prof_intro(1)

Index for
Section 1
Alphabetical
listing for P
Bottom of
page
prof_intro(1)
NAME
  prof_intro - Introduction to application profilers, profiling,
  optimization, and performance analysis

DESCRIPTION
  Tru64 UNIX supports four approaches to performance improvement:

    ·  Automatic and profile-directed optimizations. For example:
	    pixie -update a.out data/*
	    cc -non_shared -O3 -spike -feedback a.out *.c

    ·  Manual design and code optimizations. For example:
	    hiprof -all -display program data/* | more
	    hiprof -flat -all -display program data/* | more
	    uprofile -heavy program data/* | more

    ·  Minimizing system-resource usage. For example:
	    third -display program data/* | more

    ·  Verifying significance of test cases. For example:
	    pixie -testcoverage program data/* | more

  One approach might be enough, but more might be beneficial if no single
  approach addresses all aspects of a program's performance. The following
  sections describe each approach and the tools provided by Tru64 UNIX to
  support them.

AUTOMATIC AND PROFILE-DIRECTED OPTIMIZATIONS
  Techniques

  Automatic and profile-directed optimizations are the simplest approaches to
  improving application performance.

  Some degree of automatic optimization can be achieved by using the
  compiler's and linker's optimization options. These can help in the
  generation of minimal instruction sequences that make best use of the CPU
  architecture and cache memory.

  However, the compiler and linker can improve their optimizations if they
  are given information on which instructions are executed most often when
  the program is run with its normal input data and environment. While the
  default optimizations give improved performance for most common situations,
  the optimizers can do even better if they can tune the program in favor of
  the heavily used instruction sequences as determined from a sample run.

  Tru64 UNIX helps you provide the optimizers with this information on
  processing hot-spots by allowing a profiler's results to be fed back into a
  recompilation. This customized, profile-directed optimization can be used
  in conjunction with automatic optimization.

  Tools and Examples

  The cc compiler command's automatic optimization options are selected with
  -O, -fast, -inline, -spike, and other related options. See cc(1) for
  details and Chapter 10 of the Programmer's Guide for more information on
  the many options and tradeoffs available.

  For example, this command selects a high degree of optimization in both the
  compiler and the linker:

       cc -non_shared -O3 -spike *.c

  The pixie profiler provides profile information that the cc command's
  -feedback and -spike options can use to tune the generated instruction
  sequences to the demands placed on the program by particular sets of input
  data.

  The steps, shown in the following example, consist of (1) preparing the
  program for profile-directed optimization, (2) creating an instrumented
  version of the program and running it to collect profiling statistics, and
  (3) feeding that information back to the compiler and linker to help them
  optimize the executable code:

       rm -f program
       cc -non_shared -feedback program -o program -O3 *.c
       pixie -update program
       cc -non_shared -feedback program -o program -O3 -spike *.c

  To apply profile-directed optimizations to shared libraries, generate
  profile data with an exerciser program, and store it in the shared library
  prior to recompiling with that feedback. For example:

       rm -f libexample.so
       cc -feedback libexample.so -o libexample.so -shared -O3 lib*.c
       cc -o exerciser exerciser.c -L. -lexample
       pixie -L. -incobj libexample.so -run exerciser
       prof -pixie -update libexample.so exerciser.Counts
       cc -spike -feedback libexample.so -o libexample.so -shared -O3 lib*.c

MANUAL DESIGN AND CODE OPTIMIZATIONS
  Techniques

  The effectiveness of the automatic optimizations described previously is
  limited by the efficiency of the algorithms that the program uses. A
  program's performance can be further improved by manually optimizing its
  algorithms and data structures. Such optimizations may include reducing
  complexity from N-squared to log-N, avoiding copying of data, and reducing
  the amount of data used. It may also extend to tuning the algorithm to the
  architecture of the particular machine it will be run on - for example,
  processing large arrays in small blocks such that each block remains in the
  data cache for all processing, instead of the whole array being read into
  the cache for each processing phase.

  Tru64 UNIX supports manual optimization with its profiling tools, which
  identify the parts of the application that use most CPU resources - CPU
  cycles, cache misses, and so on. By evaluating different profiles of a
  program, you can identify which parts of the program use most CPU resources
  and you can then redesign or recode algorithms in those parts to use less
  resources. The profiles also make this exercise more cost-effective by
  helping you to focus on the most demanding code instead of the least
  demanding code.

  Tools and Examples

  .SS(a) CPU-Time Profiling with Call-Graph

  A call-graph profile shows how much CPU time is used by each procedure, and
  how much is used by all of the other procedures that it calls. This can
  show which phases or subsystems in a program spend most of the total CPU
  time, which can help in gaining a general understanding of the program's
  performance.

  The hiprof profiler instruments the program and records a call graph while
  the instrumented program executes. The hiprof profiler does not require
  that the program be compiled in any particular way, but the names of local
  (for example, static) procedures will be hidden if the cc command's default
  -g0 option was used, and procedures will be hidden if they are inlined. For
  example:

       cc -g1 -O2 -o program *.c
       hiprof -all -display program data/* | more

  By default, hiprof uses a low-frequency sampling technique. It can profile
  all of the code executed by the program, including all selected libraries,
  though its call graph excludes procedures in threads-related system
  libraries. It can also provide detailed profiles at the level of source
  lines or machine instructions.

  For non-threaded programs, hiprof can alternatively count the number of
  machine cycles used or page faults that occur during program execution. In
  these modes, the CPU time or page-faults count reported for the
  instrumented routines includes that for the uninstrumented routines that
  they call. This can summarize the costs and reduce the run-time overhead,
  but note that the machine-cycle counter wraps if no instrumented procedure
  is called at least every few seconds.

  The cc compiler's -pg option uses the same sampling technique as hiprof.
  This technique is supported in a very similar way on different vendors'
  UNIX systems. For example:

       cc -g1 -O2 -pg -o program *.c
       ./program data/*
       gprof program gmon.out | more

  However, hiprof may be preferred because the -pg option has some
  disadvantages:

    ·  The program needs to be specially compiled with the -pg option.

    ·  Only a few of the archive libraries that are provided with the
       operating system were compiled to generate a call-graph profile.

    ·  Only the executable is profiled. Shared libraries are not.

  The optional dxprof command provides a graphical display of various call-
  graph profiles.

  .SS(b) CPU-Time/Event Profiles for Sourcelines/Instructions

  A good performance-improvement strategy may start with a procedure-level
  profile of the whole program (perhaps with a call graph too, to give the
  big picture), but it will often progress to detailed profiling of
  individual source-lines and instructions.

  The uprofile profiler uses a sampling technique to generate a profile of
  the CPU time or events such as cache misses associated with each procedure
  or source-line or instruction. The sampling frequency depends on the
  processor type and the statistic being sampled, but for CPU time it is on
  the order of a millisecond.  The profiler achieves this without modifying
  the target program at all by using hardware counters that are built into
  the Alpha CPU.  Running the uprofile command with no arguments yields a
  list of all the kinds of events that a particular machine can profile,
  depending on the nature of its architecture. The default is to profile
  machine cycles, resulting in a CPU-time profile. The following example
  shows how to display a profile of the source lines that experienced the top
  90% of data cache misses on an EV56 Alpha:

       cc -g1 -O2 -o program *.c
       uprofile -h -q 90cum% dcacheldmisses program data/* | more

  This technique has the advantage of very low run-time overhead. Also, the
  detailed information it can provide on the costs of executing individual
  instructions or source lines is essential in identifying exactly which
  operation in a procedure is slowing down the program.

  The disadvantages of uprofile are that only executables can be profiled,
  the results can be skewed unless all processors have the same cycle speed,
  only one program can be profiled with the hardware counters at one time,
  threads can not be profiled individually, and the Alpha EV6 architecture's
  execution of instructions out of sequence can significantly reduce the
  accuracy of fine-grained profiles.

  If hiprof's -flat option is used, its default sampling technique can
  provide the same fine-grain profiles (CPU time only) and low intrusiveness
  as uprofile. Also, it is accurate even with mixed processor cycle speeds,
  and it can profile all of a program's shared libraries as well as its
  individual threads. For example:

       hiprof -flat -h -all program data/* | more

  The cc compiler's -p option uses the same low-frequency sampling technique
  as hiprof. It is common to many UNIX systems, and (on Tru64 UNIX) it is
  able to profile all the shared libraries used by a program. The program
  needs to be relinked with the -p option, but it does not need to be
  recompiled from source, so long as the original compilation used an
  acceptable debug level, such as the -g1 compiler option. For example, to
  profile individual instructions of a program:

       cc -p -o program *.o
       setenv PROFFLAGS '-all -stride 1'
       ./program data/*
       prof -all -asm -quit 5% program mon.out | more

  The pixie tool can also profile source lines and instructions (including
  shared libraries), but note that when it displays counts of "Cycles", it is
  actually reporting counts of instructions executed, not machine cycles. For
  example:

       cc -g1 -O2 -o program *.c
       pixie -all -lines -quit 20 program data/* | more

  The optional dxprof command provides a graphical display of profiles
  collected by either pixie or the cc command's -p option.

MINIMIZING SYSTEM RESOURCE USAGE
  Techniques

  The preceding techniques can improve an application's use of just the CPU.
  Further performance improvements can be made by improving the efficiency
  with which the application uses the other components of the computer
  system: heap memory, disk files, network connections, and so on.

  As with CPU profiling, the first phase of a resource usage improvement
  process is to monitor how much memory, data I/O and disk space, elapsed
  time, and so on, is used. Then the throughput of the computer can be
  increased or tuned in ways that help the program, or the program's design
  can be tuned to make better use of the computer resources that are
  available. For example:

    ·  Reduce the size of the data files that the program reads and writes.

    ·  Use memory-map files instead of regular I/O.

    ·  Allocate memory incrementally on demand instead of allocating at
       start-up the maximum that could be required.

    ·  Fix heap leaks, and do not leave allocated memory unused.  See the
       System Configuration and Tuning manual for a broader discussion of
       analyzing and tuning a Tru64 UNIX system.

  Tools and Examples

  .SS(a) System Monitors

  The Tru64 UNIX base system commands ps u, swapon -s, and vmstat 3 can show
  the currently active processes' usage of system resources such as CPU time,
  physical and virtual memory, swap space, page faults, and so on.

  The optional pview command provides a graphical display of similar
  information for the processes that comprise an application.

  The time commands provided by the Tru64 UNIX system and command shells
  provide an easy way to measure the total elapsed time and CPU time for a
  program and its descendants.

  The collect tool is an optional, low overhead, system performance monitor.

  Many other related commands are described in the System Configuration and
  Tuning manual.

  .SS(b) Heap Memory Analyzers

  The third command reports heap memory leaks in a program, by instrumenting
  it with the Third Degree memory-usage checker, running it, and displaying a
  log of leaks detected at program exit. For example:

       third -display program data/* | more

  If you are interested only in leaks occurring during the normal operation
  of the program, not during startup or shutdown, you can specify additional
  places to check for previously unreported leaks. For example, the pre-
  shutdown leak report will give this information:

       third -display -after startup -before shutdown program data/* | more

  Third Degree can also detect various kinds of bugs that may be affecting
  the correctness or performance of a program. See the Programmer's Guide for
  further details on debugging and leak-detection.

  The optional dxheap command provides a graphical display of Third Degree's
  heap and bug reports.

  The optional mview command provides a graphical analysis of heap usage over
  time. This view of a program's heap can clearly show the presence (if not
  the cause) of significant leaks or other undesireable trends such as wasted
  memory.

VERIFYING SIGNIFICANCE OF TEST CASES
  Techniques

  Most of the preceding profiling techniques are effective only if you
  profile and optimize or tune the parts of the program that are executed in
  the scenarios whose performance is important. Careful selection of the data
  used for the profiled test runs is often sufficient, but you may want a
  quantitative analysis of which code was and was not executed in a given set
  of tests.

  Tools and Examples

  The pixie command's -t[estcoverage] option reports lines of code that were
  not executed in a given test run. For example:

       pixie -t program data/* | more

  Conversely, pixie's -p[rocedure], -h[eavy], and -a[sm] options show which
  procedures, source lines, and instructions were executed.

  If multiple test runs are needed to build up a typical scenario, the prof
  command can be run separately on a set of profile data files:

       pixie -pids program
       ./program.pixie data1/*
       ./program.pixie data2/*
       prof -pixie -t program program.Counts.*

SEE ALSO
  Optimizing:	cc(1), spike(1)

  Profiling:   hiprof(1), pixie(1), third(1), uprofile(1)

  System Monitoring:   collect(8), ps(1), swapon(1), vmstat(1)

  Graphical tools, available from the Graphical Program Analysis subset of
  the Tru64 UNIX Associated Products installation media, or as part of the
  Enterprise Toolkit for Windows/NT desktops with Microsoft's Visual Studio
  97: dxheap(1), dxprof(1), mview(1), pview(1)

  Programmer's Guide

  System Configuration and Tuning
Index for
Section 1
Alphabetical
listing for P
Top of
page