5 Kernel, Symmetric Multiprocessing, NUMA, Virtual Memory, and Device Support

This chapter discusses the following topics:

The Tru64 UNIX kernel (Section 5.1)

Symmetric multiprocessing (Section 5.2)

Nonuniform memory access (NUMA) (Section 5.3)

The virtual memory subsystem (Section 5.4)

Tru64 UNIX support for various devices (Section 5.5)

5.1 The Tru64 UNIX Kernel

The kernel manages the Tru64 UNIX system resources. It can be adjusted for maximum performance by setting system attributes. Kernel tuning and debugging tools allow the examination of these attributes.

5.1.1 Kernel Tuning

Tru64 UNIX includes various subsystems that are used to define or extend the kernel. Kernel variables control subsystem behavior or track subsystem statistics after boot time.

Kernel variables are assigned default values at boot time. For certain configurations and workloads, especially memory or network-intensive systems, the default values of some attributes may not be appropriate, so you must modify these values to provide optimal performance.

Although you can use a debugger to directly change kernel variable values on a running kernel, HP recommends that you use kernel subsystem attributes to access the kernel variables. See sys_attrs(5)

Subsystem attributes are managed by the configuration manager server, cfgmgr. You can display and modify attributes by using sysconfig and sysconfigdb commands and by using the Kernel Tuner, dxkerneltuner. In some cases, you can use the sysconfig to modify attributes while the system is running. You use sysconfigdb to make changes that will be preserved when the system is rebooted.

For more information, see the System Configuration and Tuning manual and the sysconfig(8) and sysconfigdb(8) reference pages.

5.1.2 Enhanced Kernel Debugging

The dbx debugger is a symbolic debugger that enables you to examine, modify, and display a kernel's variables and data structures.

With dbx you can perform the following tasks:

Debug stripped images

Examine memory contents

Display the values of kernel variables and the value and format of kernel data structures

Debug multiple threads

Debug kernel core files using the -k option

Perform breakpoint debugging of a running kernel across a serial line using the -remote option

A front-end to dbx, called kdbx, supports the entire suite of dbx commands, in addition to a C library API that allows programmers to write C programs to extract and format kernel data more easily than they can with just dbx -k or dbx -remote.

The dbx debugger is a command-line program. The ladebug debugger, an alternate debugger, provides both command-line and graphical user interfaces.

For more information, see the Kernel Debugging manual, the Programmer's Guide, the Ladebug Debugger Manual, and the ladebug(1), dbx(1), and kdbx(8) reference pages.

5.2 Symmetric Multiprocessing

Symmetric multiprocessing (SMP) is the ability of two or more processes (or multiple threads of a threaded application) to execute simultaneously on two or more CPUs. This concurrency of execution greatly improves performance. Additionally, it provides the opportunity to extend the life and increase the cost-effectiveness of multiprocessor systems by adding CPU cards (and their compute power) to multiprocessors rather than buying more systems.

Tru64 UNIX supports an implementation of SMP that is designed to optimize the performance of compute servers (systems dedicated to compute-bound, multithreaded applications) and data servers (file servers, DBMS servers, TP systems, and mail routers that serve a large number of network clients). The operating system also supports multithreaded application development in an SMP environment. Note that SMP does not adversely affect using a multiprocessor as a timesharing system.

Tru64 UNIX SMP uses the following:

Simple locks (also called spin locks, because they "spin" for a specified period of time waiting for held locks to be freed before timing out).

Complex locks (read/write locks that can block waiting for a lock to be freed).

Funneling, which is used in rare cases where locks would not be of benefit. Funneling forces a process to execute on a specific CPU; it is typically used for legacy programs that are not thread and multiprocessor-safe.

Tru64 UNIX SMP achieves as much concurrency as possible by reducing the size of the system state that must be protected by locks, thereby reducing the necessity for locks and their overhead.

Tru64 UNIX, including its kernel, is fully parallel so that multiple processes or multiple threads can run simultaneously on multiple CPUs. The operating system uses concurrency and its locking strategy to ensure the integrity of the same kernel data structures. Multiple processes and multiple threads can access the same kernel data structures, but Tru64 UNIX makes sure that this access is performed in a logical order. Multiple processes and multiple threads cannot hold and request each others' locks, deadlocking the system.

Tru64 UNIX SMP also supports processor binding, the ability to bind a particular process to a specified CPU, and load balancing, whereby the scheduler attempts to distribute all runable processes across all available CPUs. (Note that load balancing will not override processor binding.)

To improve performance, the scheduler also attempts to execute each process on the last CPU where it ran to take advantage of any state that may be left in that CPU's cache.

SMP is configurable and any of the following five modes can be configured at system boot time:

Uniprocessing

Optimized real-time preemption

Optimized SMP

Optimized real-time preemption and SMP

Lock debug mode

When uniprocessing is set, only those locks required to support multiple threads are initialized into the kernel at system boot time.

When lock debug mode is set, the system:

Checks the lock hierarchy and minimum system priority level (SPL)

Stores debugging information by classes and maintains lock statistics

Records the simple locks that are held by each CPU in CPU-specific arrays

Records all of the complex locks that a thread is holding in the thread structure

You can use the dbx debugger to access this information for debugging.

In addition, the development environment supports multithreaded application development. The dbx, profile, and pixie utilities support multiple threads, and the system includes thread-safe libraries.

For information on the Tru64 UNIX development environment and the threads package that Tru64 UNIX supports, see the Programmer's Guide and the Guide to the POSIX Threads Library. For information on configuring SMP, see the System Administration and System Configuration and Tuning manuals.

5.3 Nonuniform Memory Access (NUMA)

Symmetric multiprocessor (SMP) systems typically provide one interconnect, either a bus or a switch, that links all system resources. This means that all CPUs in the system are subject to the same latency and bandwidth restrictions when accessing the system's memory and I/O channels. The drawback of the architecture of traditional SMP systems is that scaling the system to large numbers of CPUs causes the system bus to become a performance bottleneck.

One way to address this bottleneck is to build a system from SMP blocks (each with a limited number of CPUs, memory arrays, and I/O ports) and add a second-level bus or switch to connect the blocks. Nonuniform memory access (NUMA) is the term used to describe this type of system architecture, because it results in bandwidth and latency differences, depending on whether a particular CPU accesses memory and I/O resources locally (in the same building block where the CPU resides) or remotely (in another building block).

For a complete discussion of NUMA support, including a program example, see the NUMA Overview, which is available from the Tru64 UNIX Documentation Web site with the Version 5.1B documentation:

http://www.tru64unix.compaq.com/docs/base_doc/DOCUMENTATION/V51_HTML/NUMA/TITLE.HTM

5.4 Virtual Memory

The virtual memory subsystem performs the following functions:

Allocates memory to processes

Tracks and manages all the pages in the system

Uses paging and swapping to ensure that there is enough memory for processes to run and to cache file system I/O

The total amount of physical memory is determined by the capacity of the memory boards installed in your system. The system distributes this memory in 8 KB units called pages. The system distributes pages of physical memory among three areas:

Wired memory
Memory is wired statically at boot time and dynamically at run time.
At boot time, the operating system and the Privileged Architecture Library (PAL) code wire a contiguous portion of physical memory to perform basic system operations. Static wired memory is reserved for operating system data and text, system tables, the metadata buffer cache, which temporarily holds recently accessed UNIX File System (UFS) and CD-ROM File System (CDFS) metadata, and the Advanced File System (AdvFS) buffer cache. Static wired memory cannot be reclaimed through paging.

Virtual memory
The virtual memory subsystem uses a portion of physical memory to cache processes' most recently accessed anonymous memory (modifiable virtual address space) and file-backed memory. The subsystem allocates memory to competing processes and tracks the distribution of all the physical pages. This memory can be reclaimed through paging and swapping.

Unified Buffer Cache
The Unified Buffer Cache (UBC) uses a portion of physical memory to cache most recently accessed file system data. The UBC contains actual file data for reads and writes and for page faults from mapped file regions and also AdvFS metadata. By functioning as a layer between the operating system and the storage subsystem, the UBC can decrease the number of disk operations. This memory can be reclaimed through paging.

The virtual memory subsystem and the UBC compete for the physical pages that are not wired. Pages are allocated to processes and to the UBC, as needed. When the demand for memory increases, the oldest (least recently used) pages are reclaimed from the virtual memory subsystem and the UBC, are moved to swap space, and are then reused. Various attributes control the amount of memory available to the virtual memory subsystem and the UBC and the rate of page reclamation.

5.4.1 Managing and Tracking Pages

The virtual memory subsystem allocates physical pages to processes and the UBC, as needed. Because physical memory is limited, these pages must be periodically reclaimed so that they can be reused.

The virtual memory subsystem uses page lists to track the location and age of all the physical memory pages. At any one time, each physical page can be found on one of the following lists:

Wired list
Pages that are wired and cannot be reclaimed.

Free list
Pages that are clean and are not being used. The size of this list controls when page reclamation occurs.

Active list
Pages that are being used by the virtual memory subsystem or the UBC.
To determine which pages should be reclaimed first, the page stealer daemon identifies the oldest pages on the active list and designates the least recently used pages as follows:
- Inactive pages are the oldest pages that are being used by the virtual memory subsystem.
- UBC least recently used pages are the oldest pages that are being used by the UBC.

Tru64 UNIX virtual memory is NUMA aware. It maintains a separate set of lists and worker threads per RAD.

5.4.2 Prewriting Modified Pages

The virtual memory subsystem attempts to keep memory pages clean to ease the recovery of memory shortages. When the virtual memory subsystem anticipates that the pages on the free list will soon be depleted, it prewrites to swap space the oldest modified (dirty) inactive pages. In addition, when the number of modified UBC least recently used pages exceeds 10 percent of the total UBC least recently used pages, the virtual memory subsystem prewrites to swap space the oldest modified UBC least recently used pages.

5.4.3 Using Attributes to Control Paging and Swapping

When the demand for memory significantly depletes the free list, paging begins. The virtual memory subsystem takes the oldest inactive pages and UBC least recently used pages, moves the contents of the modified pages to swap space, and puts the clean pages on the free list, where they can be reused.

If the free page list cannot be replenished by reclaiming individual pages, swapping begins. Swapping temporarily suspends processes and moves entire resident sets to swap space, which frees large amounts of physical memory.

The point at which paging and swapping start and stop depends on the values of various tunable virtual memory subsystem kernel attributes.

Because the UBC competes with the virtual memory subsystem for the physical pages that are not wired by the kernel, the allocation of memory to the UBC can affect file system performance and paging and swapping activity. The UBC is dynamic and consumes varying amounts of memory while responding to changing file system demands.

By default, the UBC can consume up to 100 percent of memory. However, part of the memory allocated to the UBC is only borrowed from the virtual memory subsystem. When paging starts, borrowed UBC pages are the first to be reclaimed. The amount of memory allocated to the UBC can be controlled by various virtual memory subsystem kernel attributes.

5.4.4 Paging Operation

When the memory demand is high and the number of pages on the free page list falls below the paging threshold, the virtual memory subsystem uses paging to replenish the free page list. The page reclamation code controls paging and swapping. The page out daemon and task swapper daemon are extensions of the page reclamation code.

The page reclamation code activates the page stealer daemon, which first reclaims the pages that the UBC has borrowed from the virtual memory subsystem, until the size of the UBC reaches the borrowing threshold. (The default is 20 percent.) If the reclaimed pages are dirty (modified), their contents must be written to disk before the pages can be moved to the free page list. Freeing borrowed UBC pages is a fast way to reclaim pages, because UBC pages are usually unmodified.

If freeing UBC borrowed memory does not sufficiently replenish the free list, a page out occurs. The page stealer daemon reclaims the oldest inactive pages and UBC least recently used pages.

Paging becomes increasingly aggressive if the number of free pages continues to decrease. If the number of pages on the free page list falls below 20 pages (the default), a page must be reclaimed for each page taken from the list. To prevent deadlocks, when the number of pages on the free page list falls below 10 pages (the default), only privileged tasks can get memory until the free page list is replenished. Both these limits are controlled by tunable attributes.

Page out stops when the number of pages on the free list rises above the paging threshold. If paging individual pages does not sufficiently replenish the free list, swapping is used to free a large amount of memory.

5.4.5 Swapping Operation

If there is a high demand for memory, the virtual memory subsystem may be unable to replenish the free list by reclaiming pages. Swapping reduces the demand for physical memory by suspending processes, which dramatically increases the number of pages on the free list. To swap out a process, the task swapper suspends the process, writes its resident set to swap space, and moves the clean pages to the free list. Swapping can have a serious impact on system performance.

Idle task swapping begins when the number of pages on the free list falls below the swapping threshold (the default is 74 pages) for a period of time. The task swapper then suspends all tasks that have been idle for 30 seconds or more.

If the number of pages on the free list continues to decrease, hard swapping begins. The task swapper suspends, one at a time, the tasks with the lowest priority and the largest resident set size.

Swapping of an individual task stops when the number of pages on the free list reaches the high water swapping threshold. (The default is 1280.)

A swap in occurs when the number of pages on the free list has been sufficiently replenished for a period of time. The task's working set is paged in from swap space and it can now execute. By default, a task must remain in the swapped-in state for one second before it can be swapped out.

Increasing the rate of swapping (swapping earlier during page reclamation) increases throughput. As more processes are swapped out, fewer processes are actually executing and more work is done. Although increasing the rate of swapping moves long-sleeping threads out of memory and frees memory, it degrades interactive response time. Swapped out processes have a long latency.

Decreasing the rate of swapping (swapping later during page reclamation), improves interactive response time, but at the cost of throughput.

5.4.6 Swap Space Allocation Mode

You can use two modes to allocate swap space. The modes differ in how the virtual memory subsystem reserves swap space for anonymous memory (modifiable virtual address space). Anonymous memory is memory that is not backed by a file, but is backed by swap space (for example, stack space, heap space, and memory allocated by the malloc function). Neither mode has a performance benefit attached to it

Immediate mode
This mode reserves swap space when a process first allocates anonymous memory. Immediate mode is the default swap space allocation mode and is also called eager mode.
This mode may cause the system to reserve an unnecessarily large amount of swap space for processes. However, it ensures that swap space will be available to processes if it is needed. If swap space cannot be reserved, the requesting process is terminated.

Deferred mode
This mode reserves swap space only if the virtual memory subsystem needs to write a modified virtual page to swap space. It postpones the reservation of swap space for anonymous memory until it is actually needed. Deferred mode is also called lazy mode.
This mode requires less swap space than immediate mode and may cause the system to run faster, because it requires less swap space bookkeeping. However, because deferred mode does not reserve swap space in advance, the swap space may not be available when a process needs it, and processes may be killed asynchronously.
The process killed may not be the process requesting memory; in most cases, it will be a process that has been idle, such as a system daemon.

In addition, you can override the system-wide swap space allocation mode for a specific command or application by using the swapon command. For more information, see the swapon(8) reference page.

5.4.7 Using Swap Buffers

To facilitate the movement of data between memory and disk, the virtual memory subsystem uses synchronous and asynchronous swap buffers. The virtual memory subsystem uses these two types of buffers to immediately satisfy a page in request without having to wait for the completion of a page out request, which is a relatively slow process.

Synchronous swap buffers are used for page-in page faults and for swap outs. Asynchronous swap buffers are used for asynchronous page outs and for prewriting modified pages.

5.4.8 Unified Buffer Cache

The Unified Buffer Cache (UBC) is a Tru64 UNIX virtual memory feature. The UBC uses a portion of the machine's physical memory to cache the most recently accessed file system data. The UBC contains actual file data, which includes reads and writes from conventional file activity, page faults from mapped file sections, and AdvFS metadata.

The UBC shares (contends for) physical memory pages with the virtual memory subsystem, but not pages that are wired by the kernel. The UBC is dynamic, consuming varying amounts of memory in response to changes in file system demands for its service. For information about the UBC, see the System Configuration and Tuning manual.

5.5 Device Support

The kernel supports hot-swap I/O devices, which provides the capability to automatically fault-in a device driver when a hot-swappable I/O device comes on line.

When the hardware code detects a new device and determines that the device driver is not present in the kernel, it can make a kernel function call to automatically load the device's driver into the kernel. Additionally, hot-swapping provides the flexibility of not having to prebuild the kernel subsystem or driver into the kernel. Instead, it can be faulted in when the device is first accessed.