Administration Guide

Cluster Configuration

In a hot standby configuration, the AIX processor node that is the takeover node is not running any other workload. In a mutual takeover configuration, the AIX processor node that is the takeover node is running other workload.

Generally, DB2 UDB EEE runs in mutual takeover mode with partitions on each node. One exception is a scenario where the catalog node is part of a hot standby configuration.

When planning a large DB2 installation on a RS/6000 SP using HACMP ES, you need to consider how to divide the nodes of the cluster within or between the RS/6000 SP frames. Having a node and its backup in different SP frames can allow takeover in the event one frame goes down (that is, the frame power/switch board fails). However, such failures are expected to be exceedingly rare because there are N+1 power supplies in each SP frame and each SP switch has redundant paths along with N+1 fans and power. In the case of a frame failure, manual intervention may be required to recover the remaining frames. This recovery procedure is documented in the SP Administration Guide. HACMP ES provides for recovery of SP node failures; recovery of frame failures is dependent on proper layout of clusters within the SP frame(s).

Another planning consideration involves how to manage big clusters: It is easier to manage a small cluster than a big one; however, it is also easier to manage one big cluster than many smaller ones. When planning, consider how your applications will be used in your cluster environment. If there is a single, large, homogeneous application running on, for example, 16 nodes then it is probably easier to manage as a single cluster rather than as eight (8) two-node clusters. If the same 16 nodes contain many different applications with different networks, disks, and node relationships then it is probably better to group the nodes into smaller clusters. Keep in mind that nodes integrate into an HACMP cluster one at a time; it will be faster to start a configuration of multiple clusters rather than one large cluster. HACMP ES supports both single and multiple clusters as long as a node and its backup are in the same cluster.

HACMP ES failover recovery allows pre-defined (also known as "cascading") assignment of a resource group to a physical node. The failover recovery procedure also allows floating (also known as "rotating") assignment of a resource group to a physical node. IP addresses; external disk volume groups, filesystems, NFS filesystems; and, application servers within each resource group specify either an application or application component which can be manipulated by HACMP ES between physical nodes by failover and reintegration. Failover and reintegration behavior is specified by the type of resource group created, and by the number of nodes placed in the resource group.

As an example, consider a DB2 database partition (logical node): If its log and table space containers were placed on external disks, and other nodes were linked to that disk, it would be possible for those other nodes to access these disks and restart the database partition (on a takeover node). It is this type of operation that is automated by HACMP. HACMP ES can also be used to recover NFS file systems used by DB2 instance main user directories.

Read the HACMP ES documentation thoroughly as part of your planning for recovery with DB2 UDB EEE. You should read the Concepts, Planning, Installation, and Administration guides. Then you can layout the recovery architecture for your environment. For the subsystems you have identified for recovery based on the identified points of failure, identify the HACMP clusters you need and the recovery nodes for each (either hot standby or mutual takeover). This architecture and planning is a starting point for completing the HACMP worksheets found in the documentation (mentioned above).

It is strongly recommended that both disks and adapters are mirrored in your external disk configuration. For DB2 physical nodes that are configured for HACMP, care is required to ensure that nodes can vary on the volume group from the shared external disks. In a mutual takeover configuration, this arrangement requires some additional planning so that the paired nodes can access each other's volume groups without conflicts. Within DB2 UDB EEE this means that all container names must be unique across all databases.

One way to achieve uniqueness in the names is to include the partition number as part of the name. You can specify a node expression for container string syntax when creating either SMS or DMS containers. When you specify the expression, either the node number is part of the container name, or, if you specify additional arguments, the result of the argument is part of the container name. You use the argument " $N" ([blank]$N) to indicate the node expression. The argument must occur at the end of the container string and can only be used in one of the following forms. In the table below, the node number is assumed to be five (5):

Table 50. Arguments for Creating Containers

Syntax Example Value
[blank]$N " $N" 5
[blank]$N+[number] " $N+1011" 1016
[blank]$N%[number] " $N%3" 2
[blank]$N+[number]%[number] " $N+12%13" 4
[blank]$N%[number]+[number] " $N%3+20" 22

Notes:

% is modulus.
In all cases, the operators are evaluated from left to right.

Syntax	Example	Value
[blank]$N	" $N"	5
[blank]$N+[number]	" $N+1011"	1016
[blank]$N%[number]	" $N%3"	2
[blank]$N+[number]%[number]	" $N+12%13"	4
[blank]$N%[number]+[number]	" $N%3+20"	22
Notes: % is modulus. In all cases, the operators are evaluated from left to right.

Following are some examples of creating containers using this special argument:

Creating containers for use on a two-node system.

   CREATE TABLESPACE TS1 MANAGED BY DATABASE USING
      (device '/dev/rcont $N' 20000)

The following containers would be used:

   /dev/rcont0   - on Node 0
   /dev/rcont1   - on Node 1

Creating containers for use on a four-node system.

   CREATE TABLESPACE TS2 MANAGED BY DATABASE USING
      (file '/DB2/containers/TS2/container $N+100' 10000)

The following containers would be used:

   /DB2/containers/TS2/container100   - on Node 0
   /DB2/containers/TS2/container101   - on Node 1
   /DB2/containers/TS2/container102   - on Node 2
   /DB2/containers/TS2/container103   - on Node 3

Creating containers for use on a two-node system.

   CREATE TABLESPACE TS3 MANAGED BY SYSTEM USING
      ('/TS3/cont $N%2, '/TS3/cont $N%2+2')

The following containers would be used:

   /TS/cont0   - on Node 0
   /TS/cont2   - on Node 0
   /TS/cont1   - on Node 1
   /TS/cont3   - on Node 1

The following pictures show some of the planning involved to ensure a highly available external disk configuration and the ability to access all volume groups without conflict.

Figure 69. No Single Point of Failure

Figure 70. Volume Group and Logical Volume Setup

Once configured, each database partition in an instance is started by HACMP ES one physical node at a time. Using multiple clusters is recommended for starting parallel DB2 configurations that are larger than four (4) nodes.
Note: Each HACMP node in a cluster is started one at a time. For a 64-node parallel DB2 configuration, it is faster to start 32, two-node HACMP clusters in parallel rather than four (4), sixteen-node clusters.

A script file, rc.db2pe, is packaged with DB2 UDB EEE to assist in configuring for HACMP ES failover or recovery in either "hot standby" or "mutual takeover" nodes. In addition, DB2 buffer pool sizes can be customized during failover in mutual takeover configurations from within rc.db2pe. (Buffer pool size modification is needed to ensure proper performance when two database partitions run on one physical node. See the next section for additional information.) The script file, rc.db2pe, is installed on each node in /usr/bin.

Configuration of a DB2 Database Partition

When you create an application server in a HACMP configuration of a DB2 database partition, specify rc.db2pe as a start and stop script in the following way:

   /usr/bin/rc.db2pe <instance> <dpn> <secondary dpn> start <use switch>
   /usr/bin/rc.db2pe <instance> <dpn> <secondary dpn> stop <use switch>

where:

   <instance>  is the instance name.
   <dpn> is the database partition number.
   <secondary dpn> is the 'companion' database partition number in
      'mutual takeover' configurations only; in 'hot standby' configurations
      it is the same as <dpn>.
   <use switch> is usually blank; when blank, by default this indicates that
      the SP Switch network is used for hostname field in the db2nodes.cfg file	
      (all traffic for DB2 is routed over the SP switch); if not blank, the name
      used is the hostname of the SP node to be used.

Note: The DB2 command LIST DATABASE DIRECTORY is used from within rc.db2pe to find all databases configured for this database partition. The rc.db2pe script file then looks for /usr/bin/reg.parms.DATABASE and /usr/bin/failover.parms.DATABASE files, where DATABASE is each of the databases configured for this database partition. In a "mutual takeover" configuration, it is recommended you create these parameter files (reg.parms.xxx and failover.parms.xxx). In the failover.parms.xxx file, the settings for BUFFPAGE, DBHEAP, and any others affecting buffer pools should be adjusted to account for the possibility of more than one buffer pool. Buffer pool size modification is needed to ensure proper performance when two or more database partitions run on one physical node. Sample files reg.parms.SAMPLE and failover.parms.SAMPLE are provided for your use.

One of the important parameters in this environment is START_STOP_TIME. This database manager configuration parameter has a default value of ten (10) minutes. However, rc.db2pe sets this parameter to two (2) minutes. You should modify this parameter within rc.db2pe so that it is set to ten (10) minutes or perhaps something slightly larger. The length of time in the context of a failed database partition is the time between the failure of the partition and the recovery of that partition. If there are frequent "COMMIT"s used in the applications running on a partition, then ten minutes following the failure on a database partition should be sufficient time to rollback uncommitted transactions and reach a point of consistency for the database on that partition. If your workload is heavy and/or you have many partitions, you may need to increase the parameter value until there is no longer an additional problem beyond that of the original partition failure. (The additional problem would be the timeout message resulting from exceeding the START_STOP_TIME value while waiting for the rollback to complete at the failed database partition.)

Example of a Mutual Takeover Configuration

The assumption in this example is that the mutual takeover configuration will exist between physical nodes one and two with a DB2 instance name of "POWERTP". The database partitions are one and two, and the database name is "TESTDATA" on filesystem /database.

Resource group name: db2_dp_1
Node Relationship: cascading
Participating nodenames: node1_eth, node2_eth
Service_IP_label: nfs_switch_1     (<<< this is the switch alias address)
Filesystems: /database/powertp/NODE0001
Volume Groups: DB2vg1
Application Servers: db2_dp1_app
Application Server Start Script: /usr/bin/rc.db2pe powertp 1 2 start
Application Server Stop Script: /usr/bin/rc.db2pe powertp 1 2 stop

Resource group name: db2_pd_2
Node Relationship: cascading
Participating nodenames: node2_eth, node1_eth
Service_IP_label: nfs_switch_2     (<<< this is the switch alias address)
Filesystems: /database/powertp/NODE0002
Volume Groups: DB2vg2
Application Servers: db2_dp2_app
Application Server Start Script: /usr/bin/rc.db2pe powertp 2 1 start
Application Server Stop Script: /usr/bin/rc.db2pe powertp 2 1 stop

Example of a Hot Standby Takeover Configuration

The assumption in this example is that the hot standby takeover configuration will exist between physical nodes one and two with a DB2 instance name of "POWERTP". The database partition is one, and the database name is "TESTDATA" on filesystem /database.

Resource group name: db2_dp_1
Node Relationship: cascading
Participating nodenames: node1_eth, node2_eth
Service_IP_label: nfs_switch_1     (<<< this is the switch alias address)
Filesystems: /database/powertp/NODE0001
Volume Groups: DB2vg1
Application Servers: db2_dp1_app
Application Server Start Script: /usr/bin/rc.db2pe powertp 1 1 start
Application Server Stop Script: /usr/bin/rc.db2pe powertp 1 1 stop

Note:

In both examples, the resource groups contain a Service IP switch alias address. This switch alias address is used for:

NFS access to a file server for the DB2 instance owner filesystems.
Other client access that needs to be maintained in the case of a failover, ADSM connection, or other similar operations.

If your implementation does not require these aliases, they can be removed. If removed, be sure to set the MOUNT_NFS parameter to "NO" in rc.db2pe.

Configuration of a NFS Server Node

Just as with the configuration of a DB2 database partition presented above, the rc.db2pe script can be used to make available NFS-mounted directories of DB2 parallel instance user directories. This can be accomplished by setting the MOUNT_NFS parameter to "YES" in rc.db2pe and configuring the NFS failover server pair as follows:

Configure the home directory and export it as "root" using /etc/exports and exportfs command to the IP address used on the nodes in the same subnet as the NFS Server's IP address. Include both the HACMP boot and service addresses. The NFS Server's IP address is the same address as the service address in HACMP that can be taken over by a backup. The home directory of the DB2 instance owner should be NFS-mounted directly, not automounted. (The use of the automounter is not supported by the scripts as a DB2 instance owner home directory.)
Using SMIT or a bottom-line configuration, a separate /etc/filesystems entry should be created for this filesystem so that all nodes in the DB2 parallel grouping, including the file server, can mount using the NFS filesystem command.
For example, an /nfshome JFS filesystem can be exported to all nodes as /dbhome. Each node creates a NFS filesystem /dbname which is nfs_server:/nfshome. Therefore, the home directory of the DB2 instance owner would be /dbhome/powertp when the instance name is "powertp".
Ensure the NFS parameters for the mount in /etc/filesystems are "hard", "bg", "intr", and "rw".
Ensure the DB2 instance owner definitions associated with the home directory /dbhome/powertp in /etc/passwd are the same on all nodes.
The user definitions in an SP environment are typically created on the Control Workstation and "supper" or "pcp" is used to distribute /etc/passwd, /etc/security/passwd, /etc/security/user, and /etc/security/group to all nodes.
Do NOT configure the "nfs_filesystems to export" in HACMP resource groups for the volume group and the filesystem that is exported. Instead, configure it normally to NFS. The scripts for the NFS server will control the exporting of the filesystems.
Ensure the major number of the volume group where the filesystem resides is the same on both the primary node and the takeover node. This is accomplished by using importvg with the -V parameter.
Verify that the MOUNT_NFS parameter is set to "YES" in rc.db2pe and that each node has the NFS filesystem to mount in /etc/filesystems. If this is not the case, then rc.db2pe will not be able to mount the filesystem and start DB2.
If the DB2 instance owner was already created and you are copying the user's directory structure to the filesystem you are creating, ensure you tar (-cvf) the directory. This ensures the preservation of the symbolic links.
Do not forget to mirror both the adapters and the disks for the logical volumes and the filesystem logs of the filesystem you are creating.

Example of a NFS Server Takeover Configuration

The assumptions in this example are that there is an NFS server filesystem /nfshome in the volume group nfsvg over the IP address "nfs_server". The DB2 instance name is "POWERTP" and the home directory is /dbhome/powertp.

Resource group name: nfs_server
Node Relationship: cascading
Participating nodenames: node1_eth, node2_eth
Service_IP_label: nfs_server     (<<< this is the switch alias address)
Filesystems: /nfshome
Volume Groups: nfsvg
Application Servers: nfs_server_app
Application Server Start Script: /usr/bin/rc.db2pe powertp NFS SERVER start
Application Server Stop Script: /usr/bin/rc.db2pe powertp NFS SERVER stop

Note:

In this example:

/etc/filesystems on all nodes would contain an entry for /dbhome as mounting nfs_server:/nfshome. nfs_server is a Service IP switch alias address.
/etc/exports on the nfs_server node and the backup node would include the boot and service addresses and contain an entry for /nfsfs -root=nfs_switch_1, nfs_switch_2, ....

Considerations When Configuring the SP Switch

When implementing HACMP ES with the SP switch, consider the following:

There are "base" and "alias" addresses on the SP switch. The base addresses are those defined in the SP System Data Repository (SDR), and are configured by rc.switch when the system is "booted". The alias addresses are IP addresses configured, in addition to the base address, into the css0 interface through use of the ifconfig command with an alias attribute. For example:
```
   ifconfig css0 inet alias sw_alias_1 up
```
When configuring the DB2 db2nodes.cfg file, SP switch "base" IP address names should be used for both "hostname" and "netname" fields. Switch IP address aliases are ONLY used to maintain NFS connectivity. DB2 failover is acheived by restarting DB2 with the db2start restart command (which updates db2nodes.cfg).
Do not confuse the switch addresses with the etc/hosts aliases. Both the SP switch addresses and the SP switch alias addresses are real in either etc/hosts or DNS. The switch alias addresses are not another name for the SP switch base address: Each has its own separate address.
The SP switch base addresses are always present on a node when it is up. HACMP ES does not configure or move these addresses between nodes.
If you intend to use SP switch alias addresses, configure these to HACMP as boot and service addresses for "heartbeating" and IP address takeover. If you do not intend to use SP switch alias addresses, configure the base SP switch address to HACMP as a service address for "heartbeating" ONLY (no IP address takeover). Do not, in any configuration, configure alias addresses AND the switch base address; this configuration is not supported by HACMP ES.
Only the SP switch alias addresses are moved between nodes for an IP takeover configuration and not the SP switch base addresses.
The need for SP switch aliases arises because there can only be one SP switch adapter per node. Using alias addresses allows a node to takeover another node's switch alias IP address without adding another switch adapter. This is useful in nodes that are "slot-constrained". For more information on handling recovery from SP switch adapter failures, see the network failure section under "HACMP ES Script Files" later in this document.
If you configure the SP switch for IP address takeover, you will need to create two (2) extra alias IP addresses per node: One as a boot address and one as a service address.
Do not forget to use "HPS" in the HACMP ES network name definition for a SP switch base IP address or a SP switch alias IP address.
rc.cluster in HACMP automatically ifconfigs-in the SP switch boot address when HACMP is started. No additional configuration is required other than the creating the IP address and name, and defining them to HACMP.
The SP switch Eprimary node is moved between nodes by the SP Parallel System Support Program (PSSP), and not HACMP. If an Eprimary node goes off-line, the PSSP automatically has a backup node assume responsibility as the Eprimary node. The switch network is unaffected by this change and remains up.
The Eprimary node of the SP switch is the server that implements the Estart and Efence/Eunfence commands. The HACMP scripts attempt to Eunfence or to Estart a node when HACMP is started and make the switch available should it be defined as one of its networks. For this reason, ensure the Eprimary node is available when you start HACMP. The HACMP code waits up to twelve (12) minutes for an Eprimary failover to complete before it exits with an error.

DB2 HACMP Configuration Examples

The following examples show different possible failover support configurations and what happens when failure occurs.

Figure 71. Mutual Takeover with NFS Failover - Normal

The previous figure and the next two figures each have the following notes associated with them:

HACMP adapters are defined for ethernet, and SP Switch alias boot and service aliases -- base addresses are untouched. Remember to use a "HPS" string in the HACMP network name.
The NFS_server/nfshome is mounted as /dbhome on all nodes through switch aliases.
The db2nodes.cfg file contains SP Switch base addresses. The db2nodes.cfg file is changed by the DB2START RESTART command after a DB2 database partition (logical node) failover.
The Switch alias boot addresses are not shown.
Nodes can be in different SP frames.

Figure 72. Mutual Takeover with NFS Failover - NFS Failover

Figure 73. Mutual Takeover with NFS Failover - DB2 Failover

Figure 74. Hot Standby with NFS Failover - Normal

The previous figure and the next figure each have the following notes associated with them:

HACMP adapters are defined for ethernet, and SP Switch alias boot and service aliases -- base addresses are untouched. Remember to use a "HPS" string in the HACMP network name.
The NFS_server/nfshome is mounted as /dbhome on all nodes through switch aliases.
The db2nodes.cfg file contains SP Switch base addresses. The db2nodes.cfg file is changed by the DB2START RESTART command after a DB2 database partition (logical node) failover.
The Switch alias boot addresses are not shown.

Figure 75. Hot Standby with NFS Failover - DB2 Failover

Figure 76. Mutual Takeover without NFS Failover - Normal

The previous figure and the next figure each have the following notes associated with them:

HACMP adapters are defined for ethernet, and SP Switch base addresses. Remember that when bases addresses are configured to HACMP as service addresses, there is no boot address (only a "heartbeat").
Do not forget to use a "HPS" string in the HACMP network name for the SP Switch.
The db2nodes.cfg file contains SP Switch base addresses. The db2nodes.cfg file is changed by the DB2START RESTART command after a DB2 database partition (logical node) failover.
No NFS failover functions are shown.
Nodes can be in different SP frames.

Figure 77. Mutual Takeover without NFS Failover - DB2 Failover

DB2 HACMP Startup Recommendations

It is recommended that you do not specify HACMP to be started at boot time in /etc/inittab. HACMP should be started manually after the nodes are booted. This allows for non-disruptive maintenance of a failed node.

As an example of "disruptive maintenance", consider the case where a node has a hardware failure and crashed. At such a time, service needs to be performed. Failover would be automatically initiated by HACMP and recovery completed successfully. However, the failed node needs to be fixed. If HACMP was configured to be started on reboot in /etc/inittab, then this node would attempt to reintegrate after boot completion which is not desirable in this situation.

As an example of "non-disruptive maintenance", consider manually starting HACMP on each node. This allows for non-disruptive service of failed nodes since they can be fixed and reintegrated without affecting the other nodes. The ha_cmd script is provided for controlling HACMP start and stop commands from the control workstation.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]

[ DB2 List of Books | Search the DB2 Books ]