DB2 Replication Guide and Reference

Data Transformation Requirements

Source data most likely has to be subsetted, transformed, or enhanced as part of the replication process, particularly in the support of decision support or data warehousing. This section sorts these requirements into those that are easily fulfilled by using the Control Center and those that require direct manipulation of the control tables.

Basic Data Enhancement

This section explains the following basic capabilities that the Control Center supports.

Choices of target table types
Subsetting targets
Source views
Before and after images
Column renaming
Computed columns
Before and after run-time processing

Choices of Target Table Types

You can use the following types of target tables:

User copy target tables

These are the most common target tables, and are copies of the replication source without an overhead timestamp column. These tables require a primary key.

Point-in-time target tables

These are plain copies of source tables with timestamp columns added to specify when updates were made. These tables require a primary key.

Aggregate tables

These tables are used to track data and changes to data. These tables are built when rows are added or appended to the table over time.

Queries with column functions (AVG, MAX, MIN, SUM, COUNT) can range over many thousands of qualifying rows, yet they can return very compact, easy-to-understand results. By adding a timestamp and aggregating the results of these calculations, you can track broad trends, while still retaining the benefits of data reduction and compact storage.

There are two kinds of aggregate tables:

Base aggregate tables: In these tables, calculations are made against the source table. This type of history is useful for tracking broad indicators that have relatively low volatility. Base aggregate tables can include calculations against stable base data; calculations are made according to the subscription frequency criteria. Use base aggregate tables if you need to summarize your source table contents on a regular basis. For example, you can use a base aggregate table to track a daily inventory of items from a source table.
Change aggregate tables: In these tables, calculations are made against change data, not base data. Each calculation ranges over the recent changes since the time of the last calculation. You can track insert, update, and delete operations collectively or individually, determined by the filtering predicates that you specify. If there are no recent updates, calculation is deferred, creating very compact storage. Use these tables if you need to summarize the results of the changes made between each Apply program refresh operation. You can use change aggregate tables on their own, where daily summaries of change events are required, or as input to an application that must apply the summaries to a cumulative table.

Noncondensed CCD target tables

These tables are used for history tables or audit trail tables. They can contain changes only, or they can include the initial state of the source table.

Condensed CCD target tables

These target tables can be used as an alternative to a point-in-time copy when it is important to keep the last state of any record, (that is, no records are to be deleted). A delete is processed as an update, with the IBMSNAP_OPERATION column acting as the logical delete flag.

Updateable replica target tables

These target tables can be used as sources for copying data back to the replication source table and are used in update-anywhere scenarios

Subsetting Targets

IBM Replication supports both vertical and horizontal subsetting of the source table. This means that you can specify that only a subset of the replication source table columns and rows are replicated to the target table, rather than all of the columns and rows:

Column subsetting

In some replication scenarios, you might not want to replicate all columns into the CD table. With vertical subsetting, you can define the CD table to have fewer columns than your base table. However, CD tables must contain sufficient key data for point-in-time copies, which are maintained with key-qualified predicates. Omitting columns from the target is appropriate if the column data types are not supported. Vertical subsetting is available for all tables except replica tables.

You can define vertical subsetting either:

When you define a replication source table for differential refresh: Deselect the columns that you do not want to make available for replication to a target table.
When you define a replication subscription: Use the advanced subscription options to deselect source table columns that you do not want to replicate to the target.

Row subsetting

You can split a single source table into various target tables based on the source column contents (for example, department number or region code) or you can specify that only certain source rows be replicated. Use the advanced subscription options to define a WHERE clause when you define the subscripton. All target table types support row subsetting. You might need to define a view replication source to bring together all the columns needed to define the row subset.

Target table primary keys are assumed to be invariable. If one or more of the columns defined in the predicate are updated, you must specify replication logical partitioning key support when you define the replication source. Replication logical partitioning key support defines updates as DELETE and INSERT statements.

Source Views

Using the Control Center, you can define a join as a replication source. The joins can include only tables defined as replication sources. If the replication sources defined in the join have CD (or CCD) tables, a CD view is created from the replication sources' CD tables. The Capture program (or the Apply program, for CCD sources) maintains the control information for the joined replication sources and the CD views in the source server control tables.

Join views fill many requirements, both for denormalizing (restructuring) copies in data warehouse scenarios, enabling easier querying of copied data, and also for addressing the routing problem, sometimes called the database partitioning problem in distributed computing scenarios. For example, knowing where to send a bank account update may require a join of the account table with the customer table, in order to know which branch of the bank the customer deals with. Typically, production databases are normalized so that the geographic details, such as branch-number, are not stored redundantly throughout the production database. Views are also useful when you need to specify predicates that exceed the capacity of the ASN.IBMSNAP_SUBS_MEMBR.PREDICATES control table column. You can choose to manage your subsetting predicates through views rather than in the subscription meta data.

IBM Replication supports the following types of view definitions:

Simple inner-joins over one or more defined replication sources
Simple inner-joins over CCD staging tables that are defined as replication sources and maintained by an Apply program or an application other than an IBM Replication component and an external data source, such as DataPropagator NonRelational and IMS source data

Before and After Images

Both before and after images can be defined in replication sources and subscriptions. A before-image column is a copy of a column before it is updated. DB2 logs both the before-image and after-image columns of the table for each change to the table. Before images do not make sense with a base aggregate target table type. All other target table types can make use of before-image columns.

Column Renaming

Source site column names can be renamed for point-in-time and user-copy target table types.

Computed Columns

New columns can be derived from the existing source columns on the basis of SQL expressions. New columns can be aggregate functions such as COUNT or SUM in the case of aggregate target table types, or simple derivations in the case of all other target table types.

User-defined functions for DB2 Universal Database can be specified by using the computed column facility in the Control Center.

Before and After Run-Time Processing

You can define run-time processing statements using SQL statements and stored procedures before or after the Apply program processes the replication subscription. This feature is useful for pruning CCD tables and controlling the sequence in which replication subscriptions are processed. The run-time processing statements can be run at the source server before the replication subscription is processed, and at both the source and target servers after the replication subscription is processed. For example, you can transform source data before replicating the data or target data after it is copied.

The stored procedures use the SQL CALL statement, newly supported by IBM Replication, without parameters. The run-time procedures are executed together in a single unit-of-work. You can also define acceptable SQLSTATEs for each processing statement.

Depending on the DB2 platform, logic can be invoked through the SQL before and after processing:

Calls to stored procedures can be made through a before or after processing statement, or triggers can be used to do more complex transformations.
On DB2 Universal Database, before or after SQL statements can invoke user-defined functions through the use of triggers.

Advanced Data Enhancement

The transformations listed below require manipulation of the IBM Replication control tables outside of the Control Center. The techniques required to implement these transformations are not discussed in this book. For planning purposes, when you find that you will use many of these techniques in the replication scenario, you should allow additional implementation time.

Data consolidation - Union scenarios can be used to consolidate identically structured data into a common consolidated target table.
Outer join of source tables - Join scenarios can be used when one or more of the sources are subject to change. An outer join can be implemented with multiple subscriptions to a single table.
Target views - You use the target views only when consolidating data from multiple sources, such as multisite union scenarios. Multiple source tables are updated and consolidated into a join at the target server. These types of views are not supported in the Control Center.
When the target table is maintained by different servers or different replication subscriptions, a subset view ensures that the DELETE statement generated before a full-refresh is copied to the appropriate horizontal fragment. The Apply program is almost always driven by SQL operations at the source server and in addition, generates an unqualified DELETE statement at the target server before applying a full-refresh answer set. By defining a subset view over the target table, and defining the replication subscription target to be the view, you restrict the DELETE statement to the appropriate horizontal fragment. That way, a full refresh of data from the source does not affect the information from the other sources.
Poorly structured data - Many older applications often have poorly structured data; for example, conflicting sources for common data items, no primary keys defined, poorly normalized tables, and heavily encoded data values.
Although IBM Replication might be able to handle these individually, there are some extreme situations where it is not useful to attempt replication before some manual data cleansing is done.
General rule: If the data transformation can be expressed by using SQL, IBM Replication will almost certainly be able to execute it. The SQL is introduced either as part of the SELECT statement that does the copying or as part of the user-specified SQL before and after statements. This SQL might include procedural logic by including database triggers, user-defined functions, and stored procedure calls as described in Basic Data Enhancement.
IBM Replication can also be supplemented with procedural code in user-written programs. For example, a program could be scheduled to manipulate data in the CD or CCD tables before execution of the Apply program. Or, if the changes are so radical that the data cannot be changed in place, you could go as far as writing a program to read the data changes from the CD or CCD tables and then applying them to the target tables yourself. (See Defining SQL Statements or CALL Procedures for the Replication Subscription to learn about running SQL and Call procedures with replication subscriptions.)

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]

[ DB2 List of Books | Search the DB2 Books ]