Chapter 52. Common Key Data Buffering

In Find More Detailed Events, I described Doug creating the read exit for SAFR to read the Stacked Data Element file. The Stacked Data Element or SDE file contained many records of many different record types. In other words, one file needed multiple LRs to describe its records. Some of these records were “P” elements, for Policy level business events, like the ones Mark Kimmell used. Others were “L” elements for Ledger level business events. These are the ones Jim used in “show me the money.” Other record structures were reference data, describing either the policy or company or ledger account attributes. The reason Doug had to write the exit was because the file didn’t follow typical file structure rules. Doug’s read exit made each element its own logical record; thus SAFR views could decode them.

Multiple Record Type Files

Thus far in our discussion, we have painted the picture that a single file can be described by a single LR. With the SDEs, though, this wasn’t the case. Each SDE had a field in it that described the record type for that record, whether a P110 or something for the policy level transactions, or L103 for the ledger reference record. To read this file, each view had to have a record filter—the first record filter—which selected only records of a particular type which that view wanted to use as its event file.1

Thus in one SAFR thread, reading one SDE file, there could be different views each selecting a different type of SDE as its event file, and producing outputs from selected records of that file.2

Paired Look-up Exit

None of this was all that new to the tool. The break through that started the feature now known as Common Key Data Buffering was that Doug recognized, and there was a business need, to join from the business event SDE to the other reference data elements in the same SDE. A two-pass architecture could have been used, whereby views in the first pass would simply select and dump the desired reference data elements. These could then be put into the RED in the second pass and looked up as the SDE event elements were processed.

This approach was a loser when it came to performance. The volume of these reference elements was very high, because they could be on each policy transaction. This meant the IO was substantial for the first pass to just get the records out. They would take a lot of memory when loaded as reference files, and the search of them would waste CPU time in searching large tables unnecessarily.

Instead, Doug knew that the P element reference data would be in memory—in the read exit input buffer—at the same time the P business event came in, so Doug had an idea. He created a lookup exit that would be able to find the input buffer from the read exit, find any record in it of a particular type, and return that record as if it were the result of a lookup.

This approach worked because the elements in the SDE file that were related came at the same point in the file, so they would be in the read exit buffer at the same time. Doug recognized that this approach could be used to solve other problems.

For example, on the high tech chip manufacturer, we had to create views which joined from the event file to a reference file that was very large; larger than would fit in memory. The event file and the reference file were both in sorted order by the key that would be used to do the join. Doug had created a lookup exit which would read the reference file one record at a time as the event file was processed. This accomplished the join without loading everything into memory.

Then another problem came up which showed the limitations of this approach. A view needed to read the reference data file as an event file, to count records or something. It had to be read again just to produce this view. Again, this meant another pass of the data just to use some portion of it as a reference file.

Normalization

In Model the Repository, we talked about modeling the data repository. It may be very useful to review the data structure example and description in that chapter. A common trade off in designing reporting applications is what level of normalization of the data is appropriate. For on-line systems, the process of storing a piece of information in only one place is important, so that update processes are efficient. Don’t store the customer’s name on every business event because if it changes you’ll have a lot of records to update. Normalization also saves storage space because data is only stored once.

The problem with normalization is it requires joining data back together to make sense of it. If we want to report on many business events and summarize by single customer name, we have to put the data back together again. Thus I have heard Rick talk about “scatter and gather” whereby we spread everything out but then have to gather it all back together again for reporting. Joining takes CPU time.

As I said, for on-line systems, normalization is important. But once we have “cool data” as Rick calls it, records that aren’t going to be updated any longer, we can “denormalize” it, putting certain well used fields on the business events that we would often need in reporting. Repeat the field on each record, so we don’t have to do the join to report. This uses storage space to save CPU time.

Alternatively, dynamic denormalization is what is possible because of the exit Doug wrote. Instead of storing all the data in one denormalized file, as Greg and Doug worked together. Greg normalized all the tables; he stored each field only once. Then Doug made sure that the tables or files were in the form of a hierarchy, where the key to the top table, like Policy, was the high order key on all the other tables. The key for the next table down, Term, was repeated as the second key on all lower tables. Then Doug specified that the files be kept in sorted order by these keys.

Doug created a read exit to open all these files at once, the Policy, Term, Transactions, and read the first record from each file. It would continue reading records from these files until the Policy key changed. The data for one policy exists in memory at any one point in time, as shown in the Sample Insurance Input Buffer Figure 86. This data looked just like the results of reading the Stacked Data Element file with many related records in memory at the same time. He then created a companion lookup exit to be used to search these records in the same way.

Now the data can be stored in a normalized manner, saving IO when reading it because there are fewer bytes (values aren’t repeated on each transaction if they describe the same policy, for example), but the joins are very efficient because all the data needed for the joins exists in memory at one time. There is no additional IO to go find a reference data record and memory is not used for unrelated reference data.

Master File Update

What the project team did next was to turn SAFR from being a reporting system into a master file update process as described in Define Processes. The team created views, one for each record type read by the exit, to write out a new updated version of that specific record type file. It contained a general select statement selecting only the appropriate entity records.

Because the data was presented by the read exit to the view, it appeared all records for that entity came from one specific file. But in addition to merging data from multiple entities, the read exit also merged data from multiple time partitions of the same entity.

For example, suppose we have a file that contains all the policy records the repository has ever seen. On a daily basis, we might receive a few updates to these policies from the source system. These come to the repository in their own file. The read exit can be configured to open both the history, and this new daily file, at the same time. It can merge records from these two entities into the same common key buffer, so that the policy record from yesterday (from the history file) and the updated policy record (from the daily file) for the same policy end up in memory at the same time.

The view which writes the new version of the repository history policy file can select both of these records. Thus at the end of the process, we have not only used the new policy record for any view which required it, we have also performed the update process to the repository by writing out a new history file.

This same approach was used to roll the daily records into the weekly file on a daily basis, the weekly file into the monthly file on a weekly basis, the monthly file into the annual file monthly, and the annual file into the history partition annually. Doing so meant that on a daily basis we don’t have to write the entire history, annual, or monthly files; only the weekly files. This increases processing speed3.

It is critical that all these files be created, and kept, in sorted order.

Let me make that point again. It is critical, for performance, that the process be defined to avoid ever sorting the files except perhaps the low volume daily input files. All other files must be written by SAFR in sorted order, preserving merge capabilities for the next execution.

Large Reference Files

The heart of the matter is the size of the reference files, and the processes to update them. At times, the reference data to be processed are larger in total than the event file containing the amounts to be summed.

Another common problem over the years was detecting changes in the reference data. Often systems present snapshots of the reference data; how a policy looks at a specific point in time. It is the responsibility of the repository to detect if any change in that policy has occurred by comparing it to the existing policy record.

Doug and the team recognized that they could also perform this function in the read exit, because the exit has both records available to it in the merge process. If the exit detected that the record in the daily file was exactly the same as the record in the repository, the exit could throw away the new record with no impact on any report. When a record was received in the new file that was different, the exit would place the processing effective date on the record. Thus the view writing the updated policy entity would write out both records, in sorted order by policy and effective date.


We found this model of processing so useful Doug was finding himself consistently writing these types of paired exits. He finally decided to generalize them, so they accepted parameters rather than having hard coded structures within them. The parameters specify which files should be opened, the start position and length of the merge keys on each file. This program and its related lookup exit are now part of the standard SAFR suite.

Paired Write Exit

The next development came with the creation of a companion write exit.

The first step in this development was when a project needed to use logic to generate additional event records. Instead of putting the logic into the SAFR module, Doug created the ability for the read exit to call other exits (programs which would emulate an access method for the read exit), to deliver records to it. These new records were simply inserted into the bottom of the common key input buffer, as if they had been read from disk.

On a later project, Doug recognized that often data is written by views in multiple pass processes and then needed for other views. Piping can be used for these purposes, but the pipe reader loses visibility and the ability to join to the other records in the read buffer. Rather than writing the data to disk, Doug created a companion write exit that was capable of inserting rows into the common key input buffer. These records looked as if they were read from disk, but they weren’t; views downstream from the exit but in the same process created them.

Thus additional input records for a “virtual entity” could be created either by an exit or a view. An exit could be used if they were too complex for views, or views could do the work if control of the parameters in SAFR was easier for maintenance.

Results

This creates a very unique processing paradigm, something that I cannot think of an analogy for. It is almost as if a program is reading a file, and appending records onto the end of the input file; which records the program will see later as it continues processing. The truth of the matter is that the records aren’t appended at the end of the input file, but rather in the middle of it, since the input files are sorted by policy, and that specific policy is being processed at the current moment and the records are inserted into the end of the input buffer.

As I have described this for people, they have wondered about the possibility of an infinite loop. There is a possibility if two views which create virtual output in the buffer are dependent upon each other for their input. The output from one view would create an output from the other view, which would then create output from the first view again. This would likely show up as running out of memory as the input buffer fills up and overflows. Understanding the dependencies between views to avoid creating this situation is important.


Most everything we have talked about came together in a most remarkable way in the spring of 2002.

1 Failure to make these record filters ends in views producing nonsensical outputs or, at worse, data exceptions.
2 In fact, the view described in Find More Detailed Events under “sort merge” was a multiple source view, but not in the typical sense we have described so far. Rather, it used this concept of each view selecting a particular type of record as its event file from the input file. Thus both “sources” were from the same file, just different LRs. One source read perhaps the “L110” elements to produce the ledger columns; the other read something like the P110 elements to produce the policy level columns. Both elements had the account key on them so that when the extract records were summarized, the results were collapsed into a single row.
3 We used an alternative approach at the cookie manufacture, where we established seven daily files and each week we created empty versions of them. We then replaced an empty file with a real daily file each day. On a weekly basis, we accumulated all the daily files and rewrote monthly files I believe, eliminating the need to maintain weekly files at the cost of rewriting some data.