When Eric showed us the simple picture of a business event based system, it gave the impression that the repository of business events contains a single record type where all the attributes to describe the event were recorded along with any amounts. SAFR views are typically created against a single logical record, a single record structure. All views reading that LR and file process see the same kinds of records.
Of course, the repository also contains reference data. But for the most part these files are relatively small, with keys and answers measured in the tens of thousands to hundreds of thousands of rows. Because the summary structures in the same format as the transaction records, The Cookie Manufacturer’s repository was no more complex than this.
As we approached new problems, we found some did not fit into the simple view of a single record structure. For example, we used SAFR against the stacked data element file that had multiple record structures in it. The views Jim and Mark created had to first test a common field in each record to select only records that were appropriate for that view. Some records could be interpreted by one LR, other records by another. This use of SAFR is a typical ETL process to get around the complexities of the data to be extracted.
At the insurance company we found a problem with the size of the reference files. Normalization of data elements is more efficient for storage since attributes of a customer account only exist in one place on the customer account record. An account number from the event record can be used to access it. Modern businesses have millions of accounts. The size of these types of “reference” files starts to approach the same size as the event files.
Data Modeling
I remember Greg Forsythe, the most practical data modeler I have ever met in my life (a truly rare breed), working with Doug at the Insurance Company to devise an approach whereby SAFR could process these files effectively. Greg modeled the attributes needed for the reports in 3rd normal form. In other words, he placed the policy attributes on the policy table and only represented one row for each policy. Because policies are renewed over multiple terms, he made a different table for the term attributes with a term key to describe it.
He continued to add tables with additional attributes until he got down to the policy and loss transactions, those that Mark Kimmell had discovered from the Stacked Data Elements. All these other attributes were also in the Stacked Data Elements as separate records. It is likely the source system stored them in a similar manner. On the transactions – the true business events containing monetary amounts – he placed the keys for all the higher level entities: the policy, term, etc.
Having modeled the data in this way, the tables he designed could have been loaded into any relational database. SQL could have been used to join the tables together to produce the outputs as long as the amount of data in them was sufficiently small. The computer would have used indexed access methods explained in Select, Sort, Summarize to locate the appropriate records and combine them into the outputs desired.
However, with the issues of scale, a better processing pattern needed to be adopted to more realistically reflect what the machine had to do to combine the data together, and get more out of each stage in that process.
The team chose to use sequential files to exploit the Scan Engine’s power and to eliminate the CPU time involved in decoding relational data structures. They chose to maintain the files as zO/S Generation Data Groups, a simple mechanism to manage access to different versions of the sequential files. The files would be maintained in sorted order by their keys.
The five files above – Policy, Term, Policy Event, Loss, and Loss Event – are maintained in the order shown in separate files. The Policy, Term, and Loss files are reference or static data files. The Policy Event and Loss Event files are event files. Doug added a new feature to the Scan Engine: the Common Key Data Buffer feature.1
Common Key Data Buffer Event Files
This feature is a little like a prefetch function in some databases, whereby data that is expected to be used in the process is brought into memory before it is needed. The common key buffer technique brings all the records for a particularly key, in our example one policy, into memory. It does this through a merge process, whereby multiple partitioned files are brought together. The data buffer from our example for the first policy would appear as follows:
Each of these records, as long as they have unique keys, can be used either as an event file or a reference file. The data for this policy is held in memory until all the records in it are read as event file records. When they have been processed as event files, all the data for the next policy is brought into memory.
SAFR’s common key buffering feature appends three fields on the front to indicate from which file the records had come. The first field defines what LR describes the record. The next two fields describe from which partitions of that entity it had come and are discussed in the next chapter.
The Entity ID field is used so that the views can filter which business events they are reporting on. Because each of the records is passed to the views, reference data views can be resolved at the same time event file views can be resolved. Thus a view which counts policies of a particular type can select records with the Policy entity ID. Views which are summarizing premium payments can summarize Policy Events. This is similar to having views read an event file with multiple record types within it.
Common Key Data Buffering Lookups
Lookups to other records are accomplished by building a join with the linkage between these records. For example, if my view is summarizing premium payments from the Policy Event record by Term Length, I would define my view to select records where the Entity ID is Policy Event. The first record chosen will be the following.
Entity ID | Partition | Category | Policy ID | Term ID | Policy Event ID | Policy Event Type | Amount |
---|---|---|---|---|---|---|---|
Policy Event | 00 | A | 1 | 1 | 1 | Prem. Payment | 100.00 |
I’d define a join which uses the Policy ID and Term ID to find the Term record. The join will return the following record.
Entity ID | Partition | Category | Policy ID | Term ID | Term Length | Start Date |
---|---|---|---|---|---|---|
Term | 00 | A | 1 | 1 | 6 months | 11 Months ago |
Note that this record has already been processed as an event file record (it is the second record in Figure 86, but because of the Common Key Data Buffering it is still in memory and available for lookup. The view can now use any of these fields for any purpose, selection, sort, placing in columns, or used to join to other data. Thus I can produce a view of premium payments by term length if desired. This join technique also supports date effective joins as well2.
Next: Chapter 36. Optimize For Performance
Previous: Chapter 34. Consider Complex Joins
Parent Topic: Part 4. The Projects