Remember in Extract Files, we outlined that the Format Phase is a control break program which performs the “group by” or “sum” function. Our focus in this chapter is upon the functions that can be performed in this phase.
In early 2003, as part of developing Version 4 of SAFR, I volunteered to sit in front of the computer with Doug and to recreate each defect in the debugger to speed up Doug’s time coding. Over the course of a couple of months we had closed out most of the significant defects we had in the extract engine, GVBMR95. However, there were a pile of defects for the Format engine, GVBMR88, that we hadn’t seriously attacked. I remember telling Doug one night, “I don’t think we have any other defects right now for GVBMR95 that we can productively work.” Doug reluctantly responded, “OK, pull up GVBMR88 in the debugger, and let’s get started.” I could tell Doug loved MR95, and tolerated MR88.
Format Time Logic
Our focus on the extract thus far has focused on performing functions against business events, for the most part. As we noted in Find More Detailed Events, the syntax of building logic text which is turned into logic table functions uses fields from the event file. Fields on the event file can be used for selected values and making keys for joins, etc. that are not passed into the extract file in any way, or included in the final output.
We can’t ignore the need, though, to perform some of the same functions after summarization. These needs are much less frequent, but they exist nonetheless. For example, certain reports would be extremely long if all the zero values were shown in the report. Many accounts may not have any activity against them, or after all the activity is summarized, it nets to zero. Extract time logic can be used to eliminate event records with zero balances, but excluding zero balances that are the result of netting has to ultimately be performed after summarization is complete.
The format phase logic uses columns in the workbench, not field names. This is because data going into the format phase no longer matches an LR. For example, the extract program may have “created” a new field by inserting a constant in a column, similar to what was done with words “Working Capital” or “Other” sort constants were used above. In this case, there is no “field” to use in the logic text during the format phase. As another example1, the input to the format phase may not come from just one event file.2 Views can combine data from multiple event files into the same extract file. Thus GVBMR88 is limited to working with data in the extract file, no matter where that data came from.
To repeat, GVBMR88 has a number of the same functions as GVBMR95. The difference is, the logic text executed in GVBMR88 specifies columns, view columns, and it only operates against numeric fields. In other words, the ‘Event File’ for GVBMR88 is the extract file, and its ‘fields’ are the columns of data in the extract record.
Although both selection and calculations can be performed in GVBMR95 or GVBMR88, originally calculations were only available at Format time in GVBMR88, and selection was only available in GVBMR95. Even now, more functionality for each of those functions is available in the original respective engine.
The format time logic text for GVBMR88 record selection looks like the following:
SELECTIF (COL.4>200 and COL.5 > 200.00)
This will only select summarized rows where column 4 (record count) and column 5 (amount) are greater than 200.
As another example the following performs a calculation.
If COL.5 > 200 Then COLUMN = COL.5 /100 EndIf
It tests column 5 (amount) and if it is greater than 200, it makes the column equal to the value in column 5 / 100.
All format time logic text causes the numbers involved to be CT columns in the extract record. This is because CT format, packed with a standard number of decimal points, is the most efficient format for doing calculations, including summarization.3
The SAFR Scan Engine is neither solely an ETL tool, nor truly a reporting tool, but it began life, with the Alaska implementation, as more of a reporting tool. The Format program, GVBMR88 began as a print formatting engine.
The following is a sample of a SAFR hardcopy report still produced by GVBMR88.
In recent years, there are few customers using this type of SAFR output. But it is instructive to investigate this format a bit, not only because the tool still produces these outputs, but also to understand the functions they performed. Such functions are done in different ways now, but the principles involved are the same.
One feature of this report is that the report is sorted by the code values for Legal Entity, Cost Centre and Account, but next to each of these code values is a description. The report is not sorted by these descriptions. These descriptions are called sort titles. In many reporting applications today code descriptions are embedded in or referenced by dimension attributes.
When creating SAFR Version 4, I argued that the process of creating these titles should be called aggregation phase look-ups or joins. I didn’t win the day, but thinking of them that way may help understand how they are produced.
The following is the logic table for this view:
In the middle there are a number of build Look-up Keys from Event file, LKE functions, but these aren’t preceded by JOINS. They are followed by KSLK, which stand for Key Save, Look-up Key.
At the end, the SK area of the extract record is a section of the record for Sort Title Keys. GVBMR95 performs all the functions to prepare the keys for doing the Look-up to Summary Structure, or LUSM function, but it doesn’t do that last step in the lookup. The actual look-up is done in the Format phase, by GVBMR88.
The Reference Phase actually prepares two sets of files. It creates the RED and REH for Reference Extract Data and Header. It also prepares a similar set of data called the RTD and RTH, for Reference Title Data and Header. The RTD and RTH are used by GVBMR88 to do this last stage of look-up.
So GVBMR88 only does one step look-ups, the last step of the lookup. If a multiple step look-up is required to get the title, the beginning joins will all be performed by GVBMR95. Only when a LUSM function is about to be performed to get the actual sort title data will GVBMR95 be instructed to save the key it was about to use, for use in GVBMR88.
GVBMR88 performs the join after summarization. So this process can perform many less look-ups (less CPU time) if there is a high degree of summarization in the extract file.
Also, while developing version 4, I suggested to Doug we should create an option to simply show the sort fields as columns of data because people seem to be more comfortable with this nowadays through familiarity with spreadsheets. The following output is what I had in mind, and which Doug agreed we could do.
Although I was more successful in arguing this feature, Doug protested mildly saying that the indented format was much more elegant. It caused me to pause for a moment. I hadn’t recognized the function the indenting did in the report until then. But the indenting was a form of indexing, indexing for human eyes to find the appropriate row on the report. One can scan the report quickly to find changes in Legal Entity, whereas in this second report one has to look for repeating lines or subtotals.
Note that what we typically experience in a spreadsheet is fundamentally different than what is shown in this report. Most spreadsheets do not have multiple levels of summarization within them. People like to copy formulas across a lot of rows, and placing the report above into a spreadsheet requires different formulas when sub-totaling the cost center or legal entity. Typically, the cost center summary would be on a separate tab, and not include the account column, and the legal entity summary would be on another spreadsheet tab as well, and not include cost center or account.4
Each of these tabs could be produced as a separate view; they would only have the sort keys of interest for that tab. The problem is, the extract file that must be sorted to produce these three views would have more data in it than the data needed to produce this one report which has all the same information on it. Again we see the proliferation of outputs needed from one set of business events.5
Alternatively, a pivot table can be created from one spreadsheet with all the columns in it, and by including or excluding columns, the subtotals change. But very large spreadsheets can take a long time to update the pivot tables. Again, the balancing act is apparent between function and performance for reporting processes, whether on a large machine or a PC.
Subtotaling and Calculations
There are times when the level at which calculations are performed is important. In some cases, calculations must be performed on the detailed records. For example, it isn’t possible to summarize the cost of different items and the number of items selected and then multiply the summarized numbers to produce a total cost. Math dictates performing multiplication first, before addition; summarization is a form of addition. Before the advent of extract phase calculations, a different set of codes was available in GVBMR88 to tell MR88 to do the calculations at the detailed level. These functions can be used, or extract phase calculations.
Additionally, in some cases, like in calculating percentages, the calculation must be re-performed at each level of the subtotal break. There is no value to calculate a percentage at the account level (for example record count / amount for a percentage of amount per record) and then add up these percentages for the cost center level. The division must be performed at each level of summarization. A different type of subtotal code is available for this function.
GVBMR88 also has the ability to add the value of the prior line to the current line, thus creating a running total on the report.
All these features are still available in GVBMR88, but used much less frequently than they were when it was much closer to an end user reporting tool. With the advent of spreadsheets and proliferation of database reporting tools, such functions tend to be performed there, after the hard work of summarization of business events is performed by SAFR.
File Format Output
These hardcopy and spreadsheet type outputs were the only outputs from GVBMR88 in the earliest days. It was likely Jay who suggested the format phase should also be able to simply produce sequential files, sometimes called flat files. These types of files have no index structure to them; they are basically a dump of data. Adding this feature moved SAFR from being purely a reporting tool to having more characteristics of an ETL tool. It could now put out data that can be used in subsequent processes. And often those subsequent processes ended up being additional SAFR passes of the data to do more complex functions.
Later, the ability was created to create the copy and extract only views we first discussed was created, reducing the number of Format time File Format Output views that are created. If summarization is needed though, they have to go through sort and GVBMR88.
With respect to summarization, there is a key distinction that should be remembered about file format output though. In the same way spreadsheets don’t deal well with having different levels of summarization in one tab—account subtotals, cost center subtotals, and legal entity subtotals—most other tools that would act upon file output do not as well. Therefore it was decided that GVBMR88 would simply produce the lowest summarized level outputs. For example, from the report above, it would only write out the Account level rows, not the Cost Center or Legal Entity subtotals. This means many of the sub-totaling codes aren’t applicable for file format output.6
Executive Information File Structure
The last type of output to be created by GVBMR88 actually demanded a new program be written. These are the SAFR Executive Information files or cubes. And although produced by another program, they are part of the Format Phase as well.
We introduced the SAFR Executive Information File output or XIF in Iteratively View Results. Each XIF file is like one cut of a reporting cube. It contains all the data for one hardcopy report in it; in other words, subtotals for multiple levels of sort breaks are contained in the file. The first records in the file contain information that allows a program to interpret the rest of the file, including the view title (record type 0), the sort/subtotal fields (record type 1), descriptions of each column (record type 2), and an index as to where in the file the different subtotal breaks start (record type 3). The following is a representation of the version 1 XIF file, modified in some ways to make it more readable.7
The file format allows a program to read the top part of the file, format the page layout for view title, sort fields and columns, and display the first grand total record. Then, when the user drills down on the grand total record, the RBA pointer and record count on type 4, 5, and 6 records tells the program to do a “seek” for an RBA, which moves the read-write head of the disk to the specific point where the next record exists, reads the number of records in the record counter, and display them to the user. This is incredibly efficient indexing method; exactly the right data is transferred into memory for display. Each of these new displayed records also has RBA pointers to allow for additional drill down to the next level.
GVBMR88 writes the different record types and levels to individual files. The type 0, 1, 2 and 3 records are written to the XIH, executive information Header file. The grand total record is written to the XIS00, summary level 00 file in our example. The Legal Entity subtotals are written to the XIS01, file, the Cost Center subtotals to the XIS02, and so on. The type 5, 6, 7 and 8 records are written to the XID, for detail, file.
After GVBMR88 runs, one additional major program, GVBMR49 runs. It simply assembles the XIF file from these different files, in order.
The Executive Information file replaced some of the earlier backend SAFR components, like CRP, Consolidated Report Production which accumulated the hardcopy reports, and reproduced sections of them and routed them to users. These relied on people to send printed papers to mail stops; we now rely upon computers and IP addresses to do the same.
The Executive Information File structure, and reporting tools that are built upon them, allow users to get to the data they need in a very efficient manner. The format is an open format; any tool can be used to create an XIF file for use by the SAFR Insight Viewer.
Parent Topic: Part 5. The Programmer