We performed a benchmark for an airline, and in that process Jay (I suspect) outlined a way of reducing the amount of data in the extract files that turned into a feature called extract phase summarization. We’ll discuss that below. But before that feature was available, I remember many hours of babysitting SAFR processes only to see the sort portion of the process fail.
Remember in Select, Sort, Summarize, we reviewed the challenges of sorting data. Sorting data can be the longest process in general business computing, particularly in reporting. High volume sort processes can challenge and even exceed two of the three parts of our compute environment, memory and disk for even the largest compute environments. Memory is challenged because that is where all the work takes place. Disk is challenged because the file must be at a minimum copied, in most cases nearly tripled, before the process is through. Only CPU capacity is typically not challenged in sort processes.
Sorting data is the one part of the SAFR batch process that uses other companies’ software products. Sort utilities have existed for years on mainframes, and are highly tuned through years of development and testing. But because it is such an important part of the reporting process, it is important to understand some basics about sorting large files.
Sort and Merge Basics
Before describing sorting large files, let’s first understand what a merge process looks like. Sort utilities can merge multiple files that are already in sorted order. Merging requires reading the first record from each of the files, comparing those records to determine which record to write out next to the output file. Only one record from each file needs to be in memory at any point in time.
Back to sort. Like any other program, sorting utilities only work with data in memory. That doesn’t mean that files which can’t fit into memory can’t be sorted. If the data is too large to fit into memory, sort utilities bring as much data into memory as they can, sort those records, and then store that set of records onto disk in temporary files. They then bring in another section of the file, and do the same thing. After the utility has read and sorted into a temporary file the last record in the file (which might end up needing to be the first record written to the output file), it then performs a merge process. It opens each of the temporary files, reads the first record from each, and writes the first record to the output file.
Note here, though, that a record was read from disk (1st IO), sorted into a temporary file and written to disk (2nd IO), read from disk (3rd IO) and finally written to the output file (4th IO). For this one reason, sorting large files can be very, very, very slow.
Business Event Based Sorts
The need for extreme care is even more acute when attempting to implement a business event based reporting architecture. Consider why by examining the views we have built and run thus far. Suppose all of the views we have constructed were Format Phase views, they all had no general selection criteria, so every event record read is written to the extract file1, and they all wrote to EXTR001. If that were the case, then the GVBMR95 control report would look like this.
Extract File Sorts
Think for a moment about the order of the records in the extract file if we had executed the above set of views. GVBMR95 Logic Table Trace would show event file record 1 being read by view 3261, and written to the extract file. Then that same record would be processed by view 3262 and written to the extract file. Then on to the next view, and the next view. At the end of the processes, each event file record would be written four times into the extract file. Except for putting the sort keys on the front of each extract record, extracting the data does nothing to put them into the right order2.
When GVBMR88 runs, it needs all the records for view 3261 to be next to each other, in sorted order by the sort key so it can perform sub-totaling and calculations. Because SAFR shares the extract files, to reduce operational complexity by having every view written to a separate file, SAFR precedes the Sort Key Data with the View ID, as the last field of the control area. The Extract Program creates the sort control cards–the commands to the sort utility–for each extract file. The start position always starts with the view ID in the control area, but the length of the data to be sorted is the longest SK area for any view writing to that extract file.
The control record and header records are constructed in such a way that they sort to the top of either the file or the view respectively. In this way, sort places the control record, then header for view 3261 then all the extract records for view 3261 in order by their sort criteria at the top of the file, followed by the header record for 3262, etc.
Typically, one has to instruct the sort utility about specific fields by position and length that are needed to be sorted in descending order. But because all the data is in one file, sorting data in descending order requires a bit of a trick. The Extract program converts the data into a form called “two’s complement” which is similar to creating a reverse image of the data. The data can then be sort ascending, and the Format program then converts it back into the original form, thus creating a descending order to it. We’ll show an example below.
Extract Phase Summarization
Jay recognized in the airline benchmark that we had to create a single record which summarized the entire file to be used in another pass of the data as a reference file key. Perhaps the problem was one where we had to calculate the percentage of total for the individual rows. He asked Doug to create a special process at the end of the Extract engine that, instead of writing the record to the extract file, kept the record in memory until the next extract record came along. When the process saw the next extract record, it would compare the sort keys; if they were different, then the first record was written to the extract file; if they were the same, the process would accumulate all CT columns immediately, and no record was written to the extract file.
The impact of this process was that a very large file with one value on all records for collapse was reduced to a single row containing the total of all the records in the file at the end of the extract process. Instead of having to replicate every record into the extract file, sort them, and then reduce the answer to one row, SAFR arrived at the answer at the end of the extract phase.
This is great, but what if the event records had multiple values that needed to be collapsed? One approach considered was sorting the event file before execution of the Extract. But requiring the input event file to be in a sorted order reduced the application of this technique to a broader set of views. The idea behind the event based architecture was the ability to produce many views of the business events with access to all the business event attributes. Having the event file in sorted order was the same as requiring definition of indexes on the file, which reduced the type of access that could be gained. If the records came in completely random order, the extract file may end up with just as many records with no extract phase summarization.
Doug realized this, and could see that with a small change, the technique could be broadened to apply to many more views. What was needed was to keep a stack of output rows, a thousand or so for each view, rather than just one row, and keep those rows which had the most activity against them in memory, allowing the least used rows to drop into the extract file. Thus as an extract record was ready to be written to the extract file, a simple index of the view’s sort keys pointing to the proper record would allow collapsing that record. Let’s use an example to show how this works.
Extract Phase Summarization Example
I have created view 3265. This view reduces our 18 journal entries by the family name into two categories, working capital and other. I have used the following logic text to convert the accounts into constants that are then used in collapsing the data as the second sort key.
If {ACCOUNT} = 111 Or {ACCOUNT} = 211 Then
COLUMN = "Working Capital"
Else
COLUMN = "Other"
EndIf
The view generated the following logic table.
The following is the Logic Table trace for records 1 and 2. Note that the comparison on record two failed the “IF” selection criteria above, so the “else” condition was executed. After these two rows, there were two records in the stack, both for the “K & N jones” family, one “Working Capital” and one “Other”.
Here is the final output from the Format phase.
In this way the memory used for the stack can be efficiently used; we don’t need to know exactly what results we will get at the end of summarization, but we can still gain the benefits in reducing the sort file size. Typically, a few thousand records are kept in the stack by default for each view which uses extract phase summarization. If per chance, one knows that the extract file will not significantly be reduced (it has a lot of sort keys or very large number of possible values per sort key), then it is best not to use extract phase summarization. The cost of the CPU cycles to test to see if things can be summarized may be less efficient than simply allowing sort to do its job.
It is difficult to overstate the importance of the extract phase summarization innovation, and I don’t think anyone saw how far reaching it was when created. Overnight, the long running sort jobs went away because the extract files were now measured in hundreds and thousands of rows, not millions. Beyond that, it highlighted exactly the right place to perform the function of collapsing records.
The act of summarization is the act of building an index, and extract phase summarization builds an index based upon the intersection of the event data and the view definition at exactly the right point in the process. As we noted in Reporting, reporting by and large is the act of accumulating business events to an understandable level, and that level almost always has a high degree of summarization with it. Thus millions and billions of business events are reduced down to thousand, or hundreds, or sometimes even tens and single digit numbers of rows. The computer has sufficient memory to maintain hundreds of these small summaries while passing through the event file.
Parent Topic: Part 5. The Programmer