As noted earlier, if you don’t pay attention to performance throughout the preceding steps, this step won’t save you. It is not possible to turn a jalopy into a race car. Performance can be tuned up, but it can’t be created after the fact. Having said this, there are certain aspects of performance that complicate building the initial components. These are better added after the components are functionally working.
Although SAFR is designed to resolve all processes in a single pass of the event file, if SAFR is used to generate events, second passes of the data can be required to actually produce reports. At lunch one day at the Insurance Company, I think it was, I was thinking about the architecture of the allocation process. As noted, the allocation process generated a great many event records. These records needed to then be allocated in additional processes generating more output records. This happened a few more times and then reporting against all those event records needed to occur.
I mused to Doug that I wish we could have the outputs from one view feed into another view, rather than having to write them to disk and having to read them again in another execution of the Scan Engine. Doug thought for a moment and then smiled and said, “
Well, that would be possible.” He went on to describe how that feature could be added to SAFR. Over the course of the next few months, Doug added that capability we called piping.
Piping allows the output from a view which would be written to a file to simply be passed in memory between SAFR processes. Remember that the CPUs only work against data on the white board. All data has to go through memory after coming from or before going to disk. Thus to programs, data from a file or being written to a file looks the same; it is just data in memory.
SAFR uses this fact to, in a sense, simply trick itself that the data writing processes create data in memory, and data reading processes look at this data in memory; the data is never actually written to disk. Because the “white board” space is less than what is available in the “binder”, the switching back and forth between writing and reading must be repeated hundreds, thousands, or even millions of times. Small sections of data are created and then read. Then the next section of data is created and read.
It could be called a virtual file, portions of which only exist in memory for a very short period of time. Piping is used when the views are able to run asynchronously: in other words they operate without knowledge or dependence upon timing between them. This is discussed in Piping, Tokens, and the Write Verb.
Debugging or even examining data which exists in memory is difficult to do; the pace of changes on the white board is simply too fast for people to actually work with. Because of this fact, it is usually better to not pipe between views until the views and processes have been adequately tested and the results of each step verified by looking at disk output. Thus designing for piping is important, but actually adding it to the process becomes one of the later stages of the SAFR method.
Common Key Data Buffering Writes
A similar capability is available through the Common Key Data Buffering feature discussed in the prior chapter. A new record can be written into the data buffer as if it had come in from a file. This record can be used as an event record for other views, or it can be looked up by other views. In practice this is a bit like adding records onto the end of the input event file, although actually it is in the middle of the file as only one policy or arrangement is read into memory at a time.
The Common Key Data Buffering Write approach is used when there are relationships between views or records created or with large cardinality reference files. In other words, processing must be synchronous. But, the common key data buffering process requires calls to subroutines. Calling subroutines has overhead associated with it. Piping does not. Thus piping is more efficient, and should be used if there is no relationship between records being created and processes can operate asynchronously.
Care must be taken in analyzing dependencies between the views that read the common key data buffer. If the output from one view can create another input record to that same view, an infinite loop can be created. See Common Key Data Buffering.
A third approach to eliminating writing and rereading records is to use tokens. Tokens allow views to write records that are immediately used by other views within the same process. They might be thought of as preprocess views. Token writing views are placed at the front of the string of views to be executed within a thread.
Token processing allows views to run synchronously. It also uses generated machine code, so there is no overhead for calling subroutines. Thus it is very efficient. It can allow for some level of relationship between records to be managed, but only at fairly low levels of complexity. If there are significant relationships between records, the common key buffering approach should be used. See Piping, Tokens, and the Write Verb.
Another aspect of optimizing performance is adding parallelism to processes. Parallelism must be designed into the application from the beginning, but again, parallelism creates complexity in debugging processes. The system should be tested with a single process first, and ensure that the outputs are correct prior to adding parallel processes.
Parallelism comes in a couple of different forms. One form often used is dividing the input event files by some key or portion of a key which has no business purpose. For example, often the event files are segregated by two digits of a key ID. This simply creates more threads to use the CPUs available on the machine. There would never be a business reason for not processing all the partitions.
Another type of parallelism often comes from maintaining time partitions. This is done so that views which create new time partitions of files only have to write a limited number of records, and if perchance a SAFR execution does not require historical data, some time partitions do not need to be read.
Thus our example of the policy repository had this type of a structure to it:
Each box in the figure is an individual file. Time partitions may or may not be read, depending upon the views’ needs for historical data. However, all Technical Partitions would likely be read. The Entity Partitions contain the differing record types (different LRs).
The partition number added by the Common Key Data Buffering feature says from which technical partition the data was retrieved. The category field indicates from which time partition the data was retrieved. Thus the Common Key Data Buffering processes merge all these files together, one at a time into memory, policy or arrangement or what-have-you.
Views can use these fields to select records from appropriate partitions and to write out selected data. For example, the view might on a daily basis merge a single output file containing both the week’s data and the new records for that day. Then once a week the weekly file is copied into a new version of the monthly file.
These steps of the SAFR method are intended to give a sense of the approach to solve reporting problems. Of course nothing compensates for the flexibility and adaptability of the human mind. A fool with a tool is still a fool, and a fool with a method might just mean it takes longer to find out he is a fool.
Following these steps has proven to create very flexible, cost-effective reporting environments.
Previous: Chapter 35. Model the Repository
Parent Topic: Part 4. The Projects