The cookie manufacturer in the fall of 1995 needed a solution to their reporting problems. Doug developed a solution called the Executive Information File, or XIF. In later years a similar capability in other tools would be termed a cube. The structure of this file had the most summarized information at the top, and lower and lower levels of detail farther down.
For example, from our sample hardcopy report, below:
The XIF file would have the subtotals for the two families listed at the top of the file, as shown on the top portion of the next figure.
When the user double clicked on one of the families, the subtotals for the next level down would appear, as shown in the bottom portion. The process could then be repeated, showing the individual account subtotals if the user selected Dad, Mom, or Charles.
This approach simply automated the manual process of searching sort fields on a hardcopy report to find the desired information at lower and lower levels.
There were a couple of unusual aspects to this file. One was that instead of using SQL to find the next level down, it used a very simple addressing scheme called relative byte addressing. Because SAFR had built the report, and then turned it upside down, it knew how big the report was, and where the subtotal records for the next level down would be. It could simply add on to the front of each record an address of where in the file the next records would be. Computers have the ability to open and read a file at a particular point, rather than starting at the top. It would be similar to opening a binder or book to a particular page and beginning to read. This approach required that no unnecessary data be transferred into memory for each click by a user. It made the access of the file very, very efficient.
The other unique aspect of the file was that the entire definition of the report was contained in one place: the view. In many systems, portions of what a report needs to contain are in the ETL definition, portions in the database, and portions in the reporting tool. Changes to a report might mean changes to each of these layers. The Executive Information File, though, contained the definition of the view in the top records. Thus the Scan Performance Engine changed the report however requested by the user as it read the data, and the web page that displayed the results to the user could simply adjust what it displayed based upon this data in the file. No heavy duty programmer needed to be involved to make report changes.
This characteristic is important for the work we have been describing. Our process has involved many changes – defining and reading new files, adding new columns for record counts, changing selection criteria, adding new views and outputs which investigate areas in more detail, adding new reference data to describe fields, and changing the sort criteria to identify a subset of records of interest.
Output Types
The Insight Viewer output from the Scan Performance Engine provides the summary totals first, allowing a user to select additional detail they wish to see by drilling down on the file. The output is a simple sequential file; no database is involved.
The limitation of this format is that it creates a copy of the lowest level detail included in the view in each output file. It is an effective format for investigating summary results, or results of samples of data from very large systems, but is not a mechanism to maintain access to all the detail needed for enterprise reporting.1
The SAFR Hardcopy reports are traditional print ready, character based reports. Sort data can be presented as columns or more elegantly as indented values based upon the sort order. The view builder can control report headings, column headings, subtotal descriptions, sort titles, masking or display format for numbers. Additional formatting can be done via user exits for more specific shop standards.
The limitation is the basic format is fixed, as columnar character based reports for up to 255 characters in width. Tools for on-line investigation of these reports tend to be mainframe based products, which are not very glamorous in today’s reporting world.
SAFR has the ability to put out two different types of files, delimited files which can be used to download into common tools such as spreadsheets, or sequential files which can be used either as interface files or loaded into databases and other reporting structures.
The limitation on these outputs is they aren’t immediately easily viewable, but rather have to be placed in other tools to see the results.
ETL Functions
SAFR’s heritage was as a reporting tool. But it transitioned from the early years of being a reporting product to being more ETL focused. This occurred because SAFR was a mainframe tool and the mainframe batch characteristics were so critical to its performance, and it was influenced by the advent of graphical user interfaces and the “death of the mainframe”. Thus, while retaining certain reporting capabilities, it grew in the ability to transform data from raw structures into those reports. This happened to such an extent that for a number of years in the late 90’s we described it in marketing literature as an ETL tool.
Some of the ETL like features that were added to the tool over this time included the following.
SAFR can read, write, or convert any of the following numeric data types, a number of which are not even supported by COBOL
- Binary Coded Decimal
- Binary, 2, 4 or 8 byte
- Sortable Binary
- Edited Numeric
- Masked Numeric
- Zoned Decimal
- Packed
- Sortable Packed
It can convert and interpret data in more than two score different date formats; it can convert numeric data into four score masks, comma and decimal format combinations; ASCII or EBCIDIC code sets. The UNIX version can translate between big and little endian structures.
These functions allow it to do many things an ETL tool can do. But real ETL tools are defined by the types of file formats they can read. The UNIX version of SAFR only reads sequential files. The mainframe version can read data in the following formats and ways:
- Sequential access of sequential files using high performance BSAM access method
- Sequential access using SAFR channel programs
- VSAM files in sequential access
- DB2 data by reading and interpreting the underlying VSAM data structures
- DB2 data via SQL
These relatively limited set of data types I don’t think qualify as it as an ETL tool. It breaks the boundary between traditional tools.
SAFR also has a host of key words including testing field values for SPACES, NUMERIC, HIGH-VALUES, LOW-VALUES and others, as well as testing join results for FOUND, and NOT-FOUND. These key words allow it to help in an inevitable, but unenviable reporting system task called data cleansing.
Extensibility
One of the first things recognized immediately after the retailer benchmark was the need to be able to extend SAFR for non-native functions. On mainframes this capability was termed the ability to call an exit, a custom program at a specific point in the process. A more common term now is an application program interface or API. SAFR has multiple points where a custom program can be called to perform logic that SAFR itself cannot do. This extensibility has been used to add new features to SAFR over time, as most new features are first written as APIs and then folded into the base product as they prove durable for other implementations. The following are the common API points in the Scan Engine.
- VDP XML API. Although not traditionally called an exit, this API allows a completely different way of defining SAFR metadata and views. It is discussed more in Define Processes.
- Read Exit. This API point emulates an access method, allowing for custom code before a SAFR thread sees data from the event file. It can be used to manipulate an event file, combine multiple physical files into one event file for the SAFR views or even completely generate an event file. From a SAFR perspective, the views read the output from the exit rather than a physical file.
- Lookup or Join Exit. This API point is called whenever a join to a reference file table is to be performed. The parameters of the join are passed to the program, and a pointer to a resulting record is returned. From a SAFR perspective the result of the join is the output from the called program. This feature is often used to create function calls which accept parameters and return results.
- Write Exit. This function is called when a view is to write a detailed record to the extract file. It can be used to manipulate the output, evaluate if the record should be written at all, or write multiple records if desired.
- Sort Exits. The standard sort utilities used to sort extract files have the ability to call exits. A commonly used SAFR sort exit creates permutations of extract records to simulate the creation of multiple views from a single SAFR extract file. This reduces the total data extracted and sorted. This is used to create SAFR cubes which are composed of multiple executive information or XIF files.
- Format Exit. This function is called at the end of the format program after summarization has occurred. Similar to write exits, it can change the record, determine not to write it at all, or write multiple records if desired. See Exits for more details on Exits.
Data Cleansing
Data cleansing is a process of making data conform to a standard, like the values in the airline database. Data cleansing can be a very long process on some projects. In some approaches to reporting problems, data has to be clean before any reporting can be done on it. Therefore, it must be cleaned in ETL processes before it can be loaded into the database.
The approach we have suggested can find value in the data immediately, without first having to make sure all values are consistent. It is much like any type of cleaning: whatever is visible is what gets cleaned up. The first step to cleaning is understanding the data. Think back to the airline example above, when I handed the guy the list of values that actually existed in the data. Until you know what’s in the data, you can’t clean it up. Visibility will drive more data cleansing than any other activity could.
Having a tool that facilitates both ETL and reporting functions allows us to work closely with the operational systems data, and assess the quality of the information of that data. The next step in our process is to assess the reporting needs and how they measure up against the data samples we have taken.
For the first time in our method we are going to look at what we need to build, what the new systems must do. Consider this carefully; all the steps we have performed thus far have been focused on the data that exists in the organization today. When undertaking reporting projects so many organizations become distracted by what needs to be built; the structure of the new skyscraper. They spend years working on them, when the hardest part of the project is dealing with the data they already have.
So care must be taken as we begin to focus forward. To help keep from losing focus, our analysis of the reporting needs of the new system is best informed by the reporting already being done in the organization. These help uncover the real reporting needs.