Chapter 28. Define Reference Data

In the year 2000, Jim assisted me with a prototype of SAFR to detect ticketing fraud for a large US air carrier. Again, the data we were to read was not in a very simple format, so Doug wrote a read exit that would read the IDMS database and present the records to the views. Jay made enhancements to SAFR so it would automatically detect the number of database partitions that existed and create that many parallel threads.

Jim and I worked to define views that, as we scanned the production database, detected when ticketing fraud had occurred. Unlike transaction system tests which audit each individual transaction, these tests could only be done after the ticketing process and travel was complete, and all the data from all the independent ticketing systems were brought together.

Airline Project Results

SAFR scanned the production database containing over a year worth of airline tickets, 587 million records, in 2 hours and 46 minutes, 6 1/3 hours CPU time. It extracted 23.5 million records, composing about 100,000 tickets that potentially were fraudulent. The data wouldn’t tell us if fraud had actually occurred; that required people looking at the data and making judgment calls. But the potential for savings were more than enough to pay for the project itself.

Reference Data Keys

You’ll remember, from Reporting, that reporting is a process of classifying. We said that we assigned attributes to business events to describe them. Also looking at the tables defined in REAL Analysis Method, you’ll note we did not store the descriptions or titles of each of these values on every row. This not only saves space because duplicated values aren’t stored on every row, but allows the descriptions to change with time if needed. These descriptions of business units, cost centers, and accounts can generically be referred to as reference data.

One day during this airline project the project manager brought someone to my desk who was on another project attempting to use some data from the production database. He said they had created programs to deal with the documented values in a particular field. But as they were testing, things were not working right and they weren’t sure why. He asked if it would be possible for us to scan the database to tell them all the values in a particular field. I said that would be pretty easy.

I created a summary view sorting by the single field he was interested in, and accumulating a constant of one in the column for a counter of the number of occurrences. I ran it and printed the small file. His reaction was interesting. It was something like, “Well I knew about these fifteen values, but what are these other four values no one told us about?” We found a support person who had long experience with the database. She looked at them and said, “Oh, yea, I remember now, there was a conversion done on the database four or five years ago and we used that value to mean …” She went on and could name a few others. She didn’t know what one was for.


This is a common tale in IT. And there are a couple of lessons to be learned here. First, the moment documentation is finished it is out of date because something on the system has changed; documentation is no substitute for using the data. Second, understanding the keys, the identifiers in the data, is fundamental to developing reference data to describe those keys. Getting a complete list of the descriptions of business events as captured by the source system is critical to understanding the data.

Remember also, in REAL Analysis Method, we noted that the longest process is understanding the rows of data. Determining the event file is relatively short. Defining the fields on those on the event logical record takes a bit more time. Determining the values for each field takes even more time.

But the real value in reporting comes only as the rows of data are understood. There is no purpose to all of the work if the rows of data don’t provide answers to questions that inform specific actions. We can make this point again without overemphasizing it: Real value comes from understanding the rows of data. For example, with people, it isn’t the list of possible hair colors or potential heights that makes a difference. The actual people are what matters, their names, sizes and shapes and colors. Until we get to the rows of data we have nothing really to report.

Joins or Lookups

SAFR supports joining data together to combine the results on a view output. For example, if we built views against our sample event file for Legal Entity, Cost Center, and Account, the outputs would have given a list of all values in those fields, shown on the left column in each of the three areas below.
Figure 53. Sample Reference Data Tables



Once armed with all of these possible key values, names or descriptions can be assigned to all of them by making simple editable files that describe them.


The following screen shows how in SAFR to build a join. Let’s suppose this Event LR would have been used in doing our reports thus far.


Event Logical Record
Figure 64. Event Logical Record
The LR below describes the file that contains the Account Titles, the file shown in the Sample Reference Data Tables figure.
Account Titles Reference Table LR
Figure 65. Account Titles Reference Table LR


Look up or Join Path Creation
Figure 66. Look up or Join Path Creation



The next step is to define the actual join. Joins are considered part of the metadata, set up by an administrator, someone who understands how to connect SAFR Scan Engine to the files that contain the data, and how to interpret the data in the files.


To build the join, we select the source event file and the target table we want to join to. We then select those fields from the source that will be used to populate the key to the target table, the account field. When the length of the fields provided from the source matches the length of the key in the target, the join is valid. Having defined the join, we can now use any of the fields on the join table in our view.


This join only has one step to it, i.e. we go directly from the source to the target. Although if needed, we could have repeatedly selected targets and sources to go through interim tables and find additional values; those are called multistep joins.

New View Wizard Join Field Select
Figure 67. New View Wizard Join Field Selection



As we build our view, we can see all the fields on the account table, our event LR. Note that this screen shows that joins have also been defined for cost center descriptions and legal entity titles. Also note that these reference tables aren’t simply to display results on the reports; the values in them can be used for filtering and calculation functions.1

Sample output from the view is shown below. The reports become more useful because the titles are more informative than the code values by themselves.


Figure 68. Sample Report with Reference Data Descriptions


A couple of important notes:

        • All physical reference files must be presented to the SAFR process in sorted order by the key specified on the LR if the file will be used in a join.


        • All reference file keys must be unique; duplicates are not allowed in any case.
        • SAFR has to be able to locate one specific reference file record as the result of SAFR join processing.


SAFR join processing then requires a many to one join. Many event file records can look up to one reference file record, or many lower level reference file records can look up to one higher level reference file record on a multistep join2. Approaches to performing union, full outer, and right outer joins are discussed in Consider Complex Joins.

Static Data

Reference data isn’t all created by finance users; some very important reference, or static, data comes from the operational systems. For example, it is unlikely finance will update customer names, like the K & N Jones Family on the report above. Static data is data residing in the operational or close to operational systems which describe the business events, like customers, contracts, products, offices, etc. The process of gathering this data is no different than finding the event files in the legacy systems. It is critical to have it available in reporting processes to make sense of the business events, and it is sampled and refined just like reference data.

To do standard SAFR join processing, SAFR loads the reference data structures into memory. At times the volume of this static data may exceed the memory available on the machine. The Common Key Data Buffer feature, discussed in Model the Repository and Common Key Data Buffering, may be necessary. At this stage consider using a limited set of this data in producing the initial reports.


Thus far in these chapters, we’ve had the simple view that attributes of business events are assigned when business events are first captured in the operational systems. However, as noted in Rick’s subsystem lecture, in reality there can be layers and layers of classifications established before anyone attempts to understand the data. Legacy accounts in operational systems may be a couple of decades old, and not used for reporting or analysis any longer. Somewhere within the accumulated systems constructed over the years, those accounts are translated into other accounts that are now used for reporting. That can be true for any reporting attribute.

The translation process is not always one field to another field. At times, multiple fields are used to determine one new value in a new system, or a single field produces multiple attributes in the new system. For example, accounting systems from 30 to 40 years ago had only the three fields of legal entity, account, and perhaps cost center. Newer systems usually have numerous other fields, like product, customer type, and intercompany affiliate. So when the legacy account, which sometimes also meant the product for that customer type, is translated into the new ledger, it may result in three field values, account, product, and customer type.

The insurance company had a company code structure that had grown up over 30 or more years. When they decided to break down all the embedded meaning in it, like mail stop and cost center, legal entity, and HR department, they found it took over thirty tables, each with many fields, to describe the one ten character field they had used for so many years. This field is likely still used 10 years after starting the process of replacing it in a host of systems, and will likely be used in some way for years to come.

Mapping tables don’t describe the values in one system, as much as translate them to the values for another system. The keys to the tables are those from the source system, and the targets or answers are those for the new system. Mapping tables are simply reference data files to SAFR.

How To Begin

Making up reference data, categorization schemes from scratch if you will, takes a long time. A faster way is to leverage the thought already put into legacy reporting systems. New accounts are never added to the chart of accounts because someone hoped someone else would want them. Most are added as some piece of information is needed to address some specific action. So start with and modify the existing reference data. If two ledgers are to be combined into a single chart of accounts, then start by either identifying the one to keep, or building the new chart by referencing both.

In addition to using the legacy chart of accounts, reference data in other systems should be leveraged when working with lower levels of data. For example, reference data is used to classify arrangement level attributes in financial data warehouses for regulatory purposes. These types of classifications can be used when exploring the event files from lower level data in other systems.

Reference data can’t fix lack of detail. There is no way to translate from a summarized record and create a lower level detailed record for a receiving system. Summarization destroys visibility to the details. Translation processes are made easier when working with the details, as long as the details are preserved out of the translation process. A single value in a source system can be translated into a single value in the target system. Again, keeping the detail provides greater flexibility.

If we look at all the summarized results of financial data for all financial reporting systems, in other words if the set of reference data for all finance systems is combined into one set, we will find the lowest common denominator is pretty close to the customer contract level. Maintaining the detail will provide economies of scale in the long run.

There is really no short cut for this work of creating reference data. It will take longer than anyone wants it to, and thus it must be started early, perhaps even before finding the event files. Finding and balancing the events files is a way IT can help. Perhaps every project I have been on has missed nearly every deadline for when the reference data would be known. In some cases it was missed by nine months or 50% of the time. In others it was missed by 2 years or 90% of the estimate. That’s because it is an iterative process of adding values to reference descriptions or mapping tables, and looking at the outputs to see if they properly characterize the results of the business events and inform potential actions.

The more iterations of viewing the results, the better the results will be. The system can’t be built until the reference data is understood, but the reference data can’t be built without looking at the results of the system. Thus, again, a tool that supports a prototyping approach to the problem, including going clear to producing usable outputs, is nearly mandatory.

Next:  Chapter 29. Iteratively View Results

Previous:  Chapter 27. Find More Detailed Events

Parent Topic:  Part 4. The Projects

Table of Contents



1 Note that in a pervious chapter we spoke of creating a view which selected only records for one particular account using the following selection criteria:

If  (  {Legal_Entity} = “522339999” 	And {Cost_Centre} = “CC111” 	And {Account} = “123” )
	Then Select

With these reference file tables, the selection criteria could have been written as

If  ( {Legal_Entity_Desc.Event_Journal_To_Legal_Entity_Title_Join.Legal_Entity_Title} = “K & N Jones Family”
	And {CC_Desc.Event_Journal_To_CC_Desc_Join.Cost_Center_Title} = “Mum” 
	And {Account_Titles.Event_Journal_to_Account_Titles_Join.Account_Titles} = “Home” ) 
	Then Select

The values tested for are more readable, but the logic text is more verbose. This selection also causes look-ups to be performed on each record—something that requires additional CPU instructions over simply testing values in the event file; so it is less efficient. But there are cases when it is useful to join to perform selection and calculation logic. For example testing to see if the reference data is complete can be done by using the key words FOUND or NOT-FOUND. Records that either have or do not have corresponding reference data can be selected for reporting or more analysis.