In June of 1996 a few years after my training I was asked to staff a Price Waterhouse (PW) booth at a data warehousing conference. There I think I met Mike Schroeck, a PW partner, for the first time. Mike was head of PW’s Data Warehousing (DW) practice, a mentor and respected expert on reporting. Over a number of years Rick and I worked closely with Mike as the firm attacked reporting problems. Let’s set aside the operational aspects of the system, and focus on report production assuming all the data we need to produce the reports has been created, focusing on how data warehousing architecture attacks these problems.
ETL, Databases, and Report Package
Perhaps the simplest architecture layout of a data warehouse would be this:
The Extract Transform and Load process (ETL) accesses either the detailed data or the master file in the operational system, and loads it into the data warehouse database. The users interact with the reporting tool to request a report. The reporting tool accesses the data warehouse to retrieve and create the report.
The ETL tool tends to be more closely associated with the business event capture and operational system posting process. In other words, it most often is a batch process that occurs at a scheduled time after the business events have been captured and recorded in the operational system. The reporting package tends to be more closely associated with the user report demands; its work is performed as users request reports. The database works in both worlds, accepting, loading, reorganizing data as it is presented by the ETL tool, and responding to requests for reports from the reporting tool.
We’ll discuss these components more in later chapters.
Let’s get back to the question of where in this architecture to perform summarization.
In the DW model, summarization occurs any number of places. For example, if the master file is used in the source system, then the posting process in the operational system is still doing the summarization. The ETL process may either summarize the master data file further, or using the detailed transactional files may load the data warehouse with summaries, detailed data or both. All modern databases have the ability to summarize data, either as requested by the reporting package (a sum function in SQL for example), or internally through other means.1
Each of these approaches, though, really relies on the principles of the subsystem architecture to facilitate reporting processes. Each step of the process summarizes data for specific database target tables ultimately attempting to provide what may be single records in an efficient manner for the end users reporting needs.2
Perhaps a useful way of thinking about this problem is to liken it to the traditional approach to inventory management in the PC business compared to Dell’s just-in-time manufacturing process which was developing in the 1990’s.
The traditional manufacturing process was to guess what type and how many items the buyers might demand, produce that many items, store them until needed, and then ship them to the user when ordered. What Dell realized was that assembled PCs are a lot like groceries; the longer they sit on the shelf the less valuable they become. Instead of making machines and then hoping to sell them, Dell changed the manufacturing process to assemble the machines quickly, in some cases in just four hours after a customer orders the machine. To do this, Dell worked to organize the factory floor to keep all the materials in the most efficient manner for manufacturing.3
Instead of a PC, let’s think of the answer a user wants as the thing the system needs to produce. The traditional reporting architecture demanded that answers be defined up front, and thus create a master file or database table to answer that question. These master files, whether in the operational system or later in the data warehouse, are updated so answers are always ready when the user asks the question.
Dell realized that giving customers tremendous choices in their PC configurations would drive greater sales. In the world of reporting, consider what proliferates faster, the number of source systems or the number of reports? The answer is always the number of reports. The business events are relatively stable, and thus the source systems are too. But the number and types and permutations of those attributes gathered in each of those systems in reporting processes are almost limitless. The demand for reports will always exceed the capacity to produce them if we continue to demand that all questions be anticipated.
Another problem is that the demand for non-financial summaries gets the short shrift. The traditional accounting answers tend to take precedence over other views of the data; this is similar to allowing the assembly line to be set up to favor word processing configurations precluding gaming, multimedia, laptop and other possible needs. The accounting answer manufacturing process destroys the data it should be providing.
Supply and Demand
And it isn’t simply demand for reports exceeding capacity: ultimately using summarization as a means to overcome volume results in overwhelming the system with the number of summaries that must be maintained. This is because as the number of summaries that must be maintained grows, the overall workload of the system experiences diseconomies of scale, as shown here:
This shows that each summary structure updated to answer a specific question reduces the response time of the entire system to provide that answer, because the single record to provide the answer is readily available from the database to be displayed to the user. This reduction in user response time continues as more and more summaries are created to answer more and more questions.
Correspondingly, the system has to perform a posting process for each transaction received against each of these summary structures. As volumes increase, this is normally done as part of transaction processing, not at report time. At some point, the reduction in response time for a user is less than the additional update time for posting. Not all answers in the summaries are used; people don’t need answers to some questions. However, having established the summaries, the update cycle is constant for each transaction record received.
Thus there is a computer cost to maintaining the reporting inventory, and at a certain scale the computer cannot support all possible predefined answers. And because the summarization process, whether in the operational system, the ETL layer, or the database is tied more closely to the daily end of business event processing, the problem shows up in elongated batch processing times.
Working within the constraints of the traditional report inventory system, the IT community only supplies a limited set of summaries to answer a limited set of questions that can be updated by the batch processing time and still provide end users with an adequate response time. In other words the users can choose from the configurations the manufacturing line can produce when the line was established.
The alternative is to organize the manufacturing of answers such that the answer needed is produced as close to the time the question is asked as possible. Certainly people ask questions that need to be answered by reports much more often than they buy PCs; thus four hours for each report would not be acceptable for most reporting problems. But if it were possible to reduce it to four seconds instead of a usual on-line response time for a limited set of report inventory of two seconds, people might be willing to buy that solution.
Manufacturing PCs to custom specifications for each individual user requires organizing the buyer’s configurable components in the proper order so they can be assembled rapidly. What would a just-in-time manufacturing line for reports look like? The answer from McCarthy is quite clear: all reports are made from detailed business events. The record of business events is the stuff of which reports are made.
Dell used suppliers for many of its components; it did not try to start with silicone and make the actual CPU processor in its assembly line. That would have taken too long to do. In the same way, as we’ll see later, it would take too long to accumulate raw detailed business events for the prior 2 years into a common format and then summarize them to produce a financial statement in any time acceptable for a user.
But every time we build a component part, we build inventory we must manage. If at every choice we come to, we compare the cost of maintaining the inventory to the cost of engineering the fabrication process we can continually drive down cost of inventory, and increase the flexibility of the reporting environment.
There is another reason why the detail in reporting processes is needed. Dell would like the PCs’ design to be consistent to drive down maintenance cost when they have to repair them. Similarly, report answers have a relationship to each other. Reconciliation is the maintenance process of proving the relationship between each answer is correct. Reconciliation, a common finance function, requires finding a common set of attributes between two summaries and showing that the accumulated amounts in both summaries are the same. Any difference is an indication one or the other summary and the corresponding answers are wrong.
This process of reconciliation, the process of finding a common set of attributes between different summaries, is facilitated by having ready access to the detail transactions—the business events—that were used to make the summaries. The shorter the distance between the business events and the answers, the easier the reconciliation and subsequent correction is if problems are identified. Reconciliation in finance is so pervasive in some cases, elimination of the cost of reconciliation in the finance process alone can pay for the construction of the new system.
Thus, if we are to attempt to streamline the answer production process, it will require nearly manic focus on a single goal: efficiency of report production. The assembly line must be as efficient as possible in delivering the answers.
1 Materialized views would be an example where the database creates and stores summaries.
2 “Once the data ages [note passage of time], it passes from current detail to older detail. As the data is summarized, it passes from current detail to lightly summarized data, then from lightly summarized data to highly summarized data.” Inmon, p. 37.
3 “Dell is able to achieve a four-hour production cycle time using an Internet-based supply-chain management system, Figueroa says. After getting an order, Dell notifies its suppliers about what components are needed, and they’re delivered within an hour and a half.
“With our pull-to-order system, we’ve been able to eliminate warehouses in our factories and have improved factory output by double by adding production lines where warehouses used to be; says Figueroa….
“The just-in-time method demands a very disciplined assembly-line process, says David Dobrin, an analyst at Surgency Inc. in Cambridge, Mass.” Marc L. Songini, “Just-in-Time Manufacturing” Computerworld, November 20, 2000, or at ComputerWorld Article, Accessed October 2018.