The spreadsheet template is discussed in detail in Balancing Act Chapter 30. Assess Reporting Needs, Chapter 31. Estimate the Data Basis and Chapter 32. Define Summary Structures. This page contains a link to download a spreadsheet template (with disabled macros due to security), a completed sample, and instructions for completion. Using them will allow one to perform the steps outlined in the process flow below.
Additionally, video and blog segments also walk through the steps in this process.
Process Flow
Download Spreadsheets
Download the spreadsheet template, and the completed example here:
Estimation Process Instructions
The instructions have not been significantly modified from the original issued with the spreadsheet. The macros once embedded in the spreadsheet have been disabled. Buttons do not perform any functions at this point.
These are the instructions for completing the Data Warehouse Size Estimation Microsoft Excel Workbook. The workbook assists in estimating the size of the data warehouse. Based on these sizes, determinations can be made about the best hardware and software solutions. This workbook can be used in any phase of the warehouse project to guide designers in selection or trouble-shooting of the hardware and software solution but is most beneficial when used in the initial analysis phase. These instructions are divided into four sections:
- Objective and Overview
- Step-by-Step Instructions Part I (Detail Files)
- Step-by-Step Instructions Part II (Summary Files)
- Interpreting Results
Tip: Read all instructions before filling out the workbook. This will walk you through the completion of each sheet and address detailed questions on specific fields. After review, use the Step-by-Step sections in conjunction with the help buttons in the workbook to complete each worksheet. Note that throughout the instructions, terms are cross-referenced to those used in Data Warehousing and SAP where differences exist. These terms are designated by DW or SAP and are provided in parenthesis.
Objective and Overview
As explained above, the primary purpose of this workbook is to assist in sizing the data warehouse (providing number of bytes per report, files used by each report and performance requirements for hardware and software). The additional benefit is the amount of information that will be available for each report and an ongoing tool to track expectations for transaction processing (number of records, field extraction, etc.) and additional report creation.
After gathering the required data and completing the workbook you will know the following:
- Estimated number and types of reports
- Data sources for each report
- Records read to produce reports
- Records extracted to produce reports
- Table joins and other processing requirements for reports
- The workbook has the following worksheets:
- Detail Temp – Detail Source File Data Template (DW – fact table)
- Sum Temp – Summary File Data Template (DW – aggregate fact table)
- Reference File Template (DW – dimension table)
- Freq. – Reporting Frequency Listing
- Report – Listing of Reports to be Produced
- Sum Norm – Summary Adjusted to a Common Frequency(i.e., records read and extracted annually)
- Sum Not Norm – Summary Not Adjusted to a Common Frequency
Note: All worksheets will be required except Sum Temp. Determination of sizing after completion of Part I may render the creation and use of summary files unnecessary. Make sure to review the detail file findings before pursuing summary file structures and compare the findings against the processing performance standards of your hardware and software environment.
General Worksheet Conventions
Each worksheet is designed for ease of use. Remember, sizing a data warehouse is a complex process and will require knowledge of report and file specifics and some estimations based on the business operation and data usage. The workbook is constructed to ask as few questions as possible and use the input in calculations and sizing methods – the spreadsheet does all calculations and roll-ups.
Tip: Go to the Required Input Items Areas in the Step-by-Step Instructions section for the information necessary to complete the workbook.
Some instances will require customization for number and types of files but this can be achieved by copying worksheets to add detail and summary file types, copying fields and duplicating columns. By following the instructions provided below and on the worksheets, this should be easy to accomplish.
On each sheet, only the items in cells highlighted in Yellow must be changed or completed. You can change cells highlighted in Green if necessary. (For those not using color monitors, entry cells in Yellow are displayed in Arial Bold Italic font. Entry cells in Green are in Times Roman Italic font.) These are default formulas and file titles. Be very careful about deleting rows or columns from the workbook. Also, do not sort the Report worksheet. All references by formula to this workbook would be incorrect unless also sorted. Follow all instruction buttons on worksheets to duplicate columns, add additional detail and summary files or delete records. You will not duplicate calculations correctly if you do not follow those instructions. Help buttons are shown in normal font on each worksheet, macros for ‘Copy Column’, ‘Copy Sheet’ and ‘Update Reference Items’ are shown in Bold Italic within their individual macro buttons.
Steps for Completion
Perform the seven steps outlined below to complete the workbook. Detailed instructions for each step appear in the next section.
Complete the Detail Temp Worksheet Section 1
Complete the first half of the Detail Source File Template. This section estimates the size of the detail source file. (DW – Fact Table)
Complete the Freq. Worksheet
Determine when and how often reports will be produced. Complete the Frequency Worksheet. The “Base Frequency” may be changed later in the analysis.
Complete the Report Worksheet
Determine how many and what types of reports will be produced. Enter either a sampling of typical reports or classes of reports or queries.
Complete Detail Temp Worksheet Section II
Enter what reports will use which files to determine the number of records read to produce the reports.
Complete the Reference File Worksheet
Define a reference file worksheet for each reference file (DW – dimension table). The various reports and frequencies are duplicated from the reports page. Estimate the number of foreign/primary key joins each report requires from this table. Also called look-ups.
Analyze Summary Worksheets
Recalculate the tables to determine the amount of processing required to produce the reports. If the processing requirements can be met within the project hardware and software constraints, no additional work is necessary. If not, continue with the following steps.
Complete the Summary Temp Worksheet
Determine what summary files to create to reduce the reporting requirements. Update the Detail Temp Worksheet for reports now using this file.
Analyze the Summary Worksheets
Recalculate the summary tables again to determine if processing can now be performed within the software and hardware constraints.
Step-by-Step Instructions – Part I
STEP 1 - Complete the Detail Temp Worksheet Area 1
Using the top portion of the Detail Template worksheet, estimate the number and size of the detail records in the warehouse. Enter only the estimated daily transactions and length (in bytes) of each record. Other volumes are calculated by using the Frequency to be entered in the next step. Retrieve daily volume estimates from current transaction posting systems or other warehouse design deliverables. If only an estimation is available, use known statistics about the company to estimate the number of detail records. For example, current invoice line items may be an appropriate estimate for a sales system. Duplicate this worksheet for all detail files in the warehouse.
Required input items:
- File Name
- Number of records estimated per day
- Bytes per record
STEP 2 - Complete the Frequency Worksheet
Enter each frequency for which reports are produced. Determine the base frequency that will allow you to ‘normalize’ the data. For example, to see all data reported by year (megabytes and records per year), “year” will be the base frequency and you should place a ‘1′ in the Normalizing Occurrence on that line. You would then report 12 and 52 in the Normalizing Occurrence for Period and Week respectively (e.g., 12 months in a year and 48 weeks in a year). This information is used to create a summary that is normalized to a yearly reporting period. You may change the frequency at any time during the completion of the workbook but it will change the numbers reported in the summary normalized worksheet.
Required input items:
- Frequency
- Base Frequency Indicator (X)
- Normalizing Occurrence for each Frequency
STEP 3 - Complete the Reports Worksheet
Using the Report worksheet, project the reports or queries to be executed. Enter each report to be produced with the required information. When all worksheets are complete, this sheet will provide a list of each report, the megabytes estimated to read and extract for each and the number of table joins. Information is joined with the lower section of each detail and summary file worksheet to determine the number of records read and extracted based on extract percent and report file usage retrieved from the other worksheets. Remember to only complete the cells in yellow. This worksheet contains many calculations and determined values that are necessary for completion of the workbook.
Note: This workbook comes ready to accommodate one detail file, one reference file and one summary file. If the warehouse includes more files, use the Copy Worksheet button to duplicate the worksheet. This macro will ask for the new file name and will duplicate the “records read” column on the Report worksheet. To perform this task manually, update this worksheet to reference any additional detail file worksheets created. To do this, select the detail records read column. This column references the detail file worksheet. Copy the column. With the column still selected, replace (edit replace) the name of the first worksheet (“Detail Temp” on the template) with the name of the new detail worksheet.
If the system is a data warehouse with ad hoc queries, enter a typical query as one report. Where several sub-reports must be used together to produce the results of one report, use the “Report Count” column and frequency to adjust the report to the anticipated volume. For example, if a daily Sales Report by Region and Brand will actually by comprised of 3 reports or views, enter ‘3′ in the Number of Reports’ column. This will commonly be done where a report needs to be sorted and presented in various ways but all reports select the same data and are essentially the same report.
The estimated extract percent is used to determine the number of records extracted from the file(s). Knowledge of the business process and reporting requirements will be necessary to provide an approximate extract percent. For example, if a report of customers from Nebraska is produced from a file containing all customers in the United States, the report might extract on 1/50 of the total records in the file. Further calculations use this field to compute total bytes and records extracted based on the Total Records Read (total size of the input file times the report count) multiplied by the Extract Percentage.
Required input items:
- Frequency Produced
- Report Count (see above)
- Report Name
- Report Description
- Number of Table Joins (see above)
- Width of the Report (# bytes)
- Estimated Extract Percent (see above)
STEP 4 - Complete the Detail Temp Worksheet
After completing the Reports Worksheet, each report and its frequency will be displayed in the lower section of the Detail Temp Worksheet. Make sure to add the correct frequencies across the top columns for the reports. You should highlight the default frequency column, press the ‘Copy Column’ macro button, and change the frequency and time multiplier for each new frequency. These are not automatically populated from the frequency worksheet due to differences in each reporting environment and different file frequencies. Add all frequencies for the list of reports.
Go down the list of each report and specify how many files of each frequency that report will use. For example, a daily report might have columns for the last 3 business days and a month-to-date column. This report would use a maximum of 4 daily files and 4 weekly files. Record the maximums here to reach the most accurate sizing estimations. In this example, record 4 under daily and 4 under weekly.
Required input items:
- # of Files Read by Each Report
STEP 5 - Complete Reference File Worksheet
Create one Reference File Worksheet for each known reference file (DW – dimension table). Enter the reference file name and abbreviation, each key field (primary key) name and number of possible occurrences. This will approximate the number of records expected in the completed file. If the exact number of records is known, override the estimated records field but remember that you will need to re-create the calculation if required later. Enter the bytes per record to attain the total size of the file. The lower section of this worksheet allows the entry of number of table joins per report. Each report entered on the Reports Worksheet will be duplicated here and listed by frequency. The number of table joins should reflect the number of times the report will require a foreign to primary key join (look-up) for use as constraints (selection parameters) or descriptive titles. After this is completed, the results will be used on the Reports Worksheet and subsequent summary sheets for table join information. The number of table joins must be considered in sizing or tuning a system.
Required input items:
- Reference File Name
- File Name Abbreviation
- Key Fields
- Number of Occurrences of Key Fields
- Number of Table Joins to This File
STEP 6 - Analyze Summary Worksheets
Go to the Summary Normalized Worksheet and press the ‘Update Table‘ button to update the Pivot tables on the Summary Worksheets. Pivot tables are a Microsoft Excel tool. The menu commands are found under the Data, Pivot Table menu. Use the Microsoft Excel help file to manipulate the tables to determine other statistics. The summaries indicate the number of detail or summary records being processed to produce the results.
These numbers are normalized based on entries from the Frequency Worksheet. If month or period is the base frequency, totals at the bottom of the Sum Normalized worksheet will be the totals for all report frequencies per month. Make sure that the base frequency and normalization occurrence is correct and that all entries were made based on that input. This summary sheet reports the processing requirements by all the frequencies of reports and computes the totals. Compare the required number of table joins, records read and the number of records and megabytes extracted with the standards for your hardware and software platforms. Refer to the Interpretation Section below for more information on this topic.
Step-by-Step Instructions – Part II
STEP 7 - Complete the Summary Temp Worksheet
– OVERVIEW –
If the report volume requires summary files, do the following:
- Make a backup – Copy the current workbook for future reference.
- Complete summary file template using the steps below. Duplicate and complete the Summary File Template worksheet for each summary file to be maintained. See details below.
- Add a new report to the Report worksheet to produce the summary file.
- Update detail file template to include creating this new Summary File and deletion of any reports now using the summary file.
- Recalculate Summary worksheets and compare to the first workbook to determine the marginal benefit of the summary file.
?Help? To Create a new Summary File Sheet:
- Press ‘Copy Sheet’ macro button
- Enter the desired new name of this summary file sheet
- Enter the name of the detail sheet that is being summarized
- Copy Columns for new frequencies (using ‘Copy Column’ macro)
- Press ‘Update References’ macro button and provide reference cells for # records, frequency title and bytes.
- Complete the upper and lower yellow highlighted areas.
STEP 7.1. DETERMINE CRITERIA FIELDS (Summarizing Field Name)
Summary files are created by summarizing criteria fields and accumulating the value fields. Consider the following example:
- Criteria Field 1
- Customer Criteria Field 2
- Product Value Field 1
- Sales Price
- Company 1 Product X 1.00
- Company 2 Product X 2.00
- Company 1 Product Y 2.00
- Company 1 1.00
- Company 2 Product Z 3.00
If the above table were summarized by Customer, the following summary file would result:
- Criteria Field 1
- Customer Criteria Field 2
- Product Value Field 1
- Sales Price
- Company 1 x 4.00
- Company 2×5.00
Because the table was summarized at a level greater than Product, the product field is no long meaningful. It could be removed from the file, because we no longer know how much was paid for each product. This file would not be useful to produce a product report.
In determining how to create summary files, first determine which fields to summarize. Any criteria fields used in the report must be included in the summarization. If multiple reports use the same criteria fields, a summary file collapsing to these criteria fields may reduce the amount of data processed.
STEP 7.2. DETERMINE LOCALITY
“Locality means that a large number of transactions creates a much smaller set of combinations when summarized by a set of cross-structural dimensions.” (Roth, “SAP Reporting/Data Warehousing (Draft)”, 1996). From the example above, assume that the company has two customers and three products. The summary file could theoretically contain six records (2 *3).
However, assume that Company 2 for some reason will never purchase Product Y, and Company 1 will never purchase Product Z. The number of summary records now is four (2 * 3 – 2). Locality can occur because of the definition of data elements. For example, a hierarchical structure will result in very high locality. If there are 25 sales districts contained within 3 sales regions, the theoretical limit is 75 (3 * 25). However, if all sales districts are unique across regions, the limit is 25.
Enter the number of unique occurrences adjusted for locality in the Number of Occurrences cells.
STEP 7.3. DETERMINE TIME IMPACT
The greater the number of records due to passage of time, the greater the potential collapse when summarizing. For example, assume that within a single day a customer will probably not buy the same product twice. Thus summarizing the day ‘stransactions to the customer product level will result in very little collapse. However, assume that over the course of a year the customer purchases the same product at least once a month. The estimated collapse will then be 12 times higher for a yearly file.
Adjust the Number of Occurrences for the effect of time.
Required input items:
- File Name
- File Abbreviation
- Detail File Worksheet Name
- Summarizing Field Name (What file is sorted by)
- Number of Occurrences/Locality
- # of Files Read by Each Report
STEP 8 - Analyze the Summary Worksheet (Perform a test)
If possible, perform a test summarization to see the estimated collapse using different combinations of criteria fields. If the percentage is significantly different from the percentage derived from the above calculations, override the Estimated Collapse with the test results. Go back to Step 5 and re-calculate the summary tables again to determine the incremental difference and if processing can now be performed on the hardware and software solution. Refer to the Interpret Results section for more information on best solutions and processing performance guidelines.
Interpreting Results
The summary pages accumulate statistics from the Report worksheet by Frequency. Any field on the Report worksheet may be added or deleted from these tables. Using these numbers, team members should be able to determine what resources will be necessary to produce the reports. Because of the ever-changing nature of hardware and software products, benchmark guidelines should be updated continuously. Please refer to the white-paper titled “SAP Reporting/Data Warehousing” (Roth, 1996) for more information regarding analytical reporting issues, data bandwidth problems and using an Operational Data Store for high-volume reporting.
Tip: For processing performance benchmarks, please contact your hardware and/or software representative for current throughput performance. Before making benchmark decisions using these numbers, contact your appropriate partner to confirm accuracy and timeliness of the information.