The first interaction I remember with Doug was soon after meeting Jay when I started as a system tester. I found a problem in one of his programs. Renae Bell, the other system tester with whom I shared an office in Sacramento, told me to call him. I underwent the same initiation many others did through the years when I wondered if it was right to call and tell a partner—in the very real sense an owner of the firm—I had found a bug and he needed to fix it.
Doug was perhaps the most unassuming partner anyone had ever met. I remember him taking assignments to fix problems from people brand new to a project or even the company. His response was even more unexpected for some because of the stories about him they heard before having to call him.
One of the great stories was about Doug finding a bug in the COBOL complier while working in Alaska. I have heard Rick repeat the story a number of times. It seems the system was in construction, but a new version of the compiler was needed to overcome a technical limit of the old compiler. When the new compiler was installed, a new problem showed up. Rick remembers going to the office on the weekend, and seeing Doug surrounded by stacks and stacks of printouts from a computer dump, a listing of all the ones and zeros in computer memory at some particular point in time. Doug showed him he found a series of them that were causing the problem, traced them to the place in the compiler that had put them there, modified the IBM compiler using a utility called Zap that actually allows inserting a different sequences of ones and zeros into any file including the IBM COBOL compiler, and sent a printout of the code in error and his correction to IBM.
In the early days of SAFR most of the defects showed up as system abends, short for abnormal end. If you have experience with the blue screen of death on a Windows platform, you have seen an abend. To many programmers an abend is an extremely bad thing, something that means things went catastrophically wrong.
The truth of the matter is abends in SAFR processes on mainframes rarely affect anything adversely. The blue screen of death is an abend in the operating system, and can mean one has lost data and rebooting can be annoying. That is very different than an abend in a single application, like Excel. I have not seen an abend in the operating system on the mainframe. And because the vast majority of SAFR processes do not perform any updates, the effect of an abend is that Scan Engine simply has to be restarted.
When dealing in the world of high performance computing, the cost of the niceties of controlling all the potential problems becomes simply too expensive. Allowing the operating system to trap certain types of errors is a very efficient approach to the problem. CPU cycles are not spent trying to prevent errors when in fact if an error occurs the process has to be restarted anyway.
But first let’s review our computer and business meeting analogy of Computers. Each execution of a program is like a meeting in a conference room. Memory is like the whiteboard, and disk storage is like a meeting minute binder. The processors or CPUs are like the people in the room. Now, we’ll need a bit more detail, so let’s enhance this analogy a bit to explain registers and the operating system.
The eyes of the people in the meeting, our processors, need to be looking at the data on the white board before doing something with that data. The “eyes” of the computer are called registers. Just like people have two eyes, computers have multiple registers. In fact, just as some people in the room are better at math than others, some registers have special capabilities for particular types of functions. But in all cases the registers must be “looking” at the data to be acted upon by the computer in order for the computer to work.1
There is typically a person responsible for the meeting, the chair. This person follows an agenda or perhaps better, our very detailed set of procedures to conduct the meeting—the computer program. In our example here the procedures are the computer program GVBMR95. The procedures (computer programs) are kept in a separate binder from the minutes of the meeting. Similar to data from the “meeting minutes” binder, the procedures must also be transferred onto the white board in order to be read by the chair of the meeting. The “eyes” of the processor reading this program is a special register called the Program Status Word, or PSW.
Now let’s see what happens when something goes wrong in the meeting.
Over the course of those few years I got so I could follow the pattern in Doug’s analysis of a dump. To give some flavor of what’s involved, to demystify the whole thing, let’s analyze a dump. A fairly common problem when working with raw data is an OC7 or soc7, a data exception. This occurs when SAFR (or any mainframe program) is asked to perform an arithmetic operation against non-numeric data. The first indication something has gone wrong is the following messages printed in the log file, the top portion of the output containing primary messages from the operating system:
661 SYSTEM COMPLETION CODE=0C7 REASON CODE=00000000
The completion code of OC7 is a data exception, as I have stated. Other common messages are OC4, which means the program tried to use memory that was out of bounds—in other words it tried to read or write data from a white board used in another meeting. Each meeting is considered confidential. An OC1 means that we asked the computer to perform an invalid instruction.
661 PSW AT TIME OF ERROR 078D2000 920340D6 ILC 6 INTC 07 661 NO ACTIVE MODULE FOUND 661 NAME=UNKNOWN
The next row says where the PSW was looking, the instruction that should have been performed. Then two rows tell what program—which detailed agenda—was at that location. The data here is a bit peculiar to the SAFR extract program, GVBMR95. “No Active Module Found” means that the operating system doesn’t recognize what program was at this location. That’s because the operating system didn’t load a program to that location; it didn’t put an agenda there. Rather, the only program/agenda it loaded was GVBMR95. To understand this, we need to discuss code generation and parallel processing.
Our SAFR meeting room actually has more than one meeting happening inside of it. The meeting starts as one meeting, but the first part of the agenda for the meeting is to create other agendas. After those agendas are created, the participants (CPUs) in the meeting are assigned to work on specific agendas for specific periods of time.
Doug had to make sure something else hadn’t gone wrong and the OC7 was the symptom, not the cause of the problem. For example, an OC1 can also look similar to the OC7 problem, but can happen in the main meeting, not the sub meeting3. To detect between these two different types of problems, and make sure this hadn’t happened, Doug would then ask me to tell him a few details in other places in the dump. If those values were in the range of what he would have expected from knowing the program, he would feel comfortable that the problem had occurred in the generated code.
Doug would have me read the series of characters after the dash in the next line.
661 DATA AT PSW 120340D0 - F8B48002 6019F0B5 80020006
These values F8B48002 6019F0B5 80020006, are the machine instructions. They are contained at memory location 120340D0 where the eyes of the person conducting the sub-meeting are looking; the PSW.
I remember Doug often repeating the actual machine code, the F8B48002 etc., over and over to confirm he had them clearly in his mind, particularly if he was in the car driving and he couldn’t write them down. He would then tell me that the “F8” is hexadecimal representation of the machine code for a Zero Add Pack instruction.4 He knew the hex values of most of the instructions he used in his programs by heart. The Zero Add Pack instruction moves a value from a location to another location, and adds zeros on the left if needed to increase the length.5The digits of the F8 instruction have the following meanings:
- F8 = ZAP instruction
- B = The target field length minus 1 is 11, or a B in hexidecimal in the dump
- 4 = The source field length minus 1 is 4
- 8 and 002 = The target memory address is 2 bytes beyond the address in register 8
- 6 and 019 = The source memory address is 19 bytes beyond the address in register 6.
The F0 instruction after the 019 begins the next machine instruction.
The next portion of the snap dump shows the values in the registers.
661 AR/GR 0: 90A63EDE/00000000 1: 00000000/FFFC6FE4 661 2: 00000000/12034008 3: 00000000/00038F10 661 4: 00000000/12084004 5: 00000000/11F056FF 661 6: 00000000/11F4805A 7: 00000000/1203A008 661 8: 00000000/1203A03B 9: 00000000/120570C0 661 A: 00000000/920340D0 B: 00000000/00010000 661 C: 00000000/00011000 D: 00000000/00038F10 661 E: 00000000/00000000 F: 00000002/00018FA0 661 END OF SYMPTOM DUMP
The AR/GR says that the first set of numbers is the Access Registers; after the slash is the General Purpose Register values. Doug was only interested in the GR numbers. The registers, the “eyes” of the processors in our meeting, contain the addresses of data on the white board, or at times, numbers. Numbers can be added to registers. Displacements are values that are added to register values without changing the value in the register.
Often based upon what Doug saw in the machine code above he would have me take the value in register 6, the 11FA805A, add the hexadecimal 19 displacement from the machine instruction to get address 11FA8073, and go to a different part of the system output, the sys dump, to find what was at that location.
We are getting close, so now I start adding hex 1 for every two characters in the display. The value at 11FA8073 begins with 81. I could translate from the hex value into the display value using an EBCIDIC translation card, or I could look at the display portion of the dump. When I look there, I see the word “alpha” which is obviously not numeric. I now know what needed to be fixed for SAFR to complete successfully; either the LR had been defined wrong, or the event data was bad.7
Going through these steps I learned that everything that was happening could be understood; that it didn’t really take a special mind of some kind to comprehend it; it just took time and attention to details. Building up of these small steps in careful ways results in something quite extraordinary.
A number of years later I remember Doug commented, as we worked on a problem in an area of the code he hadn’t been in for quite awhile, that he was going to have to study the program for a bit to remember how it was structured. He then paused and told me a little story.
He said when working towards his master’s degree in computer science he was required to write a teleprocessing monitor. He built different parts of the system over time and one day he had to go back to a completed part to either fix or add something else to it. He had the same experience where he had to go remember how the things work. He said, “I realized then that I had created something beyond my ability to keep it all in my head; it was bigger than I was.” There are a lot of people who would agree he has constructed something bigger than any one person.
Parent Topic: Part 5. The Programmer
498 USER COMPLETION CODE=0999 498 PSW AT TIME OF ERROR 078D1000 80010418 ILC 2 INTC 0D 498 ACTIVE LOAD MODULE ADDRESS=00010000 OFFSET=0000 498 NAME=GVBMR95G
This snap dump is from the main program GVBMR95. When the sub-meeting had the OC7, the main meeting – GVBMR95 – called for help by telling the operating system it wanted to stop as well. Thus instead of a system completion code like OC7, we have a user completion code of 999. Note that the memory address, the PSW, for GVBMR95G is at 078D1000 . This is the address of the white board agenda of GVBMR95G, not 078D2000 or submeeting agenda shown in the “No Active Module Found” message.