In Computers Explained: The Meeting Analogy, I gave a basic description of how a computer works. I then added to that in Computers Explained: Big Data Implications how analytical processes impact computers.
Today, I expand upon the analogy, covering both the impact of Languages and and Parallel Processing.
At the bottom of all computer languages and procedures used by the processors are a series of ones and zeros. Each one or zero is called a bit. A grouping of usually eight bits makes a byte. For example, on an IBM mainframe the letters “Y E S” would be represented as “11101000 11000101 11100010”. The eight digits of either zeros or ones—a byte—can be combined in 256 possible ways. Only 26 of those combinations are used for capital letters, another 26 for lower case letters, and 10 more for numbers. The others are used for punctuation, and other characters. In a sense, extending our people in a meeting analogy, these combinations of ones and zeros are the alphabet of languages used in every computer “meeting”, even though the language itself might be different.
Considering our analogy further, take note that language is used in two different ways in our meeting: (1) It records the information we write in the notebook and on the white board; (2) it is also used by people as they think and communicate. These are very different things. If the processors—people—are determining how much to pay employees, the whiteboard might be filled with numbers like hours worked and pay rate, and then ultimately the pay for the current period. All these items would be written in the notebook when the meeting is over. This is generally referred to as data.
The second category of communication, telling a person in the meeting to multiply the hours worked by the pay rate, is called the program. It is written in a different binder, in this case the binder of payroll procedures, only when the program is created. These programs change must less frequently than the data changes, and are much more time consuming to develop. For the most part creating these procedures is the work of IT projects.
Although both categories of language used in our meeting are composed of zeros and ones they aren’t really the same language. It is a bit like the fact that Hawaiian and Japanese are both composed of very similar sounds, but the combinations of those sounds means completely different things. The language of a computer program is almost always not the same as the language of the data.
For example, the 11101000 that represents a “Y” in the data means something different in a program. That set of zeros and ones means a Move Character Inverse (MVCIN) instruction on a mainframe. When the processors see that instruction, it tells them to copy data on the white board to another part of the white board, but reverse the order of the letters. Thus when the computer executes an instruction a person would read as “Y”, it might make a copy of the white board word “Hello” as “olleH”.
Almost no one writes computer programs by typing in ones and zeros. That is far too time-consuming. Instead, people instruct computers to do things through various computer languages. In a simplified way, they write “MVCIN” and a computer program called a complier translates it into the instructions 11101000. There are compilers for each computer language which do this function.
Over the course of computer development, there have been more and more languages created—each one a slightly different tool. Each language has its particular strengths, similar to the way vocabulary is developed in specialized fields (like accounting and computers) to describe different concepts. Some computer languages are better suited to expression of scientific functions, others for graphics and others for a host of other types of computer problems.
Languages have also been developed to work at different levels of specificity. The most specific language is called “assembler.” Assembler is specific to a particular computer, and translates directly into the ones and zeros that computer recognizes. The MVCIN instruction above is an assembler instruction for an IBM mainframe. Because someone has to specify every specific action the computer should take in assembler, it is called a low level language.
Higher level languages don’t require programmers to specify every single action. For example, COBOL is considered a third generation computer language—a higher level language. Binary, the ones and zeros, is the first generation; assembler is the second. In COBOL someone can write the following phase:
IF HOURS-WORKED GREATER THAN STANDARD-HOURS (EMPLOYEE-GRADE)
And the COBOL compiler will translate this into the following assembler instructions:
LH 2,240(0,10) L 3,16(0,12) MH 2,96(0,3) L 4,308(0,9) PACK 568(2,13),8(2,4) OI 569(13),X'0F' AR 2,10 PACK 576(2,13),174(2,2) OI 577(13),X'0F' CLC 568(2,13),576(13) BC 13,2348(0,11)
One COBOL phrase translates into 11 assembler instructions; 11 assembler instructions translate into 11 machine instructions of ones and zeros.
Using higher level languages makes programmers more efficient; it takes less time to write the COBOL statement above than the assembler instructions. Because programmer time is a significant element of automating functions, this is a reasonable approach; making programmers more efficient means automation takes less time. But this efficiency can come at a cost if the language and compiler used—the tool—doesn’t automate the correct pattern.
The person that designed the COBOL language and wrote the COBOL complier had certain processing patterns in mind. He or she designed templates of assembler code for each COBOL statement. If that pattern matches the work that needs to be done then that code will be efficient. If not, it will not be. It may be as inefficient as using a screwdriver to pound in a nail.
For example, if in our example above the person designing the compiler knew that for some reason the HOURS-WORKED and the STANDARD-HOURS for the specific EMPLOYEE-GRADE are already loaded into computer registers—something like the eyes of those in the meeting reading information from the white board—they could have designed the pattern of assembler instructions to be different. The first ten rows of assembler instruct meeting participants where on the white board to look for the data, and how to get the data into a format that can be compared; if the values were the words “twenty-two” in one case and “25” in the other, the comparison won’t work. But perhaps the way the language is structured, the person designing the compiler knows that the eyes of processor will be looking at the right data to be compared when they come to this point and the data is in the right format. If that were the case, then instead of the 11 instructions above, the program would only need the following two:
CR 2,10 BC 13,2348(0,11)
The COBOL complier doesn’t have this pattern; it chose the pattern shown above. For the most part, when it sees an IF statement that compares two numbers, it will assume the computer will need to find the data, and change the format of the data to be comparable. One experienced assembler programmer has found he can remove 1/3rd of the computer instructions generated by COBOL if he writes the program in assembler.
But of course, writing compilers is very time consuming; very labor intensive. At times it may be more efficient to use an existing compiler and have the program perform the 11 functions rather than taking the time to build a compiler or for the programmer to write it in assembler. This is often the case because of what is called Moore’s law which states that since the mid 1960’s computers double in speeds about every two years. Thus as computers become faster, the cost of the more efficient programmer (and less efficient program) is made up for by the faster computer.
However, in the case of producing answers as close to the time when someone asks the question, 11 machine instructions might matter, particularly if it is 11 machine instructions per business event record summarized. Remember, some questions require accumulation of millions of business events.
Computer processors run at constant speeds. In a CPU “cycle” a processor can basically execute one machine instruction. If our computer runs at 1 gigahertz, or 1 billion machine instructions a second, and our answer requires summarizing 1 billion business events, then if we use an inefficient tool which uses 13 machine instructions, we will get our results in 13 seconds. If we use a different language or compiler which uses two machine instructions, we’ll get our answer in 2 seconds.
Meetings that use very flowery language like congressional debates take more time to communicate affirmation than someone saying, “Yep.”
Now remember, if the existing system today is providing the answer to this question, and that answer really takes summarizing 1 billion business events, the existing system today must be doing that work. It might be doing that work in pieces; a portion in the operational system in posting the transactions, a portion in the ETL layer loading the data warehouse, and a portion in the database as it does a sum function for the last level of summarization for the actual report request, but the work is being done today. It is simply that the work is done all along the data supply chain, a portion of which is hidden during nightly or transaction system processing. Doing this existing work more efficiently will either (1) reduce the time to produce the reports if 13 seconds is unacceptable, or (2) reduce the compute cost if 13 seconds is acceptable.
In one example of the power of parallelism, a major US retailer had a customer marketing database containing all of this retailer’s point of sale credit transactions during the last two years, organized by household. A series of COBOL programs, the standard tool for this type of work in 1993, would scan this data, and score each household’s buying activity according to various criteria as to attractiveness for marketing mailings. The output of this weekly process is used to drive mailing list applications.
The following are the results from the runs from November 1993. The data was stored on 80 tapes.1
The results were astounding to everyone involved: A 96% reduction in wall time from 28 hours to just over one hour, CPU time from 22 minutes to less than a minute, and CPU (computing) costs from US $18 thousand to less than US $1000.
The reduction in wall or elapsed time is due to parallelism. Parallelism is designing a computer system to use more than one CPU at a time. Think of it as using multiple computers at the same time, although in large computers each computer actually contains multiple CPUs. Suppose instead of doing road construction from one end of the project to the other, a city deployed multiple work crews along the road to work on segments of the road at the same time. The total length of the project might be as low as dividing the estimated time for the project by the number of work crews deployed.
CPU time is the accumulation of time used by all processors or computers. In our road work example, the total work time spent on the road project is the accumulation of each of the work crews’ time. One would have to pay each work crew for the time spent on the project, even though they are working all at the same time. So although using parallelism takes less time to complete the project, the total time worked and cost may not be reduced at all. In fact, if the plan for parallelism is poor, it may actually be slower and cost more: having one crew responsible for paving and the other responsible for grading means they would be working over the top of each other and saving no time or money.
First in the series: Computers Explained: The Meeting Analogy
Previous in the series: Computers Explained: Big Data Implications