|
Lessons Learned from 072.scReflections of an Active Participant
By Alexander Carlton Published March, 1995; see disclaimer.
Recently, SPEC has moved to sanction the use of a "SPEC Common Curses Package" for the 072.sc benchmark. This action was taken to attempt to address some of the problems that have been found in the original benchmark. In short, it has recently been recognized that not all systems are doing the same amount of work when running the 072.sc benchmark and hence accurate comparisons are becoming difficult. While there have been problems of perceptions with this change, I believe that a technical evaluation will show that the "SPEC Common Curses Package" was the most effective solution. A Little HistoryWhen somebody first suggested the public domain spreadsheet sc as a candidate for the CINT92 suite, the idea was very well received. Sc in particular seemed to have a behavior profile consistent with the kind of integer code we were very interested in: system-style code where the code flow is driven by data and pointers rather than the loops that are so prevalent in scientific applications. However, transforming the application sc into a benchmark proved to be a non-trivial task. The first challenge was creating a workload for the benchmark. By nature, sc, like all spreadsheets, is an interactive program. In order to be a useful CPU test, we needed to develop a workload where the application would be CPU bound running calculations for the measurement interval. It is difficult to measure CPU performance if the application is blocked, waiting for user input. The solution was to capture into a script file all of the keyboard commands during interactive sessions, and then to feed these files as inputs to the benchmark. Thus the benchmark would not block waiting for input and would run consistently on the CPU. Having solved the input problem, we quickly ran into a challenge in how to handle the output. During evaluation of the benchmark candidate we recognized that some terminal types (for example a basic VT52 console) were much easier to update that others (for example an xterm process emulating a 45 line tall VT220 on top of Xwindows). Further, the effective baud rate of this terminal (9600 for the console, nearly unlimited for the local xterm process) would greatly impact the elapsed time to complete the benchmark. Even with identical terminal types one could still influence the measurement of this CPU test by how the terminal was attached to the system. Thus, it was determined we had to enforce a particular terminal type to run against so that all systems were tested with the same level of terminal complexity. Additionally, it was determined that we had to redirect the benchmark output away from any tty device and to a disk file so that baud rates and other output device settings would not impact this CPU measurement. Finally, we ran into a third challenge when we were working on the validation processes for the CINT92 suite. We discovered that each implementation of the curses(3C) library, (which the sc application used to format its output), formatted the output in a slightly different way. Some versions of curses(3C) would update the fields on the screen in different orders, or they might update only the few characters which changed rather than all the characters in a word or number. In short, it was not possible to validate the redirected screen output because that output only made sense when displayed on the terminal for which curses(3C) was configured. Thus, rather than attempting to make sense out of an effectively random screen output file, we set up the sc application to dump all the values in the spreadsheets to a diskfile and we then validated that dump of values against a reference set to verify that all the correct assignments and computations were properly made. Therefore, the final form of the 072.sc benchmark was the sc application configured for a VT220 reading input from a script file, redirecting the screen output to an unchecked disk file, and validating an internal dump of all spreadsheet values to verify correct behavior. This was the benchmark as defined in the CINT92 benchmark suite. The ProblemOver time, as people analyzed the CINT92 benchmarks for tuning opportunities, it was discovered that it was more profitable to change the curses(3C) library rather than improve the compilation of the sc application code. Further analysis revealed that there is potential for optimizations for the 072.sc benchmark in the curses(3C) library which would have no impact (either good or bad) on most users. The reason 072.sc was so easy to take advantage of was that it used the curses(3C) library in a manner that no interactive user ever would. As a matter of fact, while researching this, a bug was found in the original reference distribution of SVR4 curses(3C) which was then replicated by many system vendors; this bug resulted in greatly reduced output when curses(3C) read its input from a script file rather than a tty device. The bug involved a value not properly initialized when not connected to a tty device, but was not discovered for so long because of the uncommon nature of such a configuration. Over time, as different vendors implemented different optimizations in their curses(3C) libraries, the amount of work being performed in the 072.sc benchmark became quite inconsistent. One survey of the screen output file created by different curses(3C) implementations found that the number of characters output ranged from almost two megabytes down to virtually zero bytes. While 072.sc was never meant to be a curses(3C) test, the variations in curses(3C) implementations were beginning to impact the relevance of the benchmark results. Potential SolutionsThe question then is how to fix this situation. It is necessary to find a solution that levels the playing field so that comparisons are fair. It is necessary to act quickly before more radical implementations are released and this situation becomes serious. It is also necessary to find a relatively simple solution because most of SPEC's available resources are already committed to work on the impending release of the CPU'95 benchmark suites. The obvious first impulse is to just decree a ban on all over-optimized implementations. Unfortunately, it is not that simple. As it turns out, it is not possible to define such a ban. Right from the very early days of curses(3C) in the early Berkeley releases, the definitions of curses(3C) behavior were purposely left vague to encourage optimizing implementations. It is impossible to define a "baseline" curses(3C) implementation without ruling out virtually every existing implementation since BSD4.2. A common proposed solution is to attempt to define a set of behavior that must be supported by a curses(3C) implementation before it can be used in a SPEC benchmark. This suffers from a similar fault as the last: there is no supportable standardized behavior for an interactive facility utilized in a non-interactive manner. How can one define what must be displayed when there is no user to display it to? How can one define what prompts must be displayed when it is known that the input is already set and is not dependent upon the output? On a modern machine, the sc application in the 072.sc configuration is capable of updating thousands of cell values each second; anything over a dozen updates a second is a blur to the human eye. How many screen refreshes a second can a specification require, and would such a specification lead to slower machines doing more updates than fast ones? The usual next idea is to try to prohibit the use of implementations that take specific advantage of the SPEC configuration. This has problems stemming from the fact that even the most recent reference implementation does not handle the SPEC configuration properly. Furthermore, it becomes problematic, given the curses(3C) definitions, to distinguish benchmark specific optimizations and more general purpose enhancements without having to make subjective and arbitrary decisions. The most clean solution would be for SPEC to provide the source code for its own implementation of curses(3C), and require all vendors to use the code as it is supplied. This experienced two problems. The first one that we found was that it was difficult to find or create an implementation that would be portable and performance neutral. We looked at several different implementations including the linux distribution, several PC implementations, and even one vendor-supplied implementation. None of these implementations would suffice without significant additional work to bring them up to SPEC standards. The second problem was that porting and verifying any new implementation was going to cause a significant delay in the release of the CPU'95 benchmarks. Our final solution was for SPEC to implement its own minimal implementation for curses(3C). We recognized the need to provide one common implementation so that all results are for the same amount of work. Unfortunately, realities of schedules and resources forced us to make this SPEC common curses package a minimal implementation. So, we have leveled the playing field to the best of our abilities, though at a moderately lower level. In a perfect world, we would have done better, but instead we have learned some important lessons. The LessonsThe first lesson is that it is not wise to make any assumptions about any part of the benchmark not supplied by SPEC. If there is some particular feature or behavior that is desired or expected, then that had better be explicitly stated and defined. The second lesson is not to use a configuration for a benchmark that is not typical of the use in the users' expected case. Any component utilized in an uncommon configuration may be optimized in such a way that will improve benchmark results without improving real user behavior. The third lesson is to ensure that all desired features are properly defined and checked for. Any behavior or characteristic that is not carefully checked for or explicitly required by the rules, may be optimized away. This has been a difficult learning experience, but the real world is full of surprises. At least now, these mistakes will be remedied in CPU'95. |