June 30, 1987 Dr. Robert Moon Assistant to the President Dear Dr. Moon, After assisting a site experiencing over two weeks of unsched- uled computer down-time, I felt it best to summarize some aspects for your review. The site was NASA Goddard in Greenbelt, Maryland. Evidently their Sigma 9 system was affected by a "power hit" about June 5. They attempted a reboot from PO tape, but had 96 GENMD errors in the process. The resulting operating system did not run properly. All available hardware diagnostics were run, but they did not show any errors. NASA is currently maintained by Western Computer Systems, but were maintained by Honeywell up through December 31, 1986. I was contacted the evening of June 18, and was on-site early the morning of June 20. Because of the TeleXchange conference in Washing- ton D.C. it was convenient for me to go to NASA. I found the problem and suggested a correct repair after four and a half hours. I was on-site less than six hours. Following are some of my observations and further comments: First, since I was not helping them as part of my work for Andrews University or for Telefile, I will cover the travel expenses to the area, and several phone calls, if you see fit. I realize that, whereas it had been Telefile's policy to pro- vide my travel expenses to attend TeleXchange, and whereas this policy was being "bent" concerning this meeting, that portion of the expense of attending this conference was unplanned for Andrews University. You are aware that I feel very strongly about our other staff having the opportunity to occasionally attend such functions and am willing to help make certain this is a reality. It is not yet clear how much I will be paid for this assist- ance. Since they had been down for so long, I did not feel it was right to expect payment if I could not help them, especial- ly since little money was required to cover expenses. Jimmy Nishry of Telefile had made contact with Michael Sutch of Lockheed (primary contractor with NASA for this project) to assure that if I did help them, I would be paid "what it was worth".
page 2 Second, although the timing was quite convenient as far as my trip was concerned, the fact that I arrived in the area Friday evening and they were still down presented a dilemma. Applying the statement in Luke 14:5 about "an ox fallen into a pit", I felt it necessary to assist them even on the Sabbath day. In so doing I was careful to not misrepresent my church's and primary employer's fundamental beliefs concerning proper keeping of the seventh-day Sabbath. Even at our site, some computerized functions must now be maintained on the Sabbath day. Although manual backup facili- ties have been provided, if a simple repair can restore ser- vice we have been performing it. In my search for Luke 14:5 I came across Luke 13:15 concerning watering cattle on the Sabbath and felt much better about occasionally stopping in the computer room on the Sabbath to look, feel, smell, and listen for any abnormalities. Although I personally had no problems with this, I really did not have any good answers if someone questioned this activity. Third, anytime a computer site has third-party, or even vendor maintenance, they are very vulnerable to the finger-pointing syndrome. With vendor maintenance the vendor must keep the customer happy if they expect a continued relationship. The original vendor will have not only hardware experts, but also software experts. They probably also have the best contacts with experts in the field. In their situation, Honeywell was really also a third-party maintenance firm since Xerox took the plunge in 1975. Honeywell also seems to have chosen to let these sites fade away. The only choices other than third- party maintenance available to such Xerox/Sigma users (and avoid costly software conversion) are new, compatible equip- ment or self-maintenance. With third-party maintenance, you are much less likely to have software expertise available, and there is little incentive to bring in costly expert trouble-shooters. It is not clear to me how much you are at the vendor's mercy for maintenance, but it seems that with most standard contracts, if they can demonstrate that they are trying to solve the problem, it could take forever. Although it was before you were director here, you may recall the situation we were in with Honeywell. They would not provide experienced customer engineers so many prob- lems were resolved after they left for the day. We were not suppose to touch the hardware, so the failing module would be reinstalled in the morning and we would "help" them find it. Self-maintennace may have its own set of problems, but finger- pointing really isn't one of them. A little for jest does occur, but it is a healthy way to discuss the situation and resolve the problem. It usually provides learning for all involved as reasons must be given and hypotheses tested. One area self-maintenance shines in, above all other types, is in dealing with intermittant problems. This, of course, is only true when detailed logs and continuity of personnel are maintained.
page 3 Fourth, at our site we try to use multiple equipment of the same type. This has not always been the case, but risk analy- sis has been performed, sparing levels evaluated, and contin- gencies planned when we deviate from this rule. Fault isola- tion is greatly aided by this arrangement. Doing our own maintenance has greatly assisted in this since we have been able to buy used equipment at low prices and hence upgrade our capacity, yet maintaining multiplicity. To assist in such fault isolation, you will recall I sought and obtained your permission to take a disk pack with a disk- swapping, single pack operating system CP-V version C00. Hav- ing tested it at Andrews University, I knew that it could help eliminate the 7212 RADs, Honeywell MPC tape drives, PO tape and CP-V version F00 as possible problems. This pack was not used in the fault isolation process. I also took with me a copy of our Telefile diagnostic tape with Sigma 9 snap-data. Since they had already had a Telefile field service engineer help the weekend before, I expected no problem securing permission from Telefile for its use, perhaps after the fact. The tape was unused and remained in the car. However, it seemed that no one had even thought to run the CPU diagnostics forcing a comparison with the snapdata (a standard option). In fact, it seems that they did not have the data available. Although this is very helpful in finding problems where the tests fail, I think I have seen this com- pare fail even when the answers agreed! (The diagnostic tape was written at 800 bpi since the MPC tape controllers do not need their firmware loaded to read such). Fifth, we maintain multiple copies of critical tapes such as PO tapes, source tapes, diagnostic tapes, backup tapes, etc. These copies are not only maintained on-site, but also at various locations off-site. Since NASA only had one copy of their PO tape, much time had been spent before it was checked elsewhere to determine that it was not the problem. Since few, if any, similar hardware configurations exist, it was not tested completely. A system source tape was also located just prior to my arrival (we have a copy as well). After they had been down a week, they called in their former systems programmer. He had been gone for four years and thus was very rusty. It is also not clear to me that he ever got very involved with many operating system internals. Sixth, we maintain operating systems listings on paper and on microfiche. In addition we have several copies of technical manuals which describe the internals of our operating system's predecessor (UTS). This documenation is invaluable in a sit- uation like this, and appropriate protection is used to be certain that it remains usable (i.e. does not wander off or is otherwise destroyed). I took our microfiche copy of the operating system version they are running, but it was not needed since they had already obtained a copy from elsewhere the day before.
page 4 Given all my preparation, what I used to solve their problem, I carried there in my head. We are still very dependant on a few individuals who understand how everything fits together and works. This is much more than just systems programmers who can code in assembly, or board-swapping customer engineers. Since most of our routine maintenance procedures are not docu- mented, this information is "fragile". These individuals cannot learn this in classrooms or by doing only routine day-to-day maintenance. It must be learned by trying to expand features and develop new capabilities. This takes years. George was ready to accept the responsibility of self-maintenance after six long, hard years of preparation. I had to accept the responsibility after five such years, but during much of this time we were experiencing rapid expansion and had a critical mass of development talent. Our current maintenance staff all have three or less years of full-time experience. This means that development and documentation is essential to our continued internal training. With the current amount of computer and communications hardware as well as software to be maintained and the current staffing, this is often and contin- ually squeezed out. Everything seems to run at a fevered pitch and the pressure means that everything must succeed. There is little, if any, time to develop our diagnostic tools (experiments), but whose gains outweigh such risks. It is also hard to schedule time to take classes since we only have single coverage of our major areas. Unless an environment can be created where regular hours and a normal workload exist, we will continue to lose valuable personnel. Well, last I knew NASA's Sigma 9 was busily processing data from the Dynamics Explorer satelite. Although it has been a while since we had any significant downtime, one really never knows when a really nasty gremlin will strike. Meanwhile, we will continue to solve the problems as they come along and hopefully we will be ready for the hard ones! A transcription (with minor editing) of my sitelog entry is also enclosed. Best regards, s/Keith G. Calkins Keith G. Calkins AU Computer Systems Manager and XDX Consultant cc: Jimmy Nishry Michael Sutch
6/20/87 08:15 Back on System Checked memory after PCL read of tape Contents intact. Checking transfer to user area. 11:45 Move Byte String failure (with certain conditions) [Greg Molenaar] 12:14 Replaced module (fixed) - Jack Muff 6/20/87 NASA-Goddard sitelog entry by Keith G. Calkins. I arrived on-site with Jack Muff (of Western) about 7:45 a.m. We talked about the problem and I understood that CP-V F00 "ran", but had 96 GENMD errors when booted from tape. The GENMD errors were the result of "holes" in the REF/DEF record of load modules off from tape. They (a Honeywell Bull CE had also arrived) assured me that the tape was verified good on the Sigma 5 "downstairs". "Holes" were 30-35 words long. I asked if they were 32 words long, but they were not certain. They indicated that copying the LMN's off of tape resulted in holes when copied to UC(X). They had already discovered that it made a difference whether or not the HEAD record was first. With the HEAD record first, it failed. They were certain it was coming in off tape ok, but didn't know yet where the problem was. My first step was to look at SBUF1 in memory to find out if the data was ok there, since it was bad in the user area. It was ok so we knew it was likely a CPU problem. The next step was to try to find the actual monitor code which moved the data into the user page. To do this we had to find the physical page receiving the data by getting it out of JX:CMAP. With this [and the PCP stop address] we estab- lished that RBLK6 in the RDF module of CP-V actually moved the data with an MBS of 252 bytes, except for the remainder. We established that it moved the 252 byte blocks ok, but that the remainder of 192 (.c0) was to be moved. We established that the MBS,12 0 at .6FA9 with 12/.00025740/ and and 13/.C00316E4/ only moved 64 (.40) bytes instead of 192 (.C0). We then analyzed the MBS/CBS phase sequencing charts and single-stepped the instruction. In PH44 the E register had .40 instead of .C0. This came from E+1 in PH43 which was ok at .BF. We looked at the logic equations for the register frame and decided that the FG18 in 22K contained most of this logic, and was thus suspect. It was swapped with 21K which had bits 4-7 of the E register and the floating-point diagnostic failed solid. We replaced the FG18 from Western spares and the problem was resolved. We did not resolve which component had failed, but I sus- pect one of the many diodes present. The problem was emphasized by the format of records on labeled tape, and the way data is moved into the user's data area. It thus could only be found with an extensive knowledge of CP-V.