Appendix A. Reliability, availability, and serviceability functions 243
How to repair permanent errors: concurrent repair/maintenance
The z900 provides the capability to make many changes to the platform nondisruptively. For
example, you can do the following while the system is in production:
Replace hardware components
Install microcode fixes
Make software fixes to your systems
The design of z900 includes
concurrent repair of the hardware and microcode. This enables
faulty hardware or microcode to be replaced while the server is up and running. There is no
customer involvement, other than approval for the action, and no impact on running
applications. Field data shows that more than 80% of the repairs are performed concurrently.
The concurrently maintainable hardware on z900 is:
– Cryptographic Co-Processors
– External Timer Reference (ETR)/Oscillator ports
– All channels
– Hardware Management Consoles
– Support Elements with the auto switchover function
– Power supplies, cooling units, AC inputs, internal batteries
Licensed Internal Code
The z900 servers can be maintained at the latest LIC level to provide you with the most
current set of problem corrections and the newest functions. Most LIC repairs are
designed for concurrent installation and activation.
A.2 RAS functions of the processor
Transparent CP, ICF, IFL and SAP sparing
The z900 servers have implemented full transparent sparing for PUs. This function enables
the hardware to activate a spare PU to replace a failed PU with no involvement from the
operating system or the customer, while preserving the application that was running at the
time of the error.
Transparent CP/ICF/IFL sparing
For all z900 servers, CP/ICF/IFL sparing is transparent in all modes of operation and requires
no operator intervention to invoke a new CP/ICF/IFL. The Coupling Facility Model 100 also
supports sparing for all ICF features.
1. Not all parts can be repaired concurrently (while the system is powered on). For
example, a MultiChip Module (MCM), memory chips, CAP/STI cards, I/O cages, Fast
Internal Bus Buffer (FIBB) cards, and Channel Driver (CHA) cards are not
2. Major LIC releases (drivers) that contain not only the latest corrections but new function
as well, require new activation.
3. Partial Restart: The system may be powered off and re-activated via an operator
command with one PU cluster, half memory, or partial I/O. Planning of proper fencing is
necessary to remove failed components concurrently from the configuration.
244 IBM eServer zSeries 900 Technical Guide
This feature enables you to bring a spare PU online with the same CP number and without
Example of the transparent CP sparing
Figure A-1 shows PU03 (CP04) failed, CP04 is recovered on spare PU0E, PU0E is assigned
to CP04, and PU03 is no longer available.
Figure A-1 Example of transparent CP sparing: z900 Model 107 CP failure
Dynamic SAP sparing/reassignment
Dynamic recovery is provided for failure of the System Assist Processor (SAP). If a SAP fails
and a spare PU is available, the spare PU will be dynamically activated as a new SAP in most
cases. In case there is no spare PU and a master SAP fails, an active CP will be reassigned
as a SAP.
The flow of error detection and recovery
If a CP error is detected, the operation will be retried. If it continues to fail, the CP will be
checkstopped, and the z900 instruction environment will be saved. If a spare PU is available,
the spare will get the CP number of the failed one and be taken online. The instruction
environment will be restored on the new CP. Processing continues without operator
intervention. This flow is shown in Figure A-2.
CP Sparing Flow
1. PU03 (CP04) fails
2. Error Detection
3. Spare PU0E assigned
4. Error Recovery: restart
application on PU0E
Appendix A. Reliability, availability, and serviceability functions 245
Figure A-2 The flow of error detection and recovery
Error detection: Dual execution with compare
The PU consists of two completely duplicated Instruction/Execution (I/E) units, a Level 1 (L1)
cache, and a register unit (R-unit) (see Figure A-3).
Figure A-3 Dual execution with Compare
The R-unit contains the compare circuitry and the ECC-protected checkpoint arrays
containing all of the critical architectural facilities, including register contents and instruction
addresses. At the completion of every instruction, the results produced by both I/E units are
compared and, if equal, the results of the instruction are checkpointed for recovery in case the
next instruction fails.
Restore z900 Environment
z900 Hardware Actions
Save the CP's z900
I/E Unit -B
I/E Unit -A
I Unit: component to fetch and decode instructions
E Unit: instruction-execution element
R Unit: ECC-protected