Appendix A. Reliability, availability, and serviceability functions 243
How to repair permanent errors: concurrent repair/maintenance
The z900 provides the capability to make many changes to the platform nondisruptively. For
example, you can do the following while the system is in production:
򐂰 Replace hardware components
򐂰 Install microcode fixes
򐂰 Make software fixes to your systems
The design of z900 includes
concurrent repair of the hardware and microcode. This enables
faulty hardware or microcode to be replaced while the server is up and running. There is no
customer involvement, other than approval for the action, and no impact on running
applications. Field data shows that more than 80% of the repairs are performed concurrently.
Hardware components
The concurrently maintainable hardware on z900 is:
Cryptographic Co-Processors
External Timer Reference (ETR)/Oscillator ports
All channels
Hardware Management Consoles
Support Elements with the auto switchover function
Power supplies, cooling units, AC inputs, internal batteries
Licensed Internal Code
The z900 servers can be maintained at the latest LIC level to provide you with the most
current set of problem corrections and the newest functions. Most LIC repairs are
designed for concurrent installation and activation.
A.2 RAS functions of the processor
Transparent CP, ICF, IFL and SAP sparing
The z900 servers have implemented full transparent sparing for PUs. This function enables
the hardware to activate a spare PU to replace a failed PU with no involvement from the
operating system or the customer, while preserving the application that was running at the
time of the error.
Transparent CP/ICF/IFL sparing
For all z900 servers, CP/ICF/IFL sparing is transparent in all modes of operation and requires
no operator intervention to invoke a new CP/ICF/IFL. The Coupling Facility Model 100 also
supports sparing for all ICF features.
Note:
1. Not all parts can be repaired concurrently (while the system is powered on). For
example, a MultiChip Module (MCM), memory chips, CAP/STI cards, I/O cages, Fast
Internal Bus Buffer (FIBB) cards, and Channel Driver (CHA) cards are not
hot-pluggable.
2. Major LIC releases (drivers) that contain not only the latest corrections but new function
as well, require new activation.
3. Partial Restart: The system may be powered off and re-activated via an operator
command with one PU cluster, half memory, or partial I/O. Planning of proper fencing is
necessary to remove failed components concurrently from the configuration.
244 IBM eServer zSeries 900 Technical Guide
This feature enables you to bring a spare PU online with the same CP number and without
operator intervention.
Example of the transparent CP sparing
Figure A-1 shows PU03 (CP04) failed, CP04 is recovered on spare PU0E, PU0E is assigned
to CP04, and PU03 is no longer available.
Figure A-1 Example of transparent CP sparing: z900 Model 107 CP failure
Dynamic SAP sparing/reassignment
Dynamic recovery is provided for failure of the System Assist Processor (SAP). If a SAP fails
and a spare PU is available, the spare PU will be dynamically activated as a new SAP in most
cases. In case there is no spare PU and a master SAP fails, an active CP will be reassigned
as a SAP.
The flow of error detection and recovery
If a CP error is detected, the operation will be retried. If it continues to fail, the CP will be
checkstopped, and the z900 instruction environment will be saved. If a spare PU is available,
the spare will get the CP number of the failed one and be taken online. The instruction
environment will be restored on the new CP. Processing continues without operator
intervention. This flow is shown in Figure A-2.
CP Sparing Flow
1. PU03 (CP04) fails
2. Error Detection
3. Spare PU0E assigned
as CP04
4. Error Recovery: restart
application on PU0E
CP01
PU01
CP02
PU02
Spare
PU00
CP06
PU04
SAP
PU05
Spare
PU0B
CP03
PU0C
CP00
PU0A
CP05
PU0D
Spare
PU0E
XSAP
PU0F
CP01
PU01
CP02
PU02
Spare
PU00
-
PU03
CP06
PU04
SAP
PU05
Spare
PU0B
CP03
PU0C
CP00
PU0A
CP05
PU0D
CP04
PU0E
XSAP
PU0F
CP04
PU03
CPyy
PUnn
Spare
PUnn
SAP
PUnn
-
PUnn
Crypto Element
Assigned CP
Assigned (X)SAP
Spare PU
Failed PU
CE1CE0
CE1CE0
CEx
Appendix A. Reliability, availability, and serviceability functions 245
Figure A-2 The flow of error detection and recovery
Error detection: Dual execution with compare
The PU consists of two completely duplicated Instruction/Execution (I/E) units, a Level 1 (L1)
cache, and a register unit (R-unit) (see Figure A-3).
Figure A-3 Dual execution with Compare
The R-unit contains the compare circuitry and the ECC-protected checkpoint arrays
containing all of the critical architectural facilities, including register contents and instruction
addresses. At the completion of every instruction, the results produced by both I/E units are
compared and, if equal, the results of the instruction are checkpointed for recovery in case the
next instruction fails.
PU03
Error
Detection
PU0E (CP04)
Sparing
Configure Online
Restore z900 Environment
Dispatch Processor
z900 Hardware Actions
CP04
PU03
CP04
PU0E
Error
Correction
Check Point
Retry
CP
Check Stop
Log Out
PU03 (CP04)
Error Detection
Application Preservation
Save the CP's z900
instruction environment
Enhanced
Application
Preservation
Recovery
I/E Unit -B
I/E Unit -A
L1 Cache
R unit
Comparator
L2 Cache
z900 PU
Compare
Same?
Different?
OK!
Not OK.
Retry!
I Unit: component to fetch and decode instructions
E Unit: instruction-execution element
R Unit: ECC-protected

Get IBM eServer zSeries 900 Technical Guide now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.