IBM System Blue Gene Solution: Blue Gene/Q System Administration

Book description

This IBM® Redbooks® publication is one in a series of books that are written specifically for the IBM System Blue Gene® supercomputer, Blue Gene/Q®, which is the third generation of massively parallel supercomputers from IBM in the Blue Gene series. This book provides an overview of the system administration environment for Blue Gene/Q. It is intended to help administrators understand the tools that are available to maintain this system.

This book details Blue Gene Navigator, which has grown to be a full featured web-based system administration tool on Blue Gene/Q. The book also describes many of the day-to-day administrative functions, such as running diagnostics, performing service actions, and monitoring hardware. There are also sections that cover BGmaster and the Control System processes that it monitors.

This book is intended for Blue Gene/Q system administrators. It helps them use the tools that are available to maintain the Blue Gene/Q system.

Table of contents

  1. Front cover
  2. Notices
    1. Trademarks
  3. Preface
    1. The team who wrote this book
    2. Now you can become a published author, too!
    3. Comments welcome
    4. Stay connected to IBM Redbooks
  4. Chapter 1. IBM Blue Gene/Q system overview
    1. 1.1 Blue Gene/Q hardware overview
    2. 1.2 Blue Gene/Q software overview
      1. 1.2.1 Compute Node Kernel and services
      2. 1.2.2 I/O node kernel and services
      3. 1.2.3 Application development and debugging tools
      4. 1.2.4 System administration and management
  5. Chapter 2. Navigator
    1. 2.1 System setup
    2. 2.2 Using the Navigator
      1. 2.2.1 The Navigator window
      2. 2.2.2 Security integration
      3. 2.2.3 Result tables
      4. 2.2.4 System summary
      5. 2.2.5 Alerts
      6. 2.2.6 Jobs and job history
      7. 2.2.7 Blocks and I/O blocks
      8. 2.2.8 RAS
      9. 2.2.9 Environmentals
      10. 2.2.10 Hardware
      11. 2.2.11 Hardware replacements
      12. 2.2.12 Diagnostics
      13. 2.2.13 Service actions
      14. 2.2.14 Performance monitoring
      15. 2.2.15 Knowledge Center
      16. 2.2.16 Creating blocks with the Block Builder
  6. Chapter 3. Compute and I/O blocks
    1. 3.1 Compute blocks
      1. 3.1.1 Large blocks
      2. 3.1.2 Small blocks
    2. 3.2 Creating compute blocks
    3. 3.3 I/O blocks
    4. 3.4 Creating I/O blocks
    5. 3.5 I/O links and ratios
    6. 3.6 Deleting blocks
    7. 3.7 Block status transitions
  7. Chapter 4. Configuring I/O nodes
    1. 4.1 I/O node kernel
      1. 4.1.1 Building new Linux kernels
    2. 4.2 I/O node run time
      1. 4.2.1 Installing Red Hat Enterprise Linux
      2. 4.2.2 Installing additional packages
      3. 4.2.3 Creating Blue Gene/Q Linux distribution sandboxes
    3. 4.3 I/O adapters
      1. 4.3.1 Supported adapters
      2. 4.3.2 Firmware updates
    4. 4.4 I/O configuration
      1. 4.4.1 Defining interfaces and IP addresses
      2. 4.4.2 Interface bonding configuration
      3. 4.4.3 Environment variables using /dev/bgpers
      4. 4.4.4 Site customization
      5. 4.4.5 Customizing ramdisk
      6. 4.4.6 The bgras command
    5. 4.5 Boot sequence
      1. 4.5.1 Persistent file systems
    6. 4.6 Maintenance
      1. 4.6.1 Installing RHEL updates
      2. 4.6.2 Installing a new RHEL dot release
      3. 4.6.3 Node Health Monitor
      4. 4.6.4 Core files
    7. 4.7 Common I/O services
      1. 4.7.1 Configuration
      2. 4.7.2 Jobs directory
    8. 4.8 NFS configuration
    9. 4.9 Troubleshooting
      1. 4.9.1 NFS errors
  8. Chapter 5. Control System console
    1. 5.1 Overview
      1. 5.1.1 Accessing help
      2. 5.1.2 Command history
    2. 5.2 Commands
    3. 5.3 Special commands
    4. 5.4 Command parameters
    5. 5.5 Scripting
  9. Chapter 6. Submitting jobs
    1. 6.1 The runjob architecture
    2. 6.2 The runjob command
      1. 6.2.1 The runjob options
      2. 6.2.2 Scheduler
    3. 6.3 Sub-block jobs
    4. 6.4 Signal handling
      1. 6.4.1 Exit status
      2. 6.4.2 Normal job termination
      3. 6.4.3 Abnormal job termination
    5. 6.5 Job authority
      1. 6.5.1 The grant_job_authority command
      2. 6.5.2 The revoke_job_authority command
      3. 6.5.3 The list_job_authority command
    6. 6.6 The kill_job command
    7. 6.7 Job status
      1. 6.7.1 The list_jobs command
      2. 6.7.2 The job_status command
    8. 6.8 Historical perspective of job submission on Blue Gene
  10. Chapter 7. Reliability, availability, and serviceability
    1. 7.1 Elements of a RAS message
      1. 7.1.1 Message ID and component
      2. 7.1.2 Category
      3. 7.1.3 Severity
      4. 7.1.4 Message
      5. 7.1.5 Description
      6. 7.1.6 Service action
      7. 7.1.7 Diagnostics suites
      8. 7.1.8 Control Action
    2. 7.2 Tailoring RAS messages
    3. 7.3 Viewing Blue Gene/Q defined RAS events
    4. 7.4 Viewing sent RAS events
  11. Chapter 8. Toolkit for Event Analysis and Logging
    1. 8.1 TEAL framework
      1. 8.1.1 Connector
      2. 8.1.2 Event analyzers
      3. 8.1.3 Plug-ins
    2. 8.2 Location reporting
    3. 8.3 Integration with BGmaster
    4. 8.4 Alerts and service actions
    5. 8.5 Installation
  12. Chapter 9. Service actions
    1. 9.1 Service actions overview
    2. 9.2 Service action commands
      1. 9.2.1 InstallServiceAction
      2. 9.2.2 VerifyCables
    3. 9.3 Hardware maintenance
      1. 9.3.1 ServiceNodeDCA
      2. 9.3.2 ServiceNodeBoard
      3. 9.3.3 ServiceIoDrawer
      4. 9.3.4 ServiceMidplane
      5. 9.3.5 ServiceBulkPowerModule
      6. 9.3.6 ServiceRack
      7. 9.3.7 ServiceClockCard
    4. 9.4 Deciding which service action command to use
    5. 9.5 Preparing a service action
    6. 9.6 Ending a service action
    7. 9.7 Service action logs
    8. 9.8 Cycling power
  13. Chapter 10. Diagnostics
    1. 10.1 System diagnostics
      1. 10.1.1 Requirements
    2. 10.2 Diagnostic test cases
    3. 10.3 Using the Navigator diagnostics interface
    4. 10.4 Running diagnostics
    5. 10.5 Preventive maintenance
    6. 10.6 Running diagnostics through a scheduler
      1. 10.6.1 Running diagnostics from a LoadLeveler job
    7. 10.7 Diagnostics performance considerations
  14. Chapter 11. BGmaster
    1. 11.1 BGmaster
    2. 11.2 Running bgmaster_server
      1. 11.2.1 The master_start command
      2. 11.2.2 The master_status command
      3. 11.2.3 The binary_status command
      4. 11.2.4 The bgmaster_server_refresh_config command
      5. 11.2.5 The binary_wait command
      6. 11.2.6 The alias_wait command
      7. 11.2.7 The list_agents command
      8. 11.2.8 The list_clients command
      9. 11.2.9 The master_stop command
      10. 11.2.10 The get_errors command
      11. 11.2.11 The get_history command
      12. 11.2.12 The monitor_master command
    3. 11.3 The bgagentd system process watcher
      1. 11.3.1 Starting bgagentd
      2. 11.3.2 Stopping bgagentd
    4. 11.4 Configuration
      1. 11.4.1 Sample configuration
      2. 11.4.2 BGmaster network configuration
    5. 11.5 Troubleshooting
      1. 11.5.1 Network issues
      2. 11.5.2 RAS messages
  15. Chapter 12. Midplane Management Control System server
    1. 12.1 MMCS components
      1. 12.1.1 MMCS server handles requests from clients
      2. 12.1.2 Blocks are defined in the database
      3. 12.1.3 MC server communicates with the Blue Gene/Q hardware
    2. 12.2 Starting and stopping MMCS server
    3. 12.3 Configuration
      1. 12.3.1 Configuration file options
      2. 12.3.2 Command-line options
    4. 12.4 Mailbox output
    5. 12.5 Environmental polling
      1. 12.5.1 How the environmental monitor works
      2. 12.5.2 Polling intervals
      3. 12.5.3 Data collection and RAS
      4. 12.5.4 Location-specific monitoring
      5. 12.5.5 RAS events
  16. Chapter 13. Machine controller server
    1. 13.1 Machine controller server
    2. 13.2 Starting and stopping MC server
    3. 13.3 Configuration
      1. 13.3.1 Configuration file options
      2. 13.3.2 Command-line options
    4. 13.4 The mc_server_log_level command
  17. Chapter 14. The runjob server and runjob mux
    1. 14.1 The runjob architecture
    2. 14.2 The runjob_server
      1. 14.2.1 Configuration file options
      2. 14.2.2 Command-line options
    3. 14.3 The runjob_mux
      1. 14.3.1 Configuration file options
      2. 14.3.2 Command-line options
    4. 14.4 Utilities
      1. 14.4.1 Configuration options
      2. 14.4.2 The runjob_server_status command
      3. 14.4.3 The runjob_server_log_level command
      4. 14.4.4 The runjob_mux_status command
      5. 14.4.5 The runjob_mux_log_level command
  18. Chapter 15. The Blue Gene Web Services server
    1. 15.1 Configuring and running the BGWS server
    2. 15.2 Configuration
      1. 15.2.1 Configuration file options
      2. 15.2.2 Command-line options
    3. 15.3 Utilities
      1. 15.3.1 Configuration
      2. 15.3.2 The bgws_server_status command
      3. 15.3.3 The bgws_server_refresh_config command
      4. 15.3.4 The bgws_server_log_level command
  19. Chapter 16. Real-time server
    1. 16.1 Real-time server
    2. 16.2 Configuration
      1. 16.2.1 Configuration file options
      2. 16.2.2 Command-line options
    3. 16.3 Security
    4. 16.4 Utilities
      1. 16.4.1 Configuration
      2. 16.4.2 The realtime_server_status command
      3. 16.4.3 The realtime_server_log_level command
    5. 16.5 Requirements
      1. 16.5.1 Access to the database transaction logs
      2. 16.5.2 Database configuration
  20. Chapter 17. Distributed Control System
    1. 17.1 Overview
    2. 17.2 Distributed Control System software
    3. 17.3 Hardware configurations
      1. 17.3.1 Multiple SubnetMc processes on a single service node
      2. 17.3.2 Multiple subnet service nodes
    4. 17.4 SubnetMc configuration
  21. Chapter 18. Security
    1. 18.1 Object based model
    2. 18.2 Network communication security
      1. 18.2.1 Certificates and keys
      2. 18.2.2 Handshake
      3. 18.2.3 Configuration
      4. 18.2.4 Certificate file permissions
      5. 18.2.5 Generating security keys and certificates
    3. 18.3 Database security
  22. Chapter 19. Database
    1. 19.1 Database overview
    2. 19.2 Database configuration
    3. 19.3 Database maintenance
      1. 19.3.1 Purging old data from tables
      2. 19.3.2 Optimizing with indexes
  23. Chapter 20. Logging
    1. 20.1 Apache log4j and log4cxx
    2. 20.2 Log file format
    3. 20.3 Log merge utility
      1. 20.3.1 Configuration
      2. 20.3.2 Usage
      3. 20.3.3 Time stamp format
    4. 20.4 Configuration
    5. 20.5 Log rotation
      1. 20.5.1 Control System logs
      2. 20.5.2 I/O node logs
      3. 20.5.3 Subnet service node logs
      4. 20.5.4 Log maintenance
  24. Chapter 21. The bg.properties file
    1. 21.1 The bg.properties file overview
    2. 21.2 Property file validation
  25. Chapter 22. The Coreprocessor tool
    1. 22.1 Usage and dependencies
    2. 22.2 Starting the Coreprocessor tool
    3. 22.3 Debugging live compute node problems
    4. 22.4 Saving your information
    5. 22.5 Debugging live I/O node problems
    6. 22.6 Debugging core files
  26. Appendix A. Hardware location naming conventions
    1. Letter designation reference
    2. Hardware location naming convention
  27. Appendix B. Control System simulator
    1. Setting up
    2. Starting and stopping servers
    3. Creating and booting I/O blocks and compute blocks
    4. Running jobs
    5. Simulator cleanup
  28. Related publications
    1. IBM Redbooks
    2. Other publications
    3. Online resources
    4. How to get Redbooks
    5. Help from IBM
  29. Back cover

Product information

  • Title: IBM System Blue Gene Solution: Blue Gene/Q System Administration
  • Author(s): Gary Lakner, Brant Knudson
  • Release date: May 2013
  • Publisher(s): IBM Redbooks
  • ISBN: None