IBM Journal of Research and Development
IBM Skip to main content
  Home     Products & services     Support & downloads     My account  

  Select a country  
Journals Home  
  Systems Journal  
Journal of Research
and Development
  ·  Current Issue  
  ·  Recent Issues  
  ·  Papers in Progress  
  ·  Search/Index  
  ·  Orders  
  ·  Description  
  ·  Patents  
  ·  Recent publications  
  ·  Author's Guide  
  Staff  
  Contact Us  
  Related links:  
     IBM Research  

IBM Journal of Research and Development  
Volume 51, Number 1/2, Page 131 (2007)
IBM System z9
  Full article: arrowHTML arrowPDF   arrowCopyright info





   

Enhanced I/O subsystem recovery and availability on the IBM System z9

by K. J. Oakes, U. Helmich, A. Kohler, A. W. Piechowski, M. Taubert, J. S. Trotter, J. vonButtlar, R. M. Whalen,Jr.
Although part of the IBM System z™ strategy is to improve design and development processes to prevent errors from escaping to the field, improving recovery is another element in the strategy to keep a machine up and running should an error occur. The z9™ continues on an evolutionary path of enhancing I/O subsystem (IOSS) recovery to further advance the reliability, availability, and serviceability (RAS) of System z platforms. This paper presents an overview of recovery and how it interacts with other RAS functions—such as error-detection mechanisms in hardware, including automatic identification and recovery of failing elements—up to the point in time prior to the advent of the z9. It then presents the innovations to IOSS recovery and error detection in the z9 that further improve machine availability. The recovery infrastructure, which significantly reduces recovery time and makes recovery much less dependent on machine scaling for this and future generations of System z servers, is described. Also described are such innovative uses of this new infrastructure as improvements in error detection related to elusive firmware problems seen in prior machines, the ability to detect and recover from firmware hangs or lockups related to inadvertently leaving control blocks locked, and the capability to perform recovery in parallel by multiple system-assist processors.
Related Subjects: Computer system availability; Error control and recovery; IBM System z9; Reliability