Fallout from August 2013 Power Outage


Problem Impact Long term plans Short term actions
management switch plugged into wrong subnet
cluster unable to fence eachother to regain quorum, and admins unable to access server consoles
move IPMI interface to CESR SAN Subnet to reduce external dependencies
clear labels on switch ports
port reconfigured for correct subnet
name services dependent on functioning cluster
name resolution delays or failures
deploy standalone DNS server in PSB
NIS dependent on functioning cluster
users, groups, and file systems unavailable
replace NIS services with LDAP for SL6 clients
deploy standalone DNS server in PSB
NIS client functionality depends on functioning NIS server at boot
users, groups, and file systems unavailable until ypbind restarted
replace NIS services with LDAP for SL6 clients
investigate options for improving ypbind startup procedures and behavior
Cluster recovery procedure labor and intellectually expensive
long recovery time from cluster failures
upgrade cluster protocols to make full use of the virtual synchrony guarantees of the underlying protocol stack (possibly only in SL7)
test and produce procedure for quick cluster recovery
reproduce failures and submit reports to cluster developers and upstream vendor.
CLASSE Kerberos tickets from offsite (Red Rover) depend on clustered VM
unable to login to CLASSE systems from offsite in certain cases
open up ports to allow kerberos tickets from any domain controller
Console access to servers over IPMI depends on web browser and java and network management subnet
blocked or slow access to server consoles
enable console redirection on servers to access console over serial in addition to IPMI
Even number of CLASSE cluster members increases likelyhood of split-brain (dissolved quorum) in certain situations
cluster services and protocols blocked
add server to cluster to bring to odd number of members
Legacy name servers came up with corrupt zones
name resolution delays or failures
replace legacy servers
consider graceful failover of name services
DEC Alphas display Windows NT BIOS instead of VMS/Unix BIOS
can't boot that Alpha
replace VMS Alphas by Linux cluster members
replace motherboard battery, then use graphics BIOS to restore BIOS setting


ProblemSorted ascending Impact Long term plans Short term actions
DEC Console broken during W221 reorganization
unable to access DEC terminals
continue migration to Linux services and console servers
repair DEC terminal (removed failing 10Mbit hub; cabled DECservers directly to network switch.)
main console server inaccessible from KVM
unable to access consoles on servers
fix KVM cable run

