Fallout from August 2013 Power Outage

Tasks

Problem Impact Long term plansSorted descending Short term actions
Cluster recovery procedure labor and intellectually expensive
long recovery time from cluster failures
upgrade cluster protocols to make full use of the virtual synchrony guarantees of the underlying protocol stack (possibly only in SL7)
test and produce procedure for quick cluster recovery
reproduce failures and submit reports to cluster developers and upstream vendor.
edit
DEC Alphas display Windows NT BIOS instead of VMS/Unix BIOS
can't boot that Alpha
replace VMS Alphas by Linux cluster members
replace motherboard battery, then use graphics BIOS to restore BIOS setting
edit
NIS dependent on functioning cluster
users, groups, and file systems unavailable
replace NIS services with LDAP for SL6 clients
deploy standalone DNS server in PSB
edit
NIS client functionality depends on functioning NIS server at boot
users, groups, and file systems unavailable until ypbind restarted
replace NIS services with LDAP for SL6 clients
investigate options for improving ypbind startup procedures and behavior
edit
Legacy name servers came up with corrupt zones
name resolution delays or failures
replace legacy servers
consider graceful failover of name services
 
edit
CLASSE Kerberos tickets from offsite (Red Rover) depend on clustered VM
unable to login to CLASSE systems from offsite in certain cases
open up ports to allow kerberos tickets from any domain controller
 
edit
management switch plugged into wrong subnet
cluster unable to fence eachother to regain quorum, and admins unable to access server consoles
move IPMI interface to CESR SAN Subnet to reduce external dependencies
clear labels on switch ports
port reconfigured for correct subnet
edit
Console access to servers over IPMI depends on web browser and java and network management subnet
blocked or slow access to server consoles
enable console redirection on servers to access console over serial in addition to IPMI
 
edit
name services dependent on functioning cluster
name resolution delays or failures
deploy standalone DNS server in PSB
 
edit
Even number of CLASSE cluster members increases likelyhood of split-brain (dissolved quorum) in certain situations
cluster services and protocols blocked
add server to cluster to bring to odd number of members
 
edit

Complete

Problem Impact Long term plans Short term actions

This topic: Computing > Computing/CmpGrp > CmpgrpLinks > AugustPowerOutageRecovery
Topic revision: 07 Feb 2019, AdminDevinBougie
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding CLASSE Wiki? Send feedback