Fallout from August 2013 Power Outage


Problem Impact Long term plansSorted ascending Short term actions
Even number of CLASSE cluster members increases likelyhood of split-brain (dissolved quorum) in certain situations
cluster services and protocols blocked
add server to cluster to bring to odd number of members
name services dependent on functioning cluster
name resolution delays or failures
deploy standalone DNS server in PSB
Console access to servers over IPMI depends on web browser and java and network management subnet
blocked or slow access to server consoles
enable console redirection on servers to access console over serial in addition to IPMI
management switch plugged into wrong subnet
cluster unable to fence eachother to regain quorum, and admins unable to access server consoles
move IPMI interface to CESR SAN Subnet to reduce external dependencies
clear labels on switch ports
port reconfigured for correct subnet
CLASSE Kerberos tickets from offsite (Red Rover) depend on clustered VM
unable to login to CLASSE systems from offsite in certain cases
open up ports to allow kerberos tickets from any domain controller
Legacy name servers came up with corrupt zones
name resolution delays or failures
replace legacy servers
consider graceful failover of name services
NIS dependent on functioning cluster
users, groups, and file systems unavailable
replace NIS services with LDAP for SL6 clients
deploy standalone DNS server in PSB
NIS client functionality depends on functioning NIS server at boot
users, groups, and file systems unavailable until ypbind restarted
replace NIS services with LDAP for SL6 clients
investigate options for improving ypbind startup procedures and behavior
DEC Alphas display Windows NT BIOS instead of VMS/Unix BIOS
can't boot that Alpha
replace VMS Alphas by Linux cluster members
replace motherboard battery, then use graphics BIOS to restore BIOS setting
Cluster recovery procedure labor and intellectually expensive
long recovery time from cluster failures
upgrade cluster protocols to make full use of the virtual synchrony guarantees of the underlying protocol stack (possibly only in SL7)
test and produce procedure for quick cluster recovery
reproduce failures and submit reports to cluster developers and upstream vendor.


Problem Impact Long term plans Short term actions
main console server inaccessible from KVM
unable to access consoles on servers
fix KVM cable run
DEC Console broken during W221 reorganization
unable to access DEC terminals
continue migration to Linux services and console servers
repair DEC terminal (removed failing 10Mbit hub; cabled DECservers directly to network switch.)

Topic revision: r4 - 07 Feb 2019, AdminDevinBougie
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding CLASSE Wiki? Send feedback