Computing Equipment and Staffing for CLEO Continuation
This page was used for initial discussions around CLEO Continuation. For a definitive list of nodes critical for CLEO Completion, please see CriticalSystems
cleocont.xls: System list initially based on the contents of this page.
EQUIPMENT
- Data in Eventstore RAID (hotstore only; includes constants in pds format)
- Database Servers
- CLEO-III
- Data16-29
- Some off-4S running for background evaluation
- CLEO-c
- MC in Eventstore RAID
- MC in RAID
- CLEO-III as pds files to be accessed via chains (it would be desireable to inject this into eventstore if manpower can be found)
- Random triggers for all datasets as .bin files
- Tape backup (in robot) for all the above to guard against catastrophic RAID failure
- Linux batch farm
- Existing compute nodes maintained
- Existing tem disk space maintained
- Existing (and possible future) group RAID disk space maintained
- Library
- Lnx134 and lnx135 maintained
- CLEO-c and CLEO-III active library maintenance on Linux
- Access maintained to historical Solaris CLEO-III libraries in the "green zone"
- Constants (necessary for doing MC)
- Lnxcon slave constants database node maintained
- Solon2 master constants database node maintained in the "green zone" (This is needed if we ever want to update any constants)
- solcon - pass2 constants -> CLEO s/w coordinator
- CLEO-III signal MC generation nodes
- Several Solaris nodes in the "green zone"
- solfa1, solfa2, solfa3, solfa4 (solaris queue node)
- Document database
- Access to CBX notes and paper drafts, both historical and those in progress
- Tape robot storage of raw, warm store, and cold store data
- Retain access to CLEO-III raw data via Solaris/Objectivity if possible (but probably not)
- Miscellaneous nodes
- lnx768 - timeline server -> Ji Li (CLEO run management)
- lnx122 - hypernews server
- Solaris build systems
- sol300 (tem disk), sol518 (objectivity catalog - /cdat/cleo), sol303 ("testing phase")
Online
this is a preliminary list and proposal on what CLEO online nodes need to be kept running after March, 4th, and which nodes can be turned off or reused for other purposes.
See also this page for the post-CLEO-c DAQ setup:
https://wiki.lepp.cornell.edu/lepp/bin/view/CLEO/Private/RunMan/CLEODAQPostSetUp
Note:
- CesrTA folks might have an interest to use PCs in the CLEO counting room for future machines studies by CesrTA collaborators. However all nodes would have to be re-imaged and re-configured and relocated from the W215 rack into the counting room. Please, consult with Mark Palmer.
- Once we have stopped data taking, the constants server consolidation will happen immediately (does not prevent future data taking or any use of the DAQ). This would free up solon1.
- All build systems and other cronjob taks for CLEO online will stop after data taking ends.
CLEO nodes need to be kept running after data taking stops for as long as CLEO data will be analyzed:
- solon2 (solon1/2 operations will be consolidated and run off solon2 entirely) - W221
- lnxon14 (Web/E-Log, Linux build node for offline constants database support) - W221
- lnx196 - for reading raw AIT-2 and AIT-3 tapes (tape copying, one-time backups, data restores)
- solsda - (development for CLEO and ERL/CesrTA) - W221
- lnx768 - timeline server -> Ji Li (CLEO run management)
- Online switch hps05-p2 in W215 could be replaced by smaller one and needs to move to W221. It is required for CLEO online subnet (192.168.2.0) support i.e.:
- solon2
- lnxon14
- c3pc{104,106,110b} - for CLEO cooling and gas systems
Nodes that need to be kept running initially but can be retired during summer 2008:
- c3pc{104,106,110b} until ~ July - Can be retired when the CLEO drift chambers and the RICH gets removed in July 2008. Check with Steve Gray in April 2008.
- c3pc102
- c3pc103
- c3pc107
- c3pc112
- pretty much all LCD monitors in the counting room (see note above)
- lnxon1 (until early summer 2008 at least) - W221
- solon1 (until constants management has been consolidated in March 2008)
- DR laptop
- lnxon12
- lnxon13 (note has multiple cyclade serial line h/w which we desparately need for linux infrastructure servers (or possibly for CesrTA/ERL))
- lnxon15
- solon3
- solon4
- solon5
- solon6
- sol198
- sol201
- UPS system in W215
- solgc2 (ultra5)
Green zone
Some CLEO-III work requires the use of Solaris nodes. We would
retain five nodes, presumably the most reliable and fastest in a
non-public area. This avoids having to make security updates to
Solaris OS-8, thus minimizing software maintenance. The five
nodes would provide enough hardware redundancy that emergency
repairs could be made. The library, constants, and MC signal
generation nodes would be in this restricted area, with only
authorized local personnel granted access. We anticipate that
setting up the five nodes will be a fair amount of work but hope
that once it is running, keeping it up will not take too much
effort. The day will come, however, that essential components
will fail, and we will be forced to abandon support, possibly
with little warning.
We also want to consider moving unsupported linux operating systems (RH9 on lnx134, lnx135, and lnxcon) into the Green Zone.
Other systems to consider:
- solfa5 (spare for solcon, solfa4)
- solfa6 (spare for solfa1,2,3)
Servers
- lns131 - needed for CLEO 2.5 from axp (hopefully) only until December 2008
- sol191
- sol197
- sol201 (ultra5)
- sol202 (ultra5)
- sol300
- sol301
- sol302
- sol303
- sol401
- sol402
- sol403
- sol404
- sol405
- sol406
- sol407 (pass2logs, cronjobs, besides servers other things)
- sol408
- sol409
- sol501
- sol502
- sol503
- sol504
- sol505
- sol506
- sol507
- sol508
- sol509
- sol510
- sol511
- sol512
- sol514
- sol515
- sol516
- sol531
- sol532
- sol570
Sol2.6 machines (for compiling with old solaris libraries: Jul07_03_MC or older)
- solssa
- solssb
- sol210
- sol211
Interactive
- sol410 - interactive node for running pass2
- sol333 (alias for sol566/sol567) - interactive node for users
- sol199 (ultra5) ???
- solgc1 (ultra5) ???
Batch
- sol22x
- sol5xx
- sol6xx
- solcm7
- solpi1
- solsy2
Turned off
PERSONNEL
- Librarian to repair buggy code, recover from hardware failures. Supplied by CLEO
- Constantsmaster to make changes and repairs to working libraries. Supplied by CLEO
- CLEO-III MC generator: On request from CLEOns, this individual would run the Solaris MC generation node(s) to make signal MC, much as Minnesota now makes CLEO-c signal MC on request. Supplied by CLEO. * Green zone access. A member of the computer group who retrieves raw data, warm and cold store data, etc. on request from CLEOns, much as Jastremsky does now.
ACCOUNTS
Regarding cleo31 and daqiii accounts, we do have local ones on solon2,
the UNIX ones, Windows, VMS and one cleo31 account on the CESR VMS
control system cluster.
The UNIX accounts for "cleo31" and "daqiii", both the local ones on
solon2 as well as the sol105 ones,
are still required and critical
to the CLEO
master database, which runs on solon2. Specifically the "daqiii" account
owns most of the online Objectivity and Visibroker installations as well
as the
VxWorks installations (the last one is now also available on
CESR TA
and ERL systems). The Windows and VMS accounts however are not
needed anymore (including their contents).
Contact persons for the CLEO online systems will be (besides me via
email):
Dave Kreinick and Laurel Bartnik (
lty2@cornell.edu). Laurel is able to
restart
the CLEO online system needed for the database in case of disk
failures or
power outages.
Just a couple of generic comments, which should be modified
appropriately by people with more knowledge:
1. Generally speaking, Solaris library and data files need not be stored on a Solaris server.
Migrating the files and associated environment variables
and softlinks to mount-points with different names would doubtless be painful, however.
2. Migrating some of the current Solaris services to are "the most reliable and fastest" servers may be a
lot of work.
Some of the services are running on 1U systems, not on the enterprise-class redundant servers.
It may be more maintainable to keep them as they are and make
sure we have several spare systems that can be swapped in as needed. I dunno how many identical high-end servers we have.
The number of different system types should be minimized.
--
SeldenBall - 20 Feb 2008
Just for the record, I'll state the obvious.
The robotic tape libraries above are currently controlled by
SunFire 280R machines, namely sol100 and sol103. If/when everything is consolidated into one robot, you can use one machine for spare parts for the other until such time as everything dies.
We currently have 5 Veritas Storage Migrator licenses. Once data taking stops, we can reduce that to 2 licenses. If/when all data is moved to one robot, you can reduce that to a single license.
We currently run VSM 4.5. For more than a year, I have wanted to upgrade to a more recent version (say 6.5), but time has been against us, and Bill has not been able to get the intermediate versions required. We appear to need to first upgrade to 5.0, then 6.0 and then 6.5.
As part of this upgrade, we will also need to upgrade Veritas File System to version 4.0. We currently have 3.4.
VSM is NOT available on Linux. You must keep a Solaris 7,8 or 9 system around to run it.
So, if sol100/sol103 both fail, you will have to get another solaris 8/9 machine to run the robot and you will have to get more aggressive about getting the software for the VSM upgrades. The longer you wait, the less chance you have of getting the intermediate versions. So you might want to actually do something about obtaining the software upgrades right away (which is what I said last year...)
--
GregorySharp - 20 Feb 2008