Tags

Computing Equipment and Staffing for CLEO Continuation

This page was used for initial discussions around CLEO Continuation. For a definitive list of nodes critical for CLEO Completion, please see CriticalSystems

cleocont.xls: System list initially based on the contents of this page.

EQUIPMENT

  1. Data in Eventstore RAID (hotstore only; includes constants in pds format)
    • Database Servers
      • lnx150
      • lnx151
    • CLEO-III
      • Data16-29
      • Some off-4S running for background evaluation
    • CLEO-c
      • Data31-48
      • Dskims
  2. MC in Eventstore RAID
    • CLEO-c
  3. MC in RAID
    • CLEO-III as pds files to be accessed via chains (it would be desireable to inject this into eventstore if manpower can be found)
    • Random triggers for all datasets as .bin files
  4. Tape backup (in robot) for all the above to guard against catastrophic RAID failure
    • sol100, sol103
  5. Linux batch farm
    • Existing compute nodes maintained
    • Existing tem disk space maintained
    • Existing (and possible future) group RAID disk space maintained
  6. Library
    • Lnx134 and lnx135 maintained
    • CLEO-c and CLEO-III active library maintenance on Linux
    • Access maintained to historical Solaris CLEO-III libraries in the "green zone"
  7. Constants (necessary for doing MC)
    • Lnxcon slave constants database node maintained
    • Solon2 master constants database node maintained in the "green zone" (This is needed if we ever want to update any constants)
    • solcon - pass2 constants -> CLEO s/w coordinator
  8. CLEO-III signal MC generation nodes
    • Several Solaris nodes in the "green zone"
    • solfa1, solfa2, solfa3, solfa4 (solaris queue node)
  9. Document database
    • Access to CBX notes and paper drafts, both historical and those in progress
  10. Tape robot storage of raw, warm store, and cold store data
  11. Retain access to CLEO-III raw data via Solaris/Objectivity if possible (but probably not)
  12. Miscellaneous nodes
    • lnx768 - timeline server -> Ji Li (CLEO run management)
    • lnx122 - hypernews server
  13. Solaris build systems
    • sol300 (tem disk), sol518 (objectivity catalog - /cdat/cleo), sol303 ("testing phase")

Online

this is a preliminary list and proposal on what CLEO online nodes need to be kept running after March, 4th, and which nodes can be turned off or reused for other purposes.

See also this page for the post-CLEO-c DAQ setup: https://wiki.lepp.cornell.edu/lepp/bin/view/CLEO/Private/RunMan/CLEODAQPostSetUp

Note:
  1. CesrTA folks might have an interest to use PCs in the CLEO counting room for future machines studies by CesrTA collaborators. However all nodes would have to be re-imaged and re-configured and relocated from the W215 rack into the counting room. Please, consult with Mark Palmer.
  2. Once we have stopped data taking, the constants server consolidation will happen immediately (does not prevent future data taking or any use of the DAQ). This would free up solon1.
  3. All build systems and other cronjob taks for CLEO online will stop after data taking ends.

CLEO nodes need to be kept running after data taking stops for as long as CLEO data will be analyzed:

  • solon2 (solon1/2 operations will be consolidated and run off solon2 entirely) - W221
  • lnxon14 (Web/E-Log, Linux build node for offline constants database support) - W221
  • lnx196 - for reading raw AIT-2 and AIT-3 tapes (tape copying, one-time backups, data restores)
  • solsda - (development for CLEO and ERL/CesrTA) - W221
  • lnx768 - timeline server -> Ji Li (CLEO run management)

  • Online switch hps05-p2 in W215 could be replaced by smaller one and needs to move to W221. It is required for CLEO online subnet (192.168.2.0) support i.e.:
    • solon2
    • lnxon14
    • c3pc{104,106,110b} - for CLEO cooling and gas systems

Nodes that need to be kept running initially but can be retired during summer 2008:

  • c3pc{104,106,110b} until ~ July - Can be retired when the CLEO drift chambers and the RICH gets removed in July 2008. Check with Steve Gray in April 2008.

NOT required anymore and can be retired or reused pretty much immediately after March 4th:

  • c3pc102
  • c3pc103
  • c3pc107
  • c3pc112
  • pretty much all LCD monitors in the counting room (see note above)
  • lnxon1 (until early summer 2008 at least) - W221
  • solon1 (until constants management has been consolidated in March 2008)
  • DR laptop
  • lnxon12
  • lnxon13 (note has multiple cyclade serial line h/w which we desparately need for linux infrastructure servers (or possibly for CesrTA/ERL))
  • lnxon15
  • solon3
  • solon4
  • solon5
  • solon6
  • sol198
  • sol201
  • UPS system in W215
  • solgc2 (ultra5)

Green zone

Some CLEO-III work requires the use of Solaris nodes. We would retain five nodes, presumably the most reliable and fastest in a non-public area. This avoids having to make security updates to Solaris OS-8, thus minimizing software maintenance. The five nodes would provide enough hardware redundancy that emergency repairs could be made. The library, constants, and MC signal generation nodes would be in this restricted area, with only authorized local personnel granted access. We anticipate that setting up the five nodes will be a fair amount of work but hope that once it is running, keeping it up will not take too much effort. The day will come, however, that essential components will fail, and we will be forced to abandon support, possibly with little warning.

We also want to consider moving unsupported linux operating systems (RH9 on lnx134, lnx135, and lnxcon) into the Green Zone.

Other systems to consider:

  • solfa5 (spare for solcon, solfa4)
  • solfa6 (spare for solfa1,2,3)
Servers
  • lns131 - needed for CLEO 2.5 from axp (hopefully) only until December 2008
  • sol191
  • sol197
  • sol201 (ultra5)
  • sol202 (ultra5)
  • sol300
  • sol301
  • sol302
  • sol303
  • sol401
  • sol402
  • sol403
  • sol404
  • sol405
  • sol406
  • sol407 (pass2logs, cronjobs, besides servers other things)
  • sol408
  • sol409
  • sol501
  • sol502
  • sol503
  • sol504
  • sol505
  • sol506
  • sol507
  • sol508
  • sol509
  • sol510
  • sol511
  • sol512
  • sol514
  • sol515
  • sol516
  • sol531
  • sol532
  • sol570
Sol2.6 machines (for compiling with old solaris libraries: Jul07_03_MC or older)
  • solssa
  • solssb
  • sol210
  • sol211
Interactive
  • sol410 - interactive node for running pass2
  • sol333 (alias for sol566/sol567) - interactive node for users
  • sol199 (ultra5) ???
  • solgc1 (ultra5) ???
Batch
  • sol22x
  • sol5xx
  • sol6xx
  • solcm7
  • solpi1
  • solsy2
Turned off

PERSONNEL

  • Librarian to repair buggy code, recover from hardware failures. Supplied by CLEO
  • Constantsmaster to make changes and repairs to working libraries. Supplied by CLEO
  • CLEO-III MC generator: On request from CLEOns, this individual would run the Solaris MC generation node(s) to make signal MC, much as Minnesota now makes CLEO-c signal MC on request. Supplied by CLEO. * Green zone access. A member of the computer group who retrieves raw data, warm and cold store data, etc. on request from CLEOns, much as Jastremsky does now.

ACCOUNTS

Regarding cleo31 and daqiii accounts, we do have local ones on solon2, the UNIX ones, Windows, VMS and one cleo31 account on the CESR VMS control system cluster.

The UNIX accounts for "cleo31" and "daqiii", both the local ones on solon2 as well as the sol105 ones, are still required and critical to the CLEO master database, which runs on solon2. Specifically the "daqiii" account owns most of the online Objectivity and Visibroker installations as well as the VxWorks installations (the last one is now also available on CESR TA and ERL systems). The Windows and VMS accounts however are not needed anymore (including their contents).

Contact persons for the CLEO online systems will be (besides me via email): Dave Kreinick and Laurel Bartnik (lty2@cornell.edu). Laurel is able to restart the CLEO online system needed for the database in case of disk failures or power outages.

COMMENTS

Just a couple of generic comments, which should be modified appropriately by people with more knowledge:

1. Generally speaking, Solaris library and data files need not be stored on a Solaris server. Migrating the files and associated environment variables and softlinks to mount-points with different names would doubtless be painful, however.

2. Migrating some of the current Solaris services to are "the most reliable and fastest" servers may be a lot of work.

Some of the services are running on 1U systems, not on the enterprise-class redundant servers. It may be more maintainable to keep them as they are and make sure we have several spare systems that can be swapped in as needed. I dunno how many identical high-end servers we have. The number of different system types should be minimized.

-- SeldenBall - 20 Feb 2008

Just for the record, I'll state the obvious.

The robotic tape libraries above are currently controlled by SunFire 280R machines, namely sol100 and sol103. If/when everything is consolidated into one robot, you can use one machine for spare parts for the other until such time as everything dies.

We currently have 5 Veritas Storage Migrator licenses. Once data taking stops, we can reduce that to 2 licenses. If/when all data is moved to one robot, you can reduce that to a single license.

We currently run VSM 4.5. For more than a year, I have wanted to upgrade to a more recent version (say 6.5), but time has been against us, and Bill has not been able to get the intermediate versions required. We appear to need to first upgrade to 5.0, then 6.0 and then 6.5.

As part of this upgrade, we will also need to upgrade Veritas File System to version 4.0. We currently have 3.4.

VSM is NOT available on Linux. You must keep a Solaris 7,8 or 9 system around to run it.

So, if sol100/sol103 both fail, you will have to get another solaris 8/9 machine to run the robot and you will have to get more aggressive about getting the software for the VSM upgrades. The longer you wait, the less chance you have of getting the intermediate versions. So you might want to actually do something about obtaining the software upgrades right away (which is what I said last year...)

-- GregorySharp - 20 Feb 2008
 
Topic revision: r25 - 07 Jan 2010, DevinBougie
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding CLASSE Wiki? Send feedback