You are here: CLASSE Wiki>CLEO Web>CleoContinuation (07 Jan 2010, DevinBougie)Edit Attach

Computing Equipment and Staffing for CLEO Continuation

This page was used for initial discussions around CLEO Continuation. For a definitive list of nodes critical for CLEO Completion, please see CriticalSystems

cleocont.xls: System list initially based on the contents of this page.

EQUIPMENT
- Online
- Green zone
  - Other systems to consider:
PERSONNEL
ACCOUNTS
COMMENTS

EQUIPMENT

Data in Eventstore RAID (hotstore only; includes constants in pds format)
- Database Servers
  - lnx150
  - lnx151
- CLEO-III
  - Data16-29
  - Some off-4S running for background evaluation
- CLEO-c
  - Data31-48
  - Dskims
MC in Eventstore RAID
- CLEO-c
MC in RAID
- CLEO-III as pds files to be accessed via chains (it would be desireable to inject this into eventstore if manpower can be found)
- Random triggers for all datasets as .bin files
Tape backup (in robot) for all the above to guard against catastrophic RAID failure
- sol100, sol103
Linux batch farm
- Existing compute nodes maintained
- Existing tem disk space maintained
- Existing (and possible future) group RAID disk space maintained
Library
- Lnx134 and lnx135 maintained
- CLEO-c and CLEO-III active library maintenance on Linux
- Access maintained to historical Solaris CLEO-III libraries in the "green zone"
Constants (necessary for doing MC)
- Lnxcon slave constants database node maintained
- Solon2 master constants database node maintained in the "green zone" (This is needed if we ever want to update any constants)
- solcon - pass2 constants -> CLEO s/w coordinator
CLEO-III signal MC generation nodes
- Several Solaris nodes in the "green zone"
- solfa1, solfa2, solfa3, solfa4 (solaris queue node)
Document database
- Access to CBX notes and paper drafts, both historical and those in progress
Tape robot storage of raw, warm store, and cold store data
Retain access to CLEO-III raw data via Solaris/Objectivity if possible (but probably not)
Miscellaneous nodes
- lnx768 - timeline server -> Ji Li (CLEO run management)
- lnx122 - hypernews server
Solaris build systems
- sol300 (tem disk), sol518 (objectivity catalog - /cdat/cleo), sol303 ("testing phase")

Online

this is a preliminary list and proposal on what CLEO online nodes need to be kept running after March, 4th, and which nodes can be turned off or reused for other purposes.

See also this page for the post-CLEO-c DAQ setup: https://wiki.lepp.cornell.edu/lepp/bin/view/CLEO/Private/RunMan/CLEODAQPostSetUp

Note:

CesrTA folks might have an interest to use PCs in the CLEO counting room for future machines studies by CesrTA collaborators. However all nodes would have to be re-imaged and re-configured and relocated from the W215 rack into the counting room. Please, consult with Mark Palmer.
Once we have stopped data taking, the constants server consolidation will happen immediately (does not prevent future data taking or any use of the DAQ). This would free up solon1.
All build systems and other cronjob taks for CLEO online will stop after data taking ends.

CLEO nodes need to be kept running after data taking stops for as long as CLEO data will be analyzed:

solon2 (solon1/2 operations will be consolidated and run off solon2 entirely) - W221
lnxon14 (Web/E-Log, Linux build node for offline constants database support) - W221
lnx196 - for reading raw AIT-2 and AIT-3 tapes (tape copying, one-time backups, data restores)
solsda - (development for CLEO and ERL/CesrTA) - W221
lnx768 - timeline server -> Ji Li (CLEO run management)

Online switch hps05-p2 in W215 could be replaced by smaller one and needs to move to W221. It is required for CLEO online subnet (192.168.2.0) support i.e.:
- solon2
- lnxon14
- c3pc{104,106,110b} - for CLEO cooling and gas systems

Nodes that need to be kept running initially but can be retired during summer 2008:

c3pc{104,106,110b} until ~ July - Can be retired when the CLEO drift chambers and the RICH gets removed in July 2008. Check with Steve Gray in April 2008.

NOT required anymore and can be retired or reused pretty much immediately after March 4th:

c3pc102
c3pc103
c3pc107
c3pc112
pretty much all LCD monitors in the counting room (see note above)
lnxon1 (until early summer 2008 at least) - W221
solon1 (until constants management has been consolidated in March 2008)
DR laptop
lnxon12
lnxon13 (note has multiple cyclade serial line h/w which we desparately need for linux infrastructure servers (or possibly for CesrTA/ERL))
lnxon15
solon3
solon4
solon5
solon6
sol198
sol201
UPS system in W215
solgc2 (ultra5)

Green zone

Some CLEO-III work requires the use of Solaris nodes. We would retain five nodes, presumably the most reliable and fastest in a non-public area. This avoids having to make security updates to Solaris OS-8, thus minimizing software maintenance. The five nodes would provide enough hardware redundancy that emergency repairs could be made. The library, constants, and MC signal generation nodes would be in this restricted area, with only authorized local personnel granted access. We anticipate that setting up the five nodes will be a fair amount of work but hope that once it is running, keeping it up will not take too much effort. The day will come, however, that essential components will fail, and we will be forced to abandon support, possibly with little warning.

We also want to consider moving unsupported linux operating systems (RH9 on lnx134, lnx135, and lnxcon) into the Green Zone.

Other systems to consider:

solfa5 (spare for solcon, solfa4)
solfa6 (spare for solfa1,2,3)

Servers

lns131 - needed for CLEO 2.5 from axp (hopefully) only until December 2008
sol191
sol197
sol201 (ultra5)
sol202 (ultra5)
sol300
sol301
sol302
sol303
sol401
sol402
sol403
sol404
sol405
sol406
sol407 (pass2logs, cronjobs, besides servers other things)
sol408
sol409
sol501
sol502
sol503
sol504
sol505
sol506
sol507
sol508
sol509
sol510
sol511
sol512
sol514
sol515
sol516
sol531
sol532
sol570

Sol2.6 machines (for compiling with old solaris libraries: Jul07_03_MC or older)

solssa
solssb
sol210
sol211

Interactive

sol410 - interactive node for running pass2
sol333 (alias for sol566/sol567) - interactive node for users
sol199 (ultra5) ???
solgc1 (ultra5) ???

Batch

sol22x
sol5xx
sol6xx
solcm7
solpi1
solsy2

Turned off

solos2 (sunfire v100 spare parts - https://rt.lepp.cornell.edu:444/Ticket/Display.html?id=12849)
solos3 (sunfire v100 spare parts - https://rt.lepp.cornell.edu:444/Ticket/Display.html?id=12849)
solpi2
sol102 (spare parts for sol105 home disk server?)
sol184 (ultra5)
solgs0,1,2 (ultra5)
sol190 (ultra5)
sol198 (ultra5)

PERSONNEL

Librarian to repair buggy code, recover from hardware failures. Supplied by CLEO
Constantsmaster to make changes and repairs to working libraries. Supplied by CLEO
CLEO-III MC generator: On request from CLEOns, this individual would run the Solaris MC generation node(s) to make signal MC, much as Minnesota now makes CLEO-c signal MC on request. Supplied by CLEO. * Green zone access. A member of the computer group who retrieves raw data, warm and cold store data, etc. on request from CLEOns, much as Jastremsky does now.

ACCOUNTS

Regarding cleo31 and daqiii accounts, we do have local ones on solon2, the UNIX ones, Windows, VMS and one cleo31 account on the CESR VMS control system cluster.

The UNIX accounts for "cleo31" and "daqiii", both the local ones on solon2 as well as the sol105 ones, are still required and critical to the CLEO master database, which runs on solon2. Specifically the "daqiii" account owns most of the online Objectivity and Visibroker installations as well as the VxWorks installations (the last one is now also available on CESR TA and ERL systems). The Windows and VMS accounts however are not needed anymore (including their contents).

Contact persons for the CLEO online systems will be (besides me via email): Dave Kreinick and Laurel Bartnik (lty2@cornell.edu). Laurel is able to restart the CLEO online system needed for the database in case of disk failures or power outages.

COMMENTS

Just a couple of generic comments, which should be modified appropriately by people with more knowledge:

1. Generally speaking, Solaris library and data files need not be stored on a Solaris server. Migrating the files and associated environment variables and softlinks to mount-points with different names would doubtless be painful, however.

2. Migrating some of the current Solaris services to are "the most reliable and fastest" servers may be a lot of work.

Some of the services are running on 1U systems, not on the enterprise-class redundant servers. It may be more maintainable to keep them as they are and make sure we have several spare systems that can be swapped in as needed. I dunno how many identical high-end servers we have. The number of different system types should be minimized.

-- SeldenBall - 20 Feb 2008

Just for the record, I'll state the obvious.

The robotic tape libraries above are currently controlled by SunFire 280R machines, namely sol100 and sol103. If/when everything is consolidated into one robot, you can use one machine for spare parts for the other until such time as everything dies.

We currently have 5 Veritas Storage Migrator licenses. Once data taking stops, we can reduce that to 2 licenses. If/when all data is moved to one robot, you can reduce that to a single license.

We currently run VSM 4.5. For more than a year, I have wanted to upgrade to a more recent version (say 6.5), but time has been against us, and Bill has not been able to get the intermediate versions required. We appear to need to first upgrade to 5.0, then 6.0 and then 6.5.

As part of this upgrade, we will also need to upgrade Veritas File System to version 4.0. We currently have 3.4.

VSM is NOT available on Linux. You must keep a Solaris 7,8 or 9 system around to run it.

So, if sol100/sol103 both fail, you will have to get another solaris 8/9 machine to run the robot and you will have to get more aggressive about getting the software for the VSM upgrades. The longer you wait, the less chance you have of getting the intermediate versions. So you might want to actually do something about obtaining the software upgrades right away (which is what I said last year...)

-- GregorySharp - 20 Feb 2008

Topic revision: r25 - 07 Jan 2010, DevinBougie

CLEO

Public CLEO Page

CLEO Results

CLEOns At Work

LEPP Publications Office

Journal Club

CESR

LEPP Home Page

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding CLASSE Wiki? Send feedback