You are here: CLASSE Wiki>Computing Web>ComputeFarmIntro (07 Jan 2025, AttilioDeFalco)Edit Attach

CLASSE Compute Farm

The CLASSE Compute Farm is a central resource of 60+ enterprise-class Linux nodes (with around 400 cores) with a front-end queueing system that distributes jobs across the Compute Farm nodes. Our queueing system supports interactive, batch, parallel, and GPU jobs, and it ensures equal access to the Compute Farm for all users.

Note that Compute Farm nodes are configured identically to our Linux desktop systems, and they all have direct access to the same central file systems. Therefore, jobs that are developed interactively on any Linux system can easily be transferred to the Compute Farm to run in batch mode.

In addition to conventional CPUs, we are also developing GPU capabilities in the Compute Farm.

Getting started

Continue to this page for detailed instructions on using the CLASSE Compute Farm.

Compute Farm Default OS Switch to AlmaLinux 9

Please see the Compute Farm Default OS Switch to AlmaLinux 9 section of the GridEngine page.

Maximum Running and Queued Job Limits

Please see the Maximum Running and Queued Job Limits section of the GridEngine page.

Farm Grid Engine "Best Practices" Topics

GridengineUseTopics - prepared for "the-more-you-know" CHESS talk on 13-Jun-2019

Slideshow presentation (updated June 2019)

Show

Start presentation

Slide 1: Compute Farms @ CLASSE

CLASSE has 20+ years of experience with batch queuing systems.
Initially used for high-energy physics:
- CLASSE was host laboratory for the CLEO Collaboration:
  - 1979-2012, 200+ individuals / 20+ institutions (peak).
- Early 1990's: 200+ node Solaris farm for simulations and data analysis.
  - Decommissioned in 2013-14.
Currently, 60-node Linux compute farm with approximately 400 cores.
- Single threaded jobs
- MPI parallel jobs
- Multi-process or multi-node jobs
- Interactive graphical jobs
- GPU jobs (CUDA)
- Used for:
  - Electron cloud, photocathode, SRF simulations
  - Theorists: parallel mathematica jobs, etc.

Slide 2: Batch Queuing Basics

Cluster of high-performance compute nodes:
- In general, farm nodes are faster and have more memory than contemporaneous desktops.
- All nodes run 64-bit Scientific Linux 7 (SL7).
Job scheduling software (Son of Grid Engine) provides equitable access.
- Avoids resource contention.
- Ensures jobs are executed on nodes with adequate resources.
Compute nodes are logically identical to all other CLASSE SL7 desktops/servers:
- Same operating system and software stack.
- Same access to all centralized resources (file systems, users/groups, environments, etc.).
- Code developed on any CLASSE SL7 system can run on all other CLASSE SL7 systems.
Documentation: https://wiki.classe.cornell.edu/Computing/GridEngine.
A powerful tool, especially when coupled with 500+ TB of central disk storage.
- https://wiki.classe.cornell.edu/Computing/DataStewardship

Slide 3: Configuration

Queues, projects, and limits created and tuned as necessary.
Current settings:
- Maximum of 60 simultaneously running jobs per user
  - Unlimited number of queued jobs
- 48-hour wall clock time limit
- Maximum of 24GB memory per batch job
- Maximum of 64GB memory per interactive job
Numerous options for job submission, such as:
- Memory requirements
- Output locations
- Email notifications
- Nodes to use
- Etc.

Slide 4: Current Hardware

Most recent deployments in compute farm:
- Four IBM x3550 M4's with two 6-core 2.30GHz Xeon E5-2630's and 128GB DDR3 (left).
- The IBM Flex System Enterprise (right):
  - Very flexible node configuration: up to four processors per node, flexible memory configuration, GPU's, 40Gb upgrades, etc.

Slide 5: Grid Engine Demonstration

See GridEngine.

How to submit standard shell scripts (qsub).
How to create custom grid engine scripts that specify memory and CPU requirements, specify output directory, etc.
How to submit parallel jobs, and explanation of what parallel jobs are.
Submit simple batch job, seeing it in the queue, and then receiving email of results.
Submit interactive job (qrsh), for example Matlab benchmark.

Slide 6: Sample `qsub` Script

# Set script linux shell - "bash" is recommended
#$ -S /bin/bash

# Name of queued job and output files
#$ -N regression_tests_demo

# Send successful job completion message to this email address
#$ -m e -M defalco@cornell.edu

# To make sure that the .e and .o file arrive in the working directory
#$ -cwd

# Put farm node name and start timestamp in log file 
echo -e "\nOn $HOSTNAME, Starting at: " `date` "\n"

# Initalize your runtime environment
. /nfs/acc/libs/cesr/cesr_online.bashrc

# Move into directory to run the executable, if necessary
cd /nfs/acc/user/amd275/sge_demo/regression_tests

# Executable to run
./scripts/run_tests.py 

# Put farm node name and end timestamp in log file 
echo "On $HOSTNAME, Done at: " `date`

Slide 7: Job Submission Tips & Guidelines

General purpose login node lnx201.classe.cornell.edu
- Log in with your CLASSE credentials and submit jobs.
Output and error logs are written to your home directory.
- Contains name of node where the job ran.
Do not SSH into farm nodes directly.
- Diverts resources from legitimately queued jobs.
- Instead, launch interactive session through queuing system (qrsh).
- For a specific node (e.g. to check CPU/memory usage): qrsh -q all.q@lnx326
For I/O bound processes:
- Write temporary files to /tmp (local to each compute node) to avoid network latency.
- At end of job, copy or rsync files to central storage.
- Files in /tmp are automatically cleaned up periodically.

Slide 8: Other Recent Improvements (2018)

Updated to latest Son of Grid Engine scheduler.
- Improved checkpointing capabilities
- Improved intelligence in job scheduling (CPU speed, etc.) and prioritization
Enabling scheduling of GPU units.
Enabling full-desktop interactive jobs (X2Go).
New compute nodes.
Upgrading trailer (farm subnet) connection to 40Gb.
Upgrade to 10Gb low-latency interconnects
Always something new!

Topic revision: r29 - 07 Jan 2025, AttilioDeFalco

Computing

CLASSE Website

CHESS Website

Quick Links

Usage Information

Collaboration

IT Communications

Newsletter

CLASSE-IT-NEWS-L

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding CLASSE Wiki? Send feedback

CLASSE Compute Farm

Getting started

Compute Farm Default OS Switch to AlmaLinux 9

Maximum Running and Queued Job Limits

Farm Grid Engine "Best Practices" Topics

Related links

Slideshow presentation (updated June 2019)

Slide 1: Compute Farms @ CLASSE

Slide 2: Batch Queuing Basics

Slide 3: Configuration

Slide 4: Current Hardware

Slide 5: Grid Engine Demonstration

Slide 6: Sample `qsub` Script

Slide 7: Job Submission Tips & Guidelines

Slide 8: Other Recent Improvements (2018)

Attachments ($count)

CLASSE Compute Farm

Getting started

Compute Farm Default OS Switch to AlmaLinux 9

Maximum Running and Queued Job Limits

Farm Grid Engine "Best Practices" Topics

Related links

Slideshow presentation (updated June 2019)

Slide 1: Compute Farms @ CLASSE

Slide 2: Batch Queuing Basics

Slide 3: Configuration

Slide 4: Current Hardware

Slide 5: Grid Engine Demonstration

Slide 6: Sample qsub Script

Slide 7: Job Submission Tips & Guidelines

Slide 8: Other Recent Improvements (2018)

Attachments ($count)

Slide 6: Sample `qsub` Script