CLASSE Compute Farm
The CLASSE Compute Farm is a central resource of 60+ enterprise-class Linux nodes (with around 400 cores) with a front-end queueing system that distributes jobs across the Compute Farm nodes. Our queueing system supports interactive, batch, parallel, and GPU jobs, and it ensures equal access to the Compute Farm for all users.
Note that Compute Farm nodes are configured identically to our Linux desktop systems, and they all have direct access to the same central file systems. Therefore, jobs that are developed interactively on any Linux system can easily be transferred to the Compute Farm to run in batch mode.
In addition to conventional CPUs, we are also developing GPU capabilities in the Compute Farm.
Getting started
Continue to this page for detailed instructions on using the CLASSE Compute Farm.
Please see the
Maximum Running and Queued Job Limits section of the
GridEngine page.
Farm Grid Engine "Best Practices" Topics
GridengineUseTopics - prepared for "the-more-you-know" CHESS talk on 13-Jun-2019
Slideshow presentation (updated June 2019)
Start presentation
Slide 1: Compute Farms @ CLASSE
- CLASSE has 20+ years of experience with batch queuing systems.
- Initially used for high-energy physics:
- CLASSE was host laboratory for the CLEO Collaboration:
- 1979-2012, 200+ individuals / 20+ institutions (peak).
- Early 1990's: 200+ node Solaris farm for simulations and data analysis.
- Decommissioned in 2013-14.
- Currently, 60-node Linux compute farm with approximately 400 cores.
- Single threaded jobs
- MPI parallel jobs
- Multi-process or multi-node jobs
- Interactive graphical jobs
- GPU jobs (CUDA)
- Used for:
- Electron cloud, photocathode, SRF simulations
- Theorists: parallel mathematica jobs, etc.
Slide 2: Batch Queuing Basics
- Cluster of high-performance compute nodes:
- In general, farm nodes are faster and have more memory than contemporaneous desktops.
- All nodes run 64-bit Scientific Linux 7 (SL7).
- Job scheduling software (Son of Grid Engine) provides equitable access.
- Avoids resource contention.
- Ensures jobs are executed on nodes with adequate resources.
- Compute nodes are logically identical to all other CLASSE SL7 desktops/servers:
- Same operating system and software stack.
- Same access to all centralized resources (file systems, users/groups, environments, etc.).
- Code developed on any CLASSE SL7 system can run on all other CLASSE SL7 systems.
- Documentation: https://wiki.classe.cornell.edu/Computing/GridEngine.
- A powerful tool, especially when coupled with 500+ TB of central disk storage.
Slide 3: Configuration
- Queues, projects, and limits created and tuned as necessary.
- Current settings:
- Maximum of 60 simultaneously running jobs per user
- Unlimited number of queued jobs
- 48-hour wall clock time limit
- Maximum of 24GB memory per batch job
- Maximum of 64GB memory per interactive job
- Numerous options for job submission, such as:
- Memory requirements
- Output locations
- Email notifications
- Nodes to use
- Etc.
Slide 4: Current Hardware
- Most recent deployments in compute farm:
- Four IBM x3550 M4's with two 6-core 2.30GHz Xeon E5-2630's and 128GB DDR3 (left).
- The IBM Flex System Enterprise (right):
- Very flexible node configuration: up to four processors per node, flexible memory configuration, GPU's, 40Gb upgrades, etc.
Slide 5: Grid Engine Demonstration
See GridEngine.
- How to submit standard shell scripts (
qsub
).
- How to create custom grid engine scripts that specify memory and CPU requirements, specify output directory, etc.
- How to submit parallel jobs, and explanation of what parallel jobs are.
- Submit simple batch job, seeing it in the queue, and then receiving email of results.
- Submit interactive job (
qrsh
), for example Matlab benchmark.
Slide 6: Sample qsub
Script
# Set script linux shell - "bash" is recommended
#$ -S /bin/bash
# Name of queued job and output files
#$ -N regression_tests_demo
# Send successful job completion message to this email address
#$ -m e -M defalco@cornell.edu
# To make sure that the .e and .o file arrive in the working directory
#$ -cwd
# Put farm node name and start timestamp in log file
echo -e "\nOn $HOSTNAME, Starting at: " `date` "\n"
# Initalize your runtime environment
. /nfs/acc/libs/cesr/cesr_online.bashrc
# Move into directory to run the executable, if necessary
cd /nfs/acc/user/amd275/sge_demo/regression_tests
# Executable to run
./scripts/run_tests.py
# Put farm node name and end timestamp in log file
echo "On $HOSTNAME, Done at: " `date`
Slide 7: Job Submission Tips & Guidelines
- General purpose login node
lnx201.classe.cornell.edu
- Log in with your CLASSE credentials and submit jobs.
- Output and error logs are written to your home directory.
- Contains name of node where the job ran.
- Do not SSH into farm nodes directly.
- Diverts resources from legitimately queued jobs.
- Instead, launch interactive session through queuing system (
qrsh
).
- For a specific node (e.g. to check CPU/memory usage):
qrsh -q all.q@lnx326
- For I/O bound processes:
- Write temporary files to
/tmp
(local to each compute node) to avoid network latency.
- At end of job, copy or rsync files to central storage.
- Files in
/tmp
are automatically cleaned up periodically.
Slide 8: Other Recent Improvements (2018)
- Updated to latest Son of Grid Engine scheduler.
- Improved checkpointing capabilities
- Improved intelligence in job scheduling (CPU speed, etc.) and prioritization
- Enabling scheduling of GPU units.
- Enabling full-desktop interactive jobs (X2Go).
- New compute nodes.
- Upgrading trailer (farm subnet) connection to 40Gb.
- Upgrade to 10Gb low-latency interconnects
- Always something new!