Son of Grid Engine (a.k.a SGE) "Best Practices" Topics

  • Please see ComputeFarmIntro for a brief introduction for newcomers to the CLASSE Compute Farm.

  • Please see the main GridEngine wiki page for detailed instructions on using the CLASSE Compute Farm.

  • This wiki was prepared for "the-more-you-know" CHESS talk on 13-Jun-2019

Maximum Running and Queued Job Limits

Please see the Maximum Running and Queued Job Limits section of the GridEngine page.

REMINDER: How to open a terminal session on a CLASSE Linux System (e.g. lnx201).

Please use any of the following to initiate an lnx201 terminal session:

Complete SGE Manuals

The SGE Manuals are available for our installation of Son of Grid Engine.

Checking on Farm and Job Status

To see complete Farm load, using qstat, please type:
qstat -f -u "*"

This command shows all jobs submitted by all users.
  • Notice that the " * " character is being interpreted by SGE as the linux "wildcard" character, the wildcard can be used with all SGE commands.

To see all jobs submitted by all users to all the interactive.q nodes:
qstat -f -u "*" -q *interactive*

Either au or adu appearing under the states heading, in the output of qstat, denotes a node/queue is DOWN. As in this example:
[amd275@lnx201 ~]$ qstat -f -u "*" -q *interactive*
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
cesrta_ilc_interactive.q@ilc20 IP    0/0/64         3.04     lx-amd64      
---------------------------------------------------------------------------------
chess_interactive.q@lnx335.cla IP    0/0/24         -NA-     lx-amd64      adu

It's safe to assume that we already know about DOWN nodes but if you do notice a node/queue status change right in front of you, please do send email to service-classe@cornell.edu.

To see a table of cores/memory, both available and in use, of all Farm nodes, please type:
qhost

The " - " character in any column of the qhost output means that the node is NOT available to the queueing system. Some of the DOWN nodes listed are past failed nodes that are now just hostname place holders for future node purchases.

To see the inital submission information about any job, please use the " -j " flag, followed by the JOB_ID number, to qstat. e.g:
qstat -j 3088984

This information can be very helpful when diagnosing "why is my job not running?" inquiries.

If you need to see how your batch job is running (using top , ps , pidstat , strace, etc), please first type qstat, then use qrsh (NOT SSH) to login into the node(s) running your job(s). So if your job is running in all.q@lnx326, then login to lnx326 using:
qrsh -q all.q@lnx326

Other useful Farm commands can be found on the main GridEngine wiki - please see Useful Commands

Notes on running jobs and use of the Farm

  • The SGE is a resource RESERVATION system, not a complete ENFORCEMENT system. At this time, users can still submit jobs that use a greater amount of cores and memory than RESERVED at the time of job submission. We are continually working on improvements to user requested resource limit ENFORCEMENT. We will be moving to using the "Slot" terminology for a group of resources (cores and memory).
  • However - Jobs requiring multiple cores ARE limited to a single core, if the " -pe sge_pe " flag is NOT used. Please see Job Execution Time wiki for an explaination of how the 48 hour runtime limit effects multi-core jobs.
  • Using the The CLASSE GRID Script (which automatically lands the user's session onto an interactive Farm node), a user can set the number of cores needed - it's activated by clicking the box to the left of "More Power", then selecting a value from the "Slots" dialog - the numerical value corresponds to the number of cores requested.
  • Increasing Node availability - We are eager to purchase more farm nodes as funding allows. Any projects or groups with available funds can purchase nodes and be given priority over those nodes to ensure they're availble when needed.

This topic: Computing > ComputeFarmIntro > GridengineUseTopics
Topic revision: 14 Aug 2019, AttilioDeFalco
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding CLASSE Wiki? Send feedback