6 CHTC
6.1 CHTC Office Hours & Live Help
See the Get Help Page. Rachel and Christine are our two main support people at CHTC. You can email them at chtc@cs.wisc.edu
The zoom link for CHTC office hours is go.wisc.edu/chtc-officehours. Office hours are current scheduled at * Tuesday morning: 10:30am - 12:00pm * Thursday afternoon: 3:00 - 4:30 pm.
6.2 Logging In
- Open PuTTY. Enter Hostname: submit-1.chtc.wisc.edu, Port: 22, connection type: ssh
- Log in with your netid. You will have to use DUO for MFA.
- Open WinSCP. Use the same credentials as PuTTY. You can drag files between your computer and CHTC. You can also edit documents with this program.
6.3 Hello World and General Help
For substantial help, you can start with the CHTC home page and their list of how to docs for HTC.
CHTC provides a Hello World overview of how to use CHTC. It is worth a review for new users.
6.4 Software
On windows, use PuTTY for secure shell connection (SSH) to CHTC submit server and WinSCP for FTP.
6.5 General CHTC Workflow
6.5.1 Making Batch of Jobs
- Create or edit the training controls file for your batch of jobs. You should start with the demo
training_controls.R
file in thelab_support
repo. - Run (and create if necessary)
mak_jobs.R
for your study. To create it, start with the demo in thelab_support
repo. You will need to update the path to your training controls file when creating the file. make_jobs()
is called by themak_jobs.R
script. It will make a new folder for your batch of jobs on the server in thechtc
folder for your study.- FTP all of the files in the
input
folder for this batch of jobs to the CHTC submit server.
6.5.2 Testing Jobs
- Make sure you have a
results
folder and anerror
folder in your home directory. - Make sure you have all files for the batch of jobs you will train (see above) and
train.sif
- Edit
job_nums.csv
on CHTC server to contain some (e.g., 3) jobs with only several configs (e.g., 5) in each bin (job/row). For example, you can create the content below by typingprintf "1,1,5\n2,6,10\n3,11,15" > job_nums.csv
1, 1, 5
2, 6, 10
3, 11, 15
- Type
condor_submit train.sub
to submit this jobs - Monitor the jobs using
condor_q
and-condor_q -hold
if needed - Confirm that there are data in the results file (file size > 0 for all files):
ls results/results* -lS
. This will sort files with size = 0 to the bottom of the list. There should not be any! - Confirm that the correct number of results files were created by counting them:
ls results/results* | wc -l
- Check for error files with error messages (file size > 0):
ls error/error* -lSr
. This will sort the non-zero files to the bottom of the list.
- If there are non-zero files, look at the error messages using nano editor: For example,
nano error/error_1.err
- Determine the run time for your test jobs:
condor_history $USER -limit 3
(where limit is the number of test jobs you ran). You want to bin enough configurations together in a job to have it last for 2-3 hours. You may need to runmak_jobs.R
again with a different value forconfigs_per_job
to make a newjob_nums.csv
file. - Review the memory and disk usage for these test jobs:
condor_history $USER -limit 3 -af RequestMemory MemoryUsage RequestDisk DiskUsage
. You may need to edittrain.sub
andtraining_controls.R
to increase or decrease these values. You should make the values match across both files on both CHTC server and our server to avoid confusion later. Or just delete the batch and re-runmak_jobs.R
6.5.3 Running jobs
- Make sure you have FTPed the full
jobs_nums.csv
to the CHTC server to replace the test batch. - Type
condor_submit train.sub
- Type
condor_q
to confirm that the number of jobs that were submitted matches your expectations
6.5.4 Monitoring Running Jobs
- run
condor_q
to monitor progress on your jobs. - run
condor_q -hold
to explore the reason for held jobs. If necessary, you may need to change the requested memory or disk space to get those jobs to complete. See below for more detail.condor_q -af HoldReason
will provide more detail on why a job is held if needed. - run
condor_q -run
to explore running jobs - run
condor_q -idle
to explore idle jobs - run
condor_q -analyze
orcondor_q JobId -better-analyze
to determine why certain jobs are not running by performing an analysis on a per machine basis for each machine in the pool. The latter command produces more thorough analysis of complex requirements and shows the values of relevant job ClassAd attributes. condor_release $USER
will release all held jobs for the user. You can substitute batch or job number to release a subset of held jobs if neededcondor_rm $USER
will remove/cancel all jobs for the user.- You can review the error files associated with the batch of jobs to detect and understand errors as well. Sort the files so that the files with errors (i.e., file switch file size > 0) display last by typing
ls error/error* -lSr
If there are MANY (> 60K) files, you may need to pipe the file names intols
by typingfind error -name "error_*" | ls -lSr
. You can view the contents of a non-zero error file using nano. e.g.,nano error/error_1.err
- Flock/Glide jobs will sometimes not exit properly. These jobs are put on hold (in addition to jobs with insufficient memory/disk space). You can simply release these jobs and try again:
condor_release $USER
. Of course, you should first address any problems with memory or disk usage that resulted in holds.
6.5.5 Removing a subset of jobs
It is possible to remove a subset of jobs using condor_rm
with a constraint. It sounds like your jobs are in order based on their clusterID. To remove these jobs individually, it is possible to use condor_rm JobID
. To remove a series of jobs, you can use:
condor_rm -constraint ‘ClusterId > 1 && ClusterId < 5’
ClusterID refers to the batch of jobs submitted using one submit file. After re-reading your email, it sounds like you want to remove a subset of jobs that all fall under one ClusterID. To do this, you can use something like: condor_rm -constraint ‘ProcId > 1 && ProcId < 5 && ClusterID == 15307921’ When typing condor_q, ClusterID is equivalent to the Batch_Name value. To remove the subset of jobs you want to remove, you will want to look at the values after the period in the Job_IDs column. The values after the period are the ProcessID (ProcID). For example, if I submitted a batch of jobs and it was assigned a cluster/batch name of 15307921 and I want to remove jobs 15307921.1, 15307921.2, and 15307921.3, I could use: condor_rm -const ’ProcId > 1 && ProcId < 5 && ClusterID == 15307921
6.5.6 Editing Running Jobs
At times some (or all) jobs may be held if they require more memory or disk space than was requested in sub.sub. You can update these parameters and then release the held jobs.
- To update memory, type
condor_qedit [batch_name or job_id] RequestMemory [memory]
. You can see the batch_name and job_ids by typingcondor_q -hold
Memory is quantified in MB.
- To update disk space, type
condor_qedit [batch_name or job_id] RequestDisk [space]
. You can see the batch_name and job_ids by typingcondor_q -hold
Space is quantified in MB.
- Type
condor_release $USER
to release all held jobs from the user. - If you expect that you will need the specified higher memory or disk space for the currently running jobs (not yet held), you can hold them by typing
condor_hold $USER
and then release them withcondor_release $USER
such that they will now find machine with the higher levels for these parameters.
6.5.7 Transferring Jobs back to Our Server
- Confirm that you have the correct number of results files:
ls results/results_* | wc -l
- Review the non-zero error files:
ls error/error_* -lSr
andnano error/error_1.csv
- Concatenate all the results files:
head -n +1 results/results_1.csv > batch_results.csv; ls results/results_*.csv | xargs awk 'FNR>1' >> batch_results.csv
. See tutorial on awk to understand its use for simple programming. See a tutorial on the use of xargs. - Now FTP
batch_results.csv
and the log file to the output folder for this batch on our server.
- Do NOT delete results and error files until you have confirmed they look good by processing them in R.
6.6 Using and Creating Containers for R and Packages
6.6.1 Full documetation from CHTC
CHTC has full documentation of these steps. We are currently using this Docker image for base r. Chose current numbered version (e.g., 4.3.0) among tags.
6.6.2 Use Existing Container
We have saved existing containers for training and for feature creation on our server in the CHTC folder (/CHTC/containers/train and /CHTC/containers/features). These folders include the sub and def files that were used to create these containers. The current/up-to-date container is named without an underscore. The _# are older versions with the highest number being the most recent old version. Likely not needed!
6.6.3 Using Container with your Jobs
NOTES: for now, CHTC says to not use the instructions from the above guide in your jobs’ sub file (e.g., “universe = container” and “container_image =”). That is the way of the future, however right now to run on the OSPool and CHTC, you’ll want to use this instead:
universe = vanilla
+SingularityImage = "container.sif"
And then include the container.sif file in the transfer_input_files line. That should work seamlessly across the OSPool/CHTC. Maybe use this just to be safe: requirements = (PoolName == "CHTC") || (SINGULARITY_CAN_USE_SIF)
6.6.4 Brief Steps for Creating a New Container
Here is the steps in brief
- create or use and existing build.sub and .def file. See examples for train and features in the CHTC folder on the server.
- FTP these files to CHTC submit server
- Run an interactive sessions
condor_submit -i build.sub
, where you use your sub file - Build the apptainer:
apptainer build train.sif train.def
, where you name your .sif file and use your def file - After the container is built, you can clear the cache:
apptainer cache clean -f
and thenexit
the interactive session. Your container should be returned to the root folder on chtc submit server (along with the log file). Consider whether to archive it on our server or if it was a one-time use for you.
6.7 Making package TARS
We are no longer using tars (we use containers for R instead). However, if needed in the future, CHTC offers a detailed step-by-step walkthrough for making package tars.
6.8 Open Science Pool
JJC plans to get an account directly for the OS Pool. We can access this pools through flock/glide BUT with an account, we can run twice as many jobs (using each account separately). May still just be easier to run using multiple lab accounts on CHTC but worth considering.
6.9 Common CONDOR commands
condor_submit train.sub
condor_q
for quick review of queue for submitted jobscondor_q -hold
for quick review of held jobscondor_q -af HoldReason
for more detailed review of held jobscondor_qedit [cluster id number] RequestMemory [memory]
to increase memory. Example:condor_qedit 16892087 RequestMemory 20000
. Can specific specific job with .# added to the ID. Memory is specified in MB in our training control files.condor_release $USER
to release all held jobs for user (e.g., after increasing memory)condor_rm [jobid]
orcondor_rm $USER
to remove a job or all jobscondor_history $USER -limit 10 -af requestmemory memoryusage
is an example of reviewing recent job history (past 10 in this example) for a subset of the history parameters. Can list all the parameters (long format) for a single job usingcondor_history 16892169.1 -l
where the jobid is listed explicitly.
condor_hold $USER
to hold all jobs running (e.g., to increase memory).
6.10 Citing CHTC
Guidance for citing CHTC in grants and papers.