Introduction

Here you will find instructions on how to login into the analysis facility via ssh, Where to start working? (the analysis facility filesystems and the roles of each), setting up the ATLAS environment and other tools.

Find the UChicago main page here

Login into UChicago analysis facility

First you will need to sign up on the Analysis Facility website. You can use your institutional or CERN identity (lxplus username) when signing up, this last will make the approval process smoother. Please enter your fullname, your home institution name, and your institutional email, accounts requests from services like Gmail, Outlook, iCloud, etc. won't be accepted.
In case you don't have an ATLAS membership yet, just send us an email explaining the reasons of your account request and add some US-ATLAS member connection.

Once your account is accepted you need to generate an SSH-key-pair consisting of a SSH-public-key and a SSH-private-key; past the content of your SSH-public-Key on your profile and add your local machine identification to the site. Follow the instrucctions below:

Create your SSH-key-pair via the following command (Mac/Linux):

cd ~/.ssh
# the next command prompts you to enter a passphrase, specify a passphrase of your choice to protect your private key against unauthorized use.
ssh-keygen -t rsa -f idrsa_uc 
cd - # go back to previous directory

This generates 2 files: an SSH-private-key named idrsa_uc, and an SSH-public-key named idrsa_uc.pub, upload the resulting SSH-public-key to your profile on the Analysis Facility website by pasting its content on the "SSH public key" text box, Important!: Do not copy the contents of a file that does not end in .pub. You must only upload the public (.pub) part of the key.

To print its content do:

cat ~/.ssh/idrsa_uc.pub

Now, add your identification from your local machine to the site: First, open your config file, if the file doesn't exist just create it.

# use the next line only if the file doesn't exist
touch config
# open the file and add the following lines:
ForwardAgent yes
IdentityFile ~/.ssh/idrsa_uc
# save and close the file

Finally, add your identification from your local machine using the following command:

# ssh-add  path-to-private-key 
ssh-add ~/.ssh/idrsa_uc

Tip: If, while following the previous steps you get this error message:

Could not open a connection to your authentication agent.

You may need to start the SSH-agent, you can use this command:
sh eval "$(ssh-agent -s)"

`Tip`: If, while following the previous steps you get this error message:

  Could not open a connection to your authentication agent.
You may need to start the `SSH-agent`, you can use this command:

  eval "$(ssh-agent -s)"

Once you have uploaded the public key and added your local identification to the site it will take a little bit of time to process your profile and add your account to the system. After ~15 minutes, you should be able to login via SSH:

ssh -Y <username>@login.af.uchicago.edu

If it does not work, please double check that you have been approved, have uploaded your public key and have waited at least 15 minutes. If you still have an issue, feel free to reach out to us for help.

Using the batch system at UChicago

The UChicago Analysis Facility uses HTCondor for batch workloads. In a nutshell, to submit a job you will need to create an executable script and a submit file that describes your job.

Before submitting to the analysis facility

Before going mad submmiting jobs to the batch system, we have to check some points to consider when performing an analysis/work in the most efficient way and enhance the use of the available resources. Be sure to understand them and remember that you can always ask.

Check list before submitting:

  • Which filesystem should be used to submit my jobs?
  • How much memory request does my Job(s) need?
  • Should my job(s) be submitted to the short or the long queue?
  • Check all my jobs requirements
  • Always check my jobs status
  • Which filesystem to use

    By default, all jobs start in the $scratch/ directory on the worker nodes, this means you have to create the workflow for your jobs keeping in mind that they will start in the $scratch directory as soon as you submit your jobs and indicating every step that you yourself follow when running your jobs locally, (that also includes copying the data samples that you will run to the $scratch directory). Also, notice that your output data will need to be staged to the shared filesystem or it will be lost!, since the $scratch/ area is ephemeral not for storage and it is not backed up. When submitting jobs, you should try to use the local scratch filesystem whenever possible. This will help you be a "good neighbor" to other users on the system, and reduce overall stress on the shared filesystems, which can lead to slowness, downtimes, etc.

We added some examples to make this ideas clearer.

example

Check the following example, data is read from Rucio, we pretend to process it, and then push a small output copied back to the $HOME filesystem. It assumes your X509 proxy certificate is valid and in your home directory.

We create our executable file:

#!/bin/bash
export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase
export ALRB_localConfigDir=$HOME/localConfig
source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh
lsetup rucio
rucio --verbose download --rse MWT2_DATADISK data16_13TeV:AOD.11071822._001488.pool.root.1

# You can run things like asetup as well
asetup AnalysisBase,21.2.81

# This is where you would do your data analysis via AnalysisBase, etc. We will
# just pretend to do that, and truncate the file to simulate generating an
# output file. This is definitely not what you want to do in a real analysis!
cd data16_13TeV
truncate --size 10MB AOD.11071822._001488.pool.root.1
cp AOD.11071822._001488.pool.root.1 $HOME/myjob.output

It gets submitted in the usual way:

Universe = vanilla

Output = myjob.$(Cluster).$(Process).out
Error = myjob.$(Cluster).$(Process).err
Log = myjob.log

Executable = myjob.sh

use_x509userproxy = true
x509userproxy = /home/lincolnb/x509proxy

request_memory = 1GB
request_cpus = 1

Queue 1
$ condor_submit myjob.sub
Submitting job(s).
1 job(s) submitted to cluster 17.

  • The Short and the Long queues.

Before submitting jobs you should have an idea about how long the job will take to finish (not the exact time but an approximate). In HTCondor we added a a feature called shortqueue with dedicated workers that will ONLY service jobs that run for less than 4 hours. To make use of the shortqueue you just have to add the following configuration parameter to your job submit file.

+queue="short"
If your job is longer than 4 hours just exclude this configuration parameter from your submit file and your jobs will be sent to the long-queue automatically, otherwise your job will be placed on hold until you remove it or release it (check HTCondor commmands).

☆゚. Important: Using this -short queue- for -short jobs- when possible is essential for the use of the available resources, specially to let the -long queue- available for -long jobs-. . ゚☆.

  • Job memory request

    How much memory does your job need to run?

How to submit jobs

You will need to define an executable script and a "submit" file that describes your job. A simple job that loads the ATLAS environment looks something like this:

Job script, called myjob.sh:

#!/bin/bash
export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase
export ALRB_localConfigDir=$HOME/localConfig
source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh
# at this point, you can lsetup root, rucio, athena, etc..
Submit file, called myjob.sub:
Universe = vanilla

Output = myjob.$(Cluster).$(Process).out
Error = myjob.$(Cluster).$(Process).err
Log = myjob.log

Executable = myjob.sh

request_memory = 1GB
request_cpus = 1

Queue 1
The condor_submit command is used to queue jobs:
$ condor_submit myjob.sub
Submitting job(s).
1 job(s) submitted to cluster 17.
And the condor_q command is used to view the queue:
[lincolnb@login01 ~]$ condor_q


-- Schedd: head01.af.uchicago.edu : <192.170.240.14:9618?... @ 07/22/21 11:28:26
OWNER    BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
lincolnb ID: 17       7/22 11:27      _      1      _      1 17.0

Total for query: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
Total for lincolnb: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
Total for all users: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

Configuring your jobs to use an X509 Proxy Certificate

If you need to use an X509 Proxy, e.g. to access ATLAS Data, you will want to copy your X509 certificate to the Analysis Facility.

Store your certificate in $HOME/.globus and create a ATLAS VOMS proxy in the usual way:

export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase
export ALRB_localConfigDir=$HOME/localConfig
source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh
lsetup emi
voms-proxy-init -voms atlas -out $HOME/x509proxy
You will want to generate the proxy on, or copy it to, the shared $HOME filesystem so that the HTCondor scheduler can find and read the proxy. With the following additions to your jobscript, HTCondor will configure the job enviroment automatically for X509 authenticated data access:
use_x509userproxy = true
x509userproxy = /home/YOURUSERNAME/x509proxy
E.g., in the job above for the user lincolnb:
Universe = vanilla

Output = myjob.$(Cluster).$(Process).out
Error = myjob.$(Cluster).$(Process).err
Log = myjob.log

Executable = myjob.sh

use_x509userproxy = true
x509userproxy = /home/lincolnb/x509proxy

request_memory = 1GB
request_cpus = 1

Queue 1

Check these points before using the Analysis Facility System

Topic Warning Tips
$home quota Your quota at $HOME is 100GB,
be careful not to exceed this quota because some issues may arise,
for example not being able to login next time.
- Use the command 'du -sh'
to know the actual size of your current directory
- Check the table displayed at the start of your session,
which indicates the usage of your /home and /data directories.
- - -

HTCondor user's guide

Use condor within eventloop

If you are using EventLoop to submit your code to the Condor batch system you should replace your submission driver line with something like the following:

EL::CondorDriver driver;
job.options()->setString(EL::Job::optCondorConf, "getenv = true\naccounting_group = group_atlas.<institute>");
driver.submitOnly( job, "yourJobName”);

Useful attributes for the jobs submission file

option What is it for?
transfer_output_files = When it isn’t specified, it automatically transfers back all files that have been created or modified in the job’s temporary working directory.
transfer_input_file HTCondor transfers input files from the machine where the job is submitted to the machine chosen to execute the job
when_to_transfer_output - on_exit: (default) when the jobs ends on its own - on_exit_or_evict: if the job is evicted from a machine
should_transfer_files - yes: always transfers files to the remote working directory - if_needed: (default) access t files from a shared file system if possible, otherwise it will transfer the file - :no : disables file transfe - command specifies whether HTCondor should assume the existence of a file system shared by the submit machine and the execute machine.
arguments options passed to the exe from the cmd line
periodic_remove = time - remove a job wh has been in t queue for more tn 100 hours e.g. (time() - QDate) > (100 * 3600): - remove jobs that have been running f more tn tw hours e.g. periodic_remove = (JobStatus == 2) && (time() - EnteredCurrentStatus) > (2 * 3600)
queue indicates to create a job

Useful commands to manage and check jobs status.

Command reference for submission file Description Example
condor_hold Put jobs in the queue into the hold state -
condor_dagman: Meta scheduler for tthe HTCondor jobs within a DAG (directed acyclic graph) or multiple DAGs -
condor_release Releases jobs from the HTCondor job queue that were previously placed in hold state. -
condor_ssh_to_job JobId Creates an ssh session ro a running job. -
condor_submit –interactive Sets up the job environment and input files
It gives a command prompt where you can then start job manually to see what happens.
-
condor_q Display information about your jobs in queue -
condor_qedit To modify job attributes.
check condor_q -long
condor_qedit Cmd = path_to_executable #changes it
-
condor_q -long Check job's ClassAd attributes to edit the attributes -
condor_q -analyze 27497829 Determines why certain jobs are not running. -
condor_q -hold 16.0 Reason job 16.0 is in the hold state -
condor_q -hold user retrieves: ID, OWNER, HELD_SINCE, HOLD_REASON -
condor_q -nobatch retrieves: ID, OWNER, SUBMITTED, RUN_TIME, ST, PRI, SIZE, CMD -
condor_q -run retrieves: ID, OWNER, SUBMITTED, RUN_TIME, HOST(S) -
condor_q -factory -long Factory information and the jobMaterializationPauseReason attribute. -
condor_tail xx.yy Displays the last bytes of a file in the sandbox of a running job -

Using Docker / Singularity containers (Advanced)

Some users may want to bring their own container-based workloads to the Analysis Facility. We support both Docker-based jobs as well as Singularity-based jobs. Additionally, the CVMFS repository unpacked.cern.ch is mounted on all nodes.

If, for whatever reason, you wanted to run a Debian Linux-based container on the Analysis Facilty, it would be as simple as the following Job file:

universe                = docker
docker_image            = debian
executable              = /bin/cat
arguments               = /etc/hosts
should_transfer_files   = YES
when_to_transfer_output = ON_EXIT
output                  = out.$(Process)
error                   = err.$(Process)
log                     = log.$(Process)
request_memory          = 1000M
queue 1
Similarly, if you would like to run a Singualrity container, such as the ones provided in th unpacked.cern.ch CVMFS repo, you can submit a normal vanilla universe job, with a job executable that looks something like the following:
#!/bin/bash
singularity run -B /cvmfs -B /home /cvmfs/unpacked.cern.ch/registry.hub.docker.com/atlas/rucio-clients:default rucio --version
Replacing the rucio-clients:default container and rucio --version executable with your preferred software.