Introduction
Here you will find instructions on how to login into the analysis facility via ssh, Where to start working? (the analysis facility filesystems and the roles of each), setting up the ATLAS environment and other tools.
Find the UChicago main page here
Login into UChicago analysis facility
First you will need to sign up on the Analysis Facility website.
You can use your institutional or CERN identity (lxplus username) when signing up, this last will make the approval process smoother. Please enter your fullname, your home institution name, and your institutional email, accounts requests from services like Gmail, Outlook, iCloud, etc. won't be accepted.
In case you don't have an ATLAS membership yet, just send us an email explaining the reasons of your account request and add some US-ATLAS member connection.
Once your account is accepted you need to generate an SSH-key-pair
consisting of a SSH-public-key
and a SSH-private-key
; past the content of your SSH-public-Key on your profile and add your local machine identification to the site. Follow the instrucctions below:
Create your SSH-key-pair via the following command (Mac/Linux):
cd ~/.ssh
# the next command prompts you to enter a passphrase, specify a passphrase of your choice to protect your private key against unauthorized use.
ssh-keygen -t rsa -f idrsa_uc
cd - # go back to previous directory
This generates 2 files: an SSH-private-key named idrsa_uc, and an SSH-public-key named idrsa_uc.pub, upload the resulting SSH-public-key to your profile on the Analysis Facility website by pasting its content on the "SSH public key" text box,
Important!: Do not copy the contents of a file that does not end in .pub. You must only upload the public (.pub) part of the key.
To print its content do:
Now, add your identification from your local machine to the site:
First, open your config
file, if the file doesn't exist just create it.
# use the next line only if the file doesn't exist
touch config
# open the file and add the following lines:
ForwardAgent yes
IdentityFile ~/.ssh/idrsa_uc
# save and close the file
Finally, add your identification from your local machine using the following command:
Tip
: If, while following the previous steps you get this error message:Could not open a connection to your authentication agent.
You may need to start the
SSH-agent
, you can use this command:
sh eval "$(ssh-agent -s)"
`Tip`: If, while following the previous steps you get this error message:
Could not open a connection to your authentication agent.
You may need to start the `SSH-agent`, you can use this command:
eval "$(ssh-agent -s)"
Once you have uploaded the public key and added your local identification to the site it will take a little bit of time to process your profile and add your account to the system. After ~15 minutes, you should be able to login via SSH:
If it does not work, please double check that you have been approved, have uploaded your public key and have waited at least 15 minutes. If you still have an issue, feel free to reach out to us for help.
Using the batch system at UChicago
The UChicago Analysis Facility uses HTCondor
for batch workloads.
In a nutshell, to submit a job you will need to create an executable
script and a submit
file that describes your job.
Before submitting to the analysis facility
Before going mad submmiting jobs to the batch system, we have to check some points to consider when performing an analysis/work in the most efficient way and enhance the use of the available resources. Be sure to understand them and remember that you can always ask.
Check list before submitting:
- Which filesystem should be used to submit my jobs?
- How much memory request does my Job(s) need?
- Should my job(s) be submitted to the short or the long queue?
- Check all my jobs requirements
- Always check my jobs status
-
Which filesystem to use
By default, all jobs start in the$scratch/
directory on the worker nodes, this means you have to create the workflow for your jobs keeping in mind that they will start in the$scratch
directory as soon as you submit your jobs and indicating every step that you yourself follow when running your jobs locally, (that also includes copying the data samples that you will run to the$scratch
directory). Also, notice that your output data will need to be staged to the shared filesystem or it will be lost!, since the$scratch/
area is ephemeral not for storage and it is not backed up. When submitting jobs, you should try to use the local scratch filesystem whenever possible. This will help you be a "good neighbor" to other users on the system, and reduce overall stress on the shared filesystems, which can lead to slowness, downtimes, etc.
We added some examples to make this ideas clearer.
Check the following example, data is read from Rucio
, we pretend to process it, and then push a small output copied back to the $HOME
filesystem. It assumes your X509 proxy certificate is valid and in your home directory.
We create our executable file:
#!/bin/bash
export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase
export ALRB_localConfigDir=$HOME/localConfig
source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh
lsetup rucio
rucio --verbose download --rse MWT2_DATADISK data16_13TeV:AOD.11071822._001488.pool.root.1
# You can run things like asetup as well
asetup AnalysisBase,21.2.81
# This is where you would do your data analysis via AnalysisBase, etc. We will
# just pretend to do that, and truncate the file to simulate generating an
# output file. This is definitely not what you want to do in a real analysis!
cd data16_13TeV
truncate --size 10MB AOD.11071822._001488.pool.root.1
cp AOD.11071822._001488.pool.root.1 $HOME/myjob.output
It gets submitted in the usual way:
Universe = vanilla
Output = myjob.$(Cluster).$(Process).out
Error = myjob.$(Cluster).$(Process).err
Log = myjob.log
Executable = myjob.sh
use_x509userproxy = true
x509userproxy = /home/lincolnb/x509proxy
request_memory = 1GB
request_cpus = 1
Queue 1
-
The Short and the Long queues.
Before submitting jobs you should have an idea about how long the job will take to finish (not the exact time but an approximate).
In HTCondor
we added a a feature called shortqueue
with dedicated workers that will ONLY service jobs that run for less than 4 hours.
To make use of the shortqueue you just have to add the following configuration parameter to your job submit file.
long-queue
automatically, otherwise your job will be placed on hold until you remove it or release it (check HTCondor commmands).
☆゚. Important: Using this -short queue- for -short jobs- when possible is essential for the use of the available resources, specially to let the -long queue- available for -long jobs-. . ゚☆.
-
Job memory request
How much memory does your job need to run?
How to submit jobs
You will need to define an executable script and a "submit" file that describes your job. A simple job that loads the ATLAS environment looks something like this:
Job script, called myjob.sh:
#!/bin/bash
export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase
export ALRB_localConfigDir=$HOME/localConfig
source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh
# at this point, you can lsetup root, rucio, athena, etc..
Universe = vanilla
Output = myjob.$(Cluster).$(Process).out
Error = myjob.$(Cluster).$(Process).err
Log = myjob.log
Executable = myjob.sh
request_memory = 1GB
request_cpus = 1
Queue 1
[lincolnb@login01 ~]$ condor_q
-- Schedd: head01.af.uchicago.edu : <192.170.240.14:9618?... @ 07/22/21 11:28:26
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
lincolnb ID: 17 7/22 11:27 _ 1 _ 1 17.0
Total for query: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
Total for lincolnb: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
Total for all users: 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
Configuring your jobs to use an X509 Proxy Certificate
If you need to use an X509 Proxy, e.g. to access ATLAS Data, you will want to copy your X509 certificate to the Analysis Facility.
Store your certificate in $HOME/.globus
and create a ATLAS VOMS proxy in the usual way:
export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase
export ALRB_localConfigDir=$HOME/localConfig
source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh
lsetup emi
voms-proxy-init -voms atlas -out $HOME/x509proxy
Universe = vanilla
Output = myjob.$(Cluster).$(Process).out
Error = myjob.$(Cluster).$(Process).err
Log = myjob.log
Executable = myjob.sh
use_x509userproxy = true
x509userproxy = /home/lincolnb/x509proxy
request_memory = 1GB
request_cpus = 1
Queue 1
Check these points before using the Analysis Facility System
Topic | Warning | Tips |
---|---|---|
$home quota | Your quota at $HOME is 100GB, be careful not to exceed this quota because some issues may arise, for example not being able to login next time. |
- Use the command 'du -sh'
to know the actual size of your current directory - Check the table displayed at the start of your session, which indicates the usage of your /home and /data directories. |
- | - | - |
HTCondor user's guide
Use condor within eventloop
If you are using EventLoop to submit your code to the Condor batch system you should replace your submission driver line with something like the following:
EL::CondorDriver driver;
job.options()->setString(EL::Job::optCondorConf, "getenv = true\naccounting_group = group_atlas.<institute>");
driver.submitOnly( job, "yourJobName”);
Useful attributes for the jobs submission file
option | What is it for? |
---|---|
transfer_output_files = |
When it isn’t specified, it automatically transfers back all files that have been created or modified in the job’s temporary working directory. |
transfer_input_file | HTCondor transfers input files from the machine where the job is submitted to the machine chosen to execute the job |
when_to_transfer_output | - on_exit: (default) when the jobs ends on its own - on_exit_or_evict: if the job is evicted from a machine |
should_transfer_files | - yes: always transfers files to the remote working directory - if_needed: (default) access t files from a shared file system if possible, otherwise it will transfer the file - :no : disables file transfe - command specifies whether HTCondor should assume the existence of a file system shared by the submit machine and the execute machine. |
arguments | options passed to the exe from the cmd line |
periodic_remove = time | - remove a job wh has been in t queue for more tn 100 hours e.g. (time() - QDate) > (100 * 3600): - remove jobs that have been running f more tn tw hours e.g. periodic_remove = (JobStatus == 2) && (time() - EnteredCurrentStatus) > (2 * 3600) |
queue | indicates to create a job |
Useful commands to manage and check jobs status.
Command reference for submission file | Description | Example |
---|---|---|
condor_hold | Put jobs in the queue into the hold state | - |
condor_dagman: | Meta scheduler for tthe HTCondor jobs within a DAG (directed acyclic graph) or multiple DAGs | - |
condor_release | Releases jobs from the HTCondor job queue that were previously placed in hold state. | - |
condor_ssh_to_job JobId | Creates an ssh session ro a running job. | - |
condor_submit –interactive | Sets up the job environment and input files It gives a command prompt where you can then start job manually to see what happens. |
- |
condor_q | Display information about your jobs in queue | - |
condor_qedit | To modify job attributes. check condor_q -long condor_qedit Cmd = path_to_executable #changes it |
- |
condor_q -long | Check job's ClassAd attributes to edit the attributes | - |
condor_q -analyze 27497829 | Determines why certain jobs are not running. | - |
condor_q -hold 16.0 | Reason job 16.0 is in the hold state | - |
condor_q -hold user | retrieves: ID, OWNER, HELD_SINCE, HOLD_REASON | - |
condor_q -nobatch | retrieves: ID, OWNER, SUBMITTED, RUN_TIME, ST, PRI, SIZE, CMD | - |
condor_q -run | retrieves: ID, OWNER, SUBMITTED, RUN_TIME, HOST(S) | - |
condor_q -factory -long | Factory information and the jobMaterializationPauseReason attribute. | - |
condor_tail xx.yy | Displays the last bytes of a file in the sandbox of a running job | - |
Using Docker / Singularity containers (Advanced)
Some users may want to bring their own container-based workloads to the Analysis Facility. We support both Docker-based jobs as well as Singularity-based jobs. Additionally, the CVMFS repository unpacked.cern.ch is mounted on all nodes.
If, for whatever reason, you wanted to run a Debian Linux-based container on the Analysis Facilty, it would be as simple as the following Job file:
universe = docker
docker_image = debian
executable = /bin/cat
arguments = /etc/hosts
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
output = out.$(Process)
error = err.$(Process)
log = log.$(Process)
request_memory = 1000M
queue 1
#!/bin/bash
singularity run -B /cvmfs -B /home /cvmfs/unpacked.cern.ch/registry.hub.docker.com/atlas/rucio-clients:default rucio --version
rucio-clients:default
container and rucio --version
executable with your preferred software.