Conduct & Best Practice
- Be kind to other users, don’t run too many jobs at once or fill up the scratch drive.
- Don’t run jobs on the login node, submit them to the cluster with Slurm instead. Also see below section on Login Node Usage Policy and Limits.
- Don’t run serial jobs in the compute partition on Hawk, use the ‘htc’ (high throughput) partition instead – the system will attempt to auto-redirect such jobs if you do submit them, but please try to submit straight to the most appropriate place.
- Sunbird has no HTC partition and its compute partition is for both serial and parallel jobs.
- Only run jobs which need a GPU on the ‘gpu’ partition.
- Only run jobs that need high memory on the ‘highmem’ partition.
- Sunbird has no separate ‘highmem’ partition, all nodes have 8GB per core.
- Use the test/dev partition(s) for testing short runs of your software.
- If you are a member of multiple projects, use the -A option to sbatch to ensure your job is accounted against the right project.
- Try to match your job specification to its actual needs. Don’t request more memory, CPU cores or time than you need.
- If you request exclusive use of a node, make sure you are using all the cores on it – 40 on all Intel machines, 64 on all AMD machines. We have relatively few nodes, each with a lot of cores.
- Make sure that jobs last at least a few minutes. Small jobs add a lot of overhead to the scheduler.
- Use GNU Parallel to combine multiple small jobs into a single larger one. See Batch Submission of Serial Tasks.
- Don’t store data on scratch, it may be deleted after 60 days of non-access.
- Don’t use SCW as a long term data store, home directories aren’t backed up. If you need a long term data store ask your institution.
- Publish your data and somebody else will store it for you!
- Data sets up to 50 gigabytes can be published via Zenodo.
- Data, code and protocols can be submitted for peer review and publication via GigaScience.
- Credit SCW in your publications. See this FAQ item.
- Tell SCW staff if you get a paper published from work which used the system or if you get a grant that will use it funded.
- Be aware of the Message of the Day (displayed at SSH login) and email notifications sent from SCW regarding maintenance outages & at-risk periods, and any system problems.
Login Node Usage Policy
The login nodes of SCW systems are the machines used interactively via login to: hawklogin.cf.ac.uk
hawkloginamd.cf.ac.uk
sunbird.swan.ac.uk
These machines act as the interaction point for all users, from which data & script preparation, job submission, post-processing, etc. are performed.
In contrast to the compute nodes, they are few in number and are shared between all users. As such, one user can negatively impact the experience of many if they use the login nodes inappropriately.
Starting in September 2021, SCW will be implementing additional controls to help prevent login node abuse and ensure fair share of available resources via the enforcement of User Limits with the Arbiter2 system developed at The University of Utah‘s Centre for High Performance Computing.
Login Node User Limits
Each user will be limited to the concurrent use of at most 8 processor cores and 200GB of memory by default.
However, should heavy usage be made of this allocation then the user’s access will be reduced for a penalty period of time.
‘Heavy Usage’ is defined as more than 75% of the allocation for continuous time 3 minutes.
When the penalty time expires, the user allocation will revert back to the prior state. However, a repeated abuse will result in a harsher penalty being applied for a longer period. There are three levels of penalty that an excessive user will be pushed up/down through:
Penalty Level | CPU & Memory Level | Penalty Timeout |
None | 8 processor cores and 200GB memory | NA |
1 | 50% i.e. 4 processor cores and 100GB memory | 3 minutes |
2 | 30% i.e. 2.4 processor cores and 60GB memory | 10 minutes |
3 | 10% i.e. 0.8 processor cores and 20GB memory | 30 minutes |
Should a reformed user re-offend within 3 hours of a penalty, then they will be returned to the penalty level they were previously at. Should a user not re-offend within 3 hours, then their penalty level will be reduced by one.
Note that some tasks that are intended for deployment on login nodes – such as compilers – are whitelisted and do not affect usage measurement.
When a user is moved into a penalty state, the system will email the user to say so. This email will be sent to the on-file email address (typically the user’s institutional email) and will include details of the abuse and penalty being applied. A further email will follow when the penalty is removed.