July 2015 Technical Update
As previously broadcast, Phase 1 of HPC Wales was formally funded until the end of May 2015. The HEIs are now heavily involved in the planning of Phase 2, and the HPC Wales service is currently in a transition phase offering continued access & support to researchers through a slimmed down organisation.
There are a number of technical changes that will occur over the coming months which directly affect users of the service. These changes are being made to increase capability & efficiency, reduce costs and prepare for Phase 2.
We will confirm exact details in future communications (including the service Message Of The Day), but the following pieces of work will interrupt services at various times:
1. We will be replacing the home file system on our systems. The current product has now reached the end of its life and we will deploy a current generation replacement. This will enable us to make additional storage space available to end users and to better manage that space. At the same time, we will restructure storage hardware in order to reduce the potential impact of failures. This will require a couple of short service outages at each site.
2. All our compute systems will be updated to the current release of the operating system. All systems will be upgraded to CentOS 6.6, offering slight performance improvement, up-to-date software libraries and improved reliability. In doing this, our systems will all be rebuilt using a new cluster management system that will also provide increased capabilities for the future. This will require a single longer service outage at each site.
3. We will replace the job scheduler LSF with an alternative (most likely to be SLURM). This is the software that takes job submissions into queues and then is responsible for executing them on available resources. The new system will offer a simpler, more manageable, higher performing and fairer service. This work will be undertaken at the same time as the operating system upgrades mentioned above. All users will need to take action at this point as all job scripts will need to be modified and a new set of commands will be needed. The large majority of jobs will be very easy to transition (we are hoping to provide some sample scripts to assist with this) , but a few will be more demanding and so to aid in this process we will make various documentation & guidance available, along with a test system as soon as possible.
However, sooner than the above, a number of site systems will be decommissioned. Although this will reduce available computational capacity, we will be deploying additional hub capacity where necessary. The Aberystwyth and Glamorgan compute nodes will become unavailable from the 1st August 2015, and we will close and drain all queues in advance of this. In order to provide access to user data stored at either site, however, the login nodes will remain available for a period of 1 month until the end of August 2015.
We will work to minimise the impact of these changes on the availability of the service by reducing concurrent interruption to a minimum. Plus, of course, we will work with any user in order to ease the burden of any of the changes, so don’t hesitate to contact us via the Support Desk (firstname.lastname@example.org) with any requests.
We will be following up with regular emails over the coming weeks and months, so please watch this space.
Leave a Reply