Scheduled Maintenance
General Policy
- Scheduled maintenance always starts on Tuesdays in a normal working week.
Hence the Monday is a regular workday and can be used to prepare the maintenance and following the maintenance the remainder of the week consists of regular workdays to resolve potential problems.
The only exception is maintenance at the end of August, which may overlap with the last week of the summer break. - Maintenance is scheduled in two rounds with one month in between.
In the first round one cluster with related infra is serviced followed by 3 weeks of acceptation testing.
If the the acceptation tests were successful, the second cluster with related infra is serviced in the second round.
When the infra serviced in the first round fails the acceptation tests, the second round will be delayed and rescheduled. - Which parts of a redundant setup are serviced in the first round and which ones in the second round is determined ~ 1 month before the first round.
- All clusters have maintenance scheduled in a bi-annual interval.
- Maintenance is executed using the checklist
UMCG/LifeLines Research Clusters
Infra involved
- User accounts from 'umcg' and 'll' IDVault entitlements.
- Calculon and Boxy clusters including UIs, nodes, shared storage and OpenStack cloud servers hosting o.a. scheduler and proxy VMs.
- Separate interactive servers: Flexo & Bender.
Schedule
Year | Season | Round | Date | Infra |
---|---|---|---|---|
2016 | Summer | One | Aug 23 | Calculon |
Two | Sep 20 | Boxy | ||
Three | Sep 30 | Calculon: no maintenance, but downtime due to backup-power test @ DUO data center | ||
2017 | Winter | One | Feb 07 | Calculon |
Two | Mar 07 | Boxy | ||
Three | Mar 09 | Calculon, Flexo, Bender: network maintenance @ DUO data center | ||
2017 | Summer | One | Aug 22 | Calculon, Flexo, Bender and Lobby |
Two | Sep 20 | Boxy, Flexo, Bender and Foyer; Originally planned for Sep 19th, but postponed for one day | ||
2018 | Winter | One | Feb 06 | Cancelled |
Two | Mar 06 | Cancelled | ||
2018 | Summer | One | Aug 21 | Calculon, Flexo, Bender and Lobby |
Two | Sep 18 | Boxy and Foyer | ||
2019 | Winter | One | Feb 12 | Boxy |
Two | Mar 05 | Calculon + Flexo + Bender | ||
2019 | Summer | One | Aug 27 | T.B.A. |
Two | Sep 24 | T.B.A. | ||
2020 | Winter | One | Feb 04 | T.B.A. |
Two | Mar 03 | T.B.A. | ||
2020 | Summer | One | Aug 25 | T.B.A. |
Two | Sep 22 | T.B.A. | ||
2021 | Winter | One | Feb 02 | T.B.A. |
Two | Mar 02 | T.B.A. | ||
2021 | Summer | One | Aug 31 | T.B.A. |
Two | Sep 21 | T.B.A. | ||
2022 | Winter | One | Feb 08 | T.B.A. |
Two | Mar 08 | T.B.A. | ||
2022 | Summer | One | Aug 30 | T.B.A. |
Two | Sep 20 | T.B.A. |
Genome Diagnostics Clusters
Infra involved
- User accounts from 'GD' IDVault entitlement.
- Zinc-Finger and Leucine-Zipper clusters including UIs, nodes, schedulers, data sharing servers and shared storage.
- Separate pre-processing servers: Gattaca*.
Schedule
Note: first round of scheduled maintenance always coincides with emergency power tests @ UMCG.
Year | Season | Round | Date | Infra |
---|---|---|---|---|
2016 | Fall | Zero | Sep 30 | Gattaca01 + Leucine-Zipper: no maintenance, but downtime due to backup-power test @ DUO data center |
One | Oct 04 | Gattaca02 + Zinc-Finger | ||
Two | Nov 08 | Gattaca01 + Leucine-Zipper (Was originally November 1st, but is one week delayed.) | ||
2017 | Winter | Extra | Mar 09 | Gattaca01 + Leucine-Zipper: network maintenance @ DUO data center |
2017 | Spring | One | Apr 04 | Cancelled |
Two | May 02 | Cancelled | ||
2017 | Fall | One | Oct 03 | Gattaca01 + Leucine-Zipper |
Two | Oct 31 | Gattaca02 + Zinc-Finger | ||
2018 | Spring | One | Jun 05 | Gattaca01 + Leucine-Zipper (Delayed maintenance originally scheduled for Apr 10.) |
Two | Jul 04 | Gattaca02 + Zinc-finger (Delayed maintenance originally scheduled for May 08.) | ||
2018 | Fall | One | Oct 02 | Gattaca01 + Leucine-Zipper |
Two | Oct 30 | Gattaca02 + Zinc-finger | ||
2019 | Spring | One | Apr 02 | T.B.A. |
Two | May 07 | T.B.A. | ||
2019 | Fall | One | Oct 01 | T.B.A. |
Two | Oct 29 | T.B.A. | ||
2020 | Spring | One | Apr 07 | T.B.A. |
Two | May 12 | T.B.A. | ||
2020 | Fall | One | Oct 06 | T.B.A. |
Two | Nov 03 | T.B.A. | ||
2021 | Spring | One | Apr 13 | T.B.A. |
Two | May 11 | T.B.A. | ||
2021 | Fall | One | Oct 05 | T.B.A. |
Two | Nov 02 | T.B.A. | ||
2022 | Spring | One | Apr 05 | T.B.A. |
Two | May 10 | T.B.A. | ||
2022 | Fall | One | Oct 04 | T.B.A. |
Two | Nov 01 | T.B.A. |
Checklist
- Create list of all machines that will be serviced during this round of maintenance. Use the infra catalogue to make sure the list is complete.
- Create checklist of what needs to be performed for which machines on the RUG CIT Redmine Wiki
- Determine which analysis pipelines may be affected by the maintenance and will require a verification or validation experiment.
Add the required verification/validation experiments to the checklist on the RUG CIT Redmine Wiki - Announce maintenance on mailinglist
- Perform maintenance
- Check if all items of the checklist on the RUG CIT Redmine Wiki have been executed.
- Check if all machines that can process Slurm jobs work as expected by submitting a
CheckEnvironment.sh
test job to each compute node that was affected by the maintenance.
See Analysis SOP -> FAQs -> Q: How do I know what environment is available to my job on an execution host?
- Perform verification/validation experiments as listed in the checklist on the RUG CIT Redmine Wiki and using the corresponding SOP from docportal for this round of maitenance, fill out the corresponding forms and send them to the product owners.
Last modified 3 years ago
Last modified on 2019-02-28T16:45:55+01:00