create new tag
view all tags

Cloud Scheduler Admin Guide

Note on the cloud_scheduler init script usage

  • To avoid having VMs shutdown, always use quickrestart or quickstop instead of the normal restart or stop

The cloud_admin tool

The cloud_admin tool is not installed into the path by default; it will be in the directory cloud_scheduler was installed from.
  • on condor.heprc.uvic.ca it's in /root/cs_update/cloud-scheduler-dev/cloud_admin
  • cloud_admin --help
    to see the options available

Common Tasks

Removing or Retiring VMs

  • To remove a VM from management by CS: cloud_admin -m -c cloudname -n vmid
    • This causes CS to forget about a VM and leave it alone. The VM will stay up and Condor will keep running jobs
  • To gracefully retire a VM: cloud_admin -o -c cloudname -n VMID

Enabling and Disabling Clouds

  • To enable a cloud: cloud_admin -e cloudname
  • To disable the cloud: cloud_admin -d cloudname
  • To make these changes persist after a restart, modify the enabled property in /etc/cloudscheduler/cloud_resources.conf

Adding Resources

  • To add or update resources, edit /etc/cloudscheduler/cloud_resources.conf , then service cloud_scheduler quickrestart

Draining a cloud (completely or partially)

  • Disable the cloud:
    • cloud_admin -d cloudname
  • Either:
    • force retire all the VMs: cloud_admin -o -c cloudname -a
    • or force retire some number of VMs: cloud_admin -o -c cloudname -b [number]
    • or force retire as many VMs as needed by giving the VMID of each one: cloud_admin -o -c cloudname -n VMID
  • Optionally:
    • edit /etc/cloudscheduler/cloud_resources.conf to set a reduced resource usage limit
    • when the VMs have finished retiring, do service cloud_scheduler quickrestart
  • Illustrated example Here

Cloud Aliases

The /etc/cloudscheduler/cloud_alias.json file can be used to define aliases for clouds, like this:
"CERNClouds": ["atlas_test","atlas_wigner","victoria_test"],
"GridPPClouds": ["gridpp-imperial","gridpp-oxford"]
Use cloud_admin -y to show the currently loaded aliases, and cloud_admin -t to reload them from the file.

In the event of a crash

  • Check /tmp/cloudscheduler.crash.log and/or post a new issue on github
  • get patch or quickfix the error based on the message in crash log and start cloud_scheduler back up: service cloud_scheduler start

Find out why a job is not booting a VM

  • Turn on verbose logging
    • cloud_admin -l VERBOSE
  • Get full cycle of the Scheduler thread logging
    • tail -f /var/log/cloudscheduler.log | grep Scheduler
  • look for messages from get_fitting_resources that indicate a resource mismatch or shortage
  • see what error responses are coming back from clouds that try to boot the VM but fail
  • Set the logging back to DEBUG
    • cloud_admin -l DEBUG

Upgrading Cloud Scheduler

  • It's safer to shutdown all the VMs before doing an upgrade in case the class definitons have changed and break the persistence file
  • Set the Panda queues brokeroff so jobs stop coming in
    • This needs to be done manually since the switcher doesn't know about Cloud Schedulers
  • Drain all the VMs from all the clouds
    • See above task and repeat for each cloud
  • Stubborn VMs can be killed with cloud_admin
    • cloud_admin -k -c cloudname -n vmid
  • When all VMs shutdown stop cloud scheduler
    • service cloud_scheduler stop
  • Get the new release from github (most likely the dev branch)
  • python2.7 setup.py install
  • service cloud_scheduler start
  • Run some short test jobs to make sure VMs are booting up shutting down normally
  • Set the Panda queues back online

-- MichealPaterson - 2013-05-28

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | More topic actions
Topic revision: r9 - 2015-05-15 - mhp
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback