-- ColinLeavettBrown - 2013-04-25

NEP52 Batch Services for RPI

Introduction

NEP52 provides a powerful, scalable batch processing capability to the Research Platform Initiative (RPI). This function is enabled through the following Virtual Machine (VM) images:

  • NEP52-cloud-scheduler - This VM hosts Cloud Scheduler, a service to auto-provision VMs, together with its HTCondor batch job scheduling environment. In addition, this node provides user login capabilities to allow users to submit their workload to the scheduler for execution, monitor and retrieve their results.
  • NEP52-cvmfs-server - Provides a software distribution appliance VM that can host and distribute software for multiple VMs and VM types. The VM provided contains only one simple demonstration application and should be considered a template for building efficient, project specific software repositories. Using this server in a project can greatly reduce image sizes and improve image and software propagation efficiency.
  • NEP52-batch-cvmfs-client - This VM provides a minimal Scientific Linux 6.3 kernel installation for both interactive and batch processing. It has been configured to access software from a CVMFS server and, if instantiated by Cloud Scheduler, to register with a HTCondor batch scheduler and run batch jobs.

Procedures:

The following procedures are designed to demonstrate the capabilities of the NEP52 batch processing services. By executing these procedures, you should learn how to utilize these services to process your own batch workload. The procedures will accomplish the following:

  1. Interactively run the demonstration application allowing you to become familiar with OpenStack services, OpenStack networking, and the CVMFS software distribution appliance (procedure #3).
  2. Customize the NEP52-cvmfs-server image to create a CVMFS server of your own (procedure #4).
  3. Customize the NEP52-batch-cvmfs-client image to create a VM to run a demonstration job using your CVMFS server (procedure #5).
  4. Customize the NEP52-cloud-scheduler image to create a user login ("head") node and scheduling environment to run your batch workloads on the DAIR clouds (procedure #6).
  5. Run a demonstration batch job, monitor its progress, and check its output (procedure #7).

1. Launching and logging in to a NEP52 public image instance

  1. Using the OpenStack dashboard, launch the desired NEP52 public image assigning a "keypair" to the instance to allow you to login as root. The OpenStack dialog requires the specification of a unique instance name. The name you choose can be used to communicate between instances and will be used in the following procedures.
  2. Associate a floating IP with the instance just launched.
  3. Log in to the instance as "root" via the floating IP using your keypair; ie. ssh -i <keypair.pem> root@<floating.IP>.

2. Taking a snapshot of a NEP52 public image instance

  1. Log in to the instance as root (procedure #1).
  2. Delete the UDEV rules associated with the network configuration, ie. "sed -i '/ATTR{address}/ D' /etc/udev/rules.d/70-persistent-net.rules".
  3. Using the OpenStack dashboard, snapshot the image.

3. Running the "Hello" demonstration application from the software repository

  1. Launch an instance of NEP52-cvmfs-server (procedure #1). Set the instance name to "cvmfs-server"; the NEP52-batch-cvmfs-client is configured to communicate with a server with this name. If you are unable to choose this name because it is in use by someone else, then you should choose a different name and follow #5.3 and #5.4 within the client. Since we do not need to login to the service to run the application, we do not need to associate a floating IP to the instance.
  2. Launch and login to an instance of NEP52-batch-cvmfs-client (procedure #1).
  3. Issue the command "/cvmfs/dair.cvmfs.server/Hello".

4. Creating a customized software repository

  1. Launch and login to the NEP52-cvmfs-server public image (procedure #1).
  2. Modify the content of the "/cvmfs/dair.cvmfs.server/" directory. As distributed, this directory contains the "empty" placeholder and the "Hello" script. The content of this directory, including any directory subtree, will be distributed to requesting clients once the repository has been signed and published. As a demonstration, copy "Hello" to "Goodbye" and modify the echoed text in "Goodbye" as desired.
  3. Sign and publish the repository by issuing the following commands:
    • chown -R cvmfs.cvmfs /cvmfs/dair.cvmfs.server
    • cvmfs-sync
    • cvmfs_server publish
  4. Take a snapshot of your repository (procedure #2).

5. Creating a customized client to use a customized software repository

  1. Launch an instance of your custom software repository (procedure #1). Choose an instance name, say "my-cvmfs-server". Since we do not need to login to the service to run the application, we do not need to associate a floating IP to the instance.
  2. Launch and login to an instance of NEP52-batch-cvmfs-client (procedure #1).
  3. Modify the CVMFS client configuration as follows:
    • In the domain configuration file "/etc/cvmfs/domain.d/cvmfs.server", change the server name from "cvmfs-server" to "my-cvmfs-server" (to match the name chosen in #6.1) in the URL on line #18.
    • In the configuration file "/etc/cvmfs/config.d/dair.cvmfs.server.conf", change the server name from "cvmfs-server" to "my-cvmfs-server" (to match the name chosen in #6.1) in the URL on line #1.
  4. Activate the new configuration with the command "service cvmfs restartautofs".
  5. You may use standard linux commands to view/execute the content of your software repository, eg. "ls -l /cvmfs/dair.cvmfs.server/*".
  6. Modify the image name within the "/.image.metadata" file.
  7. Take a snapshot of your client (procedure #2).

6. Configuring and starting the Cloud Scheduler service

  1. Launch and login to the NEP52-cloud-scheduler public image (procedure #1).
  2. Configure the cloud resources that you want to use. The configuration file, located at "/etc/cloudscheduler/cloud_resources.conf" documents all cloud resource parameters and contains template definitions for the Alberta and Quebec DAIR clouds at the bottom of the configuration file. The following items within each template should be modified:
    • Copy your ec2 credentials, "key_name", "access_key_id", and "secret_key_id", into the places indicated.
    • Review/adjust the resources, "vm_slots", "cpu_cores", "storage", and "memory" to be used on the cloud.
    • Be sure to set "enabled: true" for the appropriate clouds.
  3. Start the Cloud Scheduler service, ie. "service cloud_scheduler start".
  4. Add Cloud Scheduler to the list of services to be started automatically at boot, ie. "chkconfig cloud_scheduler on"
  5. To retain your customizations, take a snapshot (procedure #2). You may wish to review process #7, prior to taking your snapshot, which also makes customizations (specifically #7.3 and #7.4) to the Cloud Scheduler image.

7. Running a batch job

For this procedure, we will assume you have a running Cloud Scheduler (procedure #3), a running custom software repository (procedures #5 and #1), and that you have a customized client (procedure #6).
  1. Login to the Cloud Scheduler (procedure #1.3).
  2. Switch to the sysadmin account, ie. "su - sysadmin". This normal user account has password-less sudo access and a template job description file and simple job script.
  3. Determine the "ami" of your customized client:
    • Copy your ec2 credentials, access key id, and secret key id, into the places indicated within the /home/sysadmin/.ec2/ec2_credentials personal configuration file.
    • Issue "list_ami" to list all the images/ami IDs that you may access; ami IDs have the format "ami-xxxxxxxx" where "xxxxxxxx" are hexadecimal digits.
  4. Finalize the demonstration "/home/sysadmin/demo.job" job description file:
    • Change "<vm-name>" to the client image name (see #5.6).
    • Change "<ami-id>" to the ami ID of your client image.
  5. Submit the job for execution, ie. "condor_submit demo.job". The results of the job will be returned to the file "demo.log", "demo.out", and "demo.error" upon job completion.
  6. The progress of the job can be monitored with the "condor_q" and "cloud_status" commands", eg. watch "cloud_status -m; condor_q". Cloud Scheduler has several polling cycles, so it can take up to two minutes before the VM starts and one or two more before the job runs.
Edit | Attach | Watch | Print version | History: r36 | r16 < r15 < r14 < r13 | Backlinks | Raw View | More topic actions...
Topic revision: r14 - 2013-05-21 - crlb
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback