-- ColinLeavettBrown - 2013-04-25

NEP52 Batch Services for RPI

Introduction

NEP52 provides a powerful, scalable batch processing capability to the Research Platform Initiative (RPI). This function is enabled through the following Virtual Machine (VM) images:

  • NEP52-cloud-scheduler - This VM hosts Cloud Scheduler, a service to auto-provision VMs, together with its HTCondor batch job scheduling environment. In addition, this node provides user login capabilities to allow users to submit their workload to the scheduler for execution, monitor and retrieve their results.
  • NEP52-cvmfs-server - Provides a software distribution appliance VM that can host and distribute software for multiple VMs and VM types. The VM provided contains only one simple demonstration application and should be considered a template for building efficient, project specific software repositories. Using this server in a project can greatly reduce image sizes and improve image and software propagation efficiency.
  • NEP52-batch-cvmfs-client - This VM provides a minimal Scientific Linux 6.3 kernel installation for both interactive and batch processing. It has been configured to access software from a CVMFS server and, if instantiated by Cloud Scheduler, to register with a HTCondor batch scheduler and run batch jobs.

Procedures:

The following procedures are designed to demonstrate the capabilities of the NEP52 batch processing services. By executing these procedures, you should learn how to utilize these services to process your own batch workload. The procedures will accomplish the following:

  1. Interactively run the demonstration application allowing you to become familiar with OpenStack services, OpenStack networking, and the CVMFS software distribution appliance (procedure #1).
  2. Customize the NEP52-cvmfs-server image to create a CVMFS server of your own (procedure #2).
  3. Customize the NEP52-batch-cvmfs-client image to create a VM to run a demonstration job using your CVMFS server (procedure #3).
  4. Customize the NEP52-cloud-scheduler image to create a user login ("head") node and scheduling environment to run your batch workloads on the DAIR clouds (procedure #4).
  5. Run a demonstration batch job, monitor its progress, and check its output (procedure #5).

These procedures use the DAIR OpenStack dashboard available at:

https://nova-ab.dair-atir.canarie.ca
and a terminal capable of using the "ssh" or "putty" commands.

The screen-shots in the following documentation were developed using the HEPnet account/user name. Whenever a screen-shot contains the word "HEPnet" or the instructions contain the text "{account}", you should substitute your own account/user name.

1. Getting started

    i. Open a terminal and login to the DAIR OpenStack dashboard:

2. Running the "Hello" demonstration application from the software repository

    i. Launch an instance of NEP52-cvmfs-server:


       

        setting the instance name to "{account}-cvmfs":


       

        assigning your own key pair:


       

        and associating a floating IP:


       

    ii. Using the same methodology, launch an instance of NEP52-batch-cvmfs-client, setting the name to "{account}-client:, assigning a "keypair" and a floating IP. When your instance of "{account}-client" becomes "active" :


       

        login as root:


       

    iv. Reconfigure the client to use your server name:

        * In the domain configuration file "/etc/cvmfs/domain.d/cvmfs.server", change the string "{account}" in the URL on line #18 to your account name:


       

        * In the configuration file "/etc/cvmfs/config.d/dair.cvmfs.server.conf", change the string "{account}" in the URL on line #1 to your account name:


       

        * Activate the new configuration:


       

    v. Run the demonstration application from the CVMFS server:
       

        vi. Create a snapshot of your working, customized client:

        * Change the image name within the "/.image.metadata":


       

        * Delete network related UDEV rules created by this kernel for each unique MAC address: , ie. "sed -i '/ATTR{address}/ D' /etc/udev/rules.d/70-persistent-net.rules".


       

        * Use the OpenStack dashboard to take a snapshot of your batch client.


       

3. Creating a customized software repository

        i. Logout of your client and login to your CVMFS server:


       

        ii. Modify the content of the "/cvmfs/dair.cvmfs.server/" directory. As distributed, this directory contains the "empty" placeholder and the "Hello" script. The content of this directory, including any directory subtree, will be distributed to requesting clients once the repository has been signed and published. As a demonstration, copy "Hello" to "Goodbye" and modify the echoed text in "Goodbye" as desired.


       

        iii. Sign and publish the repository by issuing the following three commands:

chown -R cvmfs.cvmfs /cvmfs/dair.cvmfs.server
cvmfs-sync
cvmfs_server publish

        iv. Use the OpenStack dashboard to take a snapshot of your CVMFS server.

We will be exercising your new application under preocedure #5, "Runnng a batch job", but if you want to check your changes now, you could login to you running client a try the application now.

4. Configuring and starting the Cloud Scheduler service

  1. Launch an instance of NEP52-cloud-scheduler, assigning a "keypair" and a floating IP.
  2. When your instance of the NEP52-cloud-scheduler becomes "active", login as root (ie. "ssh -i <pub-key> root@<floating-ip>").
  3. Configure the cloud resources that you want to use. The configuration file, located at "/etc/cloudscheduler/cloud_resources.conf" documents all cloud resource parameters and contains template definitions for the Alberta and Quebec DAIR clouds at the bottom of the configuration file. The following items within each template should be modified:
    • Copy your ec2 credentials, "key_name", "access_key_id", and "secret_key_id", into the places indicated.
    • Review/adjust the resources, "vm_slots", "cpu_cores", "storage", and "memory" to be used on the cloud.
    • Be sure to set "enabled: true" for the appropriate clouds.
  4. Start the Cloud Scheduler service, ie. "service cloud_scheduler start".
  5. Add Cloud Scheduler to the list of services to be started automatically at boot, ie. "chkconfig cloud_scheduler on"
  6. Save a copy of your Cloud Scheduler:
    • You may wish to review procedure #5, prior to taking your snapshot, which also makes customizations (specifically #5.3 and #5.4) to the Cloud Scheduler image.
    • Delete network related UDEV rules created by this kernel for each unique MAC address, ie. "sed -i '/ATTR{address}/ D' /etc/udev/rules.d/70-persistent-net.rules".
    • Use the OpenStack dashboard to take a snapshot of your batch client.

5. Running a batch job

For this procedure, we will assume you have a running Cloud Scheduler (procedure #4), a running custom software repository (procedure #2), and that you have a customized client (procedure #3).
  1. Login to the Cloud Scheduler, ie. "ssh -i <pub-key> root@<floating-ip>").
  2. Switch to the sysadmin account, ie. "su - sysadmin". This normal user account has password-less sudo access and a template job description file and simple job script.
  3. Determine the "ami" of your customized client:
    • Copy your ec2 credentials, access key id, and secret key id, into the places indicated within the /home/sysadmin/.ec2/ec2_credentials personal configuration file.
    • Issue "list_ami" to list all the images/ami IDs that you may access; ami IDs have the format "ami-xxxxxxxx" where "xxxxxxxx" are hexadecimal digits.
  4. Finalize the demonstration "/home/sysadmin/demo.job" job description file:
    • Change "<image-name>" to match the unique name you set within the your batch client's /.image.metadata file (see procedure #3.5).
    • Change "<ami-id>" to the ami ID of your client image.
  5. Submit the job for execution, ie. "condor_submit demo.job". The results of the job will be returned to the file "demo.log", "demo.out", and "demo.error" upon job completion.
  6. The progress of the job can be monitored with the "condor_q" and "cloud_status" commands", eg. try "watch 'cloud_status -t; cloud_status -m; condor_q'" - use "Control-C" to exit the "watch" command. Cloud Scheduler has several polling cycles, so it can take several minutes before the VM starts and one or two more before the job runs. If your job won't run, the Cloud Scheduler log at /var/log/cloudscheduler.log probably contains the reason why.
Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng p1.1.png r1 manage 1257.2 K 2013-05-23 - 20:24 UnknownUser  
PNGpng p2.1.png r1 manage 315.1 K 2013-05-23 - 21:04 UnknownUser  
PNGpng p2.10.png r1 manage 59.7 K 2013-05-23 - 21:50 UnknownUser  
PNGpng p2.10a.png r1 manage 59.9 K 2013-05-23 - 22:11 UnknownUser  
PNGpng p2.11.png r2 r1 manage 36.0 K 2013-05-23 - 22:51 UnknownUser  
PNGpng p2.12.png r1 manage 71.0 K 2013-05-23 - 22:51 UnknownUser  
PNGpng p2.13.png r1 manage 225.3 K 2013-05-23 - 22:51 UnknownUser  
PNGpng p2.2.png r1 manage 262.8 K 2013-05-23 - 21:04 UnknownUser  
PNGpng p2.3.png r1 manage 249.4 K 2013-05-23 - 21:04 UnknownUser  
PNGpng p2.4.png r1 manage 210.2 K 2013-05-23 - 21:32 UnknownUser  
PNGpng p2.5.png r1 manage 261.8 K 2013-05-23 - 21:32 UnknownUser  
PNGpng p2.6.png r1 manage 24.0 K 2013-05-23 - 21:32 UnknownUser  
PNGpng p2.7.png r1 manage 67.0 K 2013-05-23 - 21:50 UnknownUser  
PNGpng p2.7a.png r1 manage 82.8 K 2013-05-23 - 22:11 UnknownUser  
PNGpng p2.8.png r1 manage 22.3 K 2013-05-23 - 21:50 UnknownUser  
PNGpng p2.8a.png r1 manage 39.3 K 2013-05-23 - 22:11 UnknownUser  
PNGpng p2.9.png r1 manage 49.2 K 2013-05-23 - 21:50 UnknownUser  
PNGpng p2.9a.png r1 manage 49.3 K 2013-05-23 - 22:15 UnknownUser  
PNGpng p3.1.png r1 manage 90.4 K 2013-05-23 - 23:10 UnknownUser  
PNGpng p3.2.png r1 manage 110.7 K 2013-05-23 - 23:10 UnknownUser  
Edit | Attach | Watch | Print version | History: r36 | r19 < r18 < r17 < r16 | Backlinks | Raw View | More topic actions...
Topic revision: r17 - 2013-05-23 - crlb
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback