NEP52 Batch Services

Overview

If you already are familiar with batch computing and just want to get to running your jobs on VMs skip to the Cloud Scheduler Test Drive Section below.

Cloud Scheduler is a tool which will spawn Virtual Machine (VMs) on Infrastructure-as-a-Service (IaaS) clouds in order to run batch computing jobs. With Cloud Scheduler you clone a VM or provide and existing VM batch worker node image and point your HT Condor batch jobs to that image; cloud scheduler handles the rest. For people already familiar with classic batch computing the process will be very familiar.

Here's how it should work for the user Jane's perspective:

  1. Jane prepares a VM image loaded with the software she needs for processing, then uploads it to an image repository. It's also possible that this could have been done previously by one of her colleagues, or she picks a pre-cooked image (as is the case in the following tutorial).
  2. Jane submits a bunch of processing jobs to a Condor pool. In the Condor jobs, she specifies regular Condor parameters, but also specifies a VM image that she would like her job to run on.
  3. Jane then waits for her jobs to complete.
  4. Jane gets her results.

rpi_cloud_architecture.png

The following tutorial will walk you though the steps necessary to run your first job on the cloud using the DAIR cloud. Only access to additional clouds is required for you to run your jobs distributed at multiple cloud sites. We start by assuming that you have access to the DAIR cloud.

Cloud Scheduler Test Drive

You can use the software provided in this RPI project to easily run your batch jobs on the DAIR cloud. The following examples will allow you to test drive this functionality quickly. In summary:

  • "Running your first batch job" will have you launch an instance of the Cloud Scheduler image (NEP52-cloud-scheduler), configure and start Cloud Scheduler, and submit a batch job. The batch job will trigger Cloud Scheduler to boot a VM on the DAIR OpenStack Cloud automatically, the job will run, and you can monitor its progress and check the job output. At the end of the job, when there are no more jobs in the queue, Cloud Scheduler will automatically remove idle batch VMs.
  • " Running a batch job which uses the Shared Software Repository service" will have you launch an instance of the CVMFS server image (NEP52-cvmfs-server), submit a batch job, and check the output of the distributed application.

In order to try the Cloud Scheduler Test Drive, you will need the following:

  • A DAIR login ID with a large enough quota to run the three concurrent demonstration instances.
  • To create your own keypair and save the pem file locally (see the Openstack dashboard/documentation).
  • To retrieve your EC2_ACCESS_KEY and EC2_SECRET_KEY from the Openstack dashboard.

Preparation

The NEP52 Batch Services service and the related NEP52 Shared Software Repository service make use of the network and require specific ports to be open. These ports must be added to your OpenStack default security group by logging into the OpenStack dashboard, selecting the "Access & Security" tab and clicking "Edit Rules" beside the default security group. Use the "Add Rule" dialog at the bottom of the form and ensure that all the ports shown in the figure below are included before proceeding with the test drive:

DefaultSG.png

Running your first batch job

Step 1: Log into DAIR and boot a Cloud Scheduler instance

Login into the DAIR OpenStack Dashboard: https://nova-ab.dair-atir.canarie.ca . Select the alberta region. Refer to the OpenStack docs for all the details of booting and managing VMs via the dashboard.

Go to the 'Images and Snapshot' tab on the left of the page then click the button that says 'Launch' next to the NEP52-cloud-scheduler image.

Fill in the form to look the same as the screen shot below substituting your username where you see the string "hepnet".

launch.png

Now you need to select your SSH key to associate with the instance so that you can login to the image. Click the access and security tab, pick your key, click "launch" (see screen shot below) and wait for the instance to become active.

select_key.png

Step 2: Log into the Cloud Scheduler instance and configure it

Now associate a floating IP to the machine. Click on the instances tab on the left. From the "Actions" beside your newly started Cloud Scheduler instance, choose "Associate Floating IP", complete the dialog and click "Associate".

Now ssh into the box as root (you can find the IP of the machine from the dashboard):

%STARTCONSOLE% ssh -i ~/.ssh/MyKey.pem root@208.75.74.80 %ENDCONSOLE%

Use your favourite editor (ie. nano, vi, or vim) to edit the Cloud Scheduler configuration file to contain your DAIR EC2 credentials, specifically "{keypair_name}", "{EC2_ACCESS_KEY}", and "{EC2_SECRET_KEY}", for both the Alberta and Quebec DAIR clouds. Then start the Cloud Scheduler service:

%STARTCONSOLE% vi /etc/cloudscheduler/cloud_resources.conf service cloud_scheduler start %ENDCONSOLE%

If you don't have your credentials follow this video to see how to do it. Your credentials will be used by Cloud Scheduler to boot VMs on your behalf.

Step 3: Run a job and be amazed

Switch to the guest user on the VM and then submit the first demonstration job which calculates pi to 1000 decimal places. You can then see what's happening with cloud_status and condor_q or you can issue these two commands periodically through "watch" to monitor the job progress:

%STARTCONSOLE% su - guest condor_submit demo-1.job watch 'cloud_status -m; condor_status; condor_q' %ENDCONSOLE%

When the job completes, it disappears from the queue. The primary output for the job will be contained in the file 'demo-1.out', errors will be reported in 'demo-1.err', and the HTCondor job log is saved in 'demo-1.log'. All these file names are user defined in the job description file 'demo-1.job'.

%STARTCONSOLE% cat demo-1.out %ENDCONSOLE%

You have just run a demonstration job on a dynamically created Virtual Machine.

Running a batch job which uses the Shared Software Repository service

We provide a VM appliance which is preconfigured with CVMFS which will allow you to share your software to multiple running VMs.

CVMFS is a read only network file system that is designed for sharing software to VMs. It's a secure and very fast way to mount POSIX network file system that can be shared to hundreds of running VMs.

Step 1:

Using the OpenStack dashboard and the same launch procedure as for the Cloud Scheduler image, launch an instance of NEP52-cvmfs-server. You must set the instance name to "{username}-cvmfs" (obviously replacing "{username}" with your own username). It is always a good idea to assign the instance your keypair so that you can log into it if the need arises.

Step 2:

If you are not already logged into the Cloud Scheduler VM, login and switch to the guest account:

%STARTCONSOLE% ssh -i ~/.ssh/MyKey.pem root@208.75.74.80 su - guest %ENDCONSOLE%

Edit the second demonstration job description file, and replace the string "{username}" with your username.

%STARTCONSOLE% vi demo-2.job %ENDCONSOLE%

The line you must change looks like this:

%STARTCONSOLE% Arguments = {user-name} %ENDCONSOLE%

Now submit the job and watch it like we did before:

%STARTCONSOLE% condor_submit demo-2.job watch 'cloud_status -m; condor_q' %ENDCONSOLE%

Once the job finishes you should see something like this in the file "demo-2.out":

%STARTCONSOLE% cat demo-2.out Job started at Tue May 28 15:40:22 PDT 2013 => demo-2.sh <= Simple script for testing the default CVMFS appliance.

Shutting down CernVM-FS: [ OK ] Stopping automount: [ OK ] Starting automount: [ OK ] Starting CernVM-FS: [ OK ]

-rwxr-xr-x 1 cvmfs cvmfs 110 Mar 28 16:00 /cvmfs/dair.cvmfs.server/Hello -rw-r--r-- 1 cvmfs cvmfs 47 Mar 28 16:00 /cvmfs/dair.cvmfs.server/empty

Hello! You have successfully connected to the skeleton CVMFS server and run its software.

Job finished at Tue May 28 15:40:27 PDT 2013 %ENDCONSOLE%

Customizing the Shared Software Repository server to host your applications

In order to make the CVMFS server really useful to you, you will need to install your application software within its' repository. Modifying the content of the software repository is outside the scope of this document, but it is covered by the service documentation for "NEP52 Shared Software Repository" service.

Running a batch job to exercise the modifications to your Shared Software Repository

If you have followed the documentation for "NEP52 Shared Software Repository" and have created the sample "Goodbye" application, then you may want to run the third demonstration job to exercise your CVMFS modifications in the batch envirnoment. The procedure is identical to the one given for the second demonstration job above, substituting "demo-3" for "demo-2" wherever it appears.

Take snapshots of your customized images

If you followed all the steps above you have a customized version the Cloud Scheduler appliance running. You can now use the OpenStack dashboard to snapshot this server to save yourself the work of customizing it again.

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng launch.png r1 manage 57.9 K 2013-05-28 - 16:57 UnknownUser  
PNGpng rpi_cloud_architecture.png r1 manage 98.6 K 2013-09-03 - 01:23 UnknownUser  
PNGpng select_key.png r2 r1 manage 23.6 K 2013-05-28 - 19:41 UnknownUser  
Edit | Attach | Watch | Print version | History: r36 < r35 < r34 < r33 < r32 | Backlinks | Raw View | More topic actions...
Topic revision: r34 - 2013-09-09 - crlb
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback