Tags:
create new tag
, view all tags
This page contains information on GSI support in the cloud scheduler.

GSI support in the cloud scheduler allows a user to authenticate using his/her grid certificate when submitting a job to the cloud scheduler. The cloud scheduler will then use these credentials for authenticating the Nimbus workspace creation and workspace deletion.

Enabling GSI support in the cloud scheduler will also put some restrictions on the VM which will only allow jobs from the owner of that VM to be started on it. In other words, jobs owned by user B will not be started on a VM owned by user A. The rationale behind this is to prevent access to a delegated proxy on a VM to other users.

Requirements

  • A cloud scheduler codebase with GSI support
    • Warning, important if you want to renew your certificate via CDS, use the cloud scheduler codebase from the CDS branch
    • merged into dev branch on Sep 13, 2010
  • A working CA is required to sign the dummy VM host certificate.
    • Done. Running on alto.cloud.nrc.ca
  • A working Globus Toolkit is required on the host running the cloud scheduler.
  • The user requires a valid grid certificate (x509)
  • The VM images must have a recent version of the condor startup scripts (with generic local condor config support)

A note about CA root cert hash values

Note that newest openssl libraries (1.0+) use a different algorithm to compute x509 cert hash values. This can cause some weird authentication failures if you have systems that use different versions of openssl, or some applications which are linked to different versions of openssl (i.e., condor statically linked to openssl 0.9 on a system with openssl 1.0 installed).

You can manually extract the old and new hash values from a CA root cert by using the openssl command as shown below. (You need to have openssl 1.0+ for this to work.)

$ openssl version
OpenSSL 1.0.0c 2 Dec 2010

$ openssl x509 -hash -noout < /etc/grid-security/certificates/5d674a88.0
5d674a88

$ openssl x509 -subject_hash_old -noout < /etc/grid-security/certificates/5d674a88.0
bffbd7d0

Install CAs and fetch-crl

If you haven't done so yet, you will need to install the GridCanada CA root package on your Cloud Scheduler system. Follow these instructions to install the standard EGI trust anchors, which include the GridCanada root CA.

Now install fetch-crl:

  • Install the latest v3 RPM from https://dist.eugridpma.info/distribution/util/fetch-crl3/ or EPEL
  • Apply the following settings in /etc/fetch-crl.conf:
    warnings
    noquiet
    verbosity = 1
    http_proxy=<optional squid server>
    logmode=syslog
    
  • /sbin/chkconfig fetch-crl-boot on
  • /sbin/chkconfig fetch-crl-cron on
  • /etc/init.d/fetch-crl-cron start

Install NEP-52 root CA package

Tip, idea If you already have your own CA that you can use to sign your own X509 certificates, you can install your CA package instead.

On the cloud scheduler host:

$ wget --no-check-certificate https://wiki.heprc.uvic.ca/twiki/pub/Main/CsGsiSupport/globus_simple_ca_08b380b1_setup-0.20.tar.gz
$ gpt-build ./globus_simple_ca_08b380b1_setup-0.20.tar.gz
$ gpt-postinstall
# $GLOBUS_LOCATION/setup/globus_simple_ca_08b380b1_setup/setup-gsi

IMPORTANT: Note that newest openssl libraries (1.0+) use a different algorithm to compute x509 cert hash values. This can cause some weird authentication failures if you have systems that use different versions of openssl, or some applications which are linked to different versions of openssl (i.e., condor statically linked to openssl 0.9 on a system with openssl 1.0 installed). In order to minimize the chance of running into these kind of errors, I suggest you create a set of symlinks as shown below.

In order to avoid errors caused by x509 CA root cert hash inconsistencies, it is recommended that you create some simlinks for the new hash of our CA root cert, as shown below:

cd /etc/grid-security/certificates
ln -s 08b380b1.0 63bbbd3b.0
ln -s 08b380b1.signing_policy 63bbbd3b.signing_policy
ln -s globus-host-ssl.conf.08b380b1 globus-host-ssl.conf.63bbbd3b
ln -s globus-user-ssl.conf.08b380b1 globus-user-ssl.conf.63bbbd3b
ln -s grid-security.conf.08b380b1 grid-security.conf.63bbbd3b

Request dummy host certificate for VM instances

For credential delegation to the worker nodes to work, they need to have a host certificate. Here we simply reuse a dummy host certificate on every VM that will be booted by the cloud scheduler.

On the cloud scheduler host:

# mkdir /etc/grid-security/VM-host-cert
# cd /etc/grid-security/VM-host-cert
# grid-cert-request -dir . -host NEP-52_VM_instance -ca
For the time being, the NEP-52 CA is hosted on myproxy.cloud.nrc.ca. Send the above certificate request to Andre.Charbonneau@nrc-cnrc.gc.ca If you have your own CA that you can use, simply send this request to your CA to get signed.

Install the signed certificate in the VM-host-cert directory created above.

Configure GSI Authentication in Condor

GSI needs to be enabled at the Condor level. This is required in order to be able to authenticate users via their X509 certificate (proxies).

In the condor config on the cloud scheduler host:

SEC_DEFAULT_AUTHENTICATION = REQUIRED
SEC_DEFAULT_AUTHENTICATION_METHODS = GSI
SEC_DEFAULT_ENCRYPTION = REQUIRED
SEC_DEFAULT_ENCRYPTION_METHODS = 3DES
GRIDMAP = /etc/grid-security/grid-mapfile.condor
This will enable both authentication (GSI) and encryption for clients connecting to this condor server. Do not forget to create the grid mapfile specified in the above configuration.

The cloud scheduler must contain itself in the grid mapfile to allow the condor daemons to update the queue when the job status changes (ie the jobs finish). Add lines such as:

"/C=CA/O=Grid/CN=host/vm129.cloud.nrc.ca" condor@vm129.cloud.nrc.ca
to you /etc/grid-security/grid-mapfile and replace vm129.cloud.nrc.ca with the hostname of your cloud scheduler host.

Note that if uses are not listed in the grid mapfile, these users will sill be authenticated and authorized to look at the condor info (READ operations), but will not be allowed to submit any jobs.

In order to the changes to take effect, you need to restart condor:

# /etc/init.d/condor stop
# /etc/init.d/condor start

To test that GSI authentication is actually enabled, try to run the condor_q command without a user proxy. You should get an error message, such as shown below:

$ grid-proxy-destroy
$ condor_q

-- Failed to fetch ads from: <132.246.148.29:8080> : vm129.cloud.nrc.ca
AUTHENTICATE:1003:Failed to authenticate with any method
AUTHENTICATE:1004:Failed to authenticate using GSI
GSI:5003:Failed to authenticate.  Globus is reporting error (851968:28).  There is probably a problem with your credentials.  (Did you run grid-proxy-init?)

Then create a proxy and try the command again. The condor_q command should work now.

Configure CA roots in cloud scheduler

We need to specify which CA root certificates and signing policy we need on our VMs. In our situation, we have 2: the GridCanada CA root, and our simple CA which is used to sign dummy VM host certs. Note that to avoid conflicts, we put both hash values for each CA root cert.

This is done by adding the following to the cloud scheduler config file:

ca_root_certs: /etc/grid-security/certificates/bffbd7d0.0,/etc/grid-security/certificates/5d674a88.0,/etc/grid-security/certificates/08b380b1.0,/etc/grid-security/certificates/63bbbd3b.0

ca_signing_policies: /etc/grid-security/certificates/bffbd7d0.signing_policy,/etc/grid-security/certificates/5d674a88.signing_policy,/etc/grid-security/certificates/08b380b1.signing_policy,/etc/grid-security/certificates/63bbbd3b.signing_policy

Configure VM dummy host certificate in cloud scheduler

cert_file: /home/andre/work/cloud-scheduler/VM_host_cert/hostcert.pem
cert_file_on_vm: /etc/grid-security/hostcert.pem

key_file: /home/andre/work/cloud-scheduler/VM_host_cert/hostkey.pem
key_file_on_vm: /etc/grid-security/hostkey.pem

Configure the nimbus grid-mapfile on the cloud servers

Make sure that the authorized users DN are added to the Nimbus grid mapfiles on the cloud server that this user is allowed to use. For example, on alto.cloud.nrc.ca, this is in the following file:
/usr/local/nimbus/services/etc/nimbus/nimbus-grid-mapfile

Testing

Restart the cloud scheduler

Create a user proxy (full legacy). Make sure it's lifetime will cover the duration of the job.

$ grid-proxy-init -old [-valid HH:MM]

Add x509 proxy info in your job description

In order to use GSI authentication, you need to specify your user proxy in your job description. This is done using the x509userproxy classad attribute. For example:
x509userproxy = /tmp/x509up_u20200

Submit the job

$ condor_submit <job-description-file>

Credential renewal

The cloud scheduler implements job credential renewal via a MyProxy server. The idea is simple: the user first puts a long lived proxy a MyProxy server prior to submitting a job and then puts the proxy information in the job description. Periodically, the cloud scheduler will scan all the jobs proxy certificates and attempt to renew those which are about to expire.

Note that this proxy renewal feature will only renew proxies that reside on the cloud scheduler. User proxies delegated to the worker nodes by Condor will not be automatically renewed.

To use automatic credential renewal, follow the instructions below:

Configure cloud scheduler to enable credential renewal

Edit the following cloud scheduler configuration parameters in your cloud_scheduler.conf file:
# job_proxy_refresher_interval specifies the amount of time, in seconds, between each job proxy
# credential expiry checks.  To disable proxy refreshing altogether, simply set this
# value to -1
#
# The default value is -1
#job_proxy_refresher_interval: -1

# job_proxy_renewal_threshold determines the amount of time, in seconds, 
# prior to proxy expiry date at which a proxy will be refreshed
#
# The default value is 900 (15 minutes)
#job_proxy_renewal_threshold: 900

Put a long-lived proxy to a MyProxy server

Prior to submitted one or more long lived job, the user should run a command like the following:
myproxy-init -R 'host/<cloud_scheduler_host>' -k <creds_name> -s alto.cloud.nrc.ca -d
In the above command, replace with the FQHN of your cloud scheduler and with unique name for your credentials in the MyProxy server. Also, if needed, change alto.cloud.nrc.ca in the above command to the name of the MyProxy server for your cloud scheduler (contact your system administrator if you are not sure what value to use for the MyProxy server).

The default lifetime of the delegated credentials on the MyProxy server is one week. If you want a different lifetime, specify it using the -c command line argument to the myproxy-init command shown above.

Add MyProxy info to job description

The user must put the following information in his/her job description:
+CSMyProxyServer     = alto.cloud.nrc.ca
+CSMyProxyCredsName  = <creds_name>
In the above job description attributes, replace with the unique name for your credentials in the MyProxy server. Also, if needed, change alto.cloud.nrc.ca in the above command to the name of the MyProxy server for your cloud scheduler.

Refreshing user proxy on worker node

Condor already will automatically sync the files between the submit machine and the worker, so no additional step is required to have the proxy on the worker refreshed. (see (http://www.cs.wisc.edu/condor/manual/v6.8.0/8_4Development_Release.html) If the user's job has proxy renewal via MyProxy properly configured as per instructions above, then the renewals should propagate automatically to the worker.

If for some reasons this does not work for you, then there is a way to do this using the condor_chirp mechanism, as shown below:

Refreshing a user proxy can be done in the job's script. An example of a job script that pulls a fresh proxy is shown below:

# Let's pull a fresh proxy from the submit machine.
PROXY_ON_SUBMIT_MACHINE=$(basename $X509_USER_PROXY)
/usr/libexec/condor/condor_chirp fetch $PROXY_ON_SUBMIT_MACHINE $X509_USER_PROXY

# Calls that need fresh grid proxy, such as gridftp, goes here...

In order to be able to use condor_chirp from your job script, you need to put the following in your job classad:

+WantIOProxy = true

else you will get an error like:

Can't open /var/lib/condor/execute/dir_1441/chirp.config file
cannot chirp_connect to shadow

Tip, idea Note: Note that the refreshed proxy on the execute side will be slightly different than the original one put there by condor when the job started. This can be seen by looking at the output of the openssl x509 command, as shown below:

  • before calling condor_chirp:
            Subject: C=CA, O=Grid, OU=sao.nrc.ca, CN=Andre Charbonneau, CN=proxy, CN=limited proxy, CN=limited proxy
       
  • after calling condor_chirp:
            Subject: C=CA, O=Grid, OU=sao.nrc.ca, CN=Andre Charbonneau, CN=proxy, CN=limited proxy
       

The reason is that when condor put it there when the job starts, condor actually 'delegates' it there. When we run condor_chirp to fetch the proxy directly from the submit machine, we get the proxy as-is on the submit machine, without doing any delegation. This is a technicality and should be transparent to the end user.

Warning, important It is unclear if the data transfers done by chirp is encrypted or not. Removing the user proxy before doing the chirp call does not affect chirp's behavior; so we can conclude that with the default condor configuration, the user's proxy is not used for authenticating chirp's data transfers. So far, no information about data encryption could be found. It is important to determine this because the user proxy is unencrypted. Still investigating... (Andre)

Interactive VM Configuration

IMPORTANT: This section is not complete and is work in progress...

Condor configuration on VM

  • SCHEDD_HOST = mycloudscheduler.cloud.nrc.ca
  • COLLECTOR_HOST = mycloudscheduler.cloud.nrc.ca
  • DAEMON_LIST = MASTER

GSI configuration on VM

  • valid user's proxy in /tmp/x509_uXXXX (tested without and GSI auth fails...)
  • valid host cert in /etc/grid-security not required (tested without and it makes no difference)
  • GridCanada root CA files in /etc/grid-security/certificates

GSI configuration on CS

  • Condor's grid mapfile must contain an entry for the user's DN. For example:
    vm129:/usr/local/condor # cat /etc/grid-security/grid-mapfile.condor 
    "/C=CA/O=Grid/OU=sao.nrc.ca/CN=Andre Charbonneau" andre@cloud.nrc.ca
       

Job submission on VM

  • Since the submit machine is not the same as the CS, then we need to spool the files (including the user's creds). This is done by giving the -spool command line argument to the condor_submit command.
  • Since we cannot assume that the user account on the VM will match the user in the condor grid-mapfile on the CS, we need to add the following to the job classads on the interactive VM:
    +Owner = UNDEFINED
          
    else you will get an error similar to:
    [babar@vm135 test]$ condor_submit -spool ./job.andre.txt 
    Submitting job(s)
    ERROR: Failed to set Owner="babar" for job 6916.0 (13)
    
    ERROR: Failed to queue job.
          
  • Once the job is complete, it should reach the 'C' state, as shown below:
    [babar@vm135 test]$ condor_q
    
    
    -- Submitter: vm129.cloud.nrc.ca : <132.246.148.29:8080> : vm129.cloud.nrc.ca
     ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
    6924.0   andre          11/18 09:58   0+00:01:02 C  0   4.2  script.sh         
    
    0 jobs; 0 idle, 0 running, 0 held
       
  • Then you can use the condor tools to fetch the job output files, as shown below:
    [babar@vm135 test]$ condor_transfer_data -all
    Fetching data files...
    
    [babar@vm135 test]$ cat test.out
    vm140.cloud.nrc.ca
       
    This was tested without a proxy, and the call fails (which is what we want).
  • Then you can remove the finished job:
    [babar@vm135 test]$ condor_rm 6924.0
    Job 6924.0 marked for removal
       

Note:

For some reason, I get an error if I run cloud-status to see if my image is indeed shutdown:
andre@vm129:~> cloud_status -m
ID          HOSTNAME                VMTYPE     STATUS   CLUSTER                
17007                               ur.vm.type Error    alto.cloud.nrc.ca      

Total VMs: 1. Total Clouds: 1

I suspect this has something to do with the CS cleaning up the job's spooled files (including the user's delegated creds!) when the job is not in the Running state anymore. Probably the CS will have to be updated to recognize the 'C' state and not touch the job's files until it is actually removed from the queue.

Renewal of expired proxy

NOTE: This feature is still experimental and available in the following git branch: proxy-replace-feature

The idea is that in order to replace an expired proxy, we do this via another condor job. Let's call this condor job a 'Proxy Replace' job. This Proxy Replace job contains the id of the job which has an expired proxy. The CloudScheduler will detect this 'proxy replace' job, extract the id of the job which has an expired proxy and then copy the current job's proxy over the expired one. The proxy replace job is then immediatly removed from the condor queue.

Scenario:

  • User submits a job and its proxy on the CS expires
  • User does condor_q -l and extracts the GlobalJobId of the job with the expired proxy
  • User constructs a new condor job, that looks like the following:
    Universe   = vanilla
    Executable = /bin/true
    x509userproxy = /tmp/x509up_u20121
    +Owner = UNDEFINED
    Requirements = VMType =?= "xxxx"
    
    +userProxyOverwriteTargetJob = "vm129.cloud.nrc.ca#91.0#1305226680"
    
    Queue
       
  • User submits this new Proxy Replace job to condor
  • Cloud Scheduler replaces the expired job proxy
  • No more expired proxy; User and Cloud Scheduler are happy.

Currently, this only works for running jobs. Jobs that are done (completed) will not be affected. Also, suppot to renew expired proxies for VMs is not yet implemented.

Note that the creation and submission of the Proxy Replace condor job described above can easily be done in a user-friendly command-line tool or via the web portal. -- AndreCharbonneau - 2010-08-30

Topic attachments
I Attachment Action Size Date Who Comment
Unknown file formatgz globus_simple_ca_08b380b1_setup-0.20.tar.gz manage 211.1 K 2011-03-18 - 13:55 AndreCharbonneau NEP-52 CA package (root cert, signing policy, etc...)
Topic revision: r24 - 2014-01-25 - FrankBerghaus
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2014 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback