Tags:
create new tag
view all tags

CERN OpenStack Ibex Cloud Testing

Outline

We are adding some CERN Open Stack resources to our currently operating ATLAS IaaS. We boot the VMs using Cloud Scheduler, and contextualize them to join a Condor pool.

Issues

DNS reverse lookups incorrect

In order to be able to use these VMs with Condor we need them to have usable DNS entries. We use the commonly used boto library to boot the VMs. Right now the boto call returns the instance data with the public DNS name set to the VM IP address. This is common problem that we resolve by doing reverse dns look ups on the IP addresses. However when we do reverse DNS lookups for the Ibex ip addresses we occasionally get what appears a garbage entry for the record.

For example a correct one is:

dig -x 188.184.133.67 -> server-2574ef60-7e2f-4b5d-a41c-bd582d935090.cern.ch

and a garbage entry here:

dig -x 188.184.139.127 -> zjkcktlhvnf.cern.ch

Update

  • On Tue Apr 2 16:18:57 PDT 2013, no VMs with crazy names were running. If we see any again, we know the problem is still there.
  • On Tue Apr 2 18:21:20 PDT 2013 we saw it. The problem is still happening.
i-00000737 z544tjsrdnj.cern.ch       cernvm-batch-node-2.6.1-1-1-x86_64-dh.v5.gz apf-test   Running      ibex           

gridftp problems

A lot of lcg-cp transfers failed. However, other network operations seem okay, e.g. lcg-cr, lcg-del, lcg-lr. It was only downloads that failed, not uploads. I further isolated the problem to only multi-stream transfers. It seems to be fairly consistently reproducible so far. Not sure yet why single-stream downloads work but multi-stream ones fail. Still investigating.

The exact commands that can reproduce the failure are, e.g.:

# This fails
lcg-cp -n 10 -D srmv2 -s ATLASSCRATCHDISK -b --connect-timeout=10 --sendreceive-timeout=10 --srm-timeout=10 srm://gorgon01.westgrid.ca:8443/srm/managerv2?SFN=/pnfs/westgrid-test.uvic.ca/data/atlas/atlasscratchdisk/user.rptaylor/testfile file://dev/null
lcg-cp -n 10 -D srmv2 -s ATLASSCRATCHDISK -b --connect-timeout=10 --sendreceive-timeout=10 --srm-timeout=10 srm://gorgon01.westgrid.ca:8443/srm/managerv2?SFN=/pnfs/westgrid-test.uvic.ca/data/atlas/atlasscratchdisk/user.rptaylor/cond10_data.000019.gen.COND._0003.pool.root file://dev/null

# But this works:
lcg-cp -D srmv2 -s ATLASSCRATCHDISK -b --connect-timeout=10 --sendreceive-timeout=10 --srm-timeout=10 srm://gorgon01.westgrid.ca:8443/srm/managerv2?SFN=/pnfs/westgrid-test.uvic.ca/data/atlas/atlasscratchdisk/user.rptaylor/testfile file://dev/null
lcg-cp -D srmv2 -s ATLASSCRATCHDISK -b --connect-timeout=10 --sendreceive-timeout=10 --srm-timeout=10 srm://gorgon01.westgrid.ca:8443/srm/managerv2?SFN=/pnfs/westgrid-test.uvic.ca/data/atlas/atlasscratchdisk/user.rptaylor/cond10_data.000019.gen.COND._0003.pool.root file://dev/null

Allowing ports 20000-25000 in the security group will probably fix this.

EC2 Metadata incorrect

There are two problems with the EC2 metadata:

Incorrect value of Boto python EC2 Library instance.public_dns_name

We would expect that the instance.public_dns_name return value when interacting with ibex using boto would return the fully qualified domain name, however it returns the IP address of the machine. The ec2 metaserver does have the correct value. For example http://169.254.169.254/latest/meta-data/public_hostname returns server-{uuid}.cern.ch the FQDN of the VM. The same FQDN is in the public_dns_name field of the boto instance data for Amazon ec2 instances. If a solution can be found for Reverse DNS issue we can ignore this for now.

VMs booting into Error state

VMs have been having problems booting since 2013-04-06. They go into the Error state and CS shuts them down again.

Requests

Whole node VM type

We would like a VM instance type that has the following properties:

  • 8 cores
  • 16 GB Memory
  • 20 GB root partition,
  • 160 GB ephemeral storage

We have found this to be a good size for ATLAS jobs and keeps the total number of VMs down.

Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | More topic actions
Topic revision: r12 - 2013-04-08 - rptaylor
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback