Difference: CERNOpenStackTesting (1 vs. 12)

Revision 122013-04-08 - rptaylor

Line: 1 to 1
 
META TOPICPARENT name="AtlasIaaSCloud"

CERN OpenStack Ibex Cloud Testing

Line: 11 to 11
 

Issues

Changed:
<
<

DNS reverse lookups incorrect

>
>

DNS reverse lookups incorrect

  In order to be able to use these VMs with Condor we need them to have usable DNS entries. We use the commonly used boto library to boot the VMs. Right now the boto call returns the instance data with the public DNS name set to the VM IP address. This is common problem that we resolve by doing reverse dns look ups on the IP addresses. However when we do reverse DNS lookups for the Ibex ip addresses we occasionally get what appears a garbage entry for the record.
Line: 57 to 57
 
Changed:
<
<

Incorrect value of Boto python EC2 Library instance.public_dns_name

>
>

Incorrect value of Boto python EC2 Library instance.public_dns_name

  We would expect that the instance.public_dns_name return value when interacting with ibex using boto would return the fully qualified domain name, however it returns the IP address of the machine. The ec2 metaserver does have the correct value. For example http://169.254.169.254/latest/meta-data/public_hostname returns server-{uuid}.cern.ch the FQDN of the VM. The same FQDN is in the public_dns_name field of the boto instance data for Amazon ec2 instances. If a solution can be found for Reverse DNS issue we can ignore this for now.
Added:
>
>

VMs booting into Error state

VMs have been having problems booting since 2013-04-06. They go into the Error state and CS shuts them down again.
 

Requests

Revision 112013-04-04 - mhp

Line: 1 to 1
 
META TOPICPARENT name="AtlasIaaSCloud"

CERN OpenStack Ibex Cloud Testing

Line: 59 to 59
 

Incorrect value of Boto python EC2 Library instance.public_dns_name

Changed:
<
<
We would expect that the instance.public_dns_name return value when interacting with ibex using boto would return the fully qualified domain name, however it returns the IP address of the machine. The ec2 metaserver does have the correct value. For example http://169.254.169.254/latest/meta-data/public_hostname returns server-{uuid}.cern.ch . If a solution can be found for Reverse DNS issue we can ignore this for now.
>
>
We would expect that the instance.public_dns_name return value when interacting with ibex using boto would return the fully qualified domain name, however it returns the IP address of the machine. The ec2 metaserver does have the correct value. For example http://169.254.169.254/latest/meta-data/public_hostname returns server-{uuid}.cern.ch the FQDN of the VM. The same FQDN is in the public_dns_name field of the boto instance data for Amazon ec2 instances. If a solution can be found for Reverse DNS issue we can ignore this for now.
 

Requests

Revision 102013-04-03 - rptaylor

Line: 1 to 1
 
META TOPICPARENT name="AtlasIaaSCloud"

CERN OpenStack Ibex Cloud Testing

Line: 48 to 48
 lcg-cp -D srmv2 -s ATLASSCRATCHDISK -b --connect-timeout=10 --sendreceive-timeout=10 --srm-timeout=10 srm://gorgon01.westgrid.ca:8443/srm/managerv2?SFN=/pnfs/westgrid-test.uvic.ca/data/atlas/atlasscratchdisk/user.rptaylor/cond10_data.000019.gen.COND._0003.pool.root file://dev/null
Added:
>
>
Allowing ports 20000-25000 in the security group will probably fix this.
 

EC2 Metadata incorrect

Revision 92013-04-03 - rptaylor

Line: 1 to 1
 
META TOPICPARENT name="AtlasIaaSCloud"

CERN OpenStack Ibex Cloud Testing

Line: 41 to 41
 
# This fails
lcg-cp -n 10 -D srmv2 -s ATLASSCRATCHDISK -b --connect-timeout=10 --sendreceive-timeout=10 --srm-timeout=10 srm://gorgon01.westgrid.ca:8443/srm/managerv2?SFN=/pnfs/westgrid-test.uvic.ca/data/atlas/atlasscratchdisk/user.rptaylor/testfile file://dev/null
Added:
>
>
lcg-cp -n 10 -D srmv2 -s ATLASSCRATCHDISK -b --connect-timeout=10 --sendreceive-timeout=10 --srm-timeout=10 srm://gorgon01.westgrid.ca:8443/srm/managerv2?SFN=/pnfs/westgrid-test.uvic.ca/data/atlas/atlasscratchdisk/user.rptaylor/cond10_data.000019.gen.COND._0003.pool.root file://dev/null
  # But this works: lcg-cp -D srmv2 -s ATLASSCRATCHDISK -b --connect-timeout=10 --sendreceive-timeout=10 --srm-timeout=10 srm://gorgon01.westgrid.ca:8443/srm/managerv2?SFN=/pnfs/westgrid-test.uvic.ca/data/atlas/atlasscratchdisk/user.rptaylor/testfile file://dev/null
Added:
>
>
lcg-cp -D srmv2 -s ATLASSCRATCHDISK -b --connect-timeout=10 --sendreceive-timeout=10 --srm-timeout=10 srm://gorgon01.westgrid.ca:8443/srm/managerv2?SFN=/pnfs/westgrid-test.uvic.ca/data/atlas/atlasscratchdisk/user.rptaylor/cond10_data.000019.gen.COND._0003.pool.root file://dev/null
 

Revision 82013-04-03 - igable

Line: 1 to 1
 
META TOPICPARENT name="AtlasIaaSCloud"

CERN OpenStack Ibex Cloud Testing

Line: 56 to 56
 

Incorrect value of Boto python EC2 Library instance.public_dns_name

Changed:
<
<
We would expect that the instance.public_dns_name return value when interacting with ibex using boto would return the fully qualified domain name, however it returns the IP address of the machine. The ec2 metaserver does have the correct value. For example http://169.254.169.254/latest/meta-data/public_hostname returns server-{uuid}.cern.ch .
>
>
We would expect that the instance.public_dns_name return value when interacting with ibex using boto would return the fully qualified domain name, however it returns the IP address of the machine. The ec2 metaserver does have the correct value. For example http://169.254.169.254/latest/meta-data/public_hostname returns server-{uuid}.cern.ch . If a solution can be found for Reverse DNS issue we can ignore this for now.
 

Requests

Revision 72013-04-03 - rptaylor

Line: 1 to 1
 
META TOPICPARENT name="AtlasIaaSCloud"

CERN OpenStack Ibex Cloud Testing

Line: 23 to 23
  dig -x 188.184.139.127 -> zjkcktlhvnf.cern.ch
Changed:
<
<
On Tue Apr 2 16:18:57 PDT 2013, no VMs with crazy names were running. If we see any again, we know the problem is still there.
>
>
Update
  • On Tue Apr 2 16:18:57 PDT 2013, no VMs with crazy names were running. If we see any again, we know the problem is still there.
  • On Tue Apr 2 18:21:20 PDT 2013 we saw it. The problem is still happening.
i-00000737 z544tjsrdnj.cern.ch       cernvm-batch-node-2.6.1-1-1-x86_64-dh.v5.gz apf-test   Running      ibex           

gridftp problems

A lot of lcg-cp transfers failed. However, other network operations seem okay, e.g. lcg-cr, lcg-del, lcg-lr. It was only downloads that failed, not uploads. I further isolated the problem to only multi-stream transfers. It seems to be fairly consistently reproducible so far. Not sure yet why single-stream downloads work but multi-stream ones fail. Still investigating.

The exact commands that can reproduce the failure are, e.g.:

# This fails
lcg-cp -n 10 -D srmv2 -s ATLASSCRATCHDISK -b --connect-timeout=10 --sendreceive-timeout=10 --srm-timeout=10 srm://gorgon01.westgrid.ca:8443/srm/managerv2?SFN=/pnfs/westgrid-test.uvic.ca/data/atlas/atlasscratchdisk/user.rptaylor/testfile file://dev/null

# But this works:
lcg-cp -D srmv2 -s ATLASSCRATCHDISK -b --connect-timeout=10 --sendreceive-timeout=10 --srm-timeout=10 srm://gorgon01.westgrid.ca:8443/srm/managerv2?SFN=/pnfs/westgrid-test.uvic.ca/data/atlas/atlasscratchdisk/user.rptaylor/testfile file://dev/null
 

EC2 Metadata incorrect

Revision 62013-04-02 - rptaylor

Line: 1 to 1
 
META TOPICPARENT name="AtlasIaaSCloud"

CERN OpenStack Ibex Cloud Testing

Line: 23 to 23
  dig -x 188.184.139.127 -> zjkcktlhvnf.cern.ch
Changed:
<
<
>
>
On Tue Apr 2 16:18:57 PDT 2013, no VMs with crazy names were running. If we see any again, we know the problem is still there.
 

EC2 Metadata incorrect

Revision 52013-04-02 - igable

Line: 1 to 1
 
META TOPICPARENT name="AtlasIaaSCloud"

CERN OpenStack Ibex Cloud Testing

Line: 23 to 23
  dig -x 188.184.139.127 -> zjkcktlhvnf.cern.ch
Added:
>
>

EC2 Metadata incorrect

There are two problems with the EC2 metadata:

Incorrect value of Boto python EC2 Library instance.public_dns_name

We would expect that the instance.public_dns_name return value when interacting with ibex using boto would return the fully qualified domain name, however it returns the IP address of the machine. The ec2 metaserver does have the correct value. For example http://169.254.169.254/latest/meta-data/public_hostname returns server-{uuid}.cern.ch .

Requests

 

Whole node VM type

We would like a VM instance type that has the following properties:

Line: 32 to 49
 
  • 160 GB ephemeral storage

We have found this to be a good size for ATLAS jobs and keeps the total number of VMs down.

Deleted:
<
<

EC2 Metadata

Boto Python EC2 Library

Revision 42013-04-02 - mhp

Line: 1 to 1
 
META TOPICPARENT name="AtlasIaaSCloud"

CERN OpenStack Ibex Cloud Testing

Line: 32 to 32
 
  • 160 GB ephemeral storage

We have found this to be a good size for ATLAS jobs and keeps the total number of VMs down.

Added:
>
>

EC2 Metadata

Boto Python EC2 Library

Revision 32013-03-28 - igable

Line: 1 to 1
 
META TOPICPARENT name="AtlasIaaSCloud"

CERN OpenStack Ibex Cloud Testing

Added:
>
>

Outline

We are adding some CERN Open Stack resources to our currently operating ATLAS IaaS. We boot the VMs using Cloud Scheduler, and contextualize them to join a Condor pool.

 

Issues

Changed:
<
<
  • Using Boto
    • instance.public_dns_name is set to the ip address
    • since the hostname cannot be determined through the public dns we use a reverse lookup
  • dig -x and host reverse lookups do not always return the correct values
  • Successful lookup - matched the vm hostname
[mhp@heplw31 ~]$ dig -x 188.184.133.67

; <<>> DiG 9.3.6-P1-RedHat-9.3.6-20.P1.el5_8.5 <<>> -x 188.184.133.67
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46773
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 3, ADDITIONAL: 4

;; QUESTION SECTION:
;67.133.184.188.in-addr.arpa.   IN      PTR

;; ANSWER SECTION:
67.133.184.188.in-addr.arpa. 1162 IN    PTR     server-2574ef60-7e2f-4b5d-a41c-bd582d935090.cern.ch.

;; AUTHORITY SECTION:
184.188.in-addr.arpa.   1162    IN      NS      ext-dns-1.cern.ch.
184.188.in-addr.arpa.   1162    IN      NS      ext-dns-2.cern.ch.
184.188.in-addr.arpa.   1162    IN      NS      ns.ripe.net.

;; ADDITIONAL SECTION:
ext-dns-2.cern.ch.      7       IN      A       192.91.245.85
ext-dns-2.cern.ch.      2761    IN      AAAA    2001:1458:1:2::100:85
ext-dns-1.cern.ch.      7       IN      A       192.65.187.5
ns.ripe.net.            1162    IN      A       193.0.9.6

;; Query time: 0 msec
;; SERVER: 142.104.61.2#53(142.104.61.2)
;; WHEN: Wed Mar 27 16:15:56 2013
;; MSG SIZE  rcvd: 259 
  • failed lookup - vm hostname was server-68eafb52-7372-425c-9a07-8206739d4c61.cern.ch
[mhp@heplw31 ~]$ dig -x 188.184.139.127

; <<>> DiG 9.3.6-P1-RedHat-9.3.6-20.P1.el5_8.5 <<>> -x 188.184.139.127
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 48437
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 3, ADDITIONAL: 4

;; QUESTION SECTION:
;127.139.184.188.in-addr.arpa.  IN      PTR

;; ANSWER SECTION:
127.139.184.188.in-addr.arpa. 1868 IN   PTR     zjkcktlhvnf.cern.ch.

;; AUTHORITY SECTION:
184.188.in-addr.arpa.   1190    IN      NS      ext-dns-2.cern.ch.
184.188.in-addr.arpa.   1190    IN      NS      ns.ripe.net.
184.188.in-addr.arpa.   1190    IN      NS      ext-dns-1.cern.ch.

;; ADDITIONAL SECTION:
ext-dns-2.cern.ch.      35      IN      A       192.91.245.85
ext-dns-2.cern.ch.      2789    IN      AAAA    2001:1458:1:2::100:85
ext-dns-1.cern.ch.      35      IN      A       192.65.187.5
ns.ripe.net.            1190    IN      A       193.0.9.6

;; Query time: 0 msec
;; SERVER: 142.104.61.2#53(142.104.61.2)
;; WHEN: Wed Mar 27 16:15:28 2013
;; MSG SIZE  rcvd: 228 
>
>

DNS reverse lookups incorrect

In order to be able to use these VMs with Condor we need them to have usable DNS entries. We use the commonly used boto library to boot the VMs. Right now the boto call returns the instance data with the public DNS name set to the VM IP address. This is common problem that we resolve by doing reverse dns look ups on the IP addresses. However when we do reverse DNS lookups for the Ibex ip addresses we occasionally get what appears a garbage entry for the record.

For example a correct one is:

dig -x 188.184.133.67 -> server-2574ef60-7e2f-4b5d-a41c-bd582d935090.cern.ch

and a garbage entry here:

dig -x 188.184.139.127 -> zjkcktlhvnf.cern.ch

Whole node VM type

We would like a VM instance type that has the following properties:

  • 8 cores
  • 16 GB Memory
  • 20 GB root partition,
  • 160 GB ephemeral storage

We have found this to be a good size for ATLAS jobs and keeps the total number of VMs down.

Revision 22013-03-28 - mhp

Line: 1 to 1
 
META TOPICPARENT name="AtlasIaaSCloud"

CERN OpenStack Ibex Cloud Testing

Issues

\ No newline at end of file
Added:
>
>
  • Using Boto
    • instance.public_dns_name is set to the ip address
    • since the hostname cannot be determined through the public dns we use a reverse lookup
  • dig -x and host reverse lookups do not always return the correct values
  • Successful lookup - matched the vm hostname
[mhp@heplw31 ~]$ dig -x 188.184.133.67

; <<>> DiG 9.3.6-P1-RedHat-9.3.6-20.P1.el5_8.5 <<>> -x 188.184.133.67
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46773
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 3, ADDITIONAL: 4

;; QUESTION SECTION:
;67.133.184.188.in-addr.arpa.   IN      PTR

;; ANSWER SECTION:
67.133.184.188.in-addr.arpa. 1162 IN    PTR     server-2574ef60-7e2f-4b5d-a41c-bd582d935090.cern.ch.

;; AUTHORITY SECTION:
184.188.in-addr.arpa.   1162    IN      NS      ext-dns-1.cern.ch.
184.188.in-addr.arpa.   1162    IN      NS      ext-dns-2.cern.ch.
184.188.in-addr.arpa.   1162    IN      NS      ns.ripe.net.

;; ADDITIONAL SECTION:
ext-dns-2.cern.ch.      7       IN      A       192.91.245.85
ext-dns-2.cern.ch.      2761    IN      AAAA    2001:1458:1:2::100:85
ext-dns-1.cern.ch.      7       IN      A       192.65.187.5
ns.ripe.net.            1162    IN      A       193.0.9.6

;; Query time: 0 msec
;; SERVER: 142.104.61.2#53(142.104.61.2)
;; WHEN: Wed Mar 27 16:15:56 2013
;; MSG SIZE  rcvd: 259 
  • failed lookup - vm hostname was server-68eafb52-7372-425c-9a07-8206739d4c61.cern.ch
[mhp@heplw31 ~]$ dig -x 188.184.139.127

; <<>> DiG 9.3.6-P1-RedHat-9.3.6-20.P1.el5_8.5 <<>> -x 188.184.139.127
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 48437
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 3, ADDITIONAL: 4

;; QUESTION SECTION:
;127.139.184.188.in-addr.arpa.  IN      PTR

;; ANSWER SECTION:
127.139.184.188.in-addr.arpa. 1868 IN   PTR     zjkcktlhvnf.cern.ch.

;; AUTHORITY SECTION:
184.188.in-addr.arpa.   1190    IN      NS      ext-dns-2.cern.ch.
184.188.in-addr.arpa.   1190    IN      NS      ns.ripe.net.
184.188.in-addr.arpa.   1190    IN      NS      ext-dns-1.cern.ch.

;; ADDITIONAL SECTION:
ext-dns-2.cern.ch.      35      IN      A       192.91.245.85
ext-dns-2.cern.ch.      2789    IN      AAAA    2001:1458:1:2::100:85
ext-dns-1.cern.ch.      35      IN      A       192.65.187.5
ns.ripe.net.            1190    IN      A       193.0.9.6

;; Query time: 0 msec
;; SERVER: 142.104.61.2#53(142.104.61.2)
;; WHEN: Wed Mar 27 16:15:28 2013
;; MSG SIZE  rcvd: 228 

Revision 12013-03-28 - igable

Line: 1 to 1
Added:
>
>
META TOPICPARENT name="AtlasIaaSCloud"

CERN OpenStack Ibex Cloud Testing

Issues

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback