Tags:
create new tag
view all tags

Documentation

Dan's recipe for CloudSigma may be a useful reference:

Colin's recipe for dual-hypervisor configuration:

The old CernVM configuration recipe is at ConfigureCernVMOld.

CernVM v2.6 Batch Node Configuration

Note: this recipe is now superseded by BuildingDHAtlasVM

Xen Image Configuration

Download the Xen image and mount it locally to modify it:

wget http://cernvm.cern.ch/releases/17/cernvm-batch-node-2.6.0-4-1-x86_64.ext3.gz
gunzip cernvm-batch-node-2.6.0-4-1-x86_64.ext3.gz
mount -o loop cernvm-batch-node-2.6.0-4-1-x86_64.ext3 /mnt

Network tuning

Add the following to /mnt/etc/sysctl.conf
# Network tuning: http://fasterdata.es.net/fasterdata/host-tuning/linux/
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216 
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.netdev_max_backlog = 30000
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_sack = 1

Add the following to /mnt/etc/rc.local

# increase txqueuelen for 10G NICS
/sbin/ifconfig eth0 txqueuelen 10000

SSH keys

Optionally, add ssh keys for debugging:
mkdir /mnt/root/.ssh
chmod 700 /mnt/root/.ssh
vi /mnt/root/.ssh/authorized_keys #add ssh public keys
chmod 600 /mnt/root/.ssh/authorized_keys

User accounts

Add the condor user:

echo "condor:x:102:102:Owner of Condor Daemons:/var/lib/condor:/sbin/nologin" >> /mnt/etc/passwd
echo "condor:x:102:" >> /mnt/etc/group

Add the atlas users to /etc/passwd and /etc/group and create their home directories. (There should be at least as many atlas accounts as cores in the VM.)

for i in `seq -w 1 32`; do  uid=$((499+${i##0})); echo "atlas$i:x:$uid:$uid::/home/atlas$i:/bin/bash" >> /mnt/etc/passwd; echo "atlas$i:x:$uid:" >> /mnt/etc/group; mkdir /mnt/home/atlas$i; chown $uid.$uid /mnt/home/atlas$i; done

Grid Environment

Create a file /mnt/etc/profile.d/grid-setup.sh containing:

# Keep grid setup out of root's environment; it causes a problem when starting condor.
if [[ $UID -eq 0 ]]; then
  return 0
fi

export GLOBUS_FTP_CLIENT_GRIDFTP2=true

# Workaround for condor not setting $HOME for atlas users.
# voms-proxy-info requires this.
if [[ -z "$HOME" ]] ; then
  export HOME=`eval echo ~$USER`
fi

## Set up grid environment:
## Option 1: gLite 3.2 in /cvmfs/grid.cern.ch
## Currently this doesn't work because 32-bit lfc libraries are configured instead of 64-bit
#. /cvmfs/grid.cern.ch/3.2.11-1/etc/profile.d/grid-env.sh
## Option 2: gLite 3.2 in AtlasLocalRootBase
shopt -s expand_aliases
export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase
alias setupATLAS='source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh'
setupATLAS --quiet
localSetupGLite
## Fix for using AtlasLocalRootBase with a kit
unset  AtlasSetupSite
rm ~/.asetup

# Site-specific variables (e.g. Frontier and Squid servers) are set based on ATLAS_SITE_NAME (from JDL).
# This auto-setup is only temporarily needed, and will soon become automatic 
. /cvmfs/atlas.cern.ch/repo/sw/local/bin/auto-setup

CVMFS

In /mnt/etc/cvmfs/default.local add the following lines:
CVMFS_REPOSITORIES=atlas.cern.ch,atlas-condb.cern.ch,grid.cern.ch
CVMFS_QUOTA_LIMIT=3500
CVMFS_HTTP_PROXY="http://chrysaor.westgrid.ca:3128;http://cernvm-webfs.atlas-canada.ca:3128;DIRECT"

NOTE: there seems to be a CVMFS bug that prevents this from working. It works if the ${CERNVM_SERVER_URL:= part (and the closing brace) are left out.

Create /mnt/etc/cvmfs/domain.d/cern.ch.local containing:

# For Europe:
#CVMFS_SERVER_URL=${CERNVM_SERVER_URL:="http://cvmfs-stratum-one.cern.ch:8000/opt/@org@;http://cernvmfs.gridpp.rl.ac.uk:8000/opt/@org@;http://cvmfs.racf.bnl.gov:8000/opt/@org@;http://cvmfs.fnal.gov:8000/opt/@org@;http://cvmfs02.grid.sinica.edu.tw:8000/opt/@org@"}
# For North America:
CVMFS_SERVER_URL=${CERNVM_SERVER_URL:="http://cvmfs.racf.bnl.gov:8000/opt/@org@;http://cvmfs.fnal.gov:8000/opt/@org@;http://cvmfs-stratum-one.cern.ch:8000/opt/@org@;http://cernvmfs.gridpp.rl.ac.uk:8000/opt/@org@;http://cvmfs02.grid.sinica.edu.tw:8000/opt/@org@"}
# For Australia:
#CVMFS_SERVER_URL=${CERNVM_SERVER_URL:="http://cvmfs.fnal.gov:8000/opt/@org@;http://cvmfs.racf.bnl.gov:8000/opt/@org@;http://cernvmfs.gridpp.rl.ac.uk:8000/opt/@org@;http://cvmfs-stratum-one.cern.ch:8000/opt/@org@;http://cvmfs02.grid.sinica.edu.tw:8000/opt/@org@"}

CernVM settings

Add the following to /mnt/etc/cernvm/site.conf
CERNVM_CVMFS2=on
CERNVM_EDITION=Basic
CERNVM_ORGANISATION=atlas
CERNVM_USER_SHELL=/bin/bash
CVMFS_REPOSITORIES=atlas,atlas-condb,grid

Filesystem

Set up the mount point for the blankspace partition:
mkdir /mnt/scratch
In /mnt/etc/fstab add the /scratch filesystem:
LABEL=blankpartition0   /scratch                    ext2    noatime         0 0

ext4 support exists in the e4fsprogs package in CernVM v2.6. We should try using ext4 with no journaling; the performance should be better. It is disabled like this: tune4fs -O ^has_journal /dev/sdb and you can verify whether the has_journal property is there using /sbin/dumpe4fs /dev/sdb . Or, just create it without journaling in the first place: mkfs.ext4 -O ^has_journal /dev/sdb . Maybe try the nobarrier and noatime options too, although they might not be as significant. However, in the case of Nimbus, some new development will be needed to use ext4 partitions.

Set up Condor

Put in the the condor configuration and init.d files from the Cloud Scheduler github repo:

mv /mnt/etc/init.d/condor /mnt/root/
cd /mnt/etc/init.d/
wget https://raw.github.com/hep-gc/cloud-scheduler/master/scripts/condor/worker/condor --no-check-certificate
chmod 755 condor
mv /mnt/etc/condor/condor_config /mnt/root/
cd /mnt/etc/condor
wget https://raw.github.com/hep-gc/cloud-scheduler/master/scripts/condor/worker/condor_config --no-check-certificate
wget https://raw.github.com/hep-gc/cloud-scheduler/master/scripts/condor/worker/condor_config.local --no-check-certificate

Modify /mnt/etc/init.d/condor

CONDOR_CONFIG_VAL=/opt/condor/bin/condor_config_val

Modify /mnt/etc/condor/condor_config as follows:

#NUM_SLOTS = 1
ALLOW_WRITE = $(FULL_HOSTNAME), $(ALLOW_ADMINISTRATOR), $(CONDOR_HOST)
ALLOW_DAEMON = $(FULL_HOSTNAME), $(ALLOW_ADMINISTRATOR), $(CONDOR_HOST)
INCLUDE         = $(RELEASE_DIR)/include
LIBEXEC         = $(RELEASE_DIR)/libexec
JAVA = /usr/lib/jvm/jre-1.6.0-openjdk.x86_64/bin/java

TODO: may need to adjust ALLOW_DAEMON ... actually does it need to be set at all? Currently the VMs have ALLOW_DAEMON = $(FULL_HOSTNAME), $(CONDOR_HOST)

Modify /mnt/etc/condor/condor_config.local and add the following lines: (Note: MaxJobRetirementTime and SHUTDOWN_GRACEFUL_TIMEOUT will soon be set by default in the script from github.)

# How long to wait for jobs to retire before killing them
MaxJobRetirementTime = 3600 * 24 * 2
# How long to wait for daemons to retire before killing them
SHUTDOWN_GRACEFUL_TIMEOUT = 3600 * 25 * 2

SLOT1_USER = atlas01
SLOT2_USER = atlas02
SLOT3_USER = atlas03
SLOT4_USER = atlas04
SLOT5_USER = atlas05
SLOT6_USER = atlas06
SLOT7_USER = atlas07
SLOT8_USER = atlas08
SLOT9_USER = atlas09
SLOT10_USER = atlas10
SLOT11_USER = atlas11
SLOT12_USER = atlas12
SLOT13_USER = atlas13
SLOT14_USER = atlas14
SLOT15_USER = atlas15
SLOT16_USER = atlas16
SLOT17_USER = atlas17
SLOT18_USER = atlas18
SLOT19_USER = atlas19
SLOT20_USER = atlas20
SLOT21_USER = atlas21
SLOT22_USER = atlas22
SLOT23_USER = atlas23
SLOT24_USER = atlas24
SLOT25_USER = atlas25
SLOT26_USER = atlas26
SLOT27_USER = atlas27
SLOT28_USER = atlas28
SLOT29_USER = atlas29
SLOT30_USER = atlas30
SLOT31_USER = atlas31
SLOT32_USER = atlas32
DEDICATED_EXECUTE_ACCOUNT_REGEXP = atlas[0-9]+
STARTER_ALLOW_RUNAS_OWNER = False
USER_JOB_WRAPPER=/usr/local/bin/condor-job-wrapper
EXECUTE=/scratch/condor

Create the wrapper script /mnt/usr/local/bin/condor-job-wrapper

#!/bin/bash -l
exec "$@"
Then chmod 755 /mnt/usr/local/bin/condor-job-wrapper

By default, Condor is configured to use directories in /var but they are missing in the CernVM image, so create the missing directories. Also, some directories in /etc/grid-security must exist when Condor uses GSI authentication.

mkdir /mnt/var/log/condor
mkdir /mnt/var/run/condor
mkdir /mnt/var/lib/condor
mkdir /mnt/var/lib/condor/spool

chown 102:102 /mnt/var/log/condor
chown 102:102 /mnt/var/run/condor
chown 102:102 /mnt/var/lib/condor
chown 102:102 /mnt/var/lib/condor/spool

mkdir /mnt/etc/grid-security
mkdir /mnt/etc/grid-security/certificates
touch /mnt/etc/grid-security/hostkey.pem
chmod 600 /mnt/etc/grid-security/hostkey.pem

TODO

Disable puppet in chkconfig ?

Set /etc/sysconfig/clock to use UTC?

Something to keep in mind about a federation: the DQ2_LOCAL_SITE_ID is set via AGIS based on ATLAS_SITE_NAME. This variable can influence the choice of DQ2 endpoint to be used for analysis output, and is also used to choose the nearest replicas in the cloud for the job's input files.

Repoman

Check that /mnt/.image.metadata exists and contains the name (VMType) of the image.

TODO ... stuff about Repoman and the dual-hypervisor image.

TODO ... check that the ssh keys don't get removed

Submitting jobs

Get a recent pilot wrapper script:

wget http://walkerr.web.cern.ch/walkerr/runpilot3-wrapper-sep19.sh
Launch the condor job like this:
Executable = runpilot3-wrapper.sh
Arguments  = -s ANALY_IAAS -h ANALY_IAAS -p 25443 -w https://pandaserver.cern.ch -j false -k 0 -u user

We set up automated proxy renewal using MyProxy as described at CsGsiSupport#Credential_renewal.

-- RyanTaylor - 2012-11-04

Edit | Attach | Watch | Print version | History: r20 < r19 < r18 < r17 < r16 | Backlinks | Raw View | More topic actions
Topic revision: r20 - 2013-02-22 - rptaylor
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback