Tags:
create new tag
view all tags
-- ColinLeavettBrown - 2010-07-20

NEP-52 Lustre file system notes:

Overview

The elephant cluster was used in developing these notes. Much of the information comes from the Lustre Operations Manual.
  1. Installing Lustre on the cluster.
  2. Define the Lustre file system.
  3. Starting and stopping the Lustre file system.
  4. Changing the Lustre network.
  5. Installing a patchless client on a workstation.
  6. Determine which OSS is serving an OST.
  7. Temporarily deactivate an OST.
  8. Re-activating an OST.
  9. Determining which files have objects on a particular OST.
  10. Restoring a corrupted PV after a motherboard replacement..

Index Up Down Note 1: Installing Lustre on the cluster.

The elephant cluster needs to run a Xen kernel and the Lustre file system. It was there fore necessary to obtain a kernel with both the xen and lustre patched. The source was available on the Oracle website. The instructions for building the required RPMs can be found here. When the build is complete, install the following RPMs on each of the nodes in the cluster:

  • e2fsprogs

  • kernel-lustre
  • kernel-lustre-xen
  • kernel-lustre-xen-devel

  • lustre
  • lustre-ldiskfs
  • lustre-modules

Reboot each node with one of the kernels (kernel-lustre-xen (required for Xen) or kernel-lustre) installed above. Note: we found that the single user mode kernel boot parameter ("S", including all tested synonyms) is ineffective with any Xen kernel. If you need to boot into single user mode, use a non-Xen kernel.

Index Up Down Note 2: Define the Lustre file system.

The lustre file system consists of an MGS, an MDT, and one or more OSTs. These components can exist on one or more nodes. In the case of elephant, e1 hosts a combined MGS/MDT, and e2 through e6 each hosts multiple OSTs. They were created as follows:

  • As root, create the MGS/MDT on e1:
    • pvcreate /dev/sdb
    • vgcreate vg00 /dev/sdb
    • lvcreate -L 100M -n MDT0000 vg00
    • mkfs.lustre --mgs --mdt --fsname=lustre /dev/vg00/MDT0000 - NB: The mkfs.lustre must be performed as root; sudo mkfs.lustre does not produce the correct results.
    • Create mount points: mkdir /lustreMDT /lustreFS

  • As root, create OSTs on e2 through e6:
    • pvcreate /dev/sdb
    • vgcreate vg00 /dev/sdb
    • lvcreate -L 600G -n OSTnnnn vg00
    • mkfs.lustre --ost --mgsnode=10.200.201.1@tcp --fsname=lustre /dev/vg00/OSTnnnn - NB: The mkfs.lustre must be performed as root; sudo mkfs.lustre does not produce the correct results.
    • Create mount points: mkdir -p /lustreOST/{OSTnnnn,OSTnnnn} /lustreFS

The "mkfs.lustre" command will assign a filesystem name to each OST created. The name takes the form "OSTnnnn", where "nnnn" is a sequentially assigned hexadecimal number starting at "0000". On elephant, the filesystem name, the logical volume name, and mount point are made consistent, ie. logical volume /dev/vg00/OST000a contains OST filesystem OST000a which is mounted on /lustreOSTs/OST000a. This is not difficult to ensure (more difficult to fix after the fact) and the shell script /usr/local/sbin/mountOSTs depend on this arrangement (at least it depends on the LV name matching the mount point; the filesystem name is not so important). To display the filesystem name, issue the command "tunefs.lustre /dev/vg00/OSTnnnn". Currently, there are 20 OSTs allocated, 0000 through 0013. The next OST filesystem name that will be assigned by mkfs.lustre will be OST0014.

  • Define the Lustre network (LNET):
    • LNET is defined by options within the modprobe configuration. In elephant's case, the definition is effected by the /etc/modprobe.d/lustre-lnet.conf file containing the following definition:
      • options lnet networks=tcp(bond0)
    • This configuration instructs lustre to use only the bond0 (10.200.201.0/24) interface.

  • Update iptables to allow access to port 988 on all server nodes (e1 through e6). Remember, all clients mounting the file system will have read/write access. The following iptable entries were created for the elephant cluster:
    • # lustre filesystem access - elephant cluster:
    • -A RH-Firewall-1-INPUT -s 10.200.201.0/24 -m tcp -p tcp --dport 988 -j ACCEPT

  • The following iptable rules were used during testing and are now obsolete:
    • -A RH-Firewall-1-INPUT -s 206.12.154.0/26 -m tcp -p tcp --dport 988 -j ACCEPT
    • # lustre filesystem access - CLB's workstation:
    • -A RH-Firewall-1-INPUT -s 142.104.60.69 -m tcp -p tcp --dport 988 -j ACCEPT
    • # lustre filesystem access - Kyle's testnode, NRC Ottawa:
    • -A RH-Firewall-1-INPUT -s 132.246.148.31 -m tcp -p tcp --dport 988 -j ACCEPT

Index Up Down Note 3: Starting and stopping the Lustre file system

Staring and stoppping the lustre file system is performed by mounting and unmounting the file system components. To start the file system the mounts should be issued as root in the following order:

  • Mount the MGS/MDT on e1: mount -t lustre /dev/vg00/MDT0000 /lustreMDT
  • Mount the OSTs on e2 through e6. /usr/local/sbin/mountOSTs
  • Mount the lustre filesystem on all nodes: mount -t 10.200.201.1@tcp:/lustre /lustreFS

To stop the file system, the unmounts should be issued as root in the following order:

  • On all nodes, unmount the lustre filesystem: umount /lustreFS
  • On e1: umount /lustreMDT
  • on e2 through e6: umount /lustreOST/*

Index Up Down Note 4: Changing the Lustre network.

In order to change the network, all network references must be updated or the file system may be rendered unuseable. Though the Lustre network configuration is defined in only two places ("mkfs.lustre --ost --mgsnode=!10.200.201.1@tcp --fsname=lustre /dev/vg00/lv00" and "options lnet networks=tcp(bond0)"), references are held in the following three places on each server node:

  • The active LNET module parameters viewable with the command "lctl list_nids".
  • The active MDT/OST filesystem parameters viewable with the command "tunefs.lustre /dev/vg00/OSTnnnn".
  • The MDT/OST log files which are not viewable. However, the following messages indicate that one or more log files have invalid refereneces:
    • Lustre: 6455:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len 368 206.12.154.1@tcp->10.200.201.4@tcp
    • Lustre: 6460:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.200.201.3@tcp

To change the network (in this example we will change from eth3 (206.12.154.0/26) to bond0 (10.200.201.0/24), the following procedure should be adopted:

  • Stop the file system (see Step 3).
  • On e1 through e7: Update the /etc/modprob.d/lustre-lnet.conf, eg. options lnet networks=tcp(bond0)
  • On e1:
    • Request log file regeneration for the MDT with the command "tunefs.lustre --writeconf /dev/vg00/MDT0000".
    • Update filesystem parameters for the MDT with the command "tunefs.lustre --erase-params --mgsnode=10.200.201.1@tcp /dev/vg00/MDT0000"
  • On e2 through e7:
    • Request log file regeneration for each OST with the command "/usr/local/sbin/OSTsWriteconf".
    • Update the filesystem parameters for each OST with the command "/usr/local/sbin/OSTsMsgnode 10.200.201.1"
  • On e1 through e7: Update iptables to reflect the new network configuration.
  • On e1 through e7: Unload the lnet module thereby purging the active module parameters. Issuing the "rmmod lnet" command generally results in "module in use by ....". The "rmmod" command can be issued repetitively for each of the identified modules until the "lnet" module is successfully unloaded. Alternatively, each node can be rebooted.
  • Restart/remount the file system (see Step 3).

Index Up Down Note 5: Installing a patchless client on a workstation.

  • Download the Lustre source.
  • tar -xzvf lustre-1.8.2.tar.gz
  • cd lustre-1.8.2
  • ./configure --disable-server --enable-client --with-linux=/usr/src/kernels/2.6.18-164.15.1.el5-xen-i686
  • make
  • sudo make install
  • sudo depmod -a
  • sudo modprobe lustre
  • sudo mount -t lustre 206.12.154.1@tcp:/lustre /lustreFS/

Index Up Down Note 6: Determine which OSS is serving an OST.

On the MGS/MDT server:

[crlb@elephant01 ~]$ lctl get_param osc.*.ost_conn_uuid
osc.lustre-OST0000-osc-ffff8803bd5ab400.ost_conn_uuid=206.12.154.2@tcp
osc.lustre-OST0000-osc.ost_conn_uuid=206.12.154.2@tcp
osc.lustre-OST0001-osc-ffff8803bd5ab400.ost_conn_uuid=206.12.154.3@tcp
osc.lustre-OST0001-osc.ost_conn_uuid=206.12.154.3@tcp
osc.lustre-OST0002-osc-ffff8803bd5ab400.ost_conn_uuid=206.12.154.4@tcp
osc.lustre-OST0002-osc.ost_conn_uuid=206.12.154.4@tcp
osc.lustre-OST0003-osc-ffff8803bd5ab400.ost_conn_uuid=206.12.154.5@tcp
osc.lustre-OST0003-osc.ost_conn_uuid=206.12.154.5@tcp
[crlb@elephant01 ~]$ :

The IP address identifies which node is serving which OST.

Index Up Down Note 7: Temporarily deactivate an OST.

On the MGS/MDT server:

  • Determine the device number for the MDT's OSC corresponding to the OST to be deactivated (a device is indentified by its' endpoints, eg. lustre-OSTnnnn-osc and lustre-mdtlov_UUID):

[crlb@elephant01 ~]$ lctl dl | grep osc
  5 UP osc lustre-OST0000-osc lustre-mdtlov_UUID 5
  6 UP osc lustre-OST0001-osc lustre-mdtlov_UUID 5
  7 UP osc lustre-OST0002-osc lustre-mdtlov_UUID 5
  8 UP osc lustre-OST0003-osc lustre-mdtlov_UUID 5
 11 UP osc lustre-OST0000-osc-ffff8803bd5ab400 a91b4601-8f1d-5061-2175-7ac02693cc0f 5
 12 UP osc lustre-OST0001-osc-ffff8803bd5ab400 a91b4601-8f1d-5061-2175-7ac02693cc0f 5
 13 UP osc lustre-OST0002-osc-ffff8803bd5ab400 a91b4601-8f1d-5061-2175-7ac02693cc0f 5
 14 UP osc lustre-OST0003-osc-ffff8803bd5ab400 a91b4601-8f1d-5061-2175-7ac02693cc0f 5
[crlb@elephant01 ~]$ 

  • To deactivate OST0003 from the above list issue:

[crlb@elephant01 ~]$ sudo lctl --device 8 deactivate
[sudo] password for crlb: 
[crlb@elephant01 ~]$

  • The "lctl dl | grep osc" command can be used to check the change in status.

Index Up Down Note 8: Re-activating an OST.

On the MGS/MDT server:

  • Determine the device number for the MDT's OSC corresponding to the OST to be re-activated (a device is indentified by its' endpoints, eg. lustre-OSTnnnn-osc and lustre-mdtlov_UUID):

[crlb@elephant01 ~]$ lctl dl | grep osc
  5 UP osc lustre-OST0000-osc lustre-mdtlov_UUID 5
  6 UP osc lustre-OST0001-osc lustre-mdtlov_UUID 5
  7 UP osc lustre-OST0002-osc lustre-mdtlov_UUID 5
  8 IN osc lustre-OST0003-osc lustre-mdtlov_UUID 5
 11 UP osc lustre-OST0000-osc-ffff8803bd5ab400 a91b4601-8f1d-5061-2175-7ac02693cc0f 5
 12 UP osc lustre-OST0001-osc-ffff8803bd5ab400 a91b4601-8f1d-5061-2175-7ac02693cc0f 5
 13 UP osc lustre-OST0002-osc-ffff8803bd5ab400 a91b4601-8f1d-5061-2175-7ac02693cc0f 5
 14 UP osc lustre-OST0003-osc-ffff8803bd5ab400 a91b4601-8f1d-5061-2175-7ac02693cc0f 5
[crlb@elephant01 ~]$ 

  • To Re-activate OST0003 from the above list issue:

[crlb@elephant01 ~]$ sudo lctl --device 8 activate
[sudo] password for crlb: 
[crlb@elephant01 ~]$

  • The "lctl dl | grep osc" command can be used to check the change in status.

Index Up Down Note 9: Determining which files have objects on a particular OST.

This procedure can be performed on any lustre node:

  • Determine the UUID for the OST of interest:
[crlb@elephant01 ~]$ lfs df
UUID                 1K-blocks      Used Available  Use% Mounted on
lustre-MDT0000_UUID   91743520    496624  86004016    0% /lustreFS[MDT:0]
lustre-OST0000_UUID  928910792 717422828 164301980   77% /lustreFS[OST:0]
lustre-OST0001_UUID  928910792 720414360 161310444   77% /lustreFS[OST:1]
lustre-OST0002_UUID  928910792 730323340 151401464   78% /lustreFS[OST:2]
lustre-OST0003_UUID  928910792 348690392 533034416   37% /lustreFS[OST:3]

filesystem summary:  3715643168 2516850920 1010048304   67% /lustreFS

[crlb@elephant01 ~]$ 

  • To list the files with objects on OST0003:
[crlb@elephant01 ~]$ lfs find --obd lustre-OST0003_UUID /lustreFS/
   .
   .
/lustreFS//BaBar/work/allruns_backup/17320691/A26.0.0V01x57F/config.tcl
/lustreFS//BaBar/work/allruns_backup/17320691/A26.0.0V01x57F/17320691.moose.01.root
/lustreFS//BaBar/work/allruns_backup/17320691/status.txt
/lustreFS//BaBar/work/allruns_backup/17320697/A26.0.0V01x57F/B+B-_generic.dec
   .
   .
[crlb@elephant01 ~]$ 

[root@elephant bin]# lfs find --obd lustre-OST0004_UUID /lustreFS/  | wc
      0       0       0
[root@elephant bin]# lctl conf_param lustre-OST0004.osc.active=0
[root@elephant bin]#

Index Up Down Note 10: Restoring a corrupted PV after a motherboard replacement.

On e4:
dd if=/dev/sdb of=/tmp/boot.txt bs=512 count=1

On e3:
scp crlb@e4:/tmp/boot.txt boot-e4.txt
dd if=boot-e4.txt of=/dev/sdb bs=512 count=1
fdisk -l /dev/sdb

pvcreate --restorefile /etc/lvm/backup/vg00 --uuid f26AZq-ycTI-7QKf-3yn9-3VCe-w1V3-dOaKlk /dev/sdb
* uuid was taken from /etc/lvm/backup/vg00
vgcfgrestore vg00
pvscan
vgchange -ay vg00
mount -a
mountOSTs 
df
Edit | Attach | Watch | Print version | History: r25 < r24 < r23 < r22 < r21 | Backlinks | Raw View | More topic actions
Topic revision: r25 - 2011-07-18 - crlb
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback