Difference: WiP1 (23 vs. 24)

Revision 242011-01-25 - crlb

Line: 1 to 1
 
META TOPICPARENT name="ColinLeavettBrown"
-- ColinLeavettBrown - 2010-07-20
Changed:
<
<

Lustre Procedures

>
>

NEP-52 Lustre file system notes:

 
Changed:
<
<

Content

  1. Determine which OSS is serving an OST.
  2. Temporarily deactivate an OST.
  3. Re-activating an OST.
  4. Determining which files have objects on a particular OST.
>
>

Overview

The elephant cluster was used in developing these notes. Much of the information comes from the Lustre Operations Manual.
  1. Installing Lustre on the cluster.
  2. Define the Lustre file system.
  3. Starting and stopping the Lustre file system.
  4. Changing the Lustre network.
  5. Installing a patchless client on a workstation.
  6. Determine which OSS is serving an OST.
  7. Temporarily deactivate an OST.
  8. Re-activating an OST.
  9. Determining which files have objects on a particular OST.
  10. Restoring a corrupted PV after a motherboard replacement..
 
Changed:
<
<

Proc 1: Determine which OSS is serving an OST.

Index Up Down
>
>

Index Up Down Note 1: Installing Lustre on the cluster.

The elephant cluster needs to run a Xen kernel and the Lustre file system. It was there fore necessary to obtain a kernel with both the xen and lustre patched. The source was available on the Oracle website. The instructions for building the required RPMs can be found here. When the build is complete, install the following RPMs on each of the nodes in the cluster:

  • e2fsprogs

  • kernel-lustre
  • kernel-lustre-xen
  • kernel-lustre-xen-devel

  • lustre
  • lustre-ldiskfs
  • lustre-modules

Reboot each node with one of the kernels (kernel-lustre-xen (required for Xen) or kernel-lustre) installed above. Note: we found that the single user mode kernel boot parameter ("S", including all tested synonyms) is ineffective with any Xen kernel. If you need to boot into single user mode, use a non-Xen kernel.

Index Up Down Note 2: Define the Lustre file system.

The lustre file system consists of an MGS, an MDT, and one or more OSTs. These components can exist on one or more nodes. In the case of elephant, e1 hosts a combined MGS/MDT, and e2 through e6 each hosts multiple OSTs. They were created as follows:

  • As root, create the MGS/MDT on e1:
    • pvcreate /dev/sdb
    • vgcreate vg00 /dev/sdb
    • lvcreate -L 100M -n MDT0000 vg00
    • mkfs.lustre --mgs --mdt --fsname=lustre /dev/vg00/MDT0000 - NB: The mkfs.lustre must be performed as root; sudo mkfs.lustre does not produce the correct results.
    • Create mount points: mkdir /lustreMDT /lustreFS

  • As root, create OSTs on e2 through e6:
    • pvcreate /dev/sdb
    • vgcreate vg00 /dev/sdb
    • lvcreate -L 600G -n OSTnnnn vg00
    • mkfs.lustre --ost --mgsnode=10.200.201.1@tcp --fsname=lustre /dev/vg00/OSTnnnn - NB: The mkfs.lustre must be performed as root; sudo mkfs.lustre does not produce the correct results.
    • Create mount points: mkdir -p /lustreOST/{OSTnnnn,OSTnnnn} /lustreFS

The "mkfs.lustre" command will assign a filesystem name to each OST created. The name takes the form "OSTnnnn", where "nnnn" is a sequentially assigned hexadecimal number starting at "0000". On elephant, the filesystem name, the logical volume name, and mount point are made consistent, ie. logical volume /dev/vg00/OST000a contains OST filesystem OST000a which is mounted on /lustreOSTs/OST000a. This is not difficult to ensure (more difficult to fix after the fact) and the shell script /usr/local/sbin/mountOSTs depend on this arrangement (at least it depends on the LV name matching the mount point; the filesystem name is not so important). To display the filesystem name, issue the command "tunefs.lustre /dev/vg00/OSTnnnn". Currently, there are 20 OSTs allocated, 0000 through 0013. The next OST filesystem name that will be assigned by mkfs.lustre will be OST0014.

  • Define the Lustre network (LNET):
    • LNET is defined by options within the modprobe configuration. In elephant's case, the definition is effected by the /etc/modprobe.d/lustre-lnet.conf file containing the following definition:
      • options lnet networks=tcp(bond0)
    • This configuration instructs lustre to use only the bond0 (10.200.201.0/24) interface.

  • Update iptables to allow access to port 988 on all server nodes (e1 through e6). Remember, all clients mounting the file system will have read/write access. The following iptable entries were created for the elephant cluster:
    • # lustre filesystem access - elephant cluster:
    • -A RH-Firewall-1-INPUT -s 10.200.201.0/24 -m tcp -p tcp --dport 988 -j ACCEPT

  • The following iptable rules were used during testing and are now obsolete:
    • -A RH-Firewall-1-INPUT -s 206.12.154.0/26 -m tcp -p tcp --dport 988 -j ACCEPT
    • # lustre filesystem access - CLB's workstation:
    • -A RH-Firewall-1-INPUT -s 142.104.60.69 -m tcp -p tcp --dport 988 -j ACCEPT
    • # lustre filesystem access - Kyle's testnode, NRC Ottawa:
    • -A RH-Firewall-1-INPUT -s 132.246.148.31 -m tcp -p tcp --dport 988 -j ACCEPT

Index Up Down Note 3: Starting and stopping the Lustre file system

Staring and stoppping the lustre file system is performed by mounting and unmounting the file system components. To start the file system the mounts should be issued as root in the following order:

  • Mount the MGS/MDT on e1: mount -t lustre /dev/vg00/MDT0000 /lustreMDT
  • Mount the OSTs on e2 through e6. /usr/local/sbin/mountOSTs
  • Mount the lustre filesystem on all nodes: mount -t 10.200.201.1@tcp:/lustre /lustreFS

To stop the file system, the unmounts should be issued as root in the following order:

  • On all nodes, unmount the lustre filesystem: umount /lustreFS
  • On e1: umount /lustreMDT
  • on e2 through e6: umount /lustreOST/*

Index Up Down Note 4: Changing the Lustre network.

In order to change the network, all network references must be updated or the file system may be rendered unuseable. Though the Lustre network configuration is defined in only two places ("mkfs.lustre --ost --mgsnode=!10.200.201.1@tcp --fsname=lustre /dev/vg00/lv00" and "options lnet networks=tcp(bond0)"), references are held in the following three places on each server node:

  • The active LNET module parameters viewable with the command "lctl list_nids".
  • The active MDT/OST filesystem parameters viewable with the command "tunefs.lustre /dev/vg00/OSTnnnn".
  • The MDT/OST log files which are not viewable. However, the following messages indicate that one or more log files have invalid refereneces:
    • Lustre: 6455:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len 368 206.12.154.1@tcp->10.200.201.4@tcp
    • Lustre: 6460:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-10.200.201.3@tcp

To change the network (in this example we will change from bond0 (10.200.201.0/24) to eth3 (206.12.154.0/26), the following procedure should be adopted:

  • Stop the file system (see Step 3).
  • On e1 through e6: Update the /etc/modprob.d/lustre-lnet.conf on e1 through e5, eg. options lnet networks=tcp(eth3)
  • On e1 through e6: Request log file regeneration via the command "tunefs.lustre --writeconf /dev/vg00/lv00".
  • On e1 through e6: Update MDT/OST filesystem parameters via the command "tunefs.lustre --erase-params --mgsnode=206.12.154.1@tcp /dev/vg00/lv00"
  • On e1 through e6: Update iptables to reflect the new network configuration.
  • On e1 through e3: Unload the lnet module thereby purging the active module parameters. Issuing the "rmmod lnet" command generally results in "module in use by ....". The "rmmod" command can be issued repetitively for each of the identified modules until the "lnet" module is successfully unloaded. Alternatively, each node can be rebooted.
  • Restart/remount the file system (see Step 3).

Index Up Down Note 5: Installing a patchless client on a workstation.

  • Download the Lustre source.
  • tar -xzvf lustre-1.8.2.tar.gz
  • cd lustre-1.8.2
  • ./configure --disable-server --enable-client --with-linux=/usr/src/kernels/2.6.18-164.15.1.el5-xen-i686
  • make
  • sudo make install
  • sudo depmod -a
  • sudo modprobe lustre
  • sudo mount -t lustre 206.12.154.1@tcp:/lustre /lustreFS/

Index Up Down Note 6: Determine which OSS is serving an OST.

  On the MGS/MDT server:
Line: 30 to 136
  The IP address identifies which node is serving which OST.
Changed:
<
<

Proc 2: Temporarily deactivate an OST.

Index Up Down
>
>

Index Up Down Note 7: Temporarily deactivate an OST.

  On the MGS/MDT server:
  • Determine the device number for the MDT's OSC corresponding to the OST to be deactivated (a device is indentified by its' endpoints, eg. lustre-OSTnnnn-osc and lustre-mdtlov_UUID):
Line: 60 to 165
 
  • The "lctl dl | grep osc" command can be used to check the change in status.
Changed:
<
<

Proc 3: Re-activating an OST.

Index Up Down
>
>

Index Up Down Note 8: Re-activating an OST.

  On the MGS/MDT server:
  • Determine the device number for the MDT's OSC corresponding to the OST to be re-activated (a device is indentified by its' endpoints, eg. lustre-OSTnnnn-osc and lustre-mdtlov_UUID):
Line: 91 to 195
 
  • The "lctl dl | grep osc" command can be used to check the change in status.
Changed:
<
<

Proc 4: Determining which files have objects on a particular OST.

Index Up Down
>
>

Index Up Down Note 9: Determining which files have objects on a particular OST.

  This procedure can be performed on any lustre node:
Line: 131 to 234
 
    1. 0 0
[root@elephant bin]# lctl conf_param lustre-OST0004.osc.active=0 [root@elephant bin]#
Deleted:
<
<

Proc 5: Restoring a corrupted PV after a motherboard replacement.

Index Up Down
 
Added:
>
>
 
Added:
>
>

Index Up Down Note 10: Restoring a corrupted PV after a motherboard replacement.

 
On e4:
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback