XEN HA cluster installation

24.09.2006 22:15 | blackhole

XEN HA HOWTO

Uvod
1)Zakladne pojmy
2)Konkretna instalacia
3)Zakladne prikazy
4)Konfiguracia & install
6)Links

1)Zakladne pojmy

XEN,HA,LVM,DRBD
XEN je opensource projekt hypervisora ktory umoznuje spustat rozne operacne system sucastne
na jednom pocitaci v sucastnosti su to Linux s jadrom 2.4 a 2.6 ,FreeBSD,NetBSD,Sun Solaris.
architectura xenu je asi takato

|DomU|Dom U|DOM U|
-----------------
|    DOM 0      |
-----------------
|XEN HP|
--------
|  HW  |
--------

DOM 0 je hlavna domena ktora manipluje s hardwarom na prislusnej masine.
DOM U je instancia operacneho systemu ktory je spusteny nad DOM 0
a nema pristup k realnemu hardwaru.
XEN HP je hypervisor ktory umoznuje spustanie a sucasny beh
viacerych U domen na jednom HW.

HA je skratka pre High Availability co znamena vysoka dostupnost,pre linux
existuje mnozstvo projektov ktore su na HA zamerane vid. Links.My pouzijeme
software z projektu linux-ha.org ktory sa vola heartbeat
LVM je vrstva ktora sa nachadza nad realnym diskom a umoznuje mi za behu
systemu zvacsovat zmensovat particie,disky pre Dom U,
alebo stripovat resp. mirrorovat data medzi rozne disky.
architectura LVM je asi takato

-----------
|  LVM   |
|particie |
-----------
|  LVM   |
-----------
|  Disk  |
-----------

DRBD je obdoba softwaroveho raidu linuxe len nam umoznuje mirrorovat
particie cez siet a tak udrziavat disky (u nas lvm particie)pre
dom U v konzistentom stave.Drbd funguje na pricipe existencie 2 diskov
s jednym sa da normalne pracovat ako s diskom zatial co druhy je mrtvy a
neda sa mountnut(samozrejme ze sa to da ,ale pre jednoduchost predpokladajme ze nie).
Vo vypise sa to uvadza ako Primary/Secondary Primary
node pracuje s diskom zatial co secondary ho iba mirroruje a caka ci
nemusi nabehnut ak pride k vypadku drbd to po timeoute detekne nastavi
vypadnuty node do stavu Unknown a v tom momente je mozne na druhom node
nastavit resource na primary.

2)Konkretna instalacia

Instalacia bude pozostavat z 2 serverov skylla,charybda ktore budu priamo spojene
sietovym kablom(toto spojenie sa odporuca robit redundatne tade UTP/seriak a podobne).
Na kazdom servery pobezi jedna domena a ta obsah svojho disku bude mirrorvat
na druhy server.
Z pohladu HA sa jedna o active/active cluster kde obi dva servery vykonavaju kriticke
ulohy,ale zaroven striehnu ci je ten druhy hore a ak nie preberu za nho je ho pracu.
Nevyhnutnou podmienkou je pre taketo riesenie zdielany storage
napr. SAN alebo aj (alebo ako lacne linux riesenie)drbd disk.
Ak cluster bezi iba s jednym nodom tak po
starte druheho nodu, server automaticky preberie na seba zodpovednost za
svoje ulohy a cluster bezi dalej

  skylla                        charybda
------------                  -----------
|          |    drbd raid 1    |          |
|  domain1 |=================>|  domain1 |
|    disk  |    domain1        |  disk    |
|----------|                  |----------|
|          |10.0.1.1  10.0.1.2|          |
|          |<=================|          |
|----------| drbd raid 2      |----------|
|  domain2 |<=================|  domain2 |
|    disk  |  domain2         |  disk    |
------------                   ------------

3)Zakladne prikazy ktore budeme potrebovat (na instalaciu a spravu serverov:)
LVM
pvcreate [dev] vytvori na dev strukturu ktora predstavuje zaklad pre lvm.
vgcreate my_volume_group [dev1] [dev n] na zariadeniach na ktorych sme pomocou pvcreate
vytvorili zaklad vyvarame tvz. volume group ktora nam zdruzuje viac fyz. zariadeni
do jedneho
vgchange -a y my_volume_group manipuluje s vlastnostami LVM particii priklad aktivuje
LVM particiu pri bootovani
vgremove [vol_group] odstrani vg a vsetky particie na nej.
vgscan vypise zoznam vsetky volume groups co su v pocitaci
lvscan vypise zoznam vsetkych LVM particii co su v pocitaci
lvdisplay vypise podrobnosti o vsetkych VG a LVM particiach
vgdisplay vypise podrobnosti o vsetkych VG v systeme
lvmcreate -L1500M -n[lv name] [my_volgroup] vytvori lvm particiu s nazvom [lv name]
na vg [my_volgroup]
lvextend meni velkost LVM particie
vgextend prida [dev] do VG
vgremove odoberie [dev] z vg
Instalacia LVM
Na serveroch spravime jednu vg definovana nad diskom a to lvm .
  --- Volume group ---
  VG Name              lvm
  System ID
  Format                lvm2
  Metadata Areas        2
  Metadata Sequence No  12
  VG Access            read/write
  VG Status            resizable
  MAX LV                0
  Cur LV                3
  Open LV              1
  Max PV                0
  Cur PV                2
  Act PV                2
  VG Size              67.55 GB
  PE Size              4.00 MB
  Total PE              17293
  Alloc PE / Size      14258 / 55.70 GB
  Free  PE / Size      3035 / 11.86 GB
  VG UUID              0DxUer-knvn-IOUI-I3G4-I6D7-qBBz-5ktO13


Na nej definujeme LVM particie takto:
  --- Logical volume ---
  LV Name                /dev/lvm/home
  VG Name                lvm
  LV UUID                14poHQ-kHwZ-GO4z-VYJn-Uvi2-Qq6P-AtkfhZ
  LV Write Access        read/write
  LV Status              available
  # open                1
  LV Size                35.00 GB
  Current LE            8960
  Segments              2
  Allocation            inherit
  Read ahead sectors    0
  Block device          253:0
  --- Logical volume ---
  LV Name                /dev/lvm/domain1
  VG Name                lvm
  LV UUID                fmOgOW-rVv0-Xr3l-2TrT-LJsR-V3Sp-J5rZsZ
  LV Write Access        read/write
  LV Status              available
  # open                0
  LV Size                10.35 GB
  Current LE            2649
  Segments              4
  Allocation            inherit
  Read ahead sectors    0
  Block device          253:1
  --- Logical volume ---
  LV Name                /dev/lvm/domain2
  VG Name                lvm
  LV UUID                Vg4U5z-4zV1-UOKg-lTlN-321N-bPPN-T5wJ5E
  LV Write Access        read/write
  LV Status              available
  # open                0
  LV Size                10.35 GB
  Current LE            2649
  Segments              2
  Allocation            inherit
  Read ahead sectors    0
  Block device          253:2

zaklad pre disky na virtualne beziace domeny je nazvany ako domain1 a domain2 nad nim je
neskor spraveny raid aby boli data zdielane medzi nodmi clustru.

XEN xend sa v userspace sklada zo servera xend ktory sa spusta hnedpo starte
systemu z init.d skriptov a z utility xm na manazovanie behu domen a ich
sledovanie.

xm list [-l pre long vypis] vypise zoznam aktualne na systeme beziacich virtualnych domen.
xm console [domain name] pripoji virtualnu konsolu na konsolu domeny u co bezi
s menom [domain name]
xm top nieco podobne ako top pre processy a len pre domeny.
xm create [-c] [path to domain config] vytvori novu domenu a spusti ju tak ako je definovana
v configuraku.Defaultne su configuraky v /etc/xend/.
xm destroy okamzite zrusi virtualnu domenu .
xm shutdown robia to co maju v nazve :)
reboot

DRBD ma svoj konfigurak v /etc/drbd.conf
na kontrolovanie statusu drbd modulu pouzivame /proc/drbd kde mame aktualne informacie.
!!Pozor nikdy sa nesmie pracovat s nizsie polozenym zariadenim ako s drbd ak
to obsluhuje disk.t,j nikdy neskusat nic robit s lvm particiou domain1 ale
iba pomocou drbd
drbdadm utilita na manazovanie drbddiskov
drbdadm state [drbd resource name||all] vypise status drbd raidu.
drbdadm cstate [drbd resource name||all] vypise connection status drbd raidu.
drbdadm primary [drbd resource name||all] !!dangerous nastavi drbd resource status na
tejto masine na primary
drbdadm secondary [drbd resource name||all] !!dangerous nastavi drbd resource status na
tejto masine na secondary neda sa pokial je drbd disk mountnuty
drbdadm connect [drbd resource name||all] spoji nody do jedneho.
4)Konfiguracia && Instalacia

Takze konfiguraciu xend servera tu opisovat nebudem ,existuje uz nespocetne vela manualov
na to pre rozne linuxove distribucie.Takze Zaciname in medias res :D.
takze ak mame nakonfigurovane lvm na nasom servery presne tak ako je hore uvedene
(zalezi na menach nie velkostiach).mozme prejst ku konfiguracii drbd ako zdielanecho storage pre
nase domeny.DRBD je distribuovane ako modul do kernelu takze ho nejak musite dostat do svojej
oblubenej distribucie.

DRBD konfigurak je v /etc/drbd.conf

resource domain2 {
  # transfer protocol to use.
  # C: write IO is reported as completed, if we know it has
  #    reached _both_ local and remote DISK.
  #    * for critical transactional data.
  # B: write IO is reported as completed, if it has reached
  #    local DISK and remote buffer cache.
  #    * for most cases.
  # A: write IO is reported as completed, if it has reached
  #    local DISK and local tcp send buffer. (see also sndbuf-size)
  #    * for high latency networks
  #
  #**********
  # uhm, benchmarks have shown that C is actually better than B.
  # this note shall disappear, when we are convinced that B is
  # the right choice "for most cases".
  # Until then, always use C unless you have a reason not to.
  #    --lge
  #**********
  #
protocol C;
  # what should be done in case the cluster starts up in
  # degraded mode, but knows it has inconsistent data.
  #incon-degr-cmd "echo '!!DRBD!! raid was started in degradated mode see /etc/drbd.conf line 90  ' | wall ; sleep 60 ; halt -f";
  startup {
    # Wait for connection timeout.
    # The init script blocks the boot process until the resources
    # are connected. This is so when the cluster manager starts later,
    # it does not see a resource with internal split-brain.
    # In case you want to limit the wait time, do it here.
    # Default is 0, which means unlimited. Unit is seconds.
    #
    wfc-timeout  20;
    # Wait for connection timeout if this node was a degraded cluster.
    # In case a degraded cluster (= cluster with only one node left)
    # is rebooted, this timeout value is used.
    #
    degr-wfc-timeout 100;    # 2 minutes.
  }
  disk {
    # if the lower level device reports io-error you have the choice of
    #  "pass_on"  ->  Report the io-error to the upper layers.
    #                Primary  -> report it to the mounted file system.
    #                Secondary -> ignore it.
    #  "panic"    ->  The node leaves the cluster by doing a kernel panic.
    #  "detach"  ->  The node drops its backing storage device, and
    #                continues in disk less mode.
    #
    on-io-error  detach;
    # In case you only want to use a fraction of the available space
    # you might use the "size" option here.
    #
    # size 10G;
  }
  net {
    # this is the size of the tcp socket send buffer
    # increase it _carefully_ if you want to use protocol A over a
    # high latency network with reasonable write throughput.
    # defaults to 2*65535; you might try even 1M, but if your kernel or
    # network driver chokes on that, you have been warned.
    sndbuf-size 512k;
    timeout      30;    #  6 seconds  (unit = 0.1 seconds)
    connect-int  6;    # 10 seconds  (unit = 1 second)
    ping-int      6;    # 10 seconds  (unit = 1 second)
    # Maximal number of requests (4K) to be allocated by DRBD.
    # The minimum is hardcoded to 32 (=128 kb).
    # For hight performance installations it might help if you
    # increase that number. These buffers are used to hold
    # datablocks while they are written to disk.
    #
    max-buffers    8192;
    # The highest number of data blocks between two write barriers.
    # If you set this < 10 you might decrease your performance.
    max-epoch-size  10240;
    # if some block send times out this many times, the peer is
    # considered dead, even if it still answers ping requests.
    # ko-count 4;
    # if the connection to the peer is lost you have the choice of
    #  "reconnect"  -> Try to reconnect (AKA WFConnection state)
    #  "stand_alone" -> Do not reconnect (AKA StandAlone state)
    #  "freeze_io"  -> Try to reconnect but freeze all IO until
    #                  the connection is established again.
    # on-disconnect reconnect;
  }
  syncer {
    # Limit the bandwith used by the resynchronisation process.
    # default unit is KB/sec; optional suffixes K,M,G are allowed
    #
    rate 500M;
    # All devices in one group are resynchronized parallel.
    # Resychronisation of groups is serialized in ascending order.
    # Put DRBD resources which are on different physical disks in one group.
    # Put DRBD resources on one physical disk in different groups.
    #
    group 2;
    # Configures the size of the active set. Each extent is 4M,
    # 257 Extents ~> 1GB active set size. In case your syncer
    # runs @ 10MB/sec, all resync after a primary's crash will last
    # 1GB / ( 10MB/sec ) ~ 102 seconds ~ One Minute and 42 Seconds.
    # BTW, the hash algorithm works best if the number of al-extents
    # is prime. (To test the worst case performace use a power of 2)
    al-extents 257;
  }
  on skylla {
    device    /dev/drbd1;
    disk      /dev/lvm/domain2;
    address    10.0.1.1:7789;
    meta-disk  internal;
    # meta-disk is either 'internal' or '/dev/ice/name [idx]'
    #
    # You can use a single block device to store meta-data
    # of multiple DRBD's.
    # E.g. use meta-disk /dev/hde6[0]; and meta-disk /dev/hde6[1];
    # for two different resources. In this case the meta-disk
    # would need to be at least 256 MB in size.
    #
    # 'internal' means, that the last 128 MB of the lower device
    # are used to store the meta-data.
    # You must not give an index with 'internal'.
  }
  on charybda {
    device    /dev/drbd1;
    disk      /dev/lvm/domain2;
    address  10.0.1.2:7789;
    meta-disk internal;
  }
}

takze ak mame nakonfigurovany a nastartovany drbd.Odporucam checkovat status drbd v /proc/drbd.
mozme sa pustit do instalacie virtualnych strojov a na nich potrebneho vybavenia.Instalaciu necham
ludoch ale v momente ako sme si zinstalovali stroje mozme si nahodit tento konfigurak aby nam
tieto stroje aj realne bootovali.

XEND konfigurak pre xend je v /etc/xend
konfigurak pre domeny je v /etc/xend/domain?

#----------------------------------------------------------------------------
# Kernel image file.
kernel = "/boot/vmlinuz-2.6.1?-xenU"
# Optional ramdisk.
#ramdisk = "/boot/initrd.gz"
# The domain build function. Default is 'linux'.
#builder='linux'
# Initial memory allocation (in megabytes) for the new domain.
memory = 256
# A name for your domain. All domains must have different names.
name = "domain1"
# List of which CPUS this domain is allowed to use, default Xen picks
#cpus = ""        # leave to Xen to pick
cpus = "0"        # all vcpus run on CPU0
#cpus = "0-3,5,^1" # run on cpus 0,2,3,5
# Number of Virtual CPUS to use, default is 1
vcpus = 1
#----------------------------------------------------------------------------
# Define network interfaces.
# Number of network interfaces. Default is 1.
#nics=1
# Optionally define mac and/or bridge for the network interfaces.
# Random MACs are assigned if not given.
vif = [ '' ]
#----------------------------------------------------------------------------
# Define the disk devices you want the domain to have access to, and
# what you want them accessible as.
# Each disk entry is of the form phy:UNAME,DEV,MODE
# where UNAME is the device, DEV is the device name the domain will see,
# and MODE is r for read-only, w for read-write.
disk = [ 'phy:/dev/drbd0,sda1,w' ]
# Set the hostname.
#hostname= "vm%d" % vmid
# Set root device.
root = "/dev/sda1 rw"
# Sets runlevel 4.
extra = "3"

Heartbeat konfigurak pre server je v /etc/ha.d/ha.cf
konfigurak resourcov ktore spravuju servery je v /etc/ha.d/haresources(pokial si niekto
pozriete tie linky tak zistite ze pre heartbeat 2.x je moznost zapnut aj CRM-Cluster
Resource Manager ale ten v case vyvoja tychto strojov bol len testovacej prevadzke tak
sa ho musite naucit sami :)

pri konfiguracii ha.cf su velmi podstatne nastavenie timeoutov pre prehlasenie ze dany stroj
je mrtvy.Lebo neodpoveda na ping(hearbeat).

  • keepalive: cas medzi jednotlivymi heartbeatmi.
  • initdead: je cas po ktorom prehlasis server ze druhy nod je mrtvy ak prave nabootoval.
  • deadtime: je cas po ktorom ak nod neodpoveda na heartbeati prevazme druhy nod jeho robotu.
  • node: je medzerou oddeleny zoznam nodov v HA clustri
  • bcast: medzerou oddeleny zoznam interfaceov na ktorych bude nod posielat heartbeati
  • ping: je zoznam IPciek ktore sa dany node snazi pingnut a ked sa mu to nepodari tak
    prehlasi ze je nedostupny a druhy nod preberie zodpovednost za aplikaciu.

v haresources je uvedeny zoznam prislusnych aplikacii ,ktore startuje heartbeat z
/etc/heartbeat/init.d alebo z /etc/init.d. u nas by tento file mohol vyzerat asi takto.

skylla drbddisk /dev/lvm/domena1
skylla haxendomains domain1

charybda drbddisk /dev/lvm/domena2
charybda haxendomains domain2

(!! je to len ukazke scripty drbddisk a haxendomains nie su mojim majetkom tak
ich nemozem zverejnit,pisal som ich pre jednu firmu takze tak :D.Haxendomains je inak jednoduchy
startovaci script co podla vstupu stopne alebo startne danu domenu.
drbddisk len nastavi danu domenu do modu Primary/* na danom node.)

6)Links:
LVM
http://www.tldp.org/HOWTO/LVM-HOWTO/

XEN
http://tx.downloads.xensource.com/downloads/docs/user/

Heartbeat & drbd
http://www.linux-ha.org/GettingStartedV2
http://www.linux-ha.org/DataRedundancyByDrbd
ha.cf
http://linux-ha.org/ha.cf
http://linux-ha.org/ha.cf/DefaultValues
haresources
http://linux-ha.org/haresources

    • LVM je podobna RAIDu alebo 26.09.2006 | 15:04
      Avatar blackhole   Návštevník

      LVM je podobna RAIDu alebo som to zle pochopil?? Inak tucny clanok. gj.
      -------
      I'm lowkey like seashells.

      • RAID bol pokial viem 26.09.2006 | 19:00
        Avatar blackhole   Návštevník

        RAID bol pokial viem navrhnuty na zvysenie redundancie systemu voci vypadku disku (raid 1...5),a na zvysenie vykonu pomalych diskov(raid 0)

        LVM ja navrhnute ako vrstva ,ktora pred userom schovava vsetky take veci ako su partitions,a taktiez disk na ktorom je dana particia.Ktorych resizovanie je fakt zla vec. zatial co resiznut LVM partition je hracka.
        Mozem mat viac diskov v tvz VG(volume groupe)a ked vytvorim v tejto VG LV(Logica volume) tak sa mi tato lv rozlozi po celej VG a ja nemam kotrolu nad tym kde sa mi nachadzaju data ak zapisujem na LV.
        :)

        P.S. Urcite skuste LVM je fakt super vec :D

    • (0_-) 28.09.2006 | 10:47
      Avatar blackhole   Návštevník

      dost dobre

    • dobre 28.09.2006 | 19:47
      Avatar blackhole   Návštevník

      dobre diky...
      ---------------------------------------
      nadani ucit se je dar;
      schopnost ucit se je dovednost;
      ochota ucit se je volba;

    • xen ruluje 30.09.2006 | 00:49
      Avatar phb   Používateľ

      inak hosi neviem ci ste si to vsimli ale je na free download velmi mocny nastroj, xen enterprise, je to sice 30 dnovka, ale to hadam nie je problem ;)
      ______________________________
      my life is better than sci-fi

      ______________________________ There are no great men, only great challenges that ordinary men are forced by circumstances to meet. -- Admiral William Halsey
    • hypervisor 07.10.2006 | 19:23
      Avatar vid   Používateľ
      schvalne, kolkym vam procesor podporuje VMX/SVM, aby ste tam nejaky HW hypervisor vobec mohli pustit?
      • nie je nutne mat podporu 09.10.2006 | 17:38
        Avatar blackhole   Návštevník

        nie je nutne mat podporu VMX/VT staci obycajny x86/x86-64/ia64 a este neviem aky CPU .
        mne dom bezi xen3 na netbsd a cPU mam staru 400MHZ sunku:)