Skip to main content

Verifying operation after loss of a single storage shelf

You can test the failure of a single storage shelf to verify that there is no single point of failure.

About this task

This procedure has the following expected results:

  • An error message should be reported by the monitoring software.

  • No failover or loss of service should occur.

  • Mirror resynchronization starts automatically after the hardware failure is restored.

  1. Check the storage failover status: storage failover show

    Example

    cluster_A::> storage failover show

    Node Partner Possible State Description
    -------------- -------------- -------- -------------------------------------
    node_A_1 node_A_2 true Connected to node_A_2
    node_A_2 node_A_1 true Connected to node_A_1
    2 entries were displayed.

  2. Check the aggregate status: storage aggregate show

    Example

    cluster_A::> storage aggregate show

    cluster Aggregates:
    Aggregate Size Available Used% State #Vols Nodes RAID Status
    --------- -------- --------- ----- ------- ------ ---------------- ------------
    node_A_1data01_mirrored
    4.15TB 3.40TB 18% online 3 node_A_1 raid_dp,
    mirrored,
    normal
    node_A_1root
    707.7GB 34.29GB 95% online 1 node_A_1 raid_dp,
    mirrored,
    normal
    node_A_2_data01_mirrored
    4.15TB 4.12TB 1% online 2 node_A_2 raid_dp,
    mirrored,
    normal
    node_A_2_data02_unmirrored
    2.18TB 2.18TB 0% online 1 node_A_2 raid_dp,
    normal
    node_A_2_root
    707.7GB 34.27GB 95% online 1 node_A_2 raid_dp,
    mirrored,
    normal

  3. Verify that all data SVMs and data volumes are online and serving data: vserver show -type data network interface show -fields is-home false volume show !vol0,!MDV*

    Example

    cluster_A::> vserver show -type data

    cluster_A::> vserver show -type data
    Admin Operational Root
    Vserver Type Subtype State State Volume Aggregate
    ----------- ------- ---------- ---------- ----------- ---------- ----------
    SVM1 data sync-source running SVM1_root node_A_1_data01_mirrored
    SVM2 data sync-source running SVM2_root node_A_2_data01_mirrored

    cluster_A::> network interface show -fields is-home false
    There are no entries matching your query.

    cluster_A::> volume show !vol0,!MDV*
    Vserver Volume Aggregate State Type Size Available Used%
    --------- ------------ ------------ ---------- ---- ---------- ---------- -----
    SVM1
    SVM1_root
    node_A_1data01_mirrored
    online RW 10GB 9.50GB 5%
    SVM1
    SVM1_data_vol
    node_A_1data01_mirrored
    online RW 10GB 9.49GB 5%
    SVM2
    SVM2_root
    node_A_2_data01_mirrored
    online RW 10GB 9.49GB 5%
    SVM2
    SVM2_data_vol
    node_A_2_data02_unmirrored
    online RW 1GB 972.6MB 5%


  4. Identify a shelf in Pool 1 for node node_A_2 to power off to simulate a sudden hardware failure: storage aggregate show -r -node node-name !*root

    The shelf you select must contain drives that are part of a mirrored data aggregate.

    Example

    In the following example, shelf ID 31 is selected to fail.

    cluster_A::> storage aggregate show -r -node node_A_2 !*root
    Owner Node: node_A_2
    Aggregate: node_A_2_data01_mirrored (online, raid_dp, mirrored) (block checksums)
    Plex: /node_A_2_data01_mirrored/plex0 (online, normal, active, pool0)
    RAID Group /node_A_2_data01_mirrored/plex0/rg0 (normal, block checksums)
    Usable Physical
    Position Disk Pool Type RPM Size Size Status
    -------- --------------------------- ---- ----- ------ -------- -------- ----------
    dparity 2.30.3 0 BSAS 7200 827.7GB 828.0GB (normal)
    parity 2.30.4 0 BSAS 7200 827.7GB 828.0GB (normal)
    data 2.30.6 0 BSAS 7200 827.7GB 828.0GB (normal)
    data 2.30.8 0 BSAS 7200 827.7GB 828.0GB (normal)
    data 2.30.5 0 BSAS 7200 827.7GB 828.0GB (normal)

    Plex: /node_A_2_data01_mirrored/plex4 (online, normal, active, pool1)
    RAID Group /node_A_2_data01_mirrored/plex4/rg0 (normal, block checksums)
    Usable Physical
    Position Disk Pool Type RPM Size Size Status
    -------- --------------------------- ---- ----- ------ -------- -------- ----------
    dparity 1.31.7 1 BSAS 7200 827.7GB 828.0GB (normal)
    parity 1.31.6 1 BSAS 7200 827.7GB 828.0GB (normal)
    data 1.31.3 1 BSAS 7200 827.7GB 828.0GB (normal)
    data 1.31.4 1 BSAS 7200 827.7GB 828.0GB (normal)
    data 1.31.5 1 BSAS 7200 827.7GB 828.0GB (normal)

    Aggregate: node_A_2_data02_unmirrored (online, raid_dp) (block checksums)
    Plex: /node_A_2_data02_unmirrored/plex0 (online, normal, active, pool0)
    RAID Group /node_A_2_data02_unmirrored/plex0/rg0 (normal, block checksums)
    Usable Physical
    Position Disk Pool Type RPM Size Size Status
    -------- --------------------------- ---- ----- ------ -------- -------- ----------
    dparity 2.30.12 0 BSAS 7200 827.7GB 828.0GB (normal)
    parity 2.30.22 0 BSAS 7200 827.7GB 828.0GB (normal)
    data 2.30.21 0 BSAS 7200 827.7GB 828.0GB (normal)
    data 2.30.20 0 BSAS 7200 827.7GB 828.0GB (normal)
    data 2.30.14 0 BSAS 7200 827.7GB 828.0GB (normal)
    15 entries were displayed.

  5. Physically power off the shelf that you selected.
  6. Check the aggregate status again: storage aggregate show storage aggregate show -r -node node_A_2 !*root

    Example

    The aggregate with drives on the powered-off shelf should have a degraded RAID status, and drives on the affected plex should have a failed status, as shown in the following example:

    cluster_A::> storage aggregate show
    Aggregate Size Available Used% State #Vols Nodes RAID Status
    --------- -------- --------- ----- ------- ------ ---------------- ------------
    node_A_1data01_mirrored
    4.15TB 3.40TB 18% online 3 node_A_1 raid_dp,
    mirrored,
    normal
    node_A_1root
    707.7GB 34.29GB 95% online 1 node_A_1 raid_dp,
    mirrored,
    normal
    node_A_2_data01_mirrored
    4.15TB 4.12TB 1% online 2 node_A_2 raid_dp,
    mirror
    degraded
    node_A_2_data02_unmirrored
    2.18TB 2.18TB 0% online 1 node_A_2 raid_dp,
    normal
    node_A_2_root
    707.7GB 34.27GB 95% online 1 node_A_2 raid_dp,
    mirror
    degraded
    cluster_A::> storage aggregate show -r -node node_A_2 !*root
    Owner Node: node_A_2
    Aggregate: node_A_2_data01_mirrored (online, raid_dp, mirror degraded) (block checksums)
    Plex: /node_A_2_data01_mirrored/plex0 (online, normal, active, pool0)
    RAID Group /node_A_2_data01_mirrored/plex0/rg0 (normal, block checksums)
    Usable Physical
    Position Disk Pool Type RPM Size Size Status
    -------- --------------------------- ---- ----- ------ -------- -------- ----------
    dparity 2.30.3 0 BSAS 7200 827.7GB 828.0GB (normal)
    parity 2.30.4 0 BSAS 7200 827.7GB 828.0GB (normal)
    data 2.30.6 0 BSAS 7200 827.7GB 828.0GB (normal)
    data 2.30.8 0 BSAS 7200 827.7GB 828.0GB (normal)
    data 2.30.5 0 BSAS 7200 827.7GB 828.0GB (normal)

    Plex: /node_A_2_data01_mirrored/plex4 (offline, failed, inactive, pool1)
    RAID Group /node_A_2_data01_mirrored/plex4/rg0 (partial, none checksums)
    Usable Physical
    Position Disk Pool Type RPM Size Size Status
    -------- --------------------------- ---- ----- ------ -------- -------- ----------
    dparity FAILED - - - 827.7GB - (failed)
    parity FAILED - - - 827.7GB - (failed)
    data FAILED - - - 827.7GB - (failed)
    data FAILED - - - 827.7GB - (failed)
    data FAILED - - - 827.7GB - (failed)

    Aggregate: node_A_2_data02_unmirrored (online, raid_dp) (block checksums)
    Plex: /node_A_2_data02_unmirrored/plex0 (online, normal, active, pool0)
    RAID Group /node_A_2_data02_unmirrored/plex0/rg0 (normal, block checksums)
    Usable Physical
    Position Disk Pool Type RPM Size Size Status
    -------- --------------------------- ---- ----- ------ -------- -------- ----------
    dparity 2.30.12 0 BSAS 7200 827.7GB 828.0GB (normal)
    parity 2.30.22 0 BSAS 7200 827.7GB 828.0GB (normal)
    data 2.30.21 0 BSAS 7200 827.7GB 828.0GB (normal)
    data 2.30.20 0 BSAS 7200 827.7GB 828.0GB (normal)
    data 2.30.14 0 BSAS 7200 827.7GB 828.0GB (normal)
    15 entries were displayed.

  7. Verify that the data is being served and that all volumes are still online: vserver show -type data network interface show -fields is-home false volume show !vol0,!MDV*

    Example

    cluster_A::> vserver show -type data

    cluster_A::> vserver show -type data
    Admin Operational Root
    Vserver Type Subtype State State Volume Aggregate
    ----------- ------- ---------- ---------- ----------- ---------- ----------
    SVM1 data sync-source running SVM1_root node_A_1_data01_mirrored
    SVM2 data sync-source running SVM2_root node_A_1_data01_mirrored

    cluster_A::> network interface show -fields is-home false
    There are no entries matching your query.

    cluster_A::> volume show !vol0,!MDV*
    Vserver Volume Aggregate State Type Size Available Used%
    --------- ------------ ------------ ---------- ---- ---------- ---------- -----
    SVM1
    SVM1_root
    node_A_1data01_mirrored
    online RW 10GB 9.50GB 5%
    SVM1
    SVM1_data_vol
    node_A_1data01_mirrored
    online RW 10GB 9.49GB 5%
    SVM2
    SVM2_root
    node_A_1data01_mirrored
    online RW 10GB 9.49GB 5%
    SVM2
    SVM2_data_vol
    node_A_2_data02_unmirrored
    online RW 1GB 972.6MB 5%

  8. Physically power on the shelf.

    Resynchronization starts automatically.

  9. Verify that resynchronization has started: storage aggregate show

    Example

    The affected aggregate should have a resyncing RAID status, as shown in the following example:

    cluster_A::> storage aggregate show
    cluster Aggregates:
    Aggregate Size Available Used% State #Vols Nodes RAID Status
    --------- -------- --------- ----- ------- ------ ---------------- ------------
    node_A_1_data01_mirrored
    4.15TB 3.40TB 18% online 3 node_A_1 raid_dp,
    mirrored,
    normal
    node_A_1_root
    707.7GB 34.29GB 95% online 1 node_A_1 raid_dp,
    mirrored,
    normal
    node_A_2_data01_mirrored
    4.15TB 4.12TB 1% online 2 node_A_2 raid_dp,
    resyncing
    node_A_2_data02_unmirrored
    2.18TB 2.18TB 0% online 1 node_A_2 raid_dp,
    normal
    node_A_2_root
    707.7GB 34.27GB 95% online 1 node_A_2 raid_dp,
    resyncing

  10. Monitor the aggregate to confirm that resynchronization is complete: storage aggregate show

    Example

    The affected aggregate should have a normal RAID status, as shown in the following example:

    cluster_A::> storage aggregate show
    cluster Aggregates:
    Aggregate Size Available Used% State #Vols Nodes RAID Status
    --------- -------- --------- ----- ------- ------ ---------------- ------------
    node_A_1data01_mirrored
    4.15TB 3.40TB 18% online 3 node_A_1 raid_dp,
    mirrored,
    normal
    node_A_1root
    707.7GB 34.29GB 95% online 1 node_A_1 raid_dp,
    mirrored,
    normal
    node_A_2_data01_mirrored
    4.15TB 4.12TB 1% online 2 node_A_2 raid_dp,
    normal
    node_A_2_data02_unmirrored
    2.18TB 2.18TB 0% online 1 node_A_2 raid_dp,
    normal
    node_A_2_root
    707.7GB 34.27GB 95% online 1 node_A_2 raid_dp,
    resyncing