Verifying operation after loss of a single storage shelf

You can test the failure of a single storage shelf to verify that there is no single point of failure.

This procedure has the following expected results:

An error message should be reported by the monitoring software.
No failover or loss of service should occur.
Mirror resynchronization starts automatically after the hardware failure is restored.

Check the storage failover status: storage failover show

cluster_A::> storage failover show

Node           Partner        Possible State Description
-------------- -------------- -------- -------------------------------------
node_A_1       node_A_2       true     Connected to node_A_2
node_A_2       node_A_1       true     Connected to node_A_1
2 entries were displayed.

Check the aggregate status: storage aggregate show

cluster_A::> storage aggregate show

cluster Aggregates:
Aggregate     Size Available Used% State   #Vols  Nodes            RAID Status
--------- -------- --------- ----- ------- ------ ---------------- ------------
node_A_1data01_mirrored
            4.15TB    3.40TB   18% online       3 node_A_1       raid_dp,
                                                                   mirrored,
                                                                   normal
node_A_1root 
           707.7GB   34.29GB   95% online       1 node_A_1       raid_dp,
                                                                   mirrored,
                                                                   normal
node_A_2_data01_mirrored 
            4.15TB    4.12TB    1% online       2 node_A_2       raid_dp,
                                                                   mirrored,
                                                                   normal
node_A_2_data02_unmirrored 
            2.18TB    2.18TB    0% online       1 node_A_2       raid_dp,
                                                                   normal
node_A_2_root 
           707.7GB   34.27GB   95% online       1 node_A_2       raid_dp,
                                                                   mirrored,
                                                                   normal

Verify that all data SVMs and data volumes are online and serving data: vserver show -type data network interface show -fields is-home false volume show !vol0,!MDV*

cluster_A::> vserver show -type data

cluster_A::> vserver show -type data
                               Admin      Operational Root
Vserver     Type    Subtype    State      State       Volume     Aggregate
----------- ------- ---------- ---------- ----------- ---------- ----------
SVM1        data    sync-source           running     SVM1_root  node_A_1_data01_mirrored
SVM2        data    sync-source          running     SVM2_root  node_A_2_data01_mirrored

cluster_A::> network interface show -fields is-home false
There are no entries matching your query.

cluster_A::> volume show !vol0,!MDV*
Vserver   Volume       Aggregate    State      Type       Size  Available Used%
--------- ------------ ------------ ---------- ---- ---------- ---------- -----
SVM1
          SVM1_root
                       node_A_1data01_mirrored
                                    online     RW         10GB     9.50GB    5%
SVM1
          SVM1_data_vol
                       node_A_1data01_mirrored
                                    online     RW         10GB     9.49GB    5%
SVM2
          SVM2_root
                       node_A_2_data01_mirrored
                                    online     RW         10GB     9.49GB    5%
SVM2
          SVM2_data_vol
                       node_A_2_data02_unmirrored
                                    online     RW          1GB    972.6MB    5%

Identify a shelf in Pool 1 for node node_A_2 to power off to simulate a sudden hardware failure: storage aggregate show -r -node node-name !*root

The shelf you select must contain drives that are part of a mirrored data aggregate.

In the following example, shelf ID 31 is selected to fail.

cluster_A::> storage aggregate show -r -node node_A_2 !*root
Owner Node: node_A_2
 Aggregate: node_A_2_data01_mirrored (online, raid_dp, mirrored) (block checksums)
  Plex: /node_A_2_data01_mirrored/plex0 (online, normal, active, pool0)
   RAID Group /node_A_2_data01_mirrored/plex0/rg0 (normal, block checksums)
                                                              Usable Physical
     Position Disk                        Pool Type     RPM     Size     Size Status
     -------- --------------------------- ---- ----- ------ -------- -------- ----------
     dparity  2.30.3                       0   BSAS    7200  827.7GB  828.0GB (normal)
     parity   2.30.4                       0   BSAS    7200  827.7GB  828.0GB (normal)
     data     2.30.6                       0   BSAS    7200  827.7GB  828.0GB (normal)
     data     2.30.8                       0   BSAS    7200  827.7GB  828.0GB (normal)
     data     2.30.5                       0   BSAS    7200  827.7GB  828.0GB (normal)

  Plex: /node_A_2_data01_mirrored/plex4 (online, normal, active, pool1)
   RAID Group /node_A_2_data01_mirrored/plex4/rg0 (normal, block checksums)
                                                              Usable Physical
     Position Disk                        Pool Type     RPM     Size     Size Status
     -------- --------------------------- ---- ----- ------ -------- -------- ----------
     dparity  1.31.7                       1   BSAS    7200  827.7GB  828.0GB (normal)
     parity   1.31.6                       1   BSAS    7200  827.7GB  828.0GB (normal)
     data     1.31.3                       1   BSAS    7200  827.7GB  828.0GB (normal)
     data     1.31.4                       1   BSAS    7200  827.7GB  828.0GB (normal)
     data     1.31.5                       1   BSAS    7200  827.7GB  828.0GB (normal)

 Aggregate: node_A_2_data02_unmirrored (online, raid_dp) (block checksums)
  Plex: /node_A_2_data02_unmirrored/plex0 (online, normal, active, pool0)
   RAID Group /node_A_2_data02_unmirrored/plex0/rg0 (normal, block checksums)
                                                              Usable Physical
     Position Disk                        Pool Type     RPM     Size     Size Status
     -------- --------------------------- ---- ----- ------ -------- -------- ----------
     dparity  2.30.12                      0   BSAS    7200  827.7GB  828.0GB (normal)
     parity   2.30.22                      0   BSAS    7200  827.7GB  828.0GB (normal)
     data     2.30.21                      0   BSAS    7200  827.7GB  828.0GB (normal)
     data     2.30.20                      0   BSAS    7200  827.7GB  828.0GB (normal)
     data     2.30.14                      0   BSAS    7200  827.7GB  828.0GB (normal)
15 entries were displayed.

Physically power off the shelf that you selected.

Check the aggregate status again: storage aggregate show storage aggregate show -r -node node_A_2 !*root

The aggregate with drives on the powered-off shelf should have a degraded RAID status, and drives on the affected plex should have a failed status, as shown in the following example:

cluster_A::> storage aggregate show
Aggregate     Size Available Used% State   #Vols  Nodes            RAID Status
--------- -------- --------- ----- ------- ------ ---------------- ------------
node_A_1data01_mirrored
            4.15TB    3.40TB   18% online       3 node_A_1       raid_dp,
                                                                   mirrored,
                                                                   normal
node_A_1root 
           707.7GB   34.29GB   95% online       1 node_A_1       raid_dp,
                                                                   mirrored,
                                                                   normal
node_A_2_data01_mirrored 
            4.15TB    4.12TB    1% online       2 node_A_2       raid_dp,
                                                                   mirror
                                                                   degraded
node_A_2_data02_unmirrored 
            2.18TB    2.18TB    0% online       1 node_A_2       raid_dp,
                                                                   normal
node_A_2_root 
           707.7GB   34.27GB   95% online       1 node_A_2       raid_dp,
                                                                   mirror
                                                                   degraded
cluster_A::> storage aggregate show -r -node node_A_2 !*root
Owner Node: node_A_2
 Aggregate: node_A_2_data01_mirrored (online, raid_dp, mirror degraded) (block checksums)
  Plex: /node_A_2_data01_mirrored/plex0 (online, normal, active, pool0)
   RAID Group /node_A_2_data01_mirrored/plex0/rg0 (normal, block checksums)
                                                              Usable Physical
     Position Disk                        Pool Type     RPM     Size     Size Status
     -------- --------------------------- ---- ----- ------ -------- -------- ----------
     dparity  2.30.3                       0   BSAS    7200  827.7GB  828.0GB (normal)
     parity   2.30.4                       0   BSAS    7200  827.7GB  828.0GB (normal)
     data     2.30.6                       0   BSAS    7200  827.7GB  828.0GB (normal)
     data     2.30.8                       0   BSAS    7200  827.7GB  828.0GB (normal)
     data     2.30.5                       0   BSAS    7200  827.7GB  828.0GB (normal)

  Plex: /node_A_2_data01_mirrored/plex4 (offline, failed, inactive, pool1)
   RAID Group /node_A_2_data01_mirrored/plex4/rg0 (partial, none checksums)
                                                              Usable Physical
     Position Disk                        Pool Type     RPM     Size     Size Status
     -------- --------------------------- ---- ----- ------ -------- -------- ----------
     dparity  FAILED                       -   -          -  827.7GB        - (failed)
     parity   FAILED                       -   -          -  827.7GB        - (failed)
     data     FAILED                       -   -          -  827.7GB        - (failed)
     data     FAILED                       -   -          -  827.7GB        - (failed)
     data     FAILED                       -   -          -  827.7GB        - (failed)

 Aggregate: node_A_2_data02_unmirrored (online, raid_dp) (block checksums)
  Plex: /node_A_2_data02_unmirrored/plex0 (online, normal, active, pool0)
   RAID Group /node_A_2_data02_unmirrored/plex0/rg0 (normal, block checksums)
                                                              Usable Physical
     Position Disk                        Pool Type     RPM     Size     Size Status
     -------- --------------------------- ---- ----- ------ -------- -------- ----------
     dparity  2.30.12                      0   BSAS    7200  827.7GB  828.0GB (normal)
     parity   2.30.22                      0   BSAS    7200  827.7GB  828.0GB (normal)
     data     2.30.21                      0   BSAS    7200  827.7GB  828.0GB (normal)
     data     2.30.20                      0   BSAS    7200  827.7GB  828.0GB (normal)
     data     2.30.14                      0   BSAS    7200  827.7GB  828.0GB (normal)
15 entries were displayed.

Verify that the data is being served and that all volumes are still online: vserver show -type data network interface show -fields is-home false volume show !vol0,!MDV*

cluster_A::> vserver show -type data

cluster_A::> vserver show -type data
                               Admin      Operational Root
Vserver     Type    Subtype    State      State       Volume     Aggregate
----------- ------- ---------- ---------- ----------- ---------- ----------
SVM1        data    sync-source           running     SVM1_root  node_A_1_data01_mirrored
SVM2        data    sync-source          running     SVM2_root  node_A_1_data01_mirrored

cluster_A::> network interface show -fields is-home false
There are no entries matching your query.

cluster_A::> volume show !vol0,!MDV*
Vserver   Volume       Aggregate    State      Type       Size  Available Used%
--------- ------------ ------------ ---------- ---- ---------- ---------- -----
SVM1
          SVM1_root
                       node_A_1data01_mirrored
                                    online     RW         10GB     9.50GB    5%
SVM1
          SVM1_data_vol
                       node_A_1data01_mirrored
                                    online     RW         10GB     9.49GB    5%
SVM2
          SVM2_root
                       node_A_1data01_mirrored
                                    online     RW         10GB     9.49GB    5%
SVM2
          SVM2_data_vol
                       node_A_2_data02_unmirrored
                                    online     RW          1GB    972.6MB    5%

Physically power on the shelf.
Resynchronization starts automatically.

Verify that resynchronization has started: storage aggregate show

The affected aggregate should have a resyncing RAID status, as shown in the following example:

cluster_A::> storage aggregate show
cluster Aggregates:
Aggregate     Size Available Used% State   #Vols  Nodes            RAID Status
--------- -------- --------- ----- ------- ------ ---------------- ------------
node_A_1_data01_mirrored
            4.15TB    3.40TB   18% online       3 node_A_1       raid_dp,
                                                                   mirrored,
                                                                   normal
node_A_1_root 
           707.7GB   34.29GB   95% online       1 node_A_1       raid_dp,
                                                                   mirrored,
                                                                   normal
node_A_2_data01_mirrored 
            4.15TB    4.12TB    1% online       2 node_A_2       raid_dp,
                                                                   resyncing
node_A_2_data02_unmirrored 
            2.18TB    2.18TB    0% online       1 node_A_2       raid_dp,
                                                                   normal
node_A_2_root 
           707.7GB   34.27GB   95% online       1 node_A_2       raid_dp,
                                                                   resyncing

Monitor the aggregate to confirm that resynchronization is complete: storage aggregate show

The affected aggregate should have a normal RAID status, as shown in the following example:

cluster_A::> storage aggregate show
cluster Aggregates:
Aggregate     Size Available Used% State   #Vols  Nodes            RAID Status
--------- -------- --------- ----- ------- ------ ---------------- ------------
node_A_1data01_mirrored
            4.15TB    3.40TB   18% online       3 node_A_1       raid_dp,
                                                                   mirrored,
                                                                   normal
node_A_1root 
           707.7GB   34.29GB   95% online       1 node_A_1       raid_dp,
                                                                   mirrored,
                                                                   normal
node_A_2_data01_mirrored 
            4.15TB    4.12TB    1% online       2 node_A_2       raid_dp,
                                                                   normal
node_A_2_data02_unmirrored 
            2.18TB    2.18TB    0% online       1 node_A_2       raid_dp,
                                                                   normal
node_A_2_root 
           707.7GB   34.27GB   95% online       1 node_A_2       raid_dp,
                                                                   resyncing

Give documentation feedback