ZFS replacing a Disk with multipathing

The system that we ran into issues was prado.ucsd.edu but I’m sure these commands will work on other systems.

Commands to locate the drive:

dmsetup ls
/opt/MegaRAID/perccli/perccli64 /c1 show all
smartctl -a /dev/sdbb | head -15
multipath -d -ll
zdb -lll /dev/disk/by-id/dm-uuid-mpath-35000cca2581304e0

Once you remove the drive and replace it, you need to detect the multipath and create a multipath device.

for example, if we have a new, configured device: /dev/sdal

The number we need is (35000cca25818d088)

partprobe /dev/sdal
multipath -v2 /dev/sdal

Jul 21 17:27:56 | sdal: No SAS end device for 'end_device-1:4'
create: mpathgy (35000cca25818d088) undef WDC     ,WUH721414AL4204 
size=13T features='0' hwhandler='0' wp=undef
`-+- policy='service-time 0' prio=1 status=undef
  `- 1:0:249:0 sdal 66:80   undef ready running
zpool replace zdata dm-uuid-mpath-35000cca25812fc0c /dev/disk/by-id/dm-uuid-mpath-35000cca25818d088 

If you run into any issue with the disk you can clear it out with the following commands:

multipath -v2 /dev/sdbb
Jul 21 17:33:27 | sdbb: No SAS end device for 'end_device-1:4'
Jul 21 17:33:27 | mpathgz: ignoring map

dmsetup remove mpathgz9
dmsetup remove mpathgz1
dmsetup remove mpathgz

 # This command should also remove a multipath device
multipath -f /dev/disk/by-id/dm-name-mpathgz

If you don’t know which drive you replace you can just run a refresh on all of the multipath devices and it will create the ones that are missing:

multipath -F
multipath -v2


Removing the Drive

Lets say the following drive is bad: dm-uuid-mpath-35000cca2581e4a44

	  raidz3-4                           DEGRADED     0     0     0
	    dm-uuid-mpath-35000cca259058cd0  ONLINE       0     0     0
	    dm-uuid-mpath-35000cca2581c1228  ONLINE       0     0     0
	    dm-uuid-mpath-35000cca2581e4a44  FAULTED      6   478     0  too many errors

Offline the drive:

zpool offline zdata dm-uuid-mpath-35000cca2581e4a44

Need to find out where that drive is:

multipath -ll | grep -A5 35000cca2581e4a44

mpathco (35000cca2581e4a44) dm-94 WDC,WUH721414AL4204
size=13T features='0' hwhandler='0' wp=rw
|-+- policy='service-time 0' prio=1 status=active
| `- 11:0:122:0 sdhl    133:176 active ready running
`-+- policy='service-time 0' prio=1 status=enabled
  `- 10:0:122:0 sdan    66:112  active ready running

Next, we need to locate the physical drive: sdhl or sdan since they both point to the same drive.

To turn on the indicator light (echo 0 turns it off)

echo 1 > /sys/block/sdhl/device/enclosure_device*/locate

Not needed, just for info

echo 1 > /sys/block/sdhl/device/enclosure_device*/fault

Note the two JBODs attached to Whiskeytown, E1 is the upper unit and E0 is the lower unit.

Extra steps to find the serial number.

smartctl -a /dev/sdhl | head -15

smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.18.0-193.6.3.el8_2.x86_64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

Vendor:               WDC
Product:              WUH721414AL4204
Revision:             C240
Compliance:           SPC-4
User Capacity:        14,000,519,643,136 bytes [14.0 TB]
Logical block size:   4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca2581e4a44
Serial number:        9JGJNDUT

Another way to locate the drive in the JBOD:

/opt/MegaRAID/perccli/perccli64 /c1 show all | grep -B5 9JGJNDUT

Drive /c1/e1/s47 device attributes :
Manufacturer Id = WDC     
Model Number = WUH721414AL4204 
NAND Vendor = NA

The drive is located in enclosure 1, slot 47 (/c1/e1/s47).

Installing a new drive:

Install new drive and recreate the multipath:

multipath -F
    # Check the Drive
multipath -v2 /dev/sdhl

Sep 14 13:16:52 | sdfd: No SAS end device for 'end_device-11:2'
Sep 14 13:16:52 | sdcv: No SAS end device for 'end_device-10:4'
create: mpathdq (35000cca26492bff4) undef WDC,WUH721414AL4204
size=13T features='0' hwhandler='0' wp=undef
|-+- policy='service-time 0' prio=1 status=undef
| `- 11:0:123:0 sdhl    129:240 undef ready running
`-+- policy='service-time 0' prio=1 status=undef
  `- 10:0:123:0 sdan    70:48   undef ready running

Do a Zpool replace:

zpool replace zdata dm-uuid-mpath-35000cca25903ac54 dm-uuid-mpath-35000cca26492bff4

Note the address from the multipath command:

create: mpathdq (35000cca26492bff4) → dm-uuid-mpath-35000cca26492bff4

