#1031131 open-iscsi: Lot of iscsi/kernel errors in dmesg with Fujitsu Eternus DX100S4 connected with 2 10Gb ethernet paths with multipathd.

Package:
open-iscsi
Source:
open-iscsi
Description:
iSCSI initiator tools
Submitter:
migo
Date:
2023-02-27 19:27:07 UTC
Severity:
normal
#1031131#5
Date:
2023-02-12 10:29:30 UTC
From:
To:
Dear Maintainer,

I want to apologize if this is not the right place to report this issue, but I need somewhere to start.

I've lot of this error messages ind dmesg:

Feb 09 10:56:05 virt2n kernel:  connection2:0: detected conn error (1015)
Feb 09 10:56:05 virt2n iscsid[2790]: connection2:0 is operational after recovery (1 attempts)
Feb 09 10:56:05 virt2n iscsid[2790]: connection1:0 is operational after recovery (1 attempts)
Feb 09 10:56:05 virt2n iscsid[2790]: Kernel reported iSCSI connection 1:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch) state (3)
Feb 09 10:56:05 virt2n iscsid[2790]: Kernel reported iSCSI connection 2:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch) state (3)
Feb 09 10:56:10 virt2n iscsid[2790]: connection1:0 is operational after recovery (1 attempts)
Feb 09 10:56:10 virt2n iscsid[2790]: connection2:0 is operational after recovery (1 attempts)
Feb 09 10:56:28 virt2n kernel:  connection1:0: detected conn error (1015)
Feb 09 10:56:28 virt2n kernel:  connection2:0: detected conn error (1015)
Feb 09 10:56:28 virt2n iscsid[2790]: Kernel reported iSCSI connection 1:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch) state (3)
Feb 09 10:56:28 virt2n iscsid[2790]: Kernel reported iSCSI connection 2:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch) state (3)
Feb 09 10:56:33 virt2n kernel:  connection1:0: detected conn error (1015)
Feb 09 10:56:33 virt2n kernel:  connection2:0: detected conn error (1015)
Feb 09 10:56:33 virt2n iscsid[2790]: connection2:0 is operational after recovery (1 attempts)
Feb 09 10:56:33 virt2n iscsid[2790]: connection1:0 is operational after recovery (1 attempts)
Feb 09 10:56:33 virt2n iscsid[2790]: Kernel reported iSCSI connection 1:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch) state (3)
Feb 09 10:56:33 virt2n iscsid[2790]: Kernel reported iSCSI connection 2:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch) state (3)
Feb 09 10:56:38 virt2n kernel:  connection2:0: detected conn error (1015)
Feb 09 10:56:38 virt2n kernel:  connection1:0: detected conn error (1015)
Feb 09 10:56:38 virt2n iscsid[2790]: connection1:0 is operational after recovery (1 attempts)
Feb 09 10:56:38 virt2n iscsid[2790]: connection2:0 is operational after recovery (1 attempts)
Feb 09 10:56:38 virt2n iscsid[2790]: Kernel reported iSCSI connection 2:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch) state (3)
Feb 09 10:56:38 virt2n iscsid[2790]: Kernel reported iSCSI connection 1:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch) state (3)
Feb 09 10:56:43 virt2n iscsid[2790]: connection2:0 is operational after recovery (1 attempts)
Feb 09 10:56:43 virt2n iscsid[2790]: connection1:0 is operational after recovery (1 attempts)
Feb 09 10:56:43 virt2n kernel:  connection1:0: detected conn error (1015)
Feb 09 10:56:43 virt2n kernel:  connection2:0: detected conn error (1015)
Feb 09 10:56:44 virt2n iscsid[2790]: Kernel reported iSCSI connection 1:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch) state (3)
Feb 09 10:56:44 virt2n iscsid[2790]: Kernel reported iSCSI connection 2:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch) state (3)
Feb 09 10:56:49 virt2n iscsid[2790]: connection1:0 is operational after recovery (1 attempts)
Feb 09 10:56:49 virt2n iscsid[2790]: connection2:0 is operational after recovery (1 attempts)
Feb 09 10:57:00 virt2n kernel:  connection1:0: detected conn error (1015)

or enventually this:

Feb 10 20:39:39 virt2n kernel:  connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 19077132455, last ping 19077133712, now 19077134976
Feb 10 20:39:39 virt2n kernel:  connection1:0: detected conn error (1022)
Feb 10 20:39:39 virt2n kernel: scsi_io_completion_action: 11 callbacks suppressed
Feb 10 20:39:39 virt2n kernel: sd 1:0:0:5: [sdl] tag#122 FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
Feb 10 20:39:39 virt2n kernel: sd 1:0:0:5: [sdl] tag#122 CDB: Test Unit Ready 00 00 00 00 00 00
Feb 10 20:39:39 virt2n kernel: sd 1:0:0:16: [sdah] tag#103 FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
Feb 10 20:39:39 virt2n kernel: sd 1:0:0:16: [sdah] tag#103 CDB: Test Unit Ready 00 00 00 00 00 00
Feb 10 20:39:39 virt2n kernel: sd 1:0:0:24: [sdax] tag#115 FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
Feb 10 20:39:39 virt2n kernel: sd 1:0:0:24: [sdax] tag#115 CDB: Test Unit Ready 00 00 00 00 00 00
Feb 10 20:39:39 virt2n kernel: sd 1:0:0:29: [sdbi] tag#118 FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
Feb 10 20:39:39 virt2n kernel: sd 1:0:0:29: [sdbi] tag#118 CDB: Test Unit Ready 00 00 00 00 00 00
Feb 10 20:39:39 virt2n kernel: sd 1:0:0:21: [sdaq] tag#108 FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
Feb 10 20:39:39 virt2n kernel: sd 1:0:0:21: [sdaq] tag#108 CDB: Test Unit Ready 00 00 00 00 00 00
Feb 10 20:39:39 virt2n kernel: sd 1:0:0:2: [sdf] tag#106 FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
Feb 10 20:39:39 virt2n kernel: sd 1:0:0:2: [sdf] tag#106 CDB: Test Unit Ready 00 00 00 00 00 00
Feb 10 20:39:39 virt2n kernel: sd 1:0:0:11: [sdx] tag#101 FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
Feb 10 20:39:39 virt2n kernel: sd 1:0:0:11: [sdx] tag#101 CDB: Test Unit Ready 00 00 00 00 00 00
Feb 10 20:39:39 virt2n kernel: sd 1:0:0:10: [sdv] tag#100 FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
Feb 10 20:39:39 virt2n kernel: sd 1:0:0:10: [sdv] tag#100 CDB: Test Unit Ready 00 00 00 00 00 00
Feb 10 20:39:39 virt2n kernel: sd 1:0:0:1: [sdd] tag#99 FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
Feb 10 20:39:39 virt2n kernel: sd 1:0:0:1: [sdd] tag#99 CDB: Test Unit Ready 00 00 00 00 00 00
Feb 10 20:39:39 virt2n kernel: sd 1:0:0:26: [sdbb] tag#116 FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
Feb 10 20:39:39 virt2n kernel: sd 1:0:0:26: [sdbb] tag#116 CDB: Test Unit Ready 00 00 00 00 00 00
Feb 10 20:39:39 virt2n iscsid[2790]: Kernel reported iSCSI connection 1:0 error (1022 - Invalid or unknown error code) state (3)
Feb 10 20:39:42 virt2n kernel:  connection1:0: pdu (op 0x5 itt 0xa000003c) rejected. Reason code 0x9
Feb 10 20:39:42 virt2n kernel:  connection1:0: pdu (op 0x5 itt 0xa0000070) rejected. Reason code 0x9
Feb 10 20:39:42 virt2n iscsid[2790]: connection1:0 is operational after recovery (1 attempts)

When this happens iscsi opration is interupted for few seconds, multipath -ll reports (this is only example, i/o pending is reported dor all 30 dm-s:


web3v (3600000e00d2c0000002c177200120000) dm-52 FUJITSU,ETERNUS_DXL
size=500G features='3 queue_if_no_path queue_mode mq' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=10 status=active
  |- 1:0:0:16 sdah 66:16  active i/o pending running
  `- 2:0:0:16 sdai 66:32  active ready running
ais_app (3600000e00d2c0000002c1772001f0000) dm-99 FUJITSU,ETERNUS_DXL
size=500G features='3 queue_if_no_path queue_mode mq' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=10 status=active
  |- 1:0:0:28 sdbh 67:176 active i/o pending running
  `- 2:0:0:28 sdbf 67:144 active ready running

SAN consists of one Fujitsu DX100 S4 storage with two controllers connected to LAN switch with two 10Gb fibre links (one from controller), each link has its own VLAN configured. Errors reported ocures on virtualization host that are connected mutipath with two 10Gb fibre links to respective VLANs. Jumbo frames are enabled across the way.

I'll add all neded info upun request.

I've consulted this issue with Fujitsu representative and it seems we have all configured right and he advised me to contact Debian support. So I'm here and would kindly ask to point me the right direction.

Thank you,
Kind regards,
Milan

Here is actual iscsi runnig conf:

root@virt2n:/home/migo# iscsiadm -m node -o show
# BEGIN RECORD 2.0-874
node.name = iqn.2000-09.com.fujitsu:storage-system.eternus-dxl:cm1ca0p1
node.tpgt = 4
node.startup = automatic
node.leading_login = No
iface.hwaddress = <empty>
iface.ipaddress = <empty>
iface.iscsi_ifacename = default
iface.net_ifacename = <empty>
iface.gateway = <empty>
iface.subnet_mask = <empty>
iface.transport_name = tcp
iface.initiatorname = <empty>
iface.state = <empty>
iface.vlan_id = 0
iface.vlan_priority = 0
iface.vlan_state = <empty>
iface.iface_num = 0
iface.mtu = 0
iface.port = 0
iface.bootproto = <empty>
iface.dhcp_alt_client_id_state = <empty>
iface.dhcp_alt_client_id = <empty>
iface.dhcp_dns = <empty>
iface.dhcp_learn_iqn = <empty>
iface.dhcp_req_vendor_id_state = <empty>
iface.dhcp_vendor_id_state = <empty>
iface.dhcp_vendor_id = <empty>
iface.dhcp_slp_da = <empty>
iface.fragmentation = <empty>
iface.gratuitous_arp = <empty>
iface.incoming_forwarding = <empty>
iface.tos_state = <empty>
iface.tos = 0
iface.ttl = 0
iface.delayed_ack = <empty>
iface.tcp_nagle = <empty>
iface.tcp_wsf_state = <empty>
iface.tcp_wsf = 0
iface.tcp_timer_scale = 0
iface.tcp_timestamp = <empty>
iface.redirect = <empty>
iface.def_task_mgmt_timeout = 0
iface.header_digest = <empty>
iface.data_digest = <empty>
iface.immediate_data = <empty>
iface.initial_r2t = <empty>
iface.data_seq_inorder = <empty>
iface.data_pdu_inorder = <empty>
iface.erl = 0
iface.max_receive_data_len = 0
iface.first_burst_len = 0
iface.max_outstanding_r2t = 0
iface.max_burst_len = 0
iface.chap_auth = <empty>
iface.bidi_chap = <empty>
iface.strict_login_compliance = <empty>
iface.discovery_auth = <empty>
iface.discovery_logout = <empty>
node.discovery_address = 10.11.12.6
node.discovery_port = 3260
node.discovery_type = send_targets
node.session.initial_cmdsn = 0
node.session.initial_login_retry_max = 8
node.session.xmit_thread_priority = -20
node.session.cmds_max = 128
node.session.queue_depth = 8
node.session.nr_sessions = 1
node.session.auth.authmethod = None
node.session.auth.username = <empty>
node.session.auth.password = <empty>
node.session.auth.username_in = <empty>
node.session.auth.password_in = <empty>
node.session.timeo.replacement_timeout = 120
node.session.err_timeo.abort_timeout = 15
node.session.err_timeo.lu_reset_timeout = 30
node.session.err_timeo.tgt_reset_timeout = 30
node.session.err_timeo.host_reset_timeout = 60
node.session.iscsi.FastAbort = Yes
node.session.iscsi.InitialR2T = No
node.session.iscsi.ImmediateData = Yes
node.session.iscsi.FirstBurstLength = 262144
node.session.iscsi.MaxBurstLength = 16776192
node.session.iscsi.DefaultTime2Retain = 0
node.session.iscsi.DefaultTime2Wait = 2
node.session.iscsi.MaxConnections = 1
node.session.iscsi.MaxOutstandingR2T = 1
node.session.iscsi.ERL = 0
node.conn[0].address = 10.11.15.6
node.conn[0].port = 3260
node.conn[0].startup = manual
node.conn[0].tcp.window_size = 524288
node.conn[0].tcp.type_of_service = 0
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.auth_timeout = 45
node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 5
node.conn[0].iscsi.MaxXmitDataSegmentLength = 0
node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144
node.conn[0].iscsi.HeaderDigest = CRC32C,None
node.conn[0].iscsi.DataDigest = CRC32C,None
node.conn[0].iscsi.IFMarker = No
node.conn[0].iscsi.OFMarker = No
# END RECORD
# BEGIN RECORD 2.0-874
node.name = iqn.2000-09.com.fujitsu:storage-system.eternus-dxl:cm0ca0p1
node.tpgt = 2
node.startup = automatic
node.leading_login = No
iface.hwaddress = <empty>
iface.ipaddress = <empty>
iface.iscsi_ifacename = default
iface.net_ifacename = <empty>
iface.gateway = <empty>
iface.subnet_mask = <empty>
iface.transport_name = tcp
iface.initiatorname = <empty>
iface.state = <empty>
iface.vlan_id = 0
iface.vlan_priority = 0
iface.vlan_state = <empty>
iface.iface_num = 0
iface.mtu = 0
iface.port = 0
iface.bootproto = <empty>
iface.dhcp_alt_client_id_state = <empty>
iface.dhcp_alt_client_id = <empty>
iface.dhcp_dns = <empty>
iface.dhcp_learn_iqn = <empty>
iface.dhcp_req_vendor_id_state = <empty>
iface.dhcp_vendor_id_state = <empty>
iface.dhcp_vendor_id = <empty>
iface.dhcp_slp_da = <empty>
iface.fragmentation = <empty>
iface.gratuitous_arp = <empty>
iface.incoming_forwarding = <empty>
iface.tos_state = <empty>
iface.tos = 0
iface.ttl = 0
iface.delayed_ack = <empty>
iface.tcp_nagle = <empty>
iface.tcp_wsf_state = <empty>
iface.tcp_wsf = 0
iface.tcp_timer_scale = 0
iface.tcp_timestamp = <empty>
iface.redirect = <empty>
iface.def_task_mgmt_timeout = 0
iface.header_digest = <empty>
iface.data_digest = <empty>
iface.immediate_data = <empty>
iface.initial_r2t = <empty>
iface.data_seq_inorder = <empty>
iface.data_pdu_inorder = <empty>
iface.erl = 0
iface.max_receive_data_len = 0
iface.first_burst_len = 0
iface.max_outstanding_r2t = 0
iface.max_burst_len = 0
iface.chap_auth = <empty>
iface.bidi_chap = <empty>
iface.strict_login_compliance = <empty>
iface.discovery_auth = <empty>
iface.discovery_logout = <empty>
node.discovery_address = 10.11.12.6
node.discovery_port = 3260
node.discovery_type = send_targets
node.session.initial_cmdsn = 0
node.session.initial_login_retry_max = 8
node.session.xmit_thread_priority = -20
node.session.cmds_max = 128
node.session.queue_depth = 8
node.session.nr_sessions = 1
node.session.auth.authmethod = None
node.session.auth.username = <empty>
node.session.auth.password = <empty>
node.session.auth.username_in = <empty>
node.session.auth.password_in = <empty>
node.session.timeo.replacement_timeout = 120
node.session.err_timeo.abort_timeout = 15
node.session.err_timeo.lu_reset_timeout = 30
node.session.err_timeo.tgt_reset_timeout = 30
node.session.err_timeo.host_reset_timeout = 60
node.session.iscsi.FastAbort = Yes
node.session.iscsi.InitialR2T = No
node.session.iscsi.ImmediateData = Yes
node.session.iscsi.FirstBurstLength = 262144
node.session.iscsi.MaxBurstLength = 16776192
node.session.iscsi.DefaultTime2Retain = 0
node.session.iscsi.DefaultTime2Wait = 2
node.session.iscsi.MaxConnections = 1
node.session.iscsi.MaxOutstandingR2T = 1
node.session.iscsi.ERL = 0
node.conn[0].address = 10.11.13.6
node.conn[0].port = 3260
node.conn[0].startup = manual
node.conn[0].tcp.window_size = 524288
node.conn[0].tcp.type_of_service = 0
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.auth_timeout = 45
node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 5
node.conn[0].iscsi.MaxXmitDataSegmentLength = 0
node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144
node.conn[0].iscsi.HeaderDigest = CRC32C,None
node.conn[0].iscsi.DataDigest = CRC32C,None
node.conn[0].iscsi.IFMarker = No
node.conn[0].iscsi.OFMarker = No
# END RECORD
# BEGIN RECORD 2.0-874
node.name = iqn.2000-09.com.fujitsu:storage-system.eternus-dxl:cm0ca0p0
node.tpgt = 1
node.startup = automatic
node.leading_login = No
iface.hwaddress = <empty>
iface.ipaddress = <empty>
iface.iscsi_ifacename = default
iface.net_ifacename = <empty>
iface.gateway = <empty>
iface.subnet_mask = <empty>
iface.transport_name = tcp
iface.initiatorname = <empty>
iface.state = <empty>
iface.vlan_id = 0
iface.vlan_priority = 0
iface.vlan_state = <empty>
iface.iface_num = 0
iface.mtu = 0
iface.port = 0
iface.bootproto = <empty>
iface.dhcp_alt_client_id_state = <empty>
iface.dhcp_alt_client_id = <empty>
iface.dhcp_dns = <empty>
iface.dhcp_learn_iqn = <empty>
iface.dhcp_req_vendor_id_state = <empty>
iface.dhcp_vendor_id_state = <empty>
iface.dhcp_vendor_id = <empty>
iface.dhcp_slp_da = <empty>
iface.fragmentation = <empty>
iface.gratuitous_arp = <empty>
iface.incoming_forwarding = <empty>
iface.tos_state = <empty>
iface.tos = 0
iface.ttl = 0
iface.delayed_ack = <empty>
iface.tcp_nagle = <empty>
iface.tcp_wsf_state = <empty>
iface.tcp_wsf = 0
iface.tcp_timer_scale = 0
iface.tcp_timestamp = <empty>
iface.redirect = <empty>
iface.def_task_mgmt_timeout = 0
iface.header_digest = <empty>
iface.data_digest = <empty>
iface.immediate_data = <empty>
iface.initial_r2t = <empty>
iface.data_seq_inorder = <empty>
iface.data_pdu_inorder = <empty>
iface.erl = 0
iface.max_receive_data_len = 0
iface.first_burst_len = 0
iface.max_outstanding_r2t = 0
iface.max_burst_len = 0
iface.chap_auth = <empty>
iface.bidi_chap = <empty>
iface.strict_login_compliance = <empty>
iface.discovery_auth = <empty>
iface.discovery_logout = <empty>
node.discovery_address = 10.11.12.6
node.discovery_port = 3260
node.discovery_type = send_targets
node.session.initial_cmdsn = 0
node.session.initial_login_retry_max = 8
node.session.xmit_thread_priority = -20
node.session.cmds_max = 128
node.session.queue_depth = 8
node.session.nr_sessions = 1
node.session.auth.authmethod = None
node.session.auth.username = <empty>
node.session.auth.password = <empty>
node.session.auth.username_in = <empty>
node.session.auth.password_in = <empty>
node.session.timeo.replacement_timeout = 120
node.session.err_timeo.abort_timeout = 15
node.session.err_timeo.lu_reset_timeout = 30
node.session.err_timeo.tgt_reset_timeout = 30
node.session.err_timeo.host_reset_timeout = 60
node.session.iscsi.FastAbort = Yes
node.session.iscsi.InitialR2T = No
node.session.iscsi.ImmediateData = Yes
node.session.iscsi.FirstBurstLength = 262144
node.session.iscsi.MaxBurstLength = 16776192
node.session.iscsi.DefaultTime2Retain = 0
node.session.iscsi.DefaultTime2Wait = 2
node.session.iscsi.MaxConnections = 1
node.session.iscsi.MaxOutstandingR2T = 1
node.session.iscsi.ERL = 0
node.conn[0].address = 10.11.12.6
node.conn[0].port = 3260
node.conn[0].startup = manual
node.conn[0].tcp.window_size = 524288
node.conn[0].tcp.type_of_service = 0
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.auth_timeout = 45
node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 5
node.conn[0].iscsi.MaxXmitDataSegmentLength = 0
node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144
node.conn[0].iscsi.HeaderDigest = CRC32C,None
node.conn[0].iscsi.DataDigest = CRC32C,None
node.conn[0].iscsi.IFMarker = No
node.conn[0].iscsi.OFMarker = No
# END RECORD
# BEGIN RECORD 2.0-874
node.name = iqn.2000-09.com.fujitsu:storage-system.eternus-dxl:cm1ca0p0
node.tpgt = 3
node.startup = automatic
node.leading_login = No
iface.hwaddress = <empty>
iface.ipaddress = <empty>
iface.iscsi_ifacename = default
iface.net_ifacename = <empty>
iface.gateway = <empty>
iface.subnet_mask = <empty>
iface.transport_name = tcp
iface.initiatorname = <empty>
iface.state = <empty>
iface.vlan_id = 0
iface.vlan_priority = 0
iface.vlan_state = <empty>
iface.iface_num = 0
iface.mtu = 0
iface.port = 0
iface.bootproto = <empty>
iface.dhcp_alt_client_id_state = <empty>
iface.dhcp_alt_client_id = <empty>
iface.dhcp_dns = <empty>
iface.dhcp_learn_iqn = <empty>
iface.dhcp_req_vendor_id_state = <empty>
iface.dhcp_vendor_id_state = <empty>
iface.dhcp_vendor_id = <empty>
iface.dhcp_slp_da = <empty>
iface.fragmentation = <empty>
iface.gratuitous_arp = <empty>
iface.incoming_forwarding = <empty>
iface.tos_state = <empty>
iface.tos = 0
iface.ttl = 0
iface.delayed_ack = <empty>
iface.tcp_nagle = <empty>
iface.tcp_wsf_state = <empty>
iface.tcp_wsf = 0
iface.tcp_timer_scale = 0
iface.tcp_timestamp = <empty>
iface.redirect = <empty>
iface.def_task_mgmt_timeout = 0
iface.header_digest = <empty>
iface.data_digest = <empty>
iface.immediate_data = <empty>
iface.initial_r2t = <empty>
iface.data_seq_inorder = <empty>
iface.data_pdu_inorder = <empty>
iface.erl = 0
iface.max_receive_data_len = 0
iface.first_burst_len = 0
iface.max_outstanding_r2t = 0
iface.max_burst_len = 0
iface.chap_auth = <empty>
iface.bidi_chap = <empty>
iface.strict_login_compliance = <empty>
iface.discovery_auth = <empty>
iface.discovery_logout = <empty>
node.discovery_address = 10.11.12.6
node.discovery_port = 3260
node.discovery_type = send_targets
node.session.initial_cmdsn = 0
node.session.initial_login_retry_max = 8
node.session.xmit_thread_priority = -20
node.session.cmds_max = 128
node.session.queue_depth = 8
node.session.nr_sessions = 1
node.session.auth.authmethod = None
node.session.auth.username = <empty>
node.session.auth.password = <empty>
node.session.auth.username_in = <empty>
node.session.auth.password_in = <empty>
node.session.timeo.replacement_timeout = 120
node.session.err_timeo.abort_timeout = 15
node.session.err_timeo.lu_reset_timeout = 30
node.session.err_timeo.tgt_reset_timeout = 30
node.session.err_timeo.host_reset_timeout = 60
node.session.iscsi.FastAbort = Yes
node.session.iscsi.InitialR2T = No
node.session.iscsi.ImmediateData = Yes
node.session.iscsi.FirstBurstLength = 262144
node.session.iscsi.MaxBurstLength = 16776192
node.session.iscsi.DefaultTime2Retain = 0
node.session.iscsi.DefaultTime2Wait = 2
node.session.iscsi.MaxConnections = 1
node.session.iscsi.MaxOutstandingR2T = 1
node.session.iscsi.ERL = 0
node.conn[0].address = 10.11.14.6
node.conn[0].port = 3260
node.conn[0].startup = manual
node.conn[0].tcp.window_size = 524288
node.conn[0].tcp.type_of_service = 0
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.auth_timeout = 45
node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 5
node.conn[0].iscsi.MaxXmitDataSegmentLength = 0
node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144
node.conn[0].iscsi.HeaderDigest = CRC32C,None
node.conn[0].iscsi.DataDigest = CRC32C,None
node.conn[0].iscsi.IFMarker = No
node.conn[0].iscsi.OFMarker = No
# END RECORD

#1031131#10
Date:
2023-02-14 12:04:47 UTC
From:
To:
Hello Migo,

Connection dropped and re-established. Looks normal.

Conn 2:0 recovered but the earlier error it gave looks misleading to
me. What target controller you have ? What there a failover across the
nodes ? How does the target handle the transition period to initiator
queries ?

Same logs again.

This all looks normal to me. If you want the errors to be suppressed,
you can increase the timeouts. On the other hand, otherwise, you could
also use dm-multipath on top, like you have shown below.

Looks all correct to me. You already are using feature queue_if_no_path

Good. As I expected, you do have a 2 node target controller setup.

Okay!! What behavior do you expect ? What anomaly do you see with the
iSCSI initiator in Debian ?

#1031131#15
Date:
2023-02-15 18:21:38 UTC
From:
To:
Hello Ritesh Raj Sarraf,

Yes I understand that message but WHY this happens, what can be the cause
for this? Link is stable and no errors are reported on ifaces connected to
SAN switch:

ens1f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 10.11.12.17  netmask 255.255.255.0  broadcast 10.11.12.255
        inet6 fe80::9640:c9ff:fe5f:9172  prefixlen 64  scopeid 0x20<link>
        ether 94:40:c9:5f:91:72  txqueuelen 1000  (Ethernet)
        RX packets 184466309231  bytes 146589180968929 (133.3 TiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 86121730517  bytes 284905539017306 (259.1 TiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 102  memory 0xa5820000-a583ffff

ens1f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 10.11.14.17  netmask 255.255.255.0  broadcast 10.11.14.255
        inet6 fe80::9640:c9ff:fe5f:9173  prefixlen 64  scopeid 0x20<link>
        ether 94:40:c9:5f:91:73  txqueuelen 1000  (Ethernet)
        RX packets 114453605386  bytes 85545283479332 (77.8 TiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 236069263519  bytes 283187128321034 (257.5 TiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 132  memory 0xa5800000-a581ffff

yes, two paths with failover

What can cause this errors? I've multipath setup, but I want to find root
cause of this error messages.
think this is not normal/standard behaviour.

This is log from past 1h:

[Wed Feb 15 18:17:18 2023]  connection1:0: pdu (op 0x5 itt 0xd0000057)
rejected. Reason code 0x9
[Wed Feb 15 18:42:10 2023]  connection1:0: pdu (op 0x5 itt 0xd000000c)
rejected. Reason code 0x9
[Wed Feb 15 18:47:23 2023]  connection1:0: pdu (op 0x5 itt 0xd000003f)
rejected. Reason code 0x9
[Wed Feb 15 19:05:12 2023]  connection1:0: ping timeout of 5 secs expired,
recv timeout 5, last rx 19183609844, last ping 19183611097, now 19183612352
[Wed Feb 15 19:05:12 2023]  connection1:0: detected conn error (1022)
[Wed Feb 15 19:05:12 2023] sd 1:0:0:27: [sdbe] tag#23 FAILED Result:
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
[Wed Feb 15 19:05:12 2023] sd 1:0:0:27: [sdbe] tag#23 CDB: Test Unit Ready
00 00 00 00 00 00
[Wed Feb 15 19:05:12 2023] sd 1:0:0:24: [sdax] tag#12 FAILED Result:
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
[Wed Feb 15 19:05:12 2023] sd 1:0:0:24: [sdax] tag#12 CDB: Test Unit Ready
00 00 00 00 00 00
[Wed Feb 15 19:05:12 2023] sd 1:0:0:0: [sdc] tag#4 FAILED Result:
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
[Wed Feb 15 19:05:12 2023] sd 1:0:0:0: [sdc] tag#4 CDB: Test Unit Ready 00
00 00 00 00 00
[Wed Feb 15 19:05:12 2023] sd 1:0:0:4: [sdj] tag#66 FAILED Result:
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
[Wed Feb 15 19:05:12 2023] sd 1:0:0:4: [sdj] tag#66 CDB: Test Unit Ready 00
00 00 00 00 00
[Wed Feb 15 19:05:12 2023] sd 1:0:0:5: [sdl] tag#3 FAILED Result:
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
[Wed Feb 15 19:05:12 2023] sd 1:0:0:5: [sdl] tag#3 CDB: Test Unit Ready 00
00 00 00 00 00
[Wed Feb 15 19:05:12 2023] sd 1:0:0:16: [sdah] tag#7 FAILED Result:
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
[Wed Feb 15 19:05:12 2023] sd 1:0:0:16: [sdah] tag#7 CDB: Test Unit Ready
00 00 00 00 00 00
[Wed Feb 15 19:05:12 2023] sd 1:0:0:11: [sdx] tag#5 FAILED Result:
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
[Wed Feb 15 19:05:12 2023] sd 1:0:0:11: [sdx] tag#5 CDB: Test Unit Ready 00
00 00 00 00 00
[Wed Feb 15 19:05:12 2023] sd 1:0:0:19: [sdan] tag#8 FAILED Result:
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
[Wed Feb 15 19:05:12 2023] sd 1:0:0:19: [sdan] tag#8 CDB: Test Unit Ready
00 00 00 00 00 00
[Wed Feb 15 19:05:12 2023] sd 1:0:0:3: [sdh] tag#65 FAILED Result:
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
[Wed Feb 15 19:05:12 2023] sd 1:0:0:3: [sdh] tag#65 CDB: Test Unit Ready 00
00 00 00 00 00
[Wed Feb 15 19:05:12 2023] sd 1:0:0:15: [sdaf] tag#6 FAILED Result:
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
[Wed Feb 15 19:05:12 2023] sd 1:0:0:15: [sdaf] tag#6 CDB: Test Unit Ready
00 00 00 00 00 00
[Wed Feb 15 19:13:26 2023]  connection1:0: pdu (op 0x5 itt 0xe000005c)
rejected. Reason code 0x9
[Wed Feb 15 19:16:52 2023]  connection1:0: pdu (op 0x5 itt 0xe000003e)
rejected. Reason code 0x9

What does this error mean: connection1:0: pdu (op 0x5 itt 0xe000003e)
rejected. Reason code 0x9

This one repeats very often in logs.

Thank you very much for your assistance!

Kind regards,
Milan

P.S. Sorry for a later response, gmail clasified this email as a SPAM and
I've discovered it only recently. :(

#1031131#20
Date:
2023-02-16 13:16:37 UTC
From:
To:
Hello,

That is mostly to analyze on the target side or on the network, to
start with.

Usually, in a multinode clustered target seutp, when a takeover
happens, connecitons from node1 will be dropped for a moment while it
is taken over by node2.

I don't know nothing about the Fujitsu target so I cannot generalize.

Thanks for sharing this. So as you mentioned there's no issue on the
network side.

So did you trigger a cluster takeover ? My guess is it is your target
initiating the connection drops, while taking over to the other node.

How a target behaves during the transition is left to the target. The
initiator will keep querying for recovery, until either it times out or
recovers.

I don't know what is causing to trigger those errors on your setup.

Please do note that you have different layers mixed to build that
storage solution


* TCP/IP
* iSCSI
* SCSI
* Device Mapper Multipath

Every layer has errors that it treats differently. For example, a SCSI
error is perfectly suppressed by the DM-Multipath layer depending on
how it is setup, to the upper layers.

Errors like:

* connection drops
* iscsi session drops/terminations
* SCSI errors
* multipath path checker errors

All these will be errors which will be recovered eventually. That is
why we have the need for close integration in between these layers,
when building a storage solution on top.



Note: These days I only have a software LIO target to test/play with,
where I have not seen any real issues/errors. How each SAN Target
behaves is something highly specific to the target, in your case the
Fujitsu target.

#1031131#25
Date:
2023-02-20 08:51:17 UTC
From:
To:
Hello,
There was no intentional cluster takeover. This happens during normal
operation.

recovery seems to work fine. I've tested it by disconnecting one path to
target and all was OK, second path was used and when first one recovered is
switched to that one over.
I can understand this is very complex problem, what do you suggest me to
debug this issues? We have 5 host connected to Fujitsu target, but errors
are only on those two which are running KVM guests. Other servers running
mail hosts and camera surveillance are not affected it seems. They are
Debian systems but not running multipath, so I've made test:

So far I've disabled all KVM guests on one (of those two KVM hosts used fro
virtualization) host and mounted  one KVM guest file system locally (/mnt)
and performed stress test on that mount:

root@virt1n:/mnt# stress-ng --ag -v -d 1
stress-ng: debug: [1186892] 256 processors online, 256 processors configured
stress-ng: info:  [1186892] defaulting to a 86400 second (1 day, 0.00 secs)
run per stressor
stress-ng: info:  [1186892] dispatching hogs: 1 hdd
stress-ng: debug: [1186892] cache allocate: default cache size: 16384K
stress-ng: debug: [1186892] starting stressors
stress-ng: debug: [1186892] 1 stressor started
stress-ng: debug: [1186893] stress-ng-hdd: started [1186893] (instance 0)

run this test for more than 12h and there were no errors in logs and no
path takeovers. Here is dmesg:

[Sat Feb 18 20:25:17 2023] EXT4-fs (dm-37): warning: checktime reached,
running e2fsck is recommended
[Sat Feb 18 20:25:17 2023] EXT4-fs (dm-37): mounted filesystem with ordered
data mode. Opts: (null)
[Sat Feb 18 20:25:37 2023] usb 1-1: USB disconnect, device number 2
[Sat Feb 18 20:26:21 2023] perf: interrupt took too long (2955 > 2500),
lowering kernel.perf_event_max_sample_rate to 67500
[Sat Feb 18 20:32:22 2023] perf: interrupt took too long (3711 > 3693),
lowering kernel.perf_event_max_sample_rate to 53750
[Sat Feb 18 20:34:46 2023] perf: interrupt took too long (5054 > 4638),
lowering kernel.perf_event_max_sample_rate to 39500
[Sat Feb 18 21:31:55 2023] perf: interrupt took too long (6810 > 6317),
lowering kernel.perf_event_max_sample_rate to 29250
[Sun Feb 19 07:17:14 2023] INFO: NMI handler (perf_ibs_nmi_handler) took
too long to run: 2.171 msecs
[Sun Feb 19 07:17:14 2023] perf: interrupt took too long (20455 > 8512),
lowering kernel.perf_event_max_sample_rate to 9750
root@virt1n:/mnt#

and some iface and disk i/o graphs are attached too.

When I let run one KVM guest (Debian) on this host system following errors
are in dmesg:

[Mon Feb 20 09:49:39 2023]  connection1:0: pdu (op 0x5 itt 0x53) rejected.
Reason code 0x9
[Mon Feb 20 09:49:41 2023]  connection1:0: pdu (op 0x5 itt 0x7c) rejected.
Reason code 0x9
[Mon Feb 20 09:49:41 2023]  connection1:0: pdu (op 0x5 itt 0x51) rejected.
Reason code 0x9
[Mon Feb 20 09:50:07 2023]  connection1:0: pdu (op 0x5 itt 0x33) rejected.
Reason code 0x9

Guest is using Virtio disk drivers (vda) so I've switched it to generic
SATA (sda) for this guest, but errors in log remains.

It seem that KVM is triggering this errors and make our NAS unstable.

What can KVM do differently? It is using /dev/mapper/dm-XX as disk devices
so no direct iscsi access.

Any ideas what should I try next?

This is very complex ecosystem. I know that error reporting is good think
:) and helping out to troubleshoot problems. But when everything is all
right there should be no errors, right?
Are you running KVM virtualization atop of your SAN target?

Thank you, kind regards,
Milan

#1031131#30
Date:
2023-02-23 12:35:45 UTC
From:
To:
Hello Milan,

By KVM, do you mean just pure KVM ? Or the management suite too,
libvirt ?

I'm afraid I do not have much direct hints for you here. given that
this issue does not happen when KVM is not involved, would imply that
the erratic behavior originates from that software's integration.

In an ideal scenario, yes, there will be no errors. But on a SAN setup,
cluster failovers are a feature of the SAN target, and as such during
that transition some errors are expected on the initiator, which are
eventually recovered.

Recovery is the critical part here. When states do not recover to
normal, it is an error; either the target or the initiator. Or even the
middleman (network) at times.

My LIO target runs in a KVM guest. So does the iSCSI initiator too.

#1031131#35
Date:
2023-02-23 13:19:08 UTC
From:
To:
Hello,

Yes, libvirt is used to manage KVM, and virt-manager on my desktop to
connect to it. It is simple setup without clustering.
Do you know someone who can help with this? Here is example of KVM guest
running configuration:

libvirt+ 244533      1  5 Feb03 ?        1-03:54:37 qemu-system-x86_64
-enable-kvm -name guest=me_test,process=qemu:me_test,debug-threads=on -S
-object
secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-108-me_test/master-key.aes
-machine pc-1.1,accel=kvm,usb=off,dump-guest-core=off -cpu host -m 8096
-realtime mlock=off -smp 4,sockets=1,cores=4,threads=1 -uuid
1591f345-96b5-4077-9d32-b2991003753d -no-user-config -nodefaults -chardev
socket,id=charmonitor,fd=57,server,nowait -mon
chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown
-boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2
-device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive
if=none,id=drive-ide0-1-0,readonly=on -device
ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0,bootindex=1 -drive
file=/dev/mapper/me_test,format=raw,if=none,id=drive-virtio-disk0,cache=none,discard=unmap,aio=native
-device
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=2,write-cache=on
-netdev tap,fd=60,id=hostnet0,vhost=on,vhostfd=61 -device
virtio-net-pci,netdev=hostnet0,id=net0,mac=aa:bb:cc:00:10:31,bus=pci.0,addr=0x3
-chardev pty,id=charserial0 -device
isa-serial,chardev=charserial0,id=serial0 -chardev
spicevmc,id=charchannel0,name=vdagent -device
virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0
-spice port=5929,addr=0.0.0.0,disable-ticketing,seamless-migration=on
-device
qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -sandbox
on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny
-msg timestamp=on

Maybe there is something undesirable for the iscsi target.

This part seems to work OK, so far no data loss was detected.

Pure KVM or libvirtd too?

Thank you, kind regards

Milan

#1031131#40
Date:
2023-02-23 13:44:33 UTC
From:
To:
Thanks for confirming.

You should start with the hypervisor where libvirtd is running. Check
if that hypervisor host is running normal.

From all the symptoms you've shared, my wild arrow in the dark is that
your hypervisor/guest network may be running into issues.

Seems like you are using host networking ? Have you validated that that
part is working reliably ?

From your initial logs, mpath reported 'io pending' on one of the
paths. I'm not sure if I'd call that fully recovered.

What if you down the other (healthy) path ? Does IO progress on the
initial (io pending) path ?

libvirtd it is. The same hypervisor with libvirtd running, hosts the
initiator and the target, in different VMs.

#1031131#45
Date:
2023-02-23 13:47:15 UTC
From:
To:
Also, what 10 GiB ethernet do you run on your hypervisor ? What
make/model ? Is that known to work reliably ?

My experience from around some years ago, 10 GiB ethernet support was
still budding. But things may have changed lately. I just don't have
access to those things no more.

#1031131#50
Date:
2023-02-23 14:12:30 UTC
From:
To:
Also, on my test setup, my LIO target is a standard one. No multinode
cluster. It is a single-node target.

The way I trigger path failovers is by bringing down individual network
interfaces. That serves well enough for my initiator focused tests.

#1031131#55
Date:
2023-02-27 19:26:35 UTC
From:
To:
hello Sarraf,

No unusual errors reported.
There is all ok with network.

I've made some experiments with storage parameters passed to KVM, standard
disk created within virt-manager is running with following paramaters:

format=raw,cache=none,discard=unmap,aio=native.

I've removed cache  parameter (let hypervisor set its default) and now
running disk with:
format=raw,discard=unmap.

With this setup, there are no errors (connection1:0: pdu (op 0x5 itt 0x48)
rejected. Reason code 0x9) in logs. Virtual host was stressed with:

ns2:/home/migo# stress-ng -v -d 1
stress-ng: debug: [858] 2 processors online, 2 processors configured
stress-ng: info:  [858] defaulting to a 86400 second (1 day, 0.00 secs) run
per stressor
stress-ng: info:  [858] dispatching hogs: 1 hdd
stress-ng: debug: [858] cache allocate: default cache size: 262144K
stress-ng: debug: [858] starting stressors
stress-ng: debug: [858] 1 stressor started
stress-ng: debug: [859] stress-ng-hdd: started [859] (instance 0)
^Cstress-ng: debug: [859] stress-ng-hdd: exited [859] (instance 0)
stress-ng: debug: [858] process [859] terminated
stress-ng: info:  [858] successful run completed in 30377.75s (8 hours, 26
mins, 17.75 secs)
stress-ng: debug: [858] metrics-check: all stressor metrics validated and
sane

It seems that using hypervisors cache is filtering those bogus commands
(causing trouble for target) sent to the target. What do you think about it?

But running iscsi target with cache is not a good idea, and this is no real
solution. :(

It recovers within 10-15s and path reporting 'io pending' recovers to
normal operation back.

I've tested disconnecting the link from the first path and second path
immediately taken over all operations. Reconnecting that path and all
operations moved to the original path with higher priority.

test-app (3600000e00d2c0000002c1772001f0000) dm-97 FUJITSU,ETERNUS_DXL
size=500G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
|-+- policy='round-robin 0' prio=50 status=active
| `- 1:0:0:28 sdar 66:176 active ready running
`-+- policy='round-robin 0' prio=10 status=enabled
  `- 2:0:0:28 sdbj 67:208 active ready running

Thank you, kind regards,

Milan