#1031131 open-iscsi: Lot of iscsi/kernel errors in dmesg with Fujitsu Eternus DX100S4 connected with 2 10Gb ethernet paths with multipathd. #1031131
- Package:
- open-iscsi
- Source:
- open-iscsi
- Description:
- iSCSI initiator tools
- Submitter:
- migo
- Date:
- 2023-02-27 19:27:07 UTC
- Severity:
- normal
Dear Maintainer, I want to apologize if this is not the right place to report this issue, but I need somewhere to start. I've lot of this error messages ind dmesg: Feb 09 10:56:05 virt2n kernel: connection2:0: detected conn error (1015) Feb 09 10:56:05 virt2n iscsid[2790]: connection2:0 is operational after recovery (1 attempts) Feb 09 10:56:05 virt2n iscsid[2790]: connection1:0 is operational after recovery (1 attempts) Feb 09 10:56:05 virt2n iscsid[2790]: Kernel reported iSCSI connection 1:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch) state (3) Feb 09 10:56:05 virt2n iscsid[2790]: Kernel reported iSCSI connection 2:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch) state (3) Feb 09 10:56:10 virt2n iscsid[2790]: connection1:0 is operational after recovery (1 attempts) Feb 09 10:56:10 virt2n iscsid[2790]: connection2:0 is operational after recovery (1 attempts) Feb 09 10:56:28 virt2n kernel: connection1:0: detected conn error (1015) Feb 09 10:56:28 virt2n kernel: connection2:0: detected conn error (1015) Feb 09 10:56:28 virt2n iscsid[2790]: Kernel reported iSCSI connection 1:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch) state (3) Feb 09 10:56:28 virt2n iscsid[2790]: Kernel reported iSCSI connection 2:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch) state (3) Feb 09 10:56:33 virt2n kernel: connection1:0: detected conn error (1015) Feb 09 10:56:33 virt2n kernel: connection2:0: detected conn error (1015) Feb 09 10:56:33 virt2n iscsid[2790]: connection2:0 is operational after recovery (1 attempts) Feb 09 10:56:33 virt2n iscsid[2790]: connection1:0 is operational after recovery (1 attempts) Feb 09 10:56:33 virt2n iscsid[2790]: Kernel reported iSCSI connection 1:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch) state (3) Feb 09 10:56:33 virt2n iscsid[2790]: Kernel reported iSCSI connection 2:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch) state (3) Feb 09 10:56:38 virt2n kernel: connection2:0: detected conn error (1015) Feb 09 10:56:38 virt2n kernel: connection1:0: detected conn error (1015) Feb 09 10:56:38 virt2n iscsid[2790]: connection1:0 is operational after recovery (1 attempts) Feb 09 10:56:38 virt2n iscsid[2790]: connection2:0 is operational after recovery (1 attempts) Feb 09 10:56:38 virt2n iscsid[2790]: Kernel reported iSCSI connection 2:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch) state (3) Feb 09 10:56:38 virt2n iscsid[2790]: Kernel reported iSCSI connection 1:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch) state (3) Feb 09 10:56:43 virt2n iscsid[2790]: connection2:0 is operational after recovery (1 attempts) Feb 09 10:56:43 virt2n iscsid[2790]: connection1:0 is operational after recovery (1 attempts) Feb 09 10:56:43 virt2n kernel: connection1:0: detected conn error (1015) Feb 09 10:56:43 virt2n kernel: connection2:0: detected conn error (1015) Feb 09 10:56:44 virt2n iscsid[2790]: Kernel reported iSCSI connection 1:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch) state (3) Feb 09 10:56:44 virt2n iscsid[2790]: Kernel reported iSCSI connection 2:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch) state (3) Feb 09 10:56:49 virt2n iscsid[2790]: connection1:0 is operational after recovery (1 attempts) Feb 09 10:56:49 virt2n iscsid[2790]: connection2:0 is operational after recovery (1 attempts) Feb 09 10:57:00 virt2n kernel: connection1:0: detected conn error (1015) or enventually this: Feb 10 20:39:39 virt2n kernel: connection1:0: ping timeout of 5 secs expired, recv timeout 5, last rx 19077132455, last ping 19077133712, now 19077134976 Feb 10 20:39:39 virt2n kernel: connection1:0: detected conn error (1022) Feb 10 20:39:39 virt2n kernel: scsi_io_completion_action: 11 callbacks suppressed Feb 10 20:39:39 virt2n kernel: sd 1:0:0:5: [sdl] tag#122 FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK Feb 10 20:39:39 virt2n kernel: sd 1:0:0:5: [sdl] tag#122 CDB: Test Unit Ready 00 00 00 00 00 00 Feb 10 20:39:39 virt2n kernel: sd 1:0:0:16: [sdah] tag#103 FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK Feb 10 20:39:39 virt2n kernel: sd 1:0:0:16: [sdah] tag#103 CDB: Test Unit Ready 00 00 00 00 00 00 Feb 10 20:39:39 virt2n kernel: sd 1:0:0:24: [sdax] tag#115 FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK Feb 10 20:39:39 virt2n kernel: sd 1:0:0:24: [sdax] tag#115 CDB: Test Unit Ready 00 00 00 00 00 00 Feb 10 20:39:39 virt2n kernel: sd 1:0:0:29: [sdbi] tag#118 FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK Feb 10 20:39:39 virt2n kernel: sd 1:0:0:29: [sdbi] tag#118 CDB: Test Unit Ready 00 00 00 00 00 00 Feb 10 20:39:39 virt2n kernel: sd 1:0:0:21: [sdaq] tag#108 FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK Feb 10 20:39:39 virt2n kernel: sd 1:0:0:21: [sdaq] tag#108 CDB: Test Unit Ready 00 00 00 00 00 00 Feb 10 20:39:39 virt2n kernel: sd 1:0:0:2: [sdf] tag#106 FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK Feb 10 20:39:39 virt2n kernel: sd 1:0:0:2: [sdf] tag#106 CDB: Test Unit Ready 00 00 00 00 00 00 Feb 10 20:39:39 virt2n kernel: sd 1:0:0:11: [sdx] tag#101 FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK Feb 10 20:39:39 virt2n kernel: sd 1:0:0:11: [sdx] tag#101 CDB: Test Unit Ready 00 00 00 00 00 00 Feb 10 20:39:39 virt2n kernel: sd 1:0:0:10: [sdv] tag#100 FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK Feb 10 20:39:39 virt2n kernel: sd 1:0:0:10: [sdv] tag#100 CDB: Test Unit Ready 00 00 00 00 00 00 Feb 10 20:39:39 virt2n kernel: sd 1:0:0:1: [sdd] tag#99 FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK Feb 10 20:39:39 virt2n kernel: sd 1:0:0:1: [sdd] tag#99 CDB: Test Unit Ready 00 00 00 00 00 00 Feb 10 20:39:39 virt2n kernel: sd 1:0:0:26: [sdbb] tag#116 FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK Feb 10 20:39:39 virt2n kernel: sd 1:0:0:26: [sdbb] tag#116 CDB: Test Unit Ready 00 00 00 00 00 00 Feb 10 20:39:39 virt2n iscsid[2790]: Kernel reported iSCSI connection 1:0 error (1022 - Invalid or unknown error code) state (3) Feb 10 20:39:42 virt2n kernel: connection1:0: pdu (op 0x5 itt 0xa000003c) rejected. Reason code 0x9 Feb 10 20:39:42 virt2n kernel: connection1:0: pdu (op 0x5 itt 0xa0000070) rejected. Reason code 0x9 Feb 10 20:39:42 virt2n iscsid[2790]: connection1:0 is operational after recovery (1 attempts) When this happens iscsi opration is interupted for few seconds, multipath -ll reports (this is only example, i/o pending is reported dor all 30 dm-s: web3v (3600000e00d2c0000002c177200120000) dm-52 FUJITSU,ETERNUS_DXL size=500G features='3 queue_if_no_path queue_mode mq' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=10 status=active |- 1:0:0:16 sdah 66:16 active i/o pending running `- 2:0:0:16 sdai 66:32 active ready running ais_app (3600000e00d2c0000002c1772001f0000) dm-99 FUJITSU,ETERNUS_DXL size=500G features='3 queue_if_no_path queue_mode mq' hwhandler='1 alua' wp=rw `-+- policy='service-time 0' prio=10 status=active |- 1:0:0:28 sdbh 67:176 active i/o pending running `- 2:0:0:28 sdbf 67:144 active ready running SAN consists of one Fujitsu DX100 S4 storage with two controllers connected to LAN switch with two 10Gb fibre links (one from controller), each link has its own VLAN configured. Errors reported ocures on virtualization host that are connected mutipath with two 10Gb fibre links to respective VLANs. Jumbo frames are enabled across the way. I'll add all neded info upun request. I've consulted this issue with Fujitsu representative and it seems we have all configured right and he advised me to contact Debian support. So I'm here and would kindly ask to point me the right direction. Thank you, Kind regards, Milan Here is actual iscsi runnig conf: root@virt2n:/home/migo# iscsiadm -m node -o show # BEGIN RECORD 2.0-874 node.name = iqn.2000-09.com.fujitsu:storage-system.eternus-dxl:cm1ca0p1 node.tpgt = 4 node.startup = automatic node.leading_login = No iface.hwaddress = <empty> iface.ipaddress = <empty> iface.iscsi_ifacename = default iface.net_ifacename = <empty> iface.gateway = <empty> iface.subnet_mask = <empty> iface.transport_name = tcp iface.initiatorname = <empty> iface.state = <empty> iface.vlan_id = 0 iface.vlan_priority = 0 iface.vlan_state = <empty> iface.iface_num = 0 iface.mtu = 0 iface.port = 0 iface.bootproto = <empty> iface.dhcp_alt_client_id_state = <empty> iface.dhcp_alt_client_id = <empty> iface.dhcp_dns = <empty> iface.dhcp_learn_iqn = <empty> iface.dhcp_req_vendor_id_state = <empty> iface.dhcp_vendor_id_state = <empty> iface.dhcp_vendor_id = <empty> iface.dhcp_slp_da = <empty> iface.fragmentation = <empty> iface.gratuitous_arp = <empty> iface.incoming_forwarding = <empty> iface.tos_state = <empty> iface.tos = 0 iface.ttl = 0 iface.delayed_ack = <empty> iface.tcp_nagle = <empty> iface.tcp_wsf_state = <empty> iface.tcp_wsf = 0 iface.tcp_timer_scale = 0 iface.tcp_timestamp = <empty> iface.redirect = <empty> iface.def_task_mgmt_timeout = 0 iface.header_digest = <empty> iface.data_digest = <empty> iface.immediate_data = <empty> iface.initial_r2t = <empty> iface.data_seq_inorder = <empty> iface.data_pdu_inorder = <empty> iface.erl = 0 iface.max_receive_data_len = 0 iface.first_burst_len = 0 iface.max_outstanding_r2t = 0 iface.max_burst_len = 0 iface.chap_auth = <empty> iface.bidi_chap = <empty> iface.strict_login_compliance = <empty> iface.discovery_auth = <empty> iface.discovery_logout = <empty> node.discovery_address = 10.11.12.6 node.discovery_port = 3260 node.discovery_type = send_targets node.session.initial_cmdsn = 0 node.session.initial_login_retry_max = 8 node.session.xmit_thread_priority = -20 node.session.cmds_max = 128 node.session.queue_depth = 8 node.session.nr_sessions = 1 node.session.auth.authmethod = None node.session.auth.username = <empty> node.session.auth.password = <empty> node.session.auth.username_in = <empty> node.session.auth.password_in = <empty> node.session.timeo.replacement_timeout = 120 node.session.err_timeo.abort_timeout = 15 node.session.err_timeo.lu_reset_timeout = 30 node.session.err_timeo.tgt_reset_timeout = 30 node.session.err_timeo.host_reset_timeout = 60 node.session.iscsi.FastAbort = Yes node.session.iscsi.InitialR2T = No node.session.iscsi.ImmediateData = Yes node.session.iscsi.FirstBurstLength = 262144 node.session.iscsi.MaxBurstLength = 16776192 node.session.iscsi.DefaultTime2Retain = 0 node.session.iscsi.DefaultTime2Wait = 2 node.session.iscsi.MaxConnections = 1 node.session.iscsi.MaxOutstandingR2T = 1 node.session.iscsi.ERL = 0 node.conn[0].address = 10.11.15.6 node.conn[0].port = 3260 node.conn[0].startup = manual node.conn[0].tcp.window_size = 524288 node.conn[0].tcp.type_of_service = 0 node.conn[0].timeo.logout_timeout = 15 node.conn[0].timeo.login_timeout = 15 node.conn[0].timeo.auth_timeout = 45 node.conn[0].timeo.noop_out_interval = 5 node.conn[0].timeo.noop_out_timeout = 5 node.conn[0].iscsi.MaxXmitDataSegmentLength = 0 node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144 node.conn[0].iscsi.HeaderDigest = CRC32C,None node.conn[0].iscsi.DataDigest = CRC32C,None node.conn[0].iscsi.IFMarker = No node.conn[0].iscsi.OFMarker = No # END RECORD # BEGIN RECORD 2.0-874 node.name = iqn.2000-09.com.fujitsu:storage-system.eternus-dxl:cm0ca0p1 node.tpgt = 2 node.startup = automatic node.leading_login = No iface.hwaddress = <empty> iface.ipaddress = <empty> iface.iscsi_ifacename = default iface.net_ifacename = <empty> iface.gateway = <empty> iface.subnet_mask = <empty> iface.transport_name = tcp iface.initiatorname = <empty> iface.state = <empty> iface.vlan_id = 0 iface.vlan_priority = 0 iface.vlan_state = <empty> iface.iface_num = 0 iface.mtu = 0 iface.port = 0 iface.bootproto = <empty> iface.dhcp_alt_client_id_state = <empty> iface.dhcp_alt_client_id = <empty> iface.dhcp_dns = <empty> iface.dhcp_learn_iqn = <empty> iface.dhcp_req_vendor_id_state = <empty> iface.dhcp_vendor_id_state = <empty> iface.dhcp_vendor_id = <empty> iface.dhcp_slp_da = <empty> iface.fragmentation = <empty> iface.gratuitous_arp = <empty> iface.incoming_forwarding = <empty> iface.tos_state = <empty> iface.tos = 0 iface.ttl = 0 iface.delayed_ack = <empty> iface.tcp_nagle = <empty> iface.tcp_wsf_state = <empty> iface.tcp_wsf = 0 iface.tcp_timer_scale = 0 iface.tcp_timestamp = <empty> iface.redirect = <empty> iface.def_task_mgmt_timeout = 0 iface.header_digest = <empty> iface.data_digest = <empty> iface.immediate_data = <empty> iface.initial_r2t = <empty> iface.data_seq_inorder = <empty> iface.data_pdu_inorder = <empty> iface.erl = 0 iface.max_receive_data_len = 0 iface.first_burst_len = 0 iface.max_outstanding_r2t = 0 iface.max_burst_len = 0 iface.chap_auth = <empty> iface.bidi_chap = <empty> iface.strict_login_compliance = <empty> iface.discovery_auth = <empty> iface.discovery_logout = <empty> node.discovery_address = 10.11.12.6 node.discovery_port = 3260 node.discovery_type = send_targets node.session.initial_cmdsn = 0 node.session.initial_login_retry_max = 8 node.session.xmit_thread_priority = -20 node.session.cmds_max = 128 node.session.queue_depth = 8 node.session.nr_sessions = 1 node.session.auth.authmethod = None node.session.auth.username = <empty> node.session.auth.password = <empty> node.session.auth.username_in = <empty> node.session.auth.password_in = <empty> node.session.timeo.replacement_timeout = 120 node.session.err_timeo.abort_timeout = 15 node.session.err_timeo.lu_reset_timeout = 30 node.session.err_timeo.tgt_reset_timeout = 30 node.session.err_timeo.host_reset_timeout = 60 node.session.iscsi.FastAbort = Yes node.session.iscsi.InitialR2T = No node.session.iscsi.ImmediateData = Yes node.session.iscsi.FirstBurstLength = 262144 node.session.iscsi.MaxBurstLength = 16776192 node.session.iscsi.DefaultTime2Retain = 0 node.session.iscsi.DefaultTime2Wait = 2 node.session.iscsi.MaxConnections = 1 node.session.iscsi.MaxOutstandingR2T = 1 node.session.iscsi.ERL = 0 node.conn[0].address = 10.11.13.6 node.conn[0].port = 3260 node.conn[0].startup = manual node.conn[0].tcp.window_size = 524288 node.conn[0].tcp.type_of_service = 0 node.conn[0].timeo.logout_timeout = 15 node.conn[0].timeo.login_timeout = 15 node.conn[0].timeo.auth_timeout = 45 node.conn[0].timeo.noop_out_interval = 5 node.conn[0].timeo.noop_out_timeout = 5 node.conn[0].iscsi.MaxXmitDataSegmentLength = 0 node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144 node.conn[0].iscsi.HeaderDigest = CRC32C,None node.conn[0].iscsi.DataDigest = CRC32C,None node.conn[0].iscsi.IFMarker = No node.conn[0].iscsi.OFMarker = No # END RECORD # BEGIN RECORD 2.0-874 node.name = iqn.2000-09.com.fujitsu:storage-system.eternus-dxl:cm0ca0p0 node.tpgt = 1 node.startup = automatic node.leading_login = No iface.hwaddress = <empty> iface.ipaddress = <empty> iface.iscsi_ifacename = default iface.net_ifacename = <empty> iface.gateway = <empty> iface.subnet_mask = <empty> iface.transport_name = tcp iface.initiatorname = <empty> iface.state = <empty> iface.vlan_id = 0 iface.vlan_priority = 0 iface.vlan_state = <empty> iface.iface_num = 0 iface.mtu = 0 iface.port = 0 iface.bootproto = <empty> iface.dhcp_alt_client_id_state = <empty> iface.dhcp_alt_client_id = <empty> iface.dhcp_dns = <empty> iface.dhcp_learn_iqn = <empty> iface.dhcp_req_vendor_id_state = <empty> iface.dhcp_vendor_id_state = <empty> iface.dhcp_vendor_id = <empty> iface.dhcp_slp_da = <empty> iface.fragmentation = <empty> iface.gratuitous_arp = <empty> iface.incoming_forwarding = <empty> iface.tos_state = <empty> iface.tos = 0 iface.ttl = 0 iface.delayed_ack = <empty> iface.tcp_nagle = <empty> iface.tcp_wsf_state = <empty> iface.tcp_wsf = 0 iface.tcp_timer_scale = 0 iface.tcp_timestamp = <empty> iface.redirect = <empty> iface.def_task_mgmt_timeout = 0 iface.header_digest = <empty> iface.data_digest = <empty> iface.immediate_data = <empty> iface.initial_r2t = <empty> iface.data_seq_inorder = <empty> iface.data_pdu_inorder = <empty> iface.erl = 0 iface.max_receive_data_len = 0 iface.first_burst_len = 0 iface.max_outstanding_r2t = 0 iface.max_burst_len = 0 iface.chap_auth = <empty> iface.bidi_chap = <empty> iface.strict_login_compliance = <empty> iface.discovery_auth = <empty> iface.discovery_logout = <empty> node.discovery_address = 10.11.12.6 node.discovery_port = 3260 node.discovery_type = send_targets node.session.initial_cmdsn = 0 node.session.initial_login_retry_max = 8 node.session.xmit_thread_priority = -20 node.session.cmds_max = 128 node.session.queue_depth = 8 node.session.nr_sessions = 1 node.session.auth.authmethod = None node.session.auth.username = <empty> node.session.auth.password = <empty> node.session.auth.username_in = <empty> node.session.auth.password_in = <empty> node.session.timeo.replacement_timeout = 120 node.session.err_timeo.abort_timeout = 15 node.session.err_timeo.lu_reset_timeout = 30 node.session.err_timeo.tgt_reset_timeout = 30 node.session.err_timeo.host_reset_timeout = 60 node.session.iscsi.FastAbort = Yes node.session.iscsi.InitialR2T = No node.session.iscsi.ImmediateData = Yes node.session.iscsi.FirstBurstLength = 262144 node.session.iscsi.MaxBurstLength = 16776192 node.session.iscsi.DefaultTime2Retain = 0 node.session.iscsi.DefaultTime2Wait = 2 node.session.iscsi.MaxConnections = 1 node.session.iscsi.MaxOutstandingR2T = 1 node.session.iscsi.ERL = 0 node.conn[0].address = 10.11.12.6 node.conn[0].port = 3260 node.conn[0].startup = manual node.conn[0].tcp.window_size = 524288 node.conn[0].tcp.type_of_service = 0 node.conn[0].timeo.logout_timeout = 15 node.conn[0].timeo.login_timeout = 15 node.conn[0].timeo.auth_timeout = 45 node.conn[0].timeo.noop_out_interval = 5 node.conn[0].timeo.noop_out_timeout = 5 node.conn[0].iscsi.MaxXmitDataSegmentLength = 0 node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144 node.conn[0].iscsi.HeaderDigest = CRC32C,None node.conn[0].iscsi.DataDigest = CRC32C,None node.conn[0].iscsi.IFMarker = No node.conn[0].iscsi.OFMarker = No # END RECORD # BEGIN RECORD 2.0-874 node.name = iqn.2000-09.com.fujitsu:storage-system.eternus-dxl:cm1ca0p0 node.tpgt = 3 node.startup = automatic node.leading_login = No iface.hwaddress = <empty> iface.ipaddress = <empty> iface.iscsi_ifacename = default iface.net_ifacename = <empty> iface.gateway = <empty> iface.subnet_mask = <empty> iface.transport_name = tcp iface.initiatorname = <empty> iface.state = <empty> iface.vlan_id = 0 iface.vlan_priority = 0 iface.vlan_state = <empty> iface.iface_num = 0 iface.mtu = 0 iface.port = 0 iface.bootproto = <empty> iface.dhcp_alt_client_id_state = <empty> iface.dhcp_alt_client_id = <empty> iface.dhcp_dns = <empty> iface.dhcp_learn_iqn = <empty> iface.dhcp_req_vendor_id_state = <empty> iface.dhcp_vendor_id_state = <empty> iface.dhcp_vendor_id = <empty> iface.dhcp_slp_da = <empty> iface.fragmentation = <empty> iface.gratuitous_arp = <empty> iface.incoming_forwarding = <empty> iface.tos_state = <empty> iface.tos = 0 iface.ttl = 0 iface.delayed_ack = <empty> iface.tcp_nagle = <empty> iface.tcp_wsf_state = <empty> iface.tcp_wsf = 0 iface.tcp_timer_scale = 0 iface.tcp_timestamp = <empty> iface.redirect = <empty> iface.def_task_mgmt_timeout = 0 iface.header_digest = <empty> iface.data_digest = <empty> iface.immediate_data = <empty> iface.initial_r2t = <empty> iface.data_seq_inorder = <empty> iface.data_pdu_inorder = <empty> iface.erl = 0 iface.max_receive_data_len = 0 iface.first_burst_len = 0 iface.max_outstanding_r2t = 0 iface.max_burst_len = 0 iface.chap_auth = <empty> iface.bidi_chap = <empty> iface.strict_login_compliance = <empty> iface.discovery_auth = <empty> iface.discovery_logout = <empty> node.discovery_address = 10.11.12.6 node.discovery_port = 3260 node.discovery_type = send_targets node.session.initial_cmdsn = 0 node.session.initial_login_retry_max = 8 node.session.xmit_thread_priority = -20 node.session.cmds_max = 128 node.session.queue_depth = 8 node.session.nr_sessions = 1 node.session.auth.authmethod = None node.session.auth.username = <empty> node.session.auth.password = <empty> node.session.auth.username_in = <empty> node.session.auth.password_in = <empty> node.session.timeo.replacement_timeout = 120 node.session.err_timeo.abort_timeout = 15 node.session.err_timeo.lu_reset_timeout = 30 node.session.err_timeo.tgt_reset_timeout = 30 node.session.err_timeo.host_reset_timeout = 60 node.session.iscsi.FastAbort = Yes node.session.iscsi.InitialR2T = No node.session.iscsi.ImmediateData = Yes node.session.iscsi.FirstBurstLength = 262144 node.session.iscsi.MaxBurstLength = 16776192 node.session.iscsi.DefaultTime2Retain = 0 node.session.iscsi.DefaultTime2Wait = 2 node.session.iscsi.MaxConnections = 1 node.session.iscsi.MaxOutstandingR2T = 1 node.session.iscsi.ERL = 0 node.conn[0].address = 10.11.14.6 node.conn[0].port = 3260 node.conn[0].startup = manual node.conn[0].tcp.window_size = 524288 node.conn[0].tcp.type_of_service = 0 node.conn[0].timeo.logout_timeout = 15 node.conn[0].timeo.login_timeout = 15 node.conn[0].timeo.auth_timeout = 45 node.conn[0].timeo.noop_out_interval = 5 node.conn[0].timeo.noop_out_timeout = 5 node.conn[0].iscsi.MaxXmitDataSegmentLength = 0 node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144 node.conn[0].iscsi.HeaderDigest = CRC32C,None node.conn[0].iscsi.DataDigest = CRC32C,None node.conn[0].iscsi.IFMarker = No node.conn[0].iscsi.OFMarker = No # END RECORD
Hello Migo, Connection dropped and re-established. Looks normal. Conn 2:0 recovered but the earlier error it gave looks misleading to me. What target controller you have ? What there a failover across the nodes ? How does the target handle the transition period to initiator queries ? Same logs again. This all looks normal to me. If you want the errors to be suppressed, you can increase the timeouts. On the other hand, otherwise, you could also use dm-multipath on top, like you have shown below. Looks all correct to me. You already are using feature queue_if_no_path Good. As I expected, you do have a 2 node target controller setup. Okay!! What behavior do you expect ? What anomaly do you see with the iSCSI initiator in Debian ?
Hello Ritesh Raj Sarraf,
Yes I understand that message but WHY this happens, what can be the cause
for this? Link is stable and no errors are reported on ifaces connected to
SAN switch:
ens1f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 10.11.12.17 netmask 255.255.255.0 broadcast 10.11.12.255
inet6 fe80::9640:c9ff:fe5f:9172 prefixlen 64 scopeid 0x20<link>
ether 94:40:c9:5f:91:72 txqueuelen 1000 (Ethernet)
RX packets 184466309231 bytes 146589180968929 (133.3 TiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 86121730517 bytes 284905539017306 (259.1 TiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 102 memory 0xa5820000-a583ffff
ens1f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 10.11.14.17 netmask 255.255.255.0 broadcast 10.11.14.255
inet6 fe80::9640:c9ff:fe5f:9173 prefixlen 64 scopeid 0x20<link>
ether 94:40:c9:5f:91:73 txqueuelen 1000 (Ethernet)
RX packets 114453605386 bytes 85545283479332 (77.8 TiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 236069263519 bytes 283187128321034 (257.5 TiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
device interrupt 132 memory 0xa5800000-a581ffff
yes, two paths with failover
What can cause this errors? I've multipath setup, but I want to find root
cause of this error messages.
think this is not normal/standard behaviour.
This is log from past 1h:
[Wed Feb 15 18:17:18 2023] connection1:0: pdu (op 0x5 itt 0xd0000057)
rejected. Reason code 0x9
[Wed Feb 15 18:42:10 2023] connection1:0: pdu (op 0x5 itt 0xd000000c)
rejected. Reason code 0x9
[Wed Feb 15 18:47:23 2023] connection1:0: pdu (op 0x5 itt 0xd000003f)
rejected. Reason code 0x9
[Wed Feb 15 19:05:12 2023] connection1:0: ping timeout of 5 secs expired,
recv timeout 5, last rx 19183609844, last ping 19183611097, now 19183612352
[Wed Feb 15 19:05:12 2023] connection1:0: detected conn error (1022)
[Wed Feb 15 19:05:12 2023] sd 1:0:0:27: [sdbe] tag#23 FAILED Result:
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
[Wed Feb 15 19:05:12 2023] sd 1:0:0:27: [sdbe] tag#23 CDB: Test Unit Ready
00 00 00 00 00 00
[Wed Feb 15 19:05:12 2023] sd 1:0:0:24: [sdax] tag#12 FAILED Result:
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
[Wed Feb 15 19:05:12 2023] sd 1:0:0:24: [sdax] tag#12 CDB: Test Unit Ready
00 00 00 00 00 00
[Wed Feb 15 19:05:12 2023] sd 1:0:0:0: [sdc] tag#4 FAILED Result:
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
[Wed Feb 15 19:05:12 2023] sd 1:0:0:0: [sdc] tag#4 CDB: Test Unit Ready 00
00 00 00 00 00
[Wed Feb 15 19:05:12 2023] sd 1:0:0:4: [sdj] tag#66 FAILED Result:
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
[Wed Feb 15 19:05:12 2023] sd 1:0:0:4: [sdj] tag#66 CDB: Test Unit Ready 00
00 00 00 00 00
[Wed Feb 15 19:05:12 2023] sd 1:0:0:5: [sdl] tag#3 FAILED Result:
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
[Wed Feb 15 19:05:12 2023] sd 1:0:0:5: [sdl] tag#3 CDB: Test Unit Ready 00
00 00 00 00 00
[Wed Feb 15 19:05:12 2023] sd 1:0:0:16: [sdah] tag#7 FAILED Result:
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
[Wed Feb 15 19:05:12 2023] sd 1:0:0:16: [sdah] tag#7 CDB: Test Unit Ready
00 00 00 00 00 00
[Wed Feb 15 19:05:12 2023] sd 1:0:0:11: [sdx] tag#5 FAILED Result:
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
[Wed Feb 15 19:05:12 2023] sd 1:0:0:11: [sdx] tag#5 CDB: Test Unit Ready 00
00 00 00 00 00
[Wed Feb 15 19:05:12 2023] sd 1:0:0:19: [sdan] tag#8 FAILED Result:
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
[Wed Feb 15 19:05:12 2023] sd 1:0:0:19: [sdan] tag#8 CDB: Test Unit Ready
00 00 00 00 00 00
[Wed Feb 15 19:05:12 2023] sd 1:0:0:3: [sdh] tag#65 FAILED Result:
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
[Wed Feb 15 19:05:12 2023] sd 1:0:0:3: [sdh] tag#65 CDB: Test Unit Ready 00
00 00 00 00 00
[Wed Feb 15 19:05:12 2023] sd 1:0:0:15: [sdaf] tag#6 FAILED Result:
hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
[Wed Feb 15 19:05:12 2023] sd 1:0:0:15: [sdaf] tag#6 CDB: Test Unit Ready
00 00 00 00 00 00
[Wed Feb 15 19:13:26 2023] connection1:0: pdu (op 0x5 itt 0xe000005c)
rejected. Reason code 0x9
[Wed Feb 15 19:16:52 2023] connection1:0: pdu (op 0x5 itt 0xe000003e)
rejected. Reason code 0x9
What does this error mean: connection1:0: pdu (op 0x5 itt 0xe000003e)
rejected. Reason code 0x9
This one repeats very often in logs.
Thank you very much for your assistance!
Kind regards,
Milan
P.S. Sorry for a later response, gmail clasified this email as a SPAM and
I've discovered it only recently. :(
Hello, That is mostly to analyze on the target side or on the network, to start with. Usually, in a multinode clustered target seutp, when a takeover happens, connecitons from node1 will be dropped for a moment while it is taken over by node2. I don't know nothing about the Fujitsu target so I cannot generalize. Thanks for sharing this. So as you mentioned there's no issue on the network side. So did you trigger a cluster takeover ? My guess is it is your target initiating the connection drops, while taking over to the other node. How a target behaves during the transition is left to the target. The initiator will keep querying for recovery, until either it times out or recovers. I don't know what is causing to trigger those errors on your setup. Please do note that you have different layers mixed to build that storage solution * TCP/IP * iSCSI * SCSI * Device Mapper Multipath Every layer has errors that it treats differently. For example, a SCSI error is perfectly suppressed by the DM-Multipath layer depending on how it is setup, to the upper layers. Errors like: * connection drops * iscsi session drops/terminations * SCSI errors * multipath path checker errors All these will be errors which will be recovered eventually. That is why we have the need for close integration in between these layers, when building a storage solution on top. Note: These days I only have a software LIO target to test/play with, where I have not seen any real issues/errors. How each SAN Target behaves is something highly specific to the target, in your case the Fujitsu target.
Hello, There was no intentional cluster takeover. This happens during normal operation. recovery seems to work fine. I've tested it by disconnecting one path to target and all was OK, second path was used and when first one recovered is switched to that one over. I can understand this is very complex problem, what do you suggest me to debug this issues? We have 5 host connected to Fujitsu target, but errors are only on those two which are running KVM guests. Other servers running mail hosts and camera surveillance are not affected it seems. They are Debian systems but not running multipath, so I've made test: So far I've disabled all KVM guests on one (of those two KVM hosts used fro virtualization) host and mounted one KVM guest file system locally (/mnt) and performed stress test on that mount: root@virt1n:/mnt# stress-ng --ag -v -d 1 stress-ng: debug: [1186892] 256 processors online, 256 processors configured stress-ng: info: [1186892] defaulting to a 86400 second (1 day, 0.00 secs) run per stressor stress-ng: info: [1186892] dispatching hogs: 1 hdd stress-ng: debug: [1186892] cache allocate: default cache size: 16384K stress-ng: debug: [1186892] starting stressors stress-ng: debug: [1186892] 1 stressor started stress-ng: debug: [1186893] stress-ng-hdd: started [1186893] (instance 0) run this test for more than 12h and there were no errors in logs and no path takeovers. Here is dmesg: [Sat Feb 18 20:25:17 2023] EXT4-fs (dm-37): warning: checktime reached, running e2fsck is recommended [Sat Feb 18 20:25:17 2023] EXT4-fs (dm-37): mounted filesystem with ordered data mode. Opts: (null) [Sat Feb 18 20:25:37 2023] usb 1-1: USB disconnect, device number 2 [Sat Feb 18 20:26:21 2023] perf: interrupt took too long (2955 > 2500), lowering kernel.perf_event_max_sample_rate to 67500 [Sat Feb 18 20:32:22 2023] perf: interrupt took too long (3711 > 3693), lowering kernel.perf_event_max_sample_rate to 53750 [Sat Feb 18 20:34:46 2023] perf: interrupt took too long (5054 > 4638), lowering kernel.perf_event_max_sample_rate to 39500 [Sat Feb 18 21:31:55 2023] perf: interrupt took too long (6810 > 6317), lowering kernel.perf_event_max_sample_rate to 29250 [Sun Feb 19 07:17:14 2023] INFO: NMI handler (perf_ibs_nmi_handler) took too long to run: 2.171 msecs [Sun Feb 19 07:17:14 2023] perf: interrupt took too long (20455 > 8512), lowering kernel.perf_event_max_sample_rate to 9750 root@virt1n:/mnt# and some iface and disk i/o graphs are attached too. When I let run one KVM guest (Debian) on this host system following errors are in dmesg: [Mon Feb 20 09:49:39 2023] connection1:0: pdu (op 0x5 itt 0x53) rejected. Reason code 0x9 [Mon Feb 20 09:49:41 2023] connection1:0: pdu (op 0x5 itt 0x7c) rejected. Reason code 0x9 [Mon Feb 20 09:49:41 2023] connection1:0: pdu (op 0x5 itt 0x51) rejected. Reason code 0x9 [Mon Feb 20 09:50:07 2023] connection1:0: pdu (op 0x5 itt 0x33) rejected. Reason code 0x9 Guest is using Virtio disk drivers (vda) so I've switched it to generic SATA (sda) for this guest, but errors in log remains. It seem that KVM is triggering this errors and make our NAS unstable. What can KVM do differently? It is using /dev/mapper/dm-XX as disk devices so no direct iscsi access. Any ideas what should I try next? This is very complex ecosystem. I know that error reporting is good think :) and helping out to troubleshoot problems. But when everything is all right there should be no errors, right? Are you running KVM virtualization atop of your SAN target? Thank you, kind regards, Milan
Hello Milan, By KVM, do you mean just pure KVM ? Or the management suite too, libvirt ? I'm afraid I do not have much direct hints for you here. given that this issue does not happen when KVM is not involved, would imply that the erratic behavior originates from that software's integration. In an ideal scenario, yes, there will be no errors. But on a SAN setup, cluster failovers are a feature of the SAN target, and as such during that transition some errors are expected on the initiator, which are eventually recovered. Recovery is the critical part here. When states do not recover to normal, it is an error; either the target or the initiator. Or even the middleman (network) at times. My LIO target runs in a KVM guest. So does the iSCSI initiator too.
Hello, Yes, libvirt is used to manage KVM, and virt-manager on my desktop to connect to it. It is simple setup without clustering. Do you know someone who can help with this? Here is example of KVM guest running configuration: libvirt+ 244533 1 5 Feb03 ? 1-03:54:37 qemu-system-x86_64 -enable-kvm -name guest=me_test,process=qemu:me_test,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-108-me_test/master-key.aes -machine pc-1.1,accel=kvm,usb=off,dump-guest-core=off -cpu host -m 8096 -realtime mlock=off -smp 4,sockets=1,cores=4,threads=1 -uuid 1591f345-96b5-4077-9d32-b2991003753d -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=57,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive if=none,id=drive-ide0-1-0,readonly=on -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0,bootindex=1 -drive file=/dev/mapper/me_test,format=raw,if=none,id=drive-virtio-disk0,cache=none,discard=unmap,aio=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=2,write-cache=on -netdev tap,fd=60,id=hostnet0,vhost=on,vhostfd=61 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=aa:bb:cc:00:10:31,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev spicevmc,id=charchannel0,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 -spice port=5929,addr=0.0.0.0,disable-ticketing,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on Maybe there is something undesirable for the iscsi target. This part seems to work OK, so far no data loss was detected. Pure KVM or libvirtd too? Thank you, kind regards Milan
Thanks for confirming. You should start with the hypervisor where libvirtd is running. Check if that hypervisor host is running normal. From all the symptoms you've shared, my wild arrow in the dark is that your hypervisor/guest network may be running into issues. Seems like you are using host networking ? Have you validated that that part is working reliably ? From your initial logs, mpath reported 'io pending' on one of the paths. I'm not sure if I'd call that fully recovered. What if you down the other (healthy) path ? Does IO progress on the initial (io pending) path ? libvirtd it is. The same hypervisor with libvirtd running, hosts the initiator and the target, in different VMs.
Also, what 10 GiB ethernet do you run on your hypervisor ? What make/model ? Is that known to work reliably ? My experience from around some years ago, 10 GiB ethernet support was still budding. But things may have changed lately. I just don't have access to those things no more.
Also, on my test setup, my LIO target is a standard one. No multinode cluster. It is a single-node target. The way I trigger path failovers is by bringing down individual network interfaces. That serves well enough for my initiator focused tests.
hello Sarraf, No unusual errors reported. There is all ok with network. I've made some experiments with storage parameters passed to KVM, standard disk created within virt-manager is running with following paramaters: format=raw,cache=none,discard=unmap,aio=native. I've removed cache parameter (let hypervisor set its default) and now running disk with: format=raw,discard=unmap. With this setup, there are no errors (connection1:0: pdu (op 0x5 itt 0x48) rejected. Reason code 0x9) in logs. Virtual host was stressed with: ns2:/home/migo# stress-ng -v -d 1 stress-ng: debug: [858] 2 processors online, 2 processors configured stress-ng: info: [858] defaulting to a 86400 second (1 day, 0.00 secs) run per stressor stress-ng: info: [858] dispatching hogs: 1 hdd stress-ng: debug: [858] cache allocate: default cache size: 262144K stress-ng: debug: [858] starting stressors stress-ng: debug: [858] 1 stressor started stress-ng: debug: [859] stress-ng-hdd: started [859] (instance 0) ^Cstress-ng: debug: [859] stress-ng-hdd: exited [859] (instance 0) stress-ng: debug: [858] process [859] terminated stress-ng: info: [858] successful run completed in 30377.75s (8 hours, 26 mins, 17.75 secs) stress-ng: debug: [858] metrics-check: all stressor metrics validated and sane It seems that using hypervisors cache is filtering those bogus commands (causing trouble for target) sent to the target. What do you think about it? But running iscsi target with cache is not a good idea, and this is no real solution. :( It recovers within 10-15s and path reporting 'io pending' recovers to normal operation back. I've tested disconnecting the link from the first path and second path immediately taken over all operations. Reconnecting that path and all operations moved to the original path with higher priority. test-app (3600000e00d2c0000002c1772001f0000) dm-97 FUJITSU,ETERNUS_DXL size=500G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw |-+- policy='round-robin 0' prio=50 status=active | `- 1:0:0:28 sdar 66:176 active ready running `-+- policy='round-robin 0' prio=10 status=enabled `- 2:0:0:28 sdbj 67:208 active ready running Thank you, kind regards, Milan