#956803 libteam-utils: teamd using 100% cpu

Package:
libteam-utils
Source:
libteam
Description:
library for controlling team network device -- userspace utilities
Submitter:
Jonathan Steinert
Date:
2023-08-12 02:45:04 UTC
Severity:
important
#956803#5
Date:
2020-04-15 12:27:12 UTC
From:
To:
Dear Maintainer,

After performing a recent upgrade of many debian packages and rebooting
I have found teamd is stuck at using 100% of one of my CPU. Restarts are
not changing this behavior.

I've run strace on the process and noticed it was mostly netlink
traffic:

# strace -p {teamd pid} -T -ttt
1586953451.038076 epoll_wait(10, [{EPOLLIN, {u32=8, u64=8}}], 2, -1) = 1 <0.000032>
1586953451.038222 recvmsg(8, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=0x000008}, msg_namelen=12, msg_iov=[{iov_base=[{{len=72, type=team, flags=NLM_F_MULTI, seq=0, pid=0}, "\x02\x01\x00\x00\x08\x00\x01\x00\x0f\x00\x00\x00\x2c\x00\x02\x00\x28\x00\x01\x00\x0c\x00\x01\x00\x65\x6e\x61\x62\x6c\x65\x64\x00"...}, {len=16, type=NLMSG_DONE, flags=NLM_F_MULTI, seq=0, pid=0}], iov_len=16384}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_PEEK|MSG_TRUNC) = 88 <0.000042>
1586953451.038650 recvmsg(8, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=0x000008}, msg_namelen=12, msg_iov=[{iov_base=[{{len=72, type=team, flags=NLM_F_MULTI, seq=0, pid=0}, "\x02\x01\x00\x00\x08\x00\x01\x00\x0f\x00\x00\x00\x2c\x00\x02\x00\x28\x00\x01\x00\x0c\x00\x01\x00\x65\x6e\x61\x62\x6c\x65\x64\x00"...}, {len=16, type=NLMSG_DONE, flags=NLM_F_MULTI, seq=0, pid=0}], iov_len=16384}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 88 <0.000061>
1586953451.038994 sendmsg(7, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base={{len=68, type=team, flags=NLM_F_REQUEST|NLM_F_ACK, seq=1589353843, pid=3548384774}, "\x01\x00\x00\x00\x08\x00\x01\x00\x0f\x00\x00\x00\x28\x00\x02\x00\x24\x00\x01\x00\x0c\x00\x01\x00\x65\x6e\x61\x62\x6c\x65\x64\x00"...}, iov_len=68}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 68 <0.000109>
1586953451.039295 recvmsg(7, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base={{len=36, type=NLMSG_ERROR, flags=NLM_F_CAPPED, seq=1589353843, pid=3548384774}, {error=0, msg={len=68, type=team, flags=NLM_F_REQUEST|NLM_F_ACK, seq=1589353843, pid=3548384774}}}, iov_len=16384}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_PEEK|MSG_TRUNC) = 36 <0.000064>
1586953451.039523 recvmsg(7, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base={{len=36, type=NLMSG_ERROR, flags=NLM_F_CAPPED, seq=1589353843, pid=3548384774}, {error=0, msg={len=68, type=team, flags=NLM_F_REQUEST|NLM_F_ACK, seq=1589353843, pid=3548384774}}}, iov_len=16384}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 36 <0.000035>
1586953451.039956 select(17, [3 10 11 16], [], [], NULL) = 1 (in [10]) <0.000039>


And then I used the teamnl command to look at the situation as well:

# teamnl lan monitor
options:
  *enabled (port:enp4s0f0) true changed
options:
  *enabled (port:enp4s0f0) true changed
options:
  *enabled (port:enp4s0f0) true changed
options:
  *enabled (port:enp4s0f1) true changed
options:
  *enabled (port:enp4s0f1) true changed
options:
  *enabled (port:enp4s0f1) true changed
options:
  *enabled (port:enp4s0f0) true changed
options:
  *enabled (port:enp4s0f0) true changed
options:
  *enabled (port:enp4s0f0) true changed
options:
  *enabled (port:enp4s0f1) true changed
options:
  *enabled (port:enp4s0f1) true changed
options:
  *enabled (port:enp4s0f1) true changed

Here is my team config:

{
	"device":		"lan",
	"hwaddr": "DE:AD:BE:EF:00:01",
	"runner": {
		"name": "loadbalance",
		"tx_hash": ["eth", "ipv4", "ipv6"]
	},
	"link_watch": {
		"name": "ethtool"
	},
	"ports":		{"enp4s0f0": {}, "enp4s0f1": {}}
}

Any advice on debugging or next steps would be appreciated.

#956803#10
Date:
2023-08-12 02:42:01 UTC
From:
To:
Dear Maintainer,

This bug still exists as of version 1.31-1. My current team config is:

{
  "device": "team1",
  "runner": {
    "name": "loadbalance",
    "tx_hash": ["eth"],
    "tx_balancer": {
      "name": "basic"
    }
  },
  "link_watch": {
    "name": "ethtool"
  },
  "ports": {
    "enp13s0": {},
    "wlp12s0": {
      "link_watch": {
        "delay_up": 4000,
        "delay_down": 1000
      }
    }
  }
}

But I have 100% cpu utlization even with a simple config like:

{
  "runner": {
    "name": "loadbalance"
  },
  "ports": {
    "enp13s0": {}
  }
}

I *do not* have the same issue with any of the other runners. (Well, I don't
have LACP set up on my router, so *maybe* LACP would have the same problem as
well, but I can't test it).

This happens regardless of the current traffic on the network.

I can verify that the "team_mode_loadbalance" and "team" kernel modules are
loaded.

I can provide more info if needed.