#513635 nscd: uses 100% CPU

Package:
nscd
Source:
glibc
Description:
GNU C Library: Name Service Cache Daemon
Submitter:
Nicolas Boullis
Date:
2010-08-24 18:51:04 UTC
Severity:
important
#513635#5
Date:
2009-01-30 22:17:14 UTC
From:
To:
Hi,

I just upgraded nscd from version 2.3.6.ds1-13etch8 to 2.7-18 and it
started using 100% CPU. In fact it starts using 100% CPU a few seconds
after it was started.

I tried to run it in debug mode (nscd -d), and it also starts using 100%
CPU after a few seconds, without logging any activity. Note that is
answers questions anyway, and log its activity fine.

Running top with threads shown shows several nscd threads that use
together 100% CPU:
 4335 root      20   0  112m 2272 1716 R 11.2  0.2   0:19.02 nscd
 4338 root      20   0  112m 2272 1716 R 11.2  0.2   0:03.04 nscd
 4339 root      20   0  112m 2272 1716 R 11.2  0.2   0:43.08 nscd
 4332 root      20   0  112m 2272 1716 R 10.6  0.2   1:12.14 nscd
 4333 root      20   0  112m 2272 1716 R 10.6  0.2   1:12.18 nscd
 4334 root      20   0  112m 2272 1716 R 10.6  0.2   1:17.18 nscd
 4336 root      20   0  112m 2272 1716 R 10.6  0.2   0:26.56 nscd
 4337 root      20   0  112m 2272 1716 R 10.6  0.2   0:44.26 nscd
 4381 root      20   0  112m 2272 1716 R 10.6  0.2   0:00.88 nscd

(The number of crazy nscd threads keeps increasing with time.)

This finally makes my system unusable while nscd is running, so I have
to stop it.

This would certainly deserve grave severity if it affected everyone, but
I can't believe it does and nobody reported the proble earlier. Hence,
there must be something specific to my system. The problem might for
example be powerpc-specific...

FWIW, whenever I start it in debug mode, it first prints:
4244: invalid persistent database file "/var/cache/nscd/passwd": file size does not match
4244: invalid persistent database file "/var/cache/nscd/group": file size does not match
4244: invalid persistent database file "/var/cache/nscd/services": file size does not match

If I remove those cache files, it does not print this, but does again
the next time I start it.


Cheers,

Nicolas

#513635#10
Date:
2009-03-12 22:00:13 UTC
From:
To:
We too have seen this behavior with Ubuntu Hardy and Intrepid.
but may be seeing multiple problems.

using gdb to look at the striped nscd and libs, has not shown much,
other then accept fails because all the FD's (1024) have been allocated.

In one type of failure, nscd keeps working, but using 100% CPU.

A second type of failure looks like a deadlock in libnss-ldap, in
_nss_ldap_endpwent from _nss_compat_getpwname_r where all 32 worker
threads are waiting for the same lock.

We now have a non-striped version of nscd and libnss-ldap running
waiting for the next problem to use gdb to see what is going on.


Comment on nscd:

It looks like nscd will accept requests (i.e. each using a FD)
and queue them for the worker threads until it runs out of FD's,
rather then not accepting new requests. It makes no allowance for
the fact the worker threads may also need to open files or sockets
too.

#513635#15
Date:
2009-03-13 10:06:55 UTC
From:
To:
    I've the same problem as the original reporter since a couple of days.
last night I downgraded nscd to etch's version and this morning nscd was
eating ~100% cpu again. I eill try to use the version from sid and see what
happens, but this seems to point elsewhere.

#513635#20
Date:
2009-03-31 16:47:30 UTC
From:
To:
We too have seen this behavior with Ubuntu Hardy and Intrepid
but may be seeing multiple problems.

There is an issue with the /etc/ldap.conf file. This has the comment:

# Search timelimit
#timelimit 30

But the default in the nss-ldap code is NO_LIMIT! We are now testing by
uncommenting this line.

We have seen this on more then one machine. One nscd worker thread
will call nss-ldap and it will then call ldap_result with a 0 timelimit.
ldap_result calls ldap_int_select that call poll. netstat -n | grep tcp
shows connection with CLOSE_WAIT. This thread holds a nss_ldap lock, so
with all the other threads are waiting for it at _nss_ldap_enter, thus
no worker threads are doing any work.

As new request are received by nscd, it will do the accept and queue them
for a worker thread. Each request uses an fd.

As each caller to nscd does not get a response, it times out seconds) and
appears to do its own ldap query so things sort of work but slowly. The used
fd count (/proc/<nscd-pid>/fd) continues to rise. Eventually nscd runs out
of fds, and goes in to the 100% cpu loop trying to do accepts.

So there appears to be three separate problems:
(1) timelimit = 0 is default in nss-ldap but /etc/ldap.conf implies it is 30

(2) ldap wait4msg does not recognize the connection is in CLOSED_WAIT
     even with timeout = LDAP_NO_LIMIT

(3) nscd will accept requests (i.e. each using a FD)
     and queue them for the worker threads until it runs out of FD's,
     rather then not accepting new requests. It makes no allowance for
     the fact the worker threads may also need to open files or sockets
     too.

#513635#25
Date:
2010-08-24 15:59:20 UTC
From:
To:
Hi,

are there any news on this?
Today the same happend to one of my Debian/squeeze installations running:

nscd 2.10.2-6

My best guess is that the nscd FDs where running full after several days.

After all in my opinion this should be grave severity?


B

#513635#30
Date:
2010-08-24 18:15:05 UTC
From:
To:
The problems we found were a combination of nscd using one FD for each
request, and had 32 worker threads, each of which was waiting in libnss-ldap
for one request. libnss-ldap had in the code a default timeout of NO_LIMIT,
even though the documentation indicated 30 seconds. Thus once libnss-ldap hung
up, all the nscd threads would hang, but incoming requests would continue to
use one FD each, until the limit of FD was reached at which time nscd would
loop. The system keept running as each client would try nscd, wait 30 seconds,
then do the ldap calls from the application.

The problems we had seen were summed up in an internal note on 4/14/2009:
https://bugs.launchpad.net/ubuntu/+source/libnss-ldap/+bug/292971 which has our latest
fix for libnss-ldap.

So use an ldap.conf timeout = 30 and get the libnss-ldap - 264-2ubuntu2 package.