#500778 nss-ldapd: problem resolving groups and users with nfs4

Package:
libnss-ldapd
Source:
nss-pam-ldapd
Description:
NSS module for using LDAP as a naming service
Submitter:
Patrick Schoenfeld
Date:
2015-11-08 20:33:03 UTC
Severity:
important
#500778#5
Date:
2008-10-01 11:11:56 UTC
From:
To:
Hi,

since we use libnss-ldapd we have a problem that is quiet serious for
us, as it effectively affects login and group ACLs. However we couldn't
yet track down this issue to a specific component, therefore we didn't
report it yet.

The setup:
Our setup is a mixed Windows/Linux environment with a LDAP server, for
central authentication. Linux clients use libnss-ldapd for resolution of
usernames and groups.

The problem:
After reboot of the Linux clients they are unable to resolve groups and
sometimes are also unable to resolve users. The result is that files are
owned by [nobody]:nogroup, while getent passwd and getent group show
the right result. In consequence people are unable to properly login
(because desktop environment need read permissions on their setting ;)
and user permissions are broken.

After 10-30 minutes of running the problem disappears. This makes me
think that some timeout occours, but I can't tell which. I thought its
probably somehow related to the udev resolution issues that are handled
different in libnss-ldapd from libnss-ldap which produces a significant
delay when booting because groups can't be resolved while ldap is
accessible, which is handled gracefully bei libnss-ldapd. Maybe you
gather invalid results while booting, because LDAP is not accessible.
But I don't see why nslcd should cache these results so I think my idea
is absurd.

The problem is reproducable with or without nscd running,  so the problem is
not related to it.

The problem seems not to be related to the groups which contain spaces,
except that it spams the log secondly with error messages unless my patch is
applied.

The problem does not occur with libnss-ldap, so the problem is specific
to libnss-ldapd.

I've choosen severity serious for this issue because at the one hand the
problem would fit severity 'Critical', because it "makes unrelated
software on the system (or the whole system) break", but then again I
felt uncomfortable with it, because the problem does not persist over
the uptime of the system and after 10-30 minutes the problem disappears.
But I think it should definitive be fixed for lenny.

Best Regards,
Patrick

#500778#10
Date:
2008-10-01 20:27:04 UTC
From:
To:
Could you provide some more details? Is the LDAP server on the system
that also runs nss-ldapd, what options do you use, which LDAP server
software etc? Your configuration file should also help.

I don't understand this. If you perform getent passwd and getent group
you get the expected result but if you do ls -l the files are reported
as nobody:nogroup?

If ls can't resolve numeric user and group ids it should print the
numeric form, not make up something.

Can you produce logs of nslcd? It should report whether the LDAP server
was reachable or not. If you can run nslcd with the -d option it should
report more information that will help in tracking this down.

Note that for logging in you also need pam_ldap which has it's own
configuration. If the problem is in that you should probably also
provide information about that.

nslcd only caches the relationship between DNs and uids for group
membership lookups (when the uniqueMember attribute is used). This
timeout is hardcoded at 15 minutes. Other than that I can't think of a
timeout as long unless you set it that high in the config.

The way nss-ldapd solves the udev problem is by not doing LDAP lookups
that early during boot at all and "fail" quickly. Only when nslcd is
started are lookups attempted. In any case I can't think of a case where
getent passwd should work and ls would fail.

One known issue (#475626) is related to the order at which nslcd is
started during boot. If the LDAP server is unavailable when nslcd is
started a timeout could occur and the LDAP server will not be found
immediately when it is available.

I am inclined to lower it to important because it seems to work in a lot
of common environments.

I hope to fix this soon. Thanks for your bugreport.

#500778#15
Date:
2008-10-01 21:07:26 UTC
From:
To:
Hi Arthur,

Yep, I can. I'm just unsure which informations are of interest (I'm at a
point where I'm kinda clueless whats the cause of the trouble :/).

No, it runs on another host. I don't use any special options. In fact
the configuration is the default configuration, except the server
address and the search base.

root@teekanne:~# grep -v '\(^#\|^$\)' /etc/nss-ldapd.conf
uri ldap://majestix-linux.intra.in-medias-res.com
base dc=intra,dc=in-medias-res,dc=com
uid nslcd
gid nslcd

The LDAP server is a usual slapd as it is in Etch:

slapd (2.3.30-5+etch1)

Right. Sometimes all files are "owned" by nobody:nogroup but the most
common problem is that only groups are a problem. And yes, while the
problem exists getent passwd and getent group show up groups properly.

Well, I think this is related to the fact that it is a NFSv4 filesystem.
nobody:nogroup is what idmapd from NFS does if it cannot properly
resolve the ids.

OK. I will add this logs ASAP.

Well, the problem is not the login per se, but that some programs (for
example GNOME) simply do not work, because they can't read their settings
(if the nobody problem exists as well. if the groups are the only
problem, then only accessing shared files is a problem)

I would have said first, that 15 minutes could be the time frame, but
then again: no. Today I saw the problem disappearing after more then
half an hour.

Well, sounds reasonable and I don't see why this should cause the
problems.

Well, yes, thats true. But on the other side it has serious affect on
the functionality on the system at a whole (because it is a client that
mounts /home etc. from the server), so I felt serious is a good
compromise.

No bug report, no solution, right? So no need to thank me, instead I
thank you if you'd find a solution for it.

Best Regards,
Patrick

#500778#20
Date:
2008-10-02 08:28:22 UTC
From:
To:
Hi,

schoenfeld@teekanne ~ % ls -l test
-rw-rw-r-- 1 schoenfeld nogroup 0 12. Sep 09:49 test

Interesting enough: The symptom is similar to the system behaviour, if
nslcd is _not_ running. Then all files resolve to nobody:nogroup.

However there is no problem visible from the log.

Best Regards,
Patrick

#500778#25
Date:
2008-10-02 22:18:47 UTC
From:
To:
If using nfs4 (I've been doing some reading up but still no first-hand
experience) is that if the user doesn't exist it is generally mapped to
nobody:nogroup.

The mapping is done by idmapd but at some point in combination with
something in the kernel. From what I understand from scanning the idmapd
code is that there is a default cache expiry time (in the kernel) of 500
seconds (10 minutes). Current value should be available
in /proc/sys/fs/nfs/idmap_cache_timeout.

My guess is that name lookups are cached in idmapd. Can you check that
by restarting idmapd (/etc/init.d/nfs-common restart) the problem goes
away?

On my system, idmapd is started way before nslcd and it probably isn't a
good idea to start if before idmapd. There seems to be an undocumented
Cache-Expiration option in the General section of /etc/idmapd.conf that
could help to bring down the cache timeout value.

Can you check the idmapd logs anything out of the ordinary? Perhaps you
can increase the verbosity in /etc/idmapd.conf.

Thanks. Perhaps I should set up a test environment myself with NFS4. Do
you have some pointers for that (I use NFS3 myself).

#500778#30
Date:
2008-10-04 07:52:00 UTC
From:
To:
Hi,

right.

Nope, it does not.
(default: 3, tried up to 10) does not seem to change anything.
Basically this is all:

Oct  3 09:46:36 teekanne rpc.idmapd[3309]: libnfsidmap: using domain:
localdomain
Oct  3 09:46:36 teekanne rpc.idmapd[3309]: libnfsidmap: using
translation method: nsswitch
Oct  3 09:46:36 teekanne rpc.idmapd[3310]: Expiration time is 600
seconds.
Oct  3 09:46:36 teekanne rpc.idmapd[3310]: Opened
/proc/net/rpc/nfs4.nametoid/channel
Oct  3 09:46:36 teekanne rpc.idmapd[3310]: Opened
/proc/net/rpc/nfs4.idtoname/channel
Oct  3 09:46:36 teekanne rpc.idmapd[3310]: New client: 0
Oct  3 09:46:36 teekanne rpc.idmapd[3310]: Opened
/var/lib/nfs/rpc_pipefs/nfs/clnt0/idmap
Oct  3 09:46:36 teekanne rpc.idmapd[3310]: New client: 1
Oct  3 09:47:23 teekanne rpc.idmapd[3310]: Client 0: (user) id "30010"
-> name "schoenfeld@localdomain"
Oct  3 09:47:23 teekanne rpc.idmapd[3310]: Client 0: (group) id "65534"
-> name "nogroup@localdomain"

Thats not a great thing. You need to setup an export entry like you do
for NFSv4, however there is a fundamentel difference to NFSv3. You
export a NFSROOT not single exports. So you possibly want to setup a
virtual export directory. Its described here [1].

Best Regards,
Patrick

[1] http://www.crazysquirrel.com/computing/debian/servers/setting-up-nfs4.jspx

#500778#35
Date:
2008-10-03 21:05:46 UTC
From:
To:
(Cc-ing the nfs-utils maintainers, perhaps they have some insight that
could solve this)

I have been able to reproduce this. On the server I have in /etc/exports
(/export/newhome is a bind-mounted /home with half a dozen users):

/export         192.168.1.0/24(ro,sync,insecure,root_squash,no_subtree_check,fsid=0)
/export/newhome 192.168.1.0/24(rw,nohide,sync,insecure,root_squash,no_subtree_check)

On the client I have in /etc/fstab:

fs:/newhome    /mnt        nfs4 rw 0 0

Now if I stop nslcd (all name lookup calls should now return
NSS_STATUS_UNAVAIL/ENOENT) an 'ls -l /mnt' shows:

[...]
drwx-----x 148 nobody users 12288 Oct  3 21:02 arthur
[...]

(the user arthur from the server is mapped to the user nobody on the
client because the namelookup failed). With some more verbose logging
rpc.idmapd shows:

[...]
rpc.idmapd: nfs4_name_to_uid: calling nsswitch->name_to_uid
rpc.idmapd: nss_getpwnam: name 'arthur@localdomain' domain 'localdomain': resulting localname 'arthur'
rpc.idmapd: nss_getpwnam: name 'arthur' not found in domain 'localdomain'
rpc.idmapd: nfs4_name_to_uid: nsswitch->name_to_uid returned -2
rpc.idmapd: nfs4_name_to_uid: final return value is -2
rpc.idmapd: Client 16: (user) name "arthur@localdomain" -> id "65534"
[...]

If I repeat the ls command a couple of times rpc.idmapd no longer logs
the failed lookups and a strace of rpc.idmapd also shows that that
process is no longer asked (by the kernel?) to look up the user.

If I then start nslcd (now name lookups should be performed as usual and
getent shows that they do) the results aren't quickly fixed.

After a while (I've been messing about with stuff in /proc so I don't
know how long this normally takes) the kernel asks rpc.idmapd again to
look up user arthur (and the other users in the filesystem). Also note
that the bugreporter had problems with groups and I've reproduced the
behaviour with users.

[...]
drwx-----x 148 arthur users 12288 Oct  3 21:02 /mnt/arthur
[...]


Now the question is, how should this caching mechanism be tuned and how
should we solve this problem. Is there a reliable way to flush the
cache? There seems to be /proc/net/rpc/nfs4.nametoid which contains some
stuff that could be relevant and /proc/sys/fs/nfs/idmap_cache_timeout.

However setting /proc/sys/fs/nfs/idmap_cache_timeout or Cache-Expiration
does not result in the expected timeout in seconds (read from the
idmapd.c). Setting it to 10 results in a retry every 30 to 60 seconds,
setting it to 100 seems to result in a retry in 60-120 seconds. Also,
writing to /proc/net/rpc/nfs4.idtoname/flush
and /proc/net/rpc/nfs4.nametoid/flush (like is done in
flush_nfsd_idmap_cache()) doesn't seem to make a difference.

I haven't had a look at the kernel code yet (this is running kernel
Linux 2.6.26-1-686 (SMP w/2 CPU cores)).


Patrick, does adding "Cache-Expiration = 10" to /etc/idmapd.conf in the
[General] section help at all in your setup? (the correct values should
be loaded sooner)

#500778#40
Date:
2008-10-06 09:42:30 UTC
From:
To:
Hi,

2008/10/3 Arthur de Jong <adejong@debian.org>:

very good. This betters the situation a lot. Its a good workaround.
Now if you'd find the reason why the behaviour differs from
libnss-ldap and could enhance libnss-ldapd in this way, this would be
great :-))

Best Regards,
Patrick

#500778#45
Date:
2008-10-13 20:17:15 UTC
From:
To:
retitle 500778 nss-ldapd: problem resolving groups and users with nfs4
severity 500778 important
tags 500778 + help
thanks

I am lowering the severity of this bug for now because the problem is
limited to using nss-ldapd in combination to nfs4 and there is a
workaround (adding "Cache-Expiration = 10" to /etc/idmapd.conf).

I will try to investigate this some more but help is appreciated with
this.

#500778#56
Date:
2008-10-13 22:18:11 UTC
From:
To:
I have been able to reproduce the same behaviour with nss_ldap. If you
freshly mount a filesystem while the LDAP server is unavailable the
kernel will not re-ask idmapd to look up the usernames until the timeout
has expired.


I have dug a little through the code (nfs-utils, libnfsidmap and kernel)
and from what I understand is that the kernel should not cache negative
lookups. But idmapd seems to map IDMAP_STATUS_LOOKUPFAIL to
IDMAP_STATUS_SUCCESS which causes the kernel to remember the mapping.
This is done in:

nfs-utils-1.1.3/utils/idmapd/idmapd.c:674:

	/* XXX: I don't like ignoring this error in the id->name case,
	 * but we've never returned it, and I need to check that the client
	 * can handle it gracefully before starting to return it now. */

	if (im.im_status == IDMAP_STATUS_LOOKUPFAIL)
		im.im_status = IDMAP_STATUS_SUCCESS;

Not sure who made the comment and if this still a valid comment. If this
is fixed this would result in negative entries not being cached at all
(except by nscd if it is enabled but the kernel would ask idmapd which
would ask nscd).

By looking though the kernel code (fs/nfs/idmap.c) there is no way to
flush the cache. Also, the value of /proc/sys/fs/nfs/idmap_cache_timeout
at the time the cache entry was created is used so it's no use in
lowering the value after the fact.

That means that I think the only way to fix this is in the short term is
to remove the LOOKUPFAIL to SUCCESS mangling from idmapd.c (which could
have other side effects) or to apply the workaround as described before.

Note that I have only read code and not done extensive debugging by
deploying modified versions of either kernel of idmapd.

There is one thing that is remaining a little puzzling in the kernel
code is the question about the cache retry. I can't explain the strange
timeout if you set the cache value really low like 1 jiffy. Then again I
don't know enough about jiffies and kernel internals to go hunting that
problem anyway.


What nss-ldapd could do is document that the Cache-Expiration option be
set. Perhaps a check could be implemented with a debconf note during
package installation.

Another option would be to start nslcd before nfs-common. This however
would probably break an environment where /usr is mounter over NFS. Also
that would cause problems because it is best to start nslcd after slapd.

#500778#61
Date:
2008-10-14 07:48:35 UTC
From:
To:
Hi Arthur,

That does not seem to be the root of the problem. I've built nfs-utils
with these lines commented out on one of my systems and disabled the
workaround in idmapd and the problem persists.

Hmm. Probably the workaround should then be included in the default
configuration of idmapd. It seems not to cause any harm and works around
these problems and IMHO its unlikely that this can be fixed *properly*
for lenny. What do you think about this approach? Shall we ask the NFS
maintainers about this change to the default configuration?

Best Regards,
Patrick

#500778#66
Date:
2008-10-14 08:28:06 UTC
From:
To:
Thanks for investigating this. Another thought occurred to me that the
kernel could be caching the contents of the directory at another level
(e.g. it could cache the directory information without ever hitting and
idmap code untill that cache is expired).

If the NFS maintainers think this does not cause problems then I think
this will be the best solution for the short term. The only downside that
I can think of is that there might be some reduced performance because the
name to id lookups need to be done more frequently.

Can you open a new bugreport on nfs-utils?

For the longer term the kernel should probably provide a mechanism to
flush the idmap cache.

#500778#73
Date:
2013-08-27 19:32:02 UTC
From:
To:
I recently came across the nfsidmap -c option. I haven't thoroughly
tried to reproduce the problem but nslcd 0.9.1-1 in experimental has an
option to flush various caches. You could put

reconnect_invalidate nfsidmap

in nslcd.conf. I'm not 100% sure if this fixes the problem but can you
reproduce the problem with that option set?

Thanks,

#500778#76
Date:
2013-08-27 19:32:02 UTC
From:
To:
I recently came across the nfsidmap -c option. I haven't thoroughly
tried to reproduce the problem but nslcd 0.9.1-1 in experimental has an
option to flush various caches. You could put

reconnect_invalidate nfsidmap

in nslcd.conf. I'm not 100% sure if this fixes the problem but can you
reproduce the problem with that option set?

Thanks,

#500778#81
Date:
2014-06-10 10:36:55 UTC
From:
To:
Just set sec=sys in both the exports entry on the server, and in the fstab options on the client, and it works - tested on centos 6.5 and ubuntu 12.04 (client)/14.04 (server). should work with debians too
---
Roy Sigurd Karlsbakk <roysk@hioa.no>
Overingeniør, IT drift, HiOA
(+47) 9801 3356 / 6723 5827

#500778#86
Date:
2014-08-18 08:14:13 UTC
From:
To:
This is happening quite often to me, also with the proposed workaround of "Cache-Timeout = 10" (both on server and clients).
I don't have problems at boot time, but I get random user/gid 4294967294, especially on file creation, and they disappear without no intervention usually in some minutes.
The NFS server is on Debian Wheezy and clients on Debian Testing. I also have a Debian Wheezy NFS client that doesn't have this problem at all, so I tried downgrading

nslcd (to 0.8.10-4)
libnss-ldapd (to 0.8.10-4)
libpam-ldapd (to 0.8.10-4)
nfs-common (to 1.2.6-4)
libnfsidmap2 (to 0.25-4)
libnss3 (to 2:3.14.5-1)

to match the versions on Wheezy but it's not helping. I also tried the "reconnect_invalidate nfsidmap" on the latest versions of nslcd (on Debian Testing) and it's also not helping. I already have sec=sys enabled in the NFS mount options.

Thanks,
Daniele

#500778#91
Date:
2015-09-24 11:22:21 UTC
From:
To:
Hi there!

It seems i'm also in pain because of this bug, as you can see the uid / gid
of many users get mapped to 4294967294

tor@host:~$ ll -a $HOME|head -6
drwxr-xr-x 16 tor        root        752 Sep 24 13:00 .
drwxr-xr-x 25 nobody     nogroup     632 Jul 23 18:21 ..
drwx------  3 4294967294 4294967294   80 Dec 25  2012 .adobe
-rw-------  1 4294967294 4294967294 2.0K Sep 24 13:00 .bash_history
-rw-r--r--  1 4294967294 root       3.1K Sep 20  2013 .bashrc

After a few minutes the issue is gone:

tor@host:~$ ll -a $HOME|head -6
drwxr-xr-x 16 tor    root     752 Sep 24 13:00 .
drwxr-xr-x 25 nobody nogroup  632 Jul 23 18:21 ..
drwx------  3 tor    kassa     80 Dec 25  2012 .adobe
-rw-------  1 tor    kassa   2.0K Sep 24 13:00 .bash_history
-rw-r--r--  1 tor    root    3.1K Sep 20  2013 .bashrc

Unfortunately i have no idea how to debug this further :/

Cheers,
Simgund

#500778#96
Date:
2015-11-08 20:30:37 UTC
From:
To:
I have a similar issue, but in my case the invalid mapping does not go
away. I'm running Jessie with nscld to authenticate against samba4 AD
and have a NAS configured as member server. To Linux clients it serves
NFS4 sec=krb5p. Actually I have 2 machines, which I think are configured
identically concerning nslcd, kerberos, and NFS.

On one machine (fresh Jessie install) everything works perfectly.

On the other machine (upgrade from wheezy) everything worked perfectly
on wheezy and indeed I noticed the issue only after several days, when I
first did ls -la on one of the imported drives. It looks extremely strange:

drwxrwxrwx 48 nobody 4294967294    4096 Sep 16 21:09 .
drwxr-xr-x  9 root   root          4096 Nov 23  2014 ..
drwxr-xr-x 38 nobody 4294967294    4096 Okt 23 07:16 adm
drwxr-xr-x  4 nobody ad_users      4096 Okt 14  2013 admin
-rw-r-xr--  1 nobody ad_users      1219 Okt 10  2001 adsl_suse71
[...]

The same directory on the other system:

drwxrwxrwx 48 mgr  lars        4096 Sep 16 21:09 .
drwxr-xr-x  9 root root        4096 Nov 23  2014 ..
drwxr-xr-x 38 mgr  lars        4096 Okt 23 07:16 adm
drwxr-xr-x  4 mgr  ad_users    4096 Okt 14  2013 admin
-rw-r-xr--  1 mgr  ad_users    1219 Okt 10  2001 adsl_suse71
[...]

which is the same, as I see it on the NAS.

However, I can read everything, but if I create new files they're
created as guest:users on the NAS, which maps to nobody:users on the
machine, where everything is alright.

I have a similar situation as described in this bug report during
start-up. Due to trouble with k5start, nslcd is not available, when NFS
is started on boot. I use to start nslcd manually from a root prompt.

As said, it does not go away. Changing the expiration does not change a
thing. The group number 4294967294 seems to pop out of thin air. It's
not the number of the 'lars' group on any system involved. Checking
syslog (Verbosity=9) it turns out that idmapd doesn't ever look up the
names. This is a log following a nfs-common restart, several 'ls' and a
'touch':

Nov  8 21:02:09 midgard rpc.idmapd[11283]: libnfsidmap: using domain:
ad.microsult.de
Nov  8 21:02:09 midgard rpc.idmapd[11283]: libnfsidmap: Realms list:
'AD.MICROSULT.DE'
Nov  8 21:02:09 midgard rpc.idmapd[11283]: libnfsidmap: loaded plugin
/lib/x86_64-linux-gnu/libnfsidmap/nsswitch.so for method nsswitch
Nov  8 21:02:09 midgard rpc.idmapd[11284]: Expiration time is 10 seconds.
Nov  8 21:02:09 midgard rpc.idmapd[11284]: Opened
/proc/net/rpc/nfs4.nametoid/channel
Nov  8 21:02:09 midgard rpc.idmapd[11284]: Opened
/proc/net/rpc/nfs4.idtoname/channel
Nov  8 21:02:09 midgard rpc.idmapd[11284]: New client: 13
Nov  8 21:02:09 midgard rpc.idmapd[11284]: New client: 14
Nov  8 21:02:09 midgard rpc.idmapd[11284]: Opened
/run/rpc_pipefs/nfs/clnt14/idmap
Nov  8 21:02:09 midgard rpc.idmapd[11284]: New client: 15
Nov  8 21:02:09 midgard nfs-common[11269]: Starting NFS common
utilities: statd idmapdrpc.idmapd: libnfsidmap: using domain:
ad.microsult.de
Nov  8 21:02:09 midgard nfs-common[11269]: rpc.idmapd: libnfsidmap:
Realms list: 'AD.MICROSULT.DE'
Nov  8 21:02:09 midgard nfs-common[11269]: rpc.idmapd: libnfsidmap:
loaded plugin /lib/x86_64-linux-gnu/libnfsidmap/nsswitch.so for method
nsswitch
Nov  8 21:15:29 midgard nfsidmap[11442]: key: 0x154df42a type: uid
value: guest@ad.microsult.de timeout 600
Nov  8 21:15:29 midgard nfsidmap[11442]: nfs4_name_to_uid: calling
nsswitch->name_to_uid
Nov  8 21:15:29 midgard nfsidmap[11442]: nss_getpwnam: name
'guest@ad.microsult.de' domain 'ad.microsult.de': resulting localname
'guest'
Nov  8 21:15:29 midgard nfsidmap[11442]: nss_getpwnam: name 'guest' not
found in domain 'ad.microsult.de'
Nov  8 21:15:29 midgard nfsidmap[11442]: nfs4_name_to_uid:
nsswitch->name_to_uid returned -2
Nov  8 21:15:29 midgard nfsidmap[11442]: nfs4_name_to_uid: final return
value is -2
Nov  8 21:15:29 midgard nfsidmap[11442]: nfs4_name_to_uid: calling
nsswitch->name_to_uid
Nov  8 21:15:29 midgard nfsidmap[11442]: nss_getpwnam: name
'nobody@ad.microsult.de' domain 'ad.microsult.de': resulting localname
'nobody'
Nov  8 21:15:29 midgard nfsidmap[11442]: nfs4_name_to_uid:
nsswitch->name_to_uid returned 0
Nov  8 21:15:29 midgard nfsidmap[11442]: nfs4_name_to_uid: final return
value is 0

So it obviously tries to resolve guest (during touch or the following
ls), but it never looked up any other name in 13 minutes with any expiry
time of 10 seconds. So it seems to be similarly related to chaching of
negative results.

Please let me know, if I can help with additional input.