Just after rebooting a slave server, that's the result on client: klecker:~$ su Password: do_ypcall: clnt_call: RPC: Unable to receive; errno = Connection refused Segmentation fault The same with login, sudo and any other auth program. Authentication is now simply locked. Also root cannot login on tty or pty. When slave comes up status does not change at all. The only solution is a warm reboot by switching power off :-( The ypbind client is called with -no-ping. That could do the difference, possibly.
From reading the source code, indeed if you call it with -no-ping,
ypbind will bind to a server at startup and after that it won't
rebind to another server ever. It appears that that is exactly what
that option is there for ...
You can force it to rebind by sending a SIGHUP to ypbind. Currently,
/etc/init.d/nis reload doesn't signal ypbind with SIGHUP, so that
could perhaps be considered a bug.
However why are you running with -no-ping ? ("doctor doctor it
hurts when I do this" ;) )
Mike.
It's a bit difficult given the lack of documentation. I guess it would be reasonable to interpret it as meaning that the regular probes for the fastest server should be disabled but still redo discovery if it detects an error.
That could be possible if I could do that as a non-privileged user. Unfortunately that's impossibile without su, sudo and anything like that. Moreover that segfault is not a good thing. Strictly due for port filtering. Without that, ypbind hangs at startup.
Having looked through the code some more this doesn't seem at all practical - unless ypbind is probing for servers it really has no idea if the server is working.
retitle 251108 ypbind does not rebind if server dies with -noping thanks You may also be able to use the ypbind RPC interface to cause the bindings to be re-probed by using the ypset program (providing you started ypbind with the -ypset option). This is rather a security hole, though, and will be unacceptable for many setups. The segfaults are up to the client code - presumably some part of either su or pam_unix needs better error checking for RPC failures from NIS. That sounds like a problem which should be addressed anyway. Moreover, looking at the code I can't immediately tell why this would help at all - the initial server discovery process shouldn't be changed by this option. Could you please describe in more detail the setup that you've got so that I can try to reproduce what's going on? Similarly, the failure to recover when the server comes back on-line ought to be investigated - I'll try to have a look at that over the weekend. As far as I can tell the behaviour that's being seen is unfortunate but what was requested - -no-ping tells ypbind not to test the servers periodically so it doesn't do so. Unfortunately, without this it then becomes reliant on some external mechanism to tell it when it needs to re-probe and at the minute there's no such mechanism. While you need -no-probe you could perhaps try using a cron job to send SIGHUP to ypbind periodically.
That's not a good reason to work as it does. BTW, 'ypcat passwd' does work while ypbind goes crazy and causes segfaults during authentication. And this is really strange. My own idea is that there are some grave issues within libc and ypbind dialog or so. But I did not read the code at all. Did you replicate the problem? I had a master and a slave here, and the problem appears whenever I reboot either the first or the second server.
ypbind should answer something valid. A master and a slave server. Master is a True64, slave is a Debian sarge. The client has all incoming tcp/udp ports filtered, but for ssh. # Generated by iptables-save v1.2.9 on Tue Mar 23 18:13:00 2004 *filter :INPUT DROP [408:28655] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [805:119348] -A INPUT -p icmp -j ACCEPT -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT -A INPUT -i ! eth0 -m state --state NEW -j ACCEPT -A INPUT -p tcp --dport 22 -j ACCEPT COMMIT # Completed on Tue Mar 23 18:13:00 2004 Yep that's a possible option. I'll try it as a workaround. Consider that all setuids() programs apparently fail during the event (e.g. fetchmail when calling procmail), but hopefully crond should work.
That's really weird, since ypcat and the glibc routines both query ypbind to find the domain's NIS server. The important thing is, what does 'ypwhich' say. 'ypwhich' queries ypbind for the current NIS server. If it doesn't print anything, or errors out, you know the domain is not bound. Your app segfaulted, not ypbind right ? Your app crashing, that's a libc6 bug, or perhaps even a bug in the app. Ypbind doesn't have anything to do with that - whatever the result of a (libc) conversation with a (ypbind) daemon, crashing with SEGV is always a bug. But why did you use -no-ping in the first place? I don't think it was meant for general usage, the man page talks about systems on a dialup line. That would at least get the systems running reliably again. Mike.
True, but that exactly the problem. I cannot currently reboot the servers, I'll do that again during the weekend in order to better specify the problem. BTW, the client is currently using the slave. I had the same event more times, just recently discovered that it's due to temporary lost connection with NIS servers. No, ypbind does work, any app that needs auth crashes, e.g. su. This issue could maybe have security implications, too. All apps which need to query for auth to be more precise, so su, sudo, login, fetchmail, exim, ... Definitively a libc problem. The interesting point is if that's truly a problem of ypbind or a libc one. I suspect ypbind returns something weird, unexpected or some NULL value and that causes a SEGV at libc level probably. Any application crashes in the same way, so it's a libc6 problem. The correct behavior shoud be that seen without -noping, that's a delay and a timeout (after that auth continues with other modules if possible). The current behavior which locks down any local auth is not acceptable, IMHO. Already answered, I need to keep that client filtered. Anyway, a bug is a bug, isn't it? Do not use that feature is not a fix :)
You can just null-route the servers IP to simulate it being unreachable. with ypbind. The NIS routines in libc6 should have a (better) timeout, and handle unexpected errors. BTW, I can't reproduce this on current libc6 versions: # id miquels do_ypcall: clnt_call: RPC: Timed out do_ypcall: clnt_call: RPC: Timed out id: miquels: No such user ssr2:~# dpkg -s libc6 Package: libc6 Version: 2.3.2.ds1-11 Now, internal libc6 routines printing to stderr, _that_ is a bug... a bug for a different day, though. Okay, I did some tests and read the source code. It can be fixed, but only by rewriting parts of libc6 and ypbind. You see, libc6 doesn't query ypbind directly. Ypbind maintains "binding" files in /var/yp/binding/domain.* that indicate the currently active NIS server. The libc6 routines then read this file, and try to talk to the server as read from the file. If the server doesn't respond, some of the libc6 routines segfault (though I tried to reproduce that, and on my unstable systems the commands like 'su' and 'id' just hang and eventually time out). Now ypbind never knows that the NIS server isn't responding. Libc6 fails to talk to the NIS server, but ypbind isn't informed of that. It *could* be fixed in the following way: - If libc6 cannot contact the NIS server as read from /var/yp/binding/domain, it should do a YPBIND_DOMAIN RPC call to ypbind i.e. query ypbind directly, then if it gets a reply, retry the NIS server once. - The YPBIND_DOMAIN RPC call should actually check if the currently bound server is still alive, and try to rebind to a different server if it isn't - To maintain the -no-ping functionality for dialup systems, ypbind should not do those checks if only one server is listed in /etc/yp.conf, since there won't be another server to bind to anyway. That way, libc6 has a way to 'kick' ypbind if the NIS server isn't responding. I think this bug has to be forwarded upstream to the NIS maintainer, and a seperate bug should be filed against libc6. Mike.
The workaround of killing -HUP ypbind in order to recheck NIS server is not useful. Moreover my own impression is that stopping the NIS servicce on one of the server without reboot the box is not critical. I see problems every time one of the server is simply rebooted. Maybe we should reassign this bug to libc6? I'm not so sure....
Do you mean that the HUP has no effect? Could you clarify what you mean here? How long do you leave the service stopped when you try stopping the box compared to the amount of time it's stopped when you're rebooting? My guess would be that either some timeout expires during the reboot or an error kicked back by the network stack while the host is down confuses ypserv. Well, until ypbind can cope with rebinding there is little point in libc telling it about problems.