Summary:
When distcc fails to distribute, it should give a reason, for each
host, why that host could not be used. (At the very least, it
should do so if the task fails.)
One of the hosts in our build cluster is broken:
osstest@army:~$ DISTCC_HOSTS=armpit/4 distcc gcc -Wall -c t.c
distcc[6366] (dcc_readx) ERROR: failed to read: Connection reset by peer
distcc[6366] (dcc_r_token_int) ERROR: read failed while waiting for token "DONE"
distcc[6366] (dcc_r_result_header) ERROR: server provided no answer. Is the server configured to allow access from your IP address? Does the server have the compiler installed? Is the server configured to access the compiler?
distcc[6366] Warning: failed to distribute t.c to armpit/4, running locally instead
osstest@army:~$ DISTCC_HOSTS=armpit/4 distcc gcc -Wall -c t.c
distcc[6372] (dcc_build_somewhere) Warning: failed to distribute, running locally instead
osstest@army:~$ DISTCC_FALLBACK=0 DISTCC_HOSTS=armpit/4 distcc gcc -Wall -c t.c
distcc[6399] (dcc_build_somewhere) Warning: failed to distribute and fallbacks are disabled
osstest@army:~$
There are some problems with these error messages. This bug report is
about the messages from the second and third runs above. It appears
from strace that the 2nd and 3rd runs didn't run on armpit because of
a backoff algorithm built into distcc:
mkdir("/local/scratch/osstest/.distcc", 0777) = -1 EEXIST (File exists)
mkdir("/local/scratch/osstest/.distcc/lock", 0777) = -1 EEXIST (File exists)
stat64("/local/scratch/osstest/.distcc/lock/backoff_tcp_armpit_3632_0", {st_mode=S_IFREG|0644, st_size=1, ...}) = 0
gettimeofday({1407754354, 166974}, NULL) = 0
write(2, "distcc[6430] (dcc_build_somewher"..., 92) = 92
But the 2nd and 3rd runs do not give any explanation for the problem.
IMO it ought to say why the task couldn't be run on armpit.
I think the best way to resolve this would be to record somewhere a
failure reason for each candidate host, and then report all those
reasons iff no host is suitable.
It might be that you think this is too verbose when fallback mode is
enabled and the build succeeds on localhost. But at the very least
the complete report should be made if the task actually fails.
Ian.