Upon (on an 8-core machine), CPU utilization jumps *immediately* from 98% idle to 20% user, 70% system, 12% idle. strace reveals that each slave is spinning through poll() calls with timeout zero, rather than blocking until a message arrives, as the documentation for mpi.probe() suggests should happen. I suppose this might be a problem in libopenmpi instead of the R binding, I haven't tried to reproduce it with anything lower-level.
Hi Zack, On 29 September 2010 at 18:22, Zack Weinberg wrote: | Package: r-cran-rmpi | Version: 0.5-8-2 | Severity: normal | | Upon | | > library(snow) | > cl = makeCluster(7, type="MPI") | | (on an 8-core machine), CPU utilization jumps *immediately* from 98% idle | to 20% user, 70% system, 12% idle. strace reveals that each slave is | spinning through poll() calls with timeout zero, rather than blocking | until a message arrives, as the documentation for mpi.probe() suggests | should happen. | | I suppose this might be a problem in libopenmpi instead of the R binding, | I haven't tried to reproduce it with anything lower-level. Very much so. It is "permanent polling" in Open MPI that does that --- and Rmpi can do little about it. So I think after some discussion we may want to reassign or close this. I used to be a little closer to Open MPI development (but now Manuel does such wonderful work that I could step back from this :-) and there once was word of changing. Manuel, any idea if that happened? Wasn't Open MPI 1.4 supposed to take care of this? Is there a new option? Dirk | | -- System Information: | Debian Release: squeeze/sid | APT prefers unstable | APT policy: (500, 'unstable'), (101, 'experimental') | Architecture: amd64 (x86_64) | | Kernel: Linux 2.6.35-trunk-amd64 (SMP w/8 CPU cores) | Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) | Shell: /bin/sh linked to /bin/dash | | Versions of packages r-cran-rmpi depends on: | ii libc6 2.11.2-6 Embedded GNU C Library: Shared lib | ii libopenmpi1.3 1.4.2-4 high performance message passing l | ii mpi-default-bin 0.6 Standard MPI runtime programs | ii r-base-core 2.11.1-6 GNU R core of statistical computat New R 2.12.0 builds in experimental by the way. I run those too. 2.12 will come on October 15. | | r-cran-rmpi recommends no packages. | | Versions of packages r-cran-rmpi suggests: | ii r-cran-rsprng 1.0-1 GNU R interface to SPRNG (Scalable | | -- no debconf information | |
Well, no. Actually, this behavior is by design. I'm not sure about the details exactly but can get back to Jeff if you're interested in those. This is coming up every now and then in the BTS or the user list. Open MPI is basically burning every free cycle that is not used for computation (busy wait). There are no immediate plans of changing that, as far as I know. If you're program is running correctly but your load is high, that's not bug. If Open MPI eats up cycles that you need for computation, that's a bug in Open MPI. If you need MPI for a program that just idles, that's clearly a bug in your application. It's HPC after all, isn't it?! ;) Hope I could shed some light into this! Best regards, Manuel
... Well I do think this is a design error in OpenMPI. There are plenty of use cases where an OpenMPI cluster might legitimately go idle for some time, and the CPU should be doing something other than busy-waiting. The one _I_ care about is, I'm debugging a large genetic optimization that needs to be parallelized for runs to finish in a reasonable amount of time, so I want the cluster _available_ all the time (I don't want to have to do startCluster/stopCluster for every run) but the CPU should go to sleep when I'm not doing a run, so the fan quiets down and I can hear myself think. Another, similar scenario is when the same machine is time-shared among several clusters each dedicated to a particular task, which only runs when jobs come in. When any given cluster is not doing any work it should not busy-wait, because that puts unnecessary load on the scheduler. Also, I'd not be surprised if busy-waiting here actually made message receive latency _worse_ due to scheduler thrashing. zw
reassign 598553 openmpi thanks On 2 October 2010 at 08:39, Zack Weinberg wrote: | On Sat, Oct 2, 2010 at 6:01 AM, Manuel Prinz <manuel@debian.org> wrote: | >> On 29 September 2010 at 18:22, Zack Weinberg wrote: | >> | (on an 8-core machine), CPU utilization jumps *immediately* from 98% idle | >> | to 20% user, 70% system, 12% idle. strace reveals that each slave is | >> | spinning through poll() calls with timeout zero, rather than blocking | >> | until a message arrives, as the documentation for mpi.probe() suggests | >> | should happen. | ... | > Well, no. Actually, this behavior is by design. I'm not sure about the details | > exactly but can get back to Jeff if you're interested in those. This is coming | > up every now and then in the BTS or the user list. Open MPI is basically burning | > every free cycle that is not used for computation (busy wait). There are no | > immediate plans of changing that, as far as I know. | | Well I do think this is a design error in OpenMPI. There are plenty I will let the two of you sort this out. Rmpi is simply standing in the middle, talking to Open MPI. Zack: We didn't have a decent MPICH2 in Debian for ages which I always defaulted to LAM and then Open MPI for Rmpi. You try a local Rmpi package, or direct installation to /usr/local/lib/R/site-packages, of Rmpi built against MPICH2 if Open MPI bugs you too much. Dirk | of use cases where an OpenMPI cluster might legitimately go idle for | some time, and the CPU should be doing something other than | busy-waiting. | | The one _I_ care about is, I'm debugging a large genetic optimization | that needs to be parallelized for runs to finish in a reasonable | amount of time, so I want the cluster _available_ all the time (I | don't want to have to do startCluster/stopCluster for every run) but | the CPU should go to sleep when I'm not doing a run, so the fan quiets | down and I can hear myself think. | | Another, similar scenario is when the same machine is time-shared | among several clusters each dedicated to a particular task, which only | runs when jobs come in. When any given cluster is not doing any work | it should not busy-wait, because that puts unnecessary load on the | scheduler. | | Also, I'd not be surprised if busy-waiting here actually made message | receive latency _worse_ due to scheduler thrashing. | | zw
Hi Zack! I did some reading and it seems the Open MPI indeed does support two modes of waiting: aggressive and degraded. The default behavior is "aggressive", but you can switch them by setting the mpi_yield_when_idle MCA parameter. See the following FAQ entries (and links therein): http://www.open-mpi.org/faq/?category=running#force-aggressive-degraded http://www.open-mpi.org/faq/?category=running#oversubscribing I guess this is basically the behaviour you want. It would be great if you could give it a try and report back if it works for you. If it doesn't do what you (and I) expect, I'll forward this issue upstream. Best regards, Manuel
On 2 October 2010 at 21:51, Manuel Prinz wrote: | Hi Zack! | | On Sat, Oct 02, 2010 at 08:39:06AM -0700, Zack Weinberg wrote: | > On Sat, Oct 2, 2010 at 6:01 AM, Manuel Prinz <manuel@debian.org> wrote: | > >> On 29 September 2010 at 18:22, Zack Weinberg wrote: | > >> | (on an 8-core machine), CPU utilization jumps *immediately* from 98% idle | > >> | to 20% user, 70% system, 12% idle. strace reveals that each slave is | > >> | spinning through poll() calls with timeout zero, rather than blocking | > >> | until a message arrives, as the documentation for mpi.probe() suggests | > >> | should happen. | > ... | > > Well, no. Actually, this behavior is by design. I'm not sure about the details | > > exactly but can get back to Jeff if you're interested in those. This is coming | > > up every now and then in the BTS or the user list. Open MPI is basically burning | > > every free cycle that is not used for computation (busy wait). There are no | > > immediate plans of changing that, as far as I know. | | I did some reading and it seems the Open MPI indeed does support two modes | of waiting: aggressive and degraded. The default behavior is "aggressive", | but you can switch them by setting the mpi_yield_when_idle MCA parameter. | See the following FAQ entries (and links therein): | | http://www.open-mpi.org/faq/?category=running#force-aggressive-degraded | http://www.open-mpi.org/faq/?category=running#oversubscribing | | I guess this is basically the behaviour you want. It would be great if you | could give it a try and report back if it works for you. If it doesn't do | what you (and I) expect, I'll forward this issue upstream. Nice work! That totally rhymes with what I recall from late in the 1.2.* cycle and would indeed be nice if we could get this tested and then documented. Dirk
I wrote a test MPI program that just calls MPI_Probe() once - this should block forever, since there are no sends happening. When run with $ mpirun -np 2 ./a.out MPI_Probe never returns and the processes spin through poll(), which is what I originally reported. So far so good. If I change the invocation to $ mpirun -np 2 --mca mpi_yield_when_idle 1 ./a.out the behavior is the same, except that the processes alternate between poll() and sched_yield(). This doesn't help anything; the scheduler is still being thrashed, and the CPU is not allowed to go idle. [In fact, my understanding of the Linux scheduler is that a zero-timeout poll() counts as a yield, so "Aggressive" mode isn't even doing anything constructive!] The desired behavior is for an idle cluster's processes to BLOCK in poll(). So mpi_yield_when_idle does not do what I want. Also, putting "mpi_yield_when_idle = 1" into ~/.openmpi/mca-params.conf has no effect, contra the documentation -- this perhaps ought to be its own bug. (I can set MCA parameters for R with environment variables, but that's not nearly as convenient as the host file.) zw
Indeed. I will add a note in README.Debian when I get feedback from Zack. The openmpi-bin package seems to be the correct place. I won't stop you from adding a note to rmpi as well, though! Best regards, Manuel
I'm out of ideas here. Jeff, could you please comment on the issue? You can find the full log here: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=598553 Thanks in advance! Best regards, Manuel
Sorry for the delay in answering. I'll try to address all points: 1. Yes, the busy-poll design is intentional in Open MPI. :-( 1a. Yes, it probably does cause some performance degradation when used with TCP. 1b. It quite definitely is a (major) performance win for non-TCP networks. That's (unfortunately) why it's there -- you can't poll/select/epoll/whatever for these non-TCP kinds of networks (E.g., openfabrics networks) without killing performance. So you have to busy poll those networks with their native poll functions and then periodically select/poll/epoll/whatever all file descriptors. This unfortunately became a central architecture point for Open MPI's progression engine (because it's in the performance-critical code path). 2. The behavior you're seeing with yield_when_idle is also intentional. We're busy polling but we're yielding so that we play well with others. It does not in any way reduce the CPU utilization; it just make Open MPI share the CPU better. But it got somewhat weakened when sched_yield() lost its meaning in recent kernels. 3. We do know how to make our progression engine switch between blocking and busy-polling (i.e., we've had many discussions about it over the years -- shared memory message passing is the Big Problem). But no one has ever had the time / resources / motivation to implement it. If anyone has some time, I would love to explain what would need to be done (it's not rocket science, but it is a bit tricky and will require getting into some minutia in the guts of Open MPI :-\ ). Does that help at least explain why the code is the way it is?
Am Montag, den 11.10.2010, 14:26 -0500 schrieb Jeff Squyres: Yes, thanks for your input! The question that remains is how to proceed with the bug report. The closest ticket upstream seems to be #193 [1] which has not been updated for 3 years. I could either mark the Debian bug as "wontfix" or reference #193 and leave it open with severity "wishlist". (But I doubt that someone will implement that soonish.) Zack, Jeff, any opinions on how to proceed? Best regards, Manuel [1] https://svn.open-mpi.org/trac/ompi/ticket/193
The right ticket to reference is probably this one: https://svn.open-mpi.org/trac/ompi/ticket/1241
Thanks! Did miss this one. I linked it to our BTS and tagged the bug accordingly. Best regards, Manuel