Fabre

#598553 Add support for blocking progress #598553

Package:: openmpi

Source:: openmpi

Submitter:: Zack Weinberg

Date:: 2010-10-24 17:18:07 UTC

Severity:: wishlist

Tags:

#598553#5

Date:: 2010-09-30 01:22:23 UTC

From:

To:

Upon

(on an 8-core machine), CPU utilization jumps *immediately* from 98% idle
to 20% user, 70% system, 12% idle.  strace reveals that each slave is
spinning through poll() calls with timeout zero, rather than blocking
until a message arrives, as the documentation for mpi.probe() suggests
should happen.

I suppose this might be a problem in libopenmpi instead of the R binding,
I haven't tried to reproduce it with anything lower-level.

#598553#10

Date:: 2010-09-30 02:28:06 UTC

From:

To:

Hi Zack,

On 29 September 2010 at 18:22, Zack Weinberg wrote:
| Package: r-cran-rmpi
| Version: 0.5-8-2
| Severity: normal
|
| Upon
|
| > library(snow)
| > cl = makeCluster(7, type="MPI")
|
| (on an 8-core machine), CPU utilization jumps *immediately* from 98% idle
| to 20% user, 70% system, 12% idle.  strace reveals that each slave is
| spinning through poll() calls with timeout zero, rather than blocking
| until a message arrives, as the documentation for mpi.probe() suggests
| should happen.
|
| I suppose this might be a problem in libopenmpi instead of the R binding,
| I haven't tried to reproduce it with anything lower-level.

Very much so. It is "permanent polling" in Open MPI that does that --- and
Rmpi can do little about it.  So I think after some discussion we may want to
reassign or close this.

I used to be a little closer to Open MPI development (but now Manuel does
such wonderful work that I could step back from this :-) and there once was
word of changing.

Manuel, any idea if that happened?  Wasn't Open MPI 1.4 supposed to take care
of this?  Is there a new option?

Dirk

|
| -- System Information:
| Debian Release: squeeze/sid
|   APT prefers unstable
|   APT policy: (500, 'unstable'), (101, 'experimental')
| Architecture: amd64 (x86_64)
|
| Kernel: Linux 2.6.35-trunk-amd64 (SMP w/8 CPU cores)
| Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
| Shell: /bin/sh linked to /bin/dash
|
| Versions of packages r-cran-rmpi depends on:
| ii  libc6                         2.11.2-6   Embedded GNU C Library: Shared lib
| ii  libopenmpi1.3                 1.4.2-4    high performance message passing l
| ii  mpi-default-bin               0.6        Standard MPI runtime programs
| ii  r-base-core                   2.11.1-6   GNU R core of statistical computat

New R 2.12.0 builds in experimental by the way.  I run those too. 2.12 will
come on October 15.


|
| r-cran-rmpi recommends no packages.
|
| Versions of packages r-cran-rmpi suggests:
| ii  r-cran-rsprng                 1.0-1      GNU R interface to SPRNG (Scalable
|
| -- no debconf information
|
|

#598553#15

Date:: 2010-10-02 13:01:01 UTC

From:

To:

Well, no. Actually, this behavior is by design. I'm not sure about the details
exactly but can get back to Jeff if you're interested in those. This is coming
up every now and then in the BTS or the user list. Open MPI is basically burning
every free cycle that is not used for computation (busy wait). There are no
immediate plans of changing that, as far as I know. If you're program is running
correctly but your load is high, that's not bug. If Open MPI eats up cycles that
you need for computation, that's a bug in Open MPI. If you need MPI for a program
that just idles, that's clearly a bug in your application. It's HPC after all,
isn't it?! ;)

Hope I could shed some light into this!

Best regards,
Manuel

#598553#20

Date:: 2010-10-02 15:39:06 UTC

From:

To:

...

Well I do think this is a design error in OpenMPI.  There are plenty
of use cases where an OpenMPI cluster might legitimately go idle for
some time, and the CPU should be doing something other than
busy-waiting.

The one _I_ care about is, I'm debugging a large genetic optimization
that needs to be parallelized for runs to finish in a reasonable
amount of time, so I want the cluster _available_ all the time (I
don't want to have to do startCluster/stopCluster for every run) but
the CPU should go to sleep when I'm not doing a run, so the fan quiets
down and I can hear myself think.

Another, similar scenario is when the same machine is time-shared
among several clusters each dedicated to a particular task, which only
runs when jobs come in.   When any given cluster is not doing any work
it should not busy-wait, because that puts unnecessary load on the
scheduler.

Also, I'd not be surprised if busy-waiting here actually made message
receive latency _worse_ due to scheduler thrashing.

zw

#598553#25

Date:: 2010-10-02 16:00:46 UTC

From:

To:

reassign 598553 openmpi
thanks

On 2 October 2010 at 08:39, Zack Weinberg wrote:
| On Sat, Oct 2, 2010 at 6:01 AM, Manuel Prinz <manuel@debian.org> wrote:
| >> On 29 September 2010 at 18:22, Zack Weinberg wrote:
| >> | (on an 8-core machine), CPU utilization jumps *immediately* from 98% idle
| >> | to 20% user, 70% system, 12% idle.  strace reveals that each slave is
| >> | spinning through poll() calls with timeout zero, rather than blocking
| >> | until a message arrives, as the documentation for mpi.probe() suggests
| >> | should happen.
| ...
| > Well, no. Actually, this behavior is by design. I'm not sure about the details
| > exactly but can get back to Jeff if you're interested in those. This is coming
| > up every now and then in the BTS or the user list. Open MPI is basically burning
| > every free cycle that is not used for computation (busy wait). There are no
| > immediate plans of changing that, as far as I know.
|
| Well I do think this is a design error in OpenMPI.  There are plenty

I will let the two of you sort this out. Rmpi is simply standing in the
middle, talking to Open MPI.

Zack: We didn't have a decent MPICH2 in Debian for ages which I always
defaulted to LAM and then Open MPI for Rmpi.  You try a local Rmpi package,
or direct installation to /usr/local/lib/R/site-packages, of Rmpi built
against MPICH2 if Open MPI bugs you too much.

Dirk

| of use cases where an OpenMPI cluster might legitimately go idle for
| some time, and the CPU should be doing something other than
| busy-waiting.
|
| The one _I_ care about is, I'm debugging a large genetic optimization
| that needs to be parallelized for runs to finish in a reasonable
| amount of time, so I want the cluster _available_ all the time (I
| don't want to have to do startCluster/stopCluster for every run) but
| the CPU should go to sleep when I'm not doing a run, so the fan quiets
| down and I can hear myself think.
|
| Another, similar scenario is when the same machine is time-shared
| among several clusters each dedicated to a particular task, which only
| runs when jobs come in.   When any given cluster is not doing any work
| it should not busy-wait, because that puts unnecessary load on the
| scheduler.
|
| Also, I'd not be surprised if busy-waiting here actually made message
| receive latency _worse_ due to scheduler thrashing.
|
| zw

#598553#34

Date:: 2010-10-02 19:51:20 UTC

From:

To:

Hi Zack!

I did some reading and it seems the Open MPI indeed does support two modes
of waiting: aggressive and degraded. The default behavior is "aggressive",
but you can switch them by setting the mpi_yield_when_idle MCA parameter.
See the following FAQ entries (and links therein):

http://www.open-mpi.org/faq/?category=running#force-aggressive-degraded
http://www.open-mpi.org/faq/?category=running#oversubscribing

I guess this is basically the behaviour you want. It would be great if you
could give it a try and report back if it works for you. If it doesn't do
what you (and I) expect, I'll forward this issue upstream.

Best regards,
Manuel

#598553#39

Date:: 2010-10-02 20:56:04 UTC

From:

To:

On 2 October 2010 at 21:51, Manuel Prinz wrote:
| Hi Zack!
|
| On Sat, Oct 02, 2010 at 08:39:06AM -0700, Zack Weinberg wrote:
| > On Sat, Oct 2, 2010 at 6:01 AM, Manuel Prinz <manuel@debian.org> wrote:
| > >> On 29 September 2010 at 18:22, Zack Weinberg wrote:
| > >> | (on an 8-core machine), CPU utilization jumps *immediately* from 98% idle
| > >> | to 20% user, 70% system, 12% idle.  strace reveals that each slave is
| > >> | spinning through poll() calls with timeout zero, rather than blocking
| > >> | until a message arrives, as the documentation for mpi.probe() suggests
| > >> | should happen.
| > ...
| > > Well, no. Actually, this behavior is by design. I'm not sure about the details
| > > exactly but can get back to Jeff if you're interested in those. This is coming
| > > up every now and then in the BTS or the user list. Open MPI is basically burning
| > > every free cycle that is not used for computation (busy wait). There are no
| > > immediate plans of changing that, as far as I know.
|
| I did some reading and it seems the Open MPI indeed does support two modes
| of waiting: aggressive and degraded. The default behavior is "aggressive",
| but you can switch them by setting the mpi_yield_when_idle MCA parameter.
| See the following FAQ entries (and links therein):
|
| http://www.open-mpi.org/faq/?category=running#force-aggressive-degraded
| http://www.open-mpi.org/faq/?category=running#oversubscribing
|
| I guess this is basically the behaviour you want. It would be great if you
| could give it a try and report back if it works for you. If it doesn't do
| what you (and I) expect, I'll forward this issue upstream.

Nice work!  That totally rhymes with what I recall from late in the 1.2.*
cycle and would indeed be nice if we could get this tested and then documented.

Dirk

#598553#44

Date:: 2010-10-02 20:37:42 UTC

From:

To:

I wrote a test MPI program that just calls MPI_Probe() once - this
should block forever, since there are no sends happening.  When run
with

$ mpirun -np 2 ./a.out

MPI_Probe never returns and the processes spin through poll(), which
is what I originally reported.  So far so good.  If I change the
invocation to

$ mpirun -np 2 --mca mpi_yield_when_idle 1 ./a.out

the behavior is the same, except that the processes alternate between
poll() and sched_yield().  This doesn't help anything; the scheduler
is still being thrashed, and the CPU is not allowed to go idle.  [In
fact, my understanding of the Linux scheduler is that a zero-timeout
poll() counts as a yield, so "Aggressive" mode isn't even doing
anything constructive!]

The desired behavior is for an idle cluster's processes to BLOCK in
poll().  So mpi_yield_when_idle does not do what I want.

Also, putting "mpi_yield_when_idle = 1" into
~/.openmpi/mca-params.conf has no effect, contra the documentation --
this perhaps ought to be its own bug.  (I can set MCA parameters for R
with environment variables, but that's not nearly as convenient as the
host file.)

zw

#598553#49

Date:: 2010-10-02 21:09:00 UTC

From:

To:

Indeed. I will add a note in README.Debian when I get feedback from Zack. The
openmpi-bin package seems to be the correct place. I won't stop you from adding
a note to rmpi as well, though!

Best regards,
Manuel

#598553#54

Date:: 2010-10-02 23:30:15 UTC

From:

To:

I'm out of ideas here. Jeff, could you please comment on the issue?
You can find the full log here:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=598553

Thanks in advance!

Best regards,
Manuel

#598553#63

Date:: 2010-10-11 19:26:59 UTC

From:

To:

Sorry for the delay in answering. I'll try to address all points:

1. Yes, the busy-poll design is intentional in Open MPI. :-(
1a. Yes, it probably does cause some performance degradation when used with TCP.
1b. It quite definitely is a (major) performance win for non-TCP networks. That's (unfortunately) why it's there -- you can't poll/select/epoll/whatever for these non-TCP kinds of networks (E.g., openfabrics networks) without killing performance. So you have to busy poll those networks with their native poll functions and then periodically select/poll/epoll/whatever all file descriptors. This unfortunately became a central architecture point for Open MPI's progression engine (because it's in the performance-critical code path).

2. The behavior you're seeing with yield_when_idle is also intentional. We're busy polling but we're yielding so that we play well with others. It does not in any way reduce the CPU utilization; it just make Open MPI share the CPU better. But it got somewhat weakened when sched_yield() lost its meaning in recent kernels.

3. We do know how to make our progression engine switch between blocking and busy-polling (i.e., we've had many discussions about it over the years -- shared memory message passing is the Big Problem). But no one has ever had the time / resources / motivation to implement it. If anyone has some time, I would love to explain what would need to be done (it's not rocket science, but it is a bit tricky and will require getting into some minutia in the guts of Open MPI :-\ ).

Does that help at least explain why the code is the way it is?

#598553#68

Date:: 2010-10-21 12:06:34 UTC

From:

To:

Am Montag, den 11.10.2010, 14:26 -0500 schrieb Jeff Squyres:

Yes, thanks for your input! The question that remains is how to proceed
with the bug report. The closest ticket upstream seems to be #193 [1]
which has not been updated for 3 years. I could either mark the Debian
bug as "wontfix" or reference #193 and leave it open with severity
"wishlist". (But I doubt that someone will implement that soonish.)

Zack, Jeff, any opinions on how to proceed?

Best regards,
Manuel

[1] https://svn.open-mpi.org/trac/ompi/ticket/193

#598553#73

Date:: 2010-10-21 17:54:37 UTC

From:

To:

The right ticket to reference is probably this one:

https://svn.open-mpi.org/trac/ompi/ticket/1241

#598553#86

Date:: 2010-10-24 17:14:59 UTC

From:

To:

Thanks! Did miss this one. I linked it to our BTS and tagged
the bug accordingly.

Best regards,
Manuel

#598553 Add support for blocking progress #598553

Just Reply to ...

Reply to submitter ...

Send control command (Silently)

Set Architecture Tags (Silently)