#1125487 tag2upload service needs to be able to retry against ftpmaster API

#1125487#5
Date:
2026-01-14 18:59:38 UTC
From:
To:
Debian tag2upload service writes ("[tag2upload 2420] failed, ifeffit 2:1.2.11d-13"):


This is good advice.

However, I think our system could have avoided this.  The ftpmaster
API is required for knowing what is *currently* in the target suite.
With a small amount of reorganisation on top of work we're doing for
#1112106 (retrying against Salsa), we could do this before we commit
to the upload.

I am filing this bug to track this, since #1112106 talks fairly
narrowly about Salsa.

Ian.

#1125487#10
Date:
2026-01-16 12:43:25 UTC
From:
To:
Ian Jackson writes ("Bug#1125487: tag2upload service needs to be able to retry against ftpmaster API"):

It turns out that this wasn't a transient error after all; it was
#1125668.  I still think we would like to be able to survive ftpmaster
API failures.

But having looked at the code some more, it's not so easy.  For
example, this particular ftpmaster API call occurs quite late, during
changes file generation.

I still think we *could* do it, but it would look something like this:

 * Add a new feature to dgit rpush to allow it to synchronise with its
   caller just before it starts signing things.  It should do this
   idempotently in `i_resp_want`.

 * Have dgit-repos-server use this feature, and only do the
   commit-to-public-upload dance when dgit rpush wants it.

Simplest would be if we could provide dgit rpush with the manager
connection.  But we want to be able to handle dgit rpush crashing,
without losing knowledge of the o2m protocol state.

Another complication is that dgit-repos-server would, then, while it
is running dgit rpush, have to be waiting for *two* things:
  (a) dgit rpush terminates (SIGCHLD/waitpid)
  (b) dgit rpush wants to commit

I thought of a number of options for such an arrangement:

 1. dgit-repos-server uses the self-pipe trick turning (a) into an
    fd, so that it can be selected on.  What a palaver, unless
    there's a covnenient library we could use.

 2. dgit rpush notifies dgit-repos-server by sending a signal to its
    parent (!), and dgit-repos-server uses sigwait.  Does perl
    even have a convenient way to sigwait?

 3. dgit-repos-server forks again, for littel child whose job it is to
    proxy the commit-to-public-upload dance.  That way if *that* child
    doesn't crash, dgit-repos-server knows what the o2m protocol state
    is.

 4. dgit rpush writes the o2m protocol state to a file.
    Before it starts the commit-to-public-upload dance it writes
    UNKNOWN file.  After dgit rpush exits, dgit-repos-server can read
    this file to see if it can reuse the o2m connection.  (dgit rpush
    is very unlikely to crash during the commit-to-public-upload dance
    unless it's because the o2m connection is in any case broken.)

 5. Instead of modifying dgit rpush, provide a stunt wrapper for gpg.
    This is *actually* the commitment point.  But the last thing we
    want to do is get more entangled with the gnupg CLI interface.

None of this seems entirely trivial.  4 is probably easiest but it's a
bit of a bdoge!  We may want to downgrade this bug again, and postpone
this work.

Ian.

#1125487#15
Date:
2026-01-16 13:09:58 UTC
From:
To:
Hello,

Ian Jackson [16/Jan 12:43pm GMT] wrote:

(3) seems preferable to me.

IMO this should definitely not block ending the beta.

#1125487#20
Date:
2026-01-16 16:10:53 UTC
From:
To:
Sean Whitton writes ("Bug#1125487: tag2upload service needs to be able to retry against ftpmaster API"):

OK.

#1125487#27
Date:
2026-01-18 13:01:05 UTC
From:
To:
Sean Whitton writes ("Bug#1125487: tag2upload service needs to be able to retry against ftpmaster API"):

I thought of another option:

  6. dgit rpush raises SIGTTIN when it reaches the commitment point.
     d-r-s sends it SIGCONT, and provides it confirmation via a
     non-synchronising channel too (so that some administrator's
     SIGCONT isn't taken as success).  This allows d-r-s to use
     waitpid WSTOPPED to collect either possible next event.

     Something would need to take some extra measures to avoid leaking
     stopped dgit rpush processes, because a stopped process won't die
     from usually-fatal signals.  For example, maybe a child process,
     or timer_create(2), to send a SIGCONT after a timeout.  (Doing
     this in d-r-s itself is bad because what if it crashes due
     to a problem talking to the manager.)

Ian.