#922240 ftp.debian.org: consider switching to merged pdiffs

#922240#5
Date:
2019-02-13 16:32:24 UTC
From:
To:
I'd like to reopen the discussion about the pdiff format on
the archive. Currently a pdiff is generated for each generation
of the archive, which means that apt has to fetch 4 pdiffs per
day it has to catch up.

This means that for a 10 day interval, we have to fetch 40 pdiffs
per index. Assuming amd64+i386 with Contents files and Sources
enabled, we are looking at 2*(1+1+1)*40=6*40=240 files to fetch.

This is clearly suboptimal, as it makes the log output unreadable,
and causes severe slowdowns on high-latency or non-persistent
connections.

It might make sense to consider switching to merged pdiffs, which generate
one Pdiff from each generation to the latest one. This can be done either
by preserving old index files and creating pdiffs from them, or simply by
concatenating the new pdiff to the old ones.

A point against it could be increased space requirements and time to
compress the pdiffs, but I'd welcome more discussion on that subject.

#922240#10
Date:
2019-02-13 18:41:00 UTC
From:
To:
On a related note, the last known blocker for #649882 is solved.
If/when #649882 is implemented, we will have even more PDiffs by default
for apt-file at the trade-off of making the Contents file smaller
(particular multi-arch setups will benefit there).  I suspect it will
also off-set any grows in PDiffs being merged server side.

Thanks,
~Niels

#922240#15
Date:
2019-02-13 21:07:27 UTC
From:
To:
The code is in dak, generate_index_diffs.py - as soon as one gets me a
MR on salsa for it, we can have this.

Make it "just work", that is, when its merged, next run should just do
the right thing, and we are all happy. :)

#922240#20
Date:
2019-02-13 21:07:27 UTC
From:
To:
The code is in dak, generate_index_diffs.py - as soon as one gets me a
MR on salsa for it, we can have this.

Make it "just work", that is, when its merged, next run should just do
the right thing, and we are all happy. :)

#922240#25
Date:
2020-01-01 12:04:00 UTC
From:
To:
I am considering to look at this feature - I am looking for a review
before I invest a lot of time on an implementation in case the design is
going to be rejected.


# Proposal
I have spoken with Julian about the APT side and we would end up doing
completely merged pdiffs for this to work (i.e. every patch must move
you from the current state to the newest state).  This means that every
dinstall will lead to a new generation of all existing patches.

To avoid a combinational blow up, Julian and I propose that we limit the
number patch generations to a low constant.  This would limit the number
of patches to a factor 3x of the current number. We can further reduce
the number by reducing the number of pdiffs.

## The rational behind multiple generations of pdiffs:

This is ensure that any "apt-get update" that fetches an Index file
during a mirror sync will still be able to see the patches files listed
in the Index file.

My understanding is that 3 generations will be sufficient to avoid
issues by giving "apt(-get) update" at least 6 hour window to complete
before there is an issue.
  As that pdiffs are only used during an "apt(-get) update", there is no
reason to be concerned about stale metadata in the Index after apt(-get)
has fetched all the files.

The number of generations will obviously be configurable, so we can
trivially change it if 3 is too much or too little.  My interest is that
we agree on the generation approach (also - my guesstimate of 3 is from
"rather safe than sorry" instead of a "carefully calculated math").

Addendum: Ideally, the Index file would be removed from the by-hash at
the same time as the patches file listed in it.  AFAICT, this is not
trivially possible in generate_releases.py and I have assumed it to be a
non-issue given the above safe-guard.  Let me know if you disagree with
this assumption.

# Alternatives

Theoretically, it is possible to do trade-off of the pdiff where only
some of them are merged.  However, apt(-get) nor the metadata are
currently geared/designed to do this efficiently.
  Furthermore, it would not have the full performance benefit for the
client as they would in many cases still end up having to download at
least 2-3 patches (and worst case 5-7) if we want to avoid a
considerable increase in pdiff files.

For these reasons, this approach has not been considered in depth.

# Optional improvements

When merging pdiffs, it is possible to do something smarter than simply
concatenating two pdiffs together.  If it is interesting and I can
understand the runes that make up `diffindex-rred` in apt-file/2.5.4
then I will try to implement some of this in dak.  This will reduce the
file-size of the merged patches and possibly fix #947839 as a side-effect.



@FTP masters: Do you agree with the "fully-merged" approach with N
generations (with N=3 by default) as a solution to this request?


Thanks,
~Niels

#922240#30
Date:
2020-01-01 16:59:15 UTC
From:
To:
Yay.

Ok.

We currently seem to do up to 56 of them. And (for today) seem to cover
a timeframe from last-run-on-17th-December to first run of todays, so
about two weeks of pdiffs.

Does 3 generations mean we will have cover for less than a day, or do i
misunderstand something here?

I'm happy with it, as long as we cover a long enough timeframe of diff
generation with it, ie. multiple days.

And preferably if it works as a drop-in replacement and we can just
start using it whenever it gets merged.

#922240#35
Date:
2020-01-01 19:07:00 UTC
From:
To:
Joerg Jaspert:

Ack, 14 days assuming 4 dinstalls per day.

The generation represents a different "time axis" than the one you are
thinking of.  On one axis, we have "how many dinstalls can you be behind
and still get patches?".  This is the "only axis" we have right now and
is currently set to 56 dinstalls (~14 days).  For reference, the code
refers to the this limit/time axis as "--maxdiffs"/"MaxDiffs".

When we start to do patch merging, we introduce the time axis "which
dinstall state will this patch forward you to?".  I referred to this as
"Generations" in my description in a lack of a better word (I am open to
suggestions for a better name).

As said, the primary purpose of using multiple generations is to avoid
removing pdiffs referenced by an Index file while an "apt-get update" is
fetching files (Time of fetch for Index vs. time of fetch for the actual
pdiff file).  As I understand it, we will ever need more than 3
generations, but that still implies that we will have 3*56 = 168 pdiff
files.

Ok.  :)

I will try to have a look at it soonish.

Thanks,
~Niels

#922240#40
Date:
2020-11-09 21:45:02 UTC
From:
To:
The code has been written and merged into dak.  The bug remains open
until the code has been activated on ftp-master and deployed in relevant
suites (presumably after a test phase in experimental).

~Niels

#922240#47
Date:
2025-08-10 21:48:57 UTC
From:
To:
Hi

we are using merged_pdiffs everywhere where it counts.