- Package:
- ftp.debian.org
- Source:
- ftp.debian.org
- Submitter:
- Julian Andres Klode
- Date:
- 2025-08-10 21:51:10 UTC
- Severity:
- wishlist
I'd like to reopen the discussion about the pdiff format on the archive. Currently a pdiff is generated for each generation of the archive, which means that apt has to fetch 4 pdiffs per day it has to catch up. This means that for a 10 day interval, we have to fetch 40 pdiffs per index. Assuming amd64+i386 with Contents files and Sources enabled, we are looking at 2*(1+1+1)*40=6*40=240 files to fetch. This is clearly suboptimal, as it makes the log output unreadable, and causes severe slowdowns on high-latency or non-persistent connections. It might make sense to consider switching to merged pdiffs, which generate one Pdiff from each generation to the latest one. This can be done either by preserving old index files and creating pdiffs from them, or simply by concatenating the new pdiff to the old ones. A point against it could be increased space requirements and time to compress the pdiffs, but I'd welcome more discussion on that subject.
On a related note, the last known blocker for #649882 is solved. If/when #649882 is implemented, we will have even more PDiffs by default for apt-file at the trade-off of making the Contents file smaller (particular multi-arch setups will benefit there). I suspect it will also off-set any grows in PDiffs being merged server side. Thanks, ~Niels
The code is in dak, generate_index_diffs.py - as soon as one gets me a MR on salsa for it, we can have this. Make it "just work", that is, when its merged, next run should just do the right thing, and we are all happy. :)
The code is in dak, generate_index_diffs.py - as soon as one gets me a MR on salsa for it, we can have this. Make it "just work", that is, when its merged, next run should just do the right thing, and we are all happy. :)
I am considering to look at this feature - I am looking for a review before I invest a lot of time on an implementation in case the design is going to be rejected. # Proposal I have spoken with Julian about the APT side and we would end up doing completely merged pdiffs for this to work (i.e. every patch must move you from the current state to the newest state). This means that every dinstall will lead to a new generation of all existing patches. To avoid a combinational blow up, Julian and I propose that we limit the number patch generations to a low constant. This would limit the number of patches to a factor 3x of the current number. We can further reduce the number by reducing the number of pdiffs. ## The rational behind multiple generations of pdiffs: This is ensure that any "apt-get update" that fetches an Index file during a mirror sync will still be able to see the patches files listed in the Index file. My understanding is that 3 generations will be sufficient to avoid issues by giving "apt(-get) update" at least 6 hour window to complete before there is an issue. As that pdiffs are only used during an "apt(-get) update", there is no reason to be concerned about stale metadata in the Index after apt(-get) has fetched all the files. The number of generations will obviously be configurable, so we can trivially change it if 3 is too much or too little. My interest is that we agree on the generation approach (also - my guesstimate of 3 is from "rather safe than sorry" instead of a "carefully calculated math"). Addendum: Ideally, the Index file would be removed from the by-hash at the same time as the patches file listed in it. AFAICT, this is not trivially possible in generate_releases.py and I have assumed it to be a non-issue given the above safe-guard. Let me know if you disagree with this assumption. # Alternatives Theoretically, it is possible to do trade-off of the pdiff where only some of them are merged. However, apt(-get) nor the metadata are currently geared/designed to do this efficiently. Furthermore, it would not have the full performance benefit for the client as they would in many cases still end up having to download at least 2-3 patches (and worst case 5-7) if we want to avoid a considerable increase in pdiff files. For these reasons, this approach has not been considered in depth. # Optional improvements When merging pdiffs, it is possible to do something smarter than simply concatenating two pdiffs together. If it is interesting and I can understand the runes that make up `diffindex-rred` in apt-file/2.5.4 then I will try to implement some of this in dak. This will reduce the file-size of the merged patches and possibly fix #947839 as a side-effect. @FTP masters: Do you agree with the "fully-merged" approach with N generations (with N=3 by default) as a solution to this request? Thanks, ~Niels
Yay. Ok. We currently seem to do up to 56 of them. And (for today) seem to cover a timeframe from last-run-on-17th-December to first run of todays, so about two weeks of pdiffs. Does 3 generations mean we will have cover for less than a day, or do i misunderstand something here? I'm happy with it, as long as we cover a long enough timeframe of diff generation with it, ie. multiple days. And preferably if it works as a drop-in replacement and we can just start using it whenever it gets merged.
Joerg Jaspert: Ack, 14 days assuming 4 dinstalls per day. The generation represents a different "time axis" than the one you are thinking of. On one axis, we have "how many dinstalls can you be behind and still get patches?". This is the "only axis" we have right now and is currently set to 56 dinstalls (~14 days). For reference, the code refers to the this limit/time axis as "--maxdiffs"/"MaxDiffs". When we start to do patch merging, we introduce the time axis "which dinstall state will this patch forward you to?". I referred to this as "Generations" in my description in a lack of a better word (I am open to suggestions for a better name). As said, the primary purpose of using multiple generations is to avoid removing pdiffs referenced by an Index file while an "apt-get update" is fetching files (Time of fetch for Index vs. time of fetch for the actual pdiff file). As I understand it, we will ever need more than 3 generations, but that still implies that we will have 3*56 = 168 pdiff files. Ok. :) I will try to have a look at it soonish. Thanks, ~Niels
The code has been written and merged into dak. The bug remains open until the code has been activated on ftp-master and deployed in relevant suites (presumably after a test phase in experimental). ~Niels
Hi we are using merged_pdiffs everywhere where it counts.