#111879 apt-get: wishlist: random download order for better HTTP cache hit rate

Package:
apt
Source:
apt
Description:
commandline package manager
Submitter:
Zygo Blaxell
Date:
2021-11-07 09:03:03 UTC
Severity:
wishlist
#111879#5
Date:
2001-09-10 14:41:47 UTC
From:
To:
I am maintaining a group of 50+ machines running Debian.  These are
a mixture of stable, unstable, and testing machines distributed
geographically in three countries.  All have a local caching HTTP proxy
on their LANs, and they are configured to use it.

At regular intervals 'apt-get update; apt-get -ufy dist-upgrade' is
run on these machines almost simultaneously (it is actually manually
started by local support personnel, who for various unfortunate reasons
do it during the same hour each day).

This has a nasty side-effect: there may be many machines attempting to
download all of the same packages in the same order through the same
proxy if there is a large number of packages upgraded that day.  The HTTP
caches can't have a cache hit on a package until that package is fully
downloaded, so ultimately many machines will end up downloading the same
package at the same time through the same HTTP proxy and Internet feed.

Using rsync to create a local copy of debian.org uses much more bandwidth
than 50 machines simultaneously downloading new libc6 packages, so the
HTTP cache strategy seems to be more efficient even if it wastes 70%
of its bandwidth in the process.

It would be nice if apt-get could randomize the order in which it
attempts to fetch packages (and package lists files, for that matter)
from any given source.  In the absence of a scheme to schedule network
traffic synchronously among many machines running apt-get simultaneously,
randomizing the download order within apt-get would maximize the
probability that any two machines are fetching different packages at
any given time.  This in turn improves the cache hit rate when different
machines fetch the same package, as the first such machine will have had
sufficient time to download the package before the second and subsequent
machines make their requests from the cache.

#111879#10
Date:
2001-09-10 17:39:56 UTC
From:
To:
It's the fault of the web cache/proxy that it does not start serving out the
partial content, and then multiplexing the content still coming in from the
first request, out to all the subsequent requests.

It is my opinion that this bug should be filed on the web proxy/cache software
you are using, and not on apt, and that this bug should be closed.

#111879#15
Date:
2001-09-10 20:22:14 UTC
From:
To:
Actually, I have played around with this kind of solution, using
'apt-get --print-uris' to generate data to simulate a 'dist-upgrade',
and HTTP proxy cache logs for data to simulate an 'update'.  In a
nutshell, the solution you propose can be worse than using no cache
at all.  The solution I propose improves performance in some situations,
regardless of the kind of cache used.

I have to deal with some eastern European site offices which have lots
of available bandwidth, but average 30% (best case 10%, worst case 80%,
one or two hours per month of no connectivity at all) packet loss
between the ISP and anything interesting, like a Debian mirror or another
corporate office.  Any one TCP connection is able to use less than 2% of
the bandwidth between sites--the TCP congestion window never opens up,
because every third packet disappears in transit.  If I open 10 or 20
TCP connections, each fetching a different package, each one behaves the
same as if I had opened only a single connection--there is no bandwidth
starvation nor significant additional latency, because the TCP congestion
window never opens more than one or two segments.

Some results of my simulations:

With no cache at all, 30% packet loss at the ISP, 3% of local bandwidth
consumed per TCP connection, and unmodified apt-get, the average
run time of a parallel 'apt-get dist-upgrade' or 'apt-get upgrade'
is identical given a machine pool sized between 1 and 10 machines.
Some machines finish faster than others--there is considerable variance
between machines.  No machine ever uses enough bandwidth to affect packet
loss, latency, or available bandwidth for other connections, so the
TCP connections don't interact with each other at all.  Once you get
20 or so machines running in parallel, things start to get non-linear.

With a cache that behaves as I described (no cache hits until a complete
object is downloaded), under the same conditions, the total run time of
'apt-get dist-upgrade' run in parallel on 10 machines is about 30% less,
because some of the machines do get cache hits if they are sufficiently
delayed during one HTTP object fetch that the following object can
be fetched from the cache.  In the event of a cache miss, behavior is
identical to the uncached scenario.  I think the 30% packet loss and
30% speed improvement in this simulation is just a coincidence--I can't
think of any mechanism by which they would be related.

With a cache that behaves as you described (partial content hits and
multiplexing live upstream HTTP server connections amongst downstream
HTTP clients), under the same conditions, the total run time is
constant for all machines, and equal to the worst-case run time of
the uncached case.  Actually it's very close to having one machine
perform the entire apt-get dist-upgrade through a cache, followed by
all other machines using the cache sequentially.

In practice, running all of the apt-get dist-upgrade's sequentially
through a cache under these conditions is too slow to be worth mentioning.
The Packages.gz files for unstable change while the dist-upgrades are
running, invalidating any cached copies of these files and losing packages
(update has to be run again), and generally wasting an entire day.

If all of the apt-get's in a cluster can be configured somehow to fetch
their packages in parallel through a caching web proxy that behaves
either as you describe (joining all requests to the same object into one)
or as I describe (treating all requests independently until a complete
object can be cached), the run time on all machines is equal, and it is
the total download time of the longest file in each of the groups of N
packages fetched in the update/upgrade.  N would be about 10, given the
network (non-)performance characteristics I've been discussing so far.

If each apt-get fetches its objects in random order, as I proposed,
you get as close to this result as possible without introducing an
external synchronization mechanism.  If I understand correctly, apt-get
does not care about the order of downloaded packages, since it won't
actually install any until the entire download of all packages is
complete.

Hmmm...it occurs to me that in the process of gathering experimental data,
I've already got tools that could be easily adapted to pre-fill the
HTTP cache for a dist-upgrade.  Something like...

for x in \
http://http.us.debian.org/debian/dists/{stable,testing,unstable}/{main,contrib,non-free}/binary-{i386,alpha}/Packages.gz \
http://non-us.debian.org/debian-non-US/dists/{stable,testing,unstable}/{main,contrib,non-free}/binary-{i386,alpha}/Packages.gz \
; do
  wget --cache on --delete-after "$x" &
done

wait

for x in `cat machines`; do
  ssh root@$x sh <<'COMMANDS'
    apt-get update;
    apt-get --print-uris -y dist-upgrade | \
      while read url other; do
        case "$url" in
          \'*\')
	    echo "${url//\'/}"
          ;;
        esac
      done | \
      tr '\n' '\0' | \
      xargs -0 -P10 -n1 -rt wget --cache on --delete-after
COMMANDS
done

...would do the job crudely, but quite effectively.  A little bit of
Perl could parse /etc/apt/sources.list and handle the pre-fetching more
gracefully than wget.  Hmmm.

Of course the other issue is that ultimately I don't control all of
the HTTP caching software.  Two of the caches I use are supplied by the
upstream ISP or the corporate IT department...if I wanted my own caching,
I'd have to build yet another cache behind these machines...

#111879#20
Date:
2001-09-10 22:49:02 UTC
From:
To:
The question is not about raw performance but about a correctly behaving
cache. APT explicitly does not support parallel gets (and I don't really
care if it is one 1 machine or ten), and the arguments your present for
why it is so good hold for pretty much every other internet user.

Plain and simple fact is that your cache is broken. It should preserve
it's hit rate by responding before a fetch completes, and if it
desperately make sense, it should process all pending transactions in
parallel (APT actually issues 10 gets to the cache at once, the cache
could easially fetch them in parallel, it just can't respond out of order)

A better solution to your problem is a preditictive usage sensative
debian-specific cache. It could download off peak hours, at a
slower+kinder rate and have the new .debs prepared before the client
requests them.

Nobody has written such a cache (apt-proxy doesn't quite do it all), but I
think it would be incredibly handy for large orginizations such as yours.

Jason

#111879#25
Date:
2001-09-14 03:09:51 UTC
From:
To:
What additional features would be needed in such a cache?

- Fine-tuning cache parameters for .deb's, Packages files, etc.
  (I finally seem to have convinced Squid to do this correctly)
- Priming the cache before running upgrades (wget --delete-after)
- Bandwidth management (if this is really necessary off-peak, it could
  probably be accomplished at the network layer)

#111879#30
Date:
2001-09-27 05:01:05 UTC
From:
To:
It wouldn't be a cache so much as an active partial mirror, who's content
is driven by client requests. So if I fetch foo.deb one day, the 'cache'
would fetch all future versions of foo.deb and all its' dependencies
recursively in anticipation that I would eventually upgrade foo.deb

Some people might want some code to make the cache bounded in size though.

That only works for 1 workstation. You really need to aggregate
package selections for all workstations and manage the cache that way.

Jason

#111879#35
Date:
2001-09-27 05:13:14 UTC
From:
To:
This still sounds like a client-side app to me.  Aggregate package
selections, build a list of all installed packages, cross-reference that
with an up-to-date available packages list to resolve dependencies and find
URLs, and download everything through a standard-issue caching proxy server.

#111879#38
Date:
2008-12-27 10:33:28 UTC
From:
To:
Hello Zygo.

Now we have many apt proxies in Debian archive: apt-proxy, apt-cacher, apt-cacher-ng,
approx. Can one of these tools do what you need?

#111879#43
Date:
2008-12-27 20:00:00 UTC
From:
To:
I still believe the best solution is for each client to request distinct
packages through a common caching proxy, and the less apt-specific the
proxy the better.  Assuming that coordination between the clients is
not available, the best way to achieve this effect is for each client
to fetch its packages in random order.  Even if there is a caching proxy
available that can stream data from partially-downloaded cached objects to
multiple clients, throughput through the LAN gateway is usually improved
if multiple clients fetch distinct packages concurrently from multiple
archive mirrors.

For a few years I had a shell script which did random-order prefetch of
package files based on 'apt-get --print-uris dist-upgrade', but this
stopped being useful after I started using aptitude, which created
different dist-upgrade plans that made the prefetch less useful.  Often
I would find that apt-get could not find a working upgrade solution at
all, which prevented it from generating any useful prefetch data.
Eventually I stopped maintaining this script.

I had similar problems with the Gnome apt-watcher applet,
unattended-upgrades, apt-cacher, and similar prefetching-oriented
utilities.  They require detailed information on each client's installed
package lists--often they need to be installed and running on each
client--in order to assess what packages are prefetching candidates.
In the event that manual dependency resolution was required, there would
often be no prefetching at all.

Since seven years ago our networks have gotten a lot faster.  Not only
is bandwidth sufficient for a single client to download all upgrades in
a reasonable period of time, but we can now tolerate dozens of concurrent
downloads of the same packages to different machines on the same LAN.
Bandwidth and throughput limits have increased, costs have decreased, and
we have had no incentive to optimize apt's network usage for many years.

We're back to using unmodified apt-get/aptitude with a standard HTTP
proxy (e.g. Squid or Apache mod_proxy) configured to cache very large
files, and we ignore the fact that we can be concurrently downloading
the same package 20 or more times before a cached copy of the package
becomes complete and available in the caching proxy server.

#111879#46
Date:
2008-12-27 20:00:00 UTC
From:
To:
I still believe the best solution is for each client to request distinct
packages through a common caching proxy, and the less apt-specific the
proxy the better.  Assuming that coordination between the clients is
not available, the best way to achieve this effect is for each client
to fetch its packages in random order.  Even if there is a caching proxy
available that can stream data from partially-downloaded cached objects to
multiple clients, throughput through the LAN gateway is usually improved
if multiple clients fetch distinct packages concurrently from multiple
archive mirrors.

For a few years I had a shell script which did random-order prefetch of
package files based on 'apt-get --print-uris dist-upgrade', but this
stopped being useful after I started using aptitude, which created
different dist-upgrade plans that made the prefetch less useful.  Often
I would find that apt-get could not find a working upgrade solution at
all, which prevented it from generating any useful prefetch data.
Eventually I stopped maintaining this script.

I had similar problems with the Gnome apt-watcher applet,
unattended-upgrades, apt-cacher, and similar prefetching-oriented
utilities.  They require detailed information on each client's installed
package lists--often they need to be installed and running on each
client--in order to assess what packages are prefetching candidates.
In the event that manual dependency resolution was required, there would
often be no prefetching at all.

Since seven years ago our networks have gotten a lot faster.  Not only
is bandwidth sufficient for a single client to download all upgrades in
a reasonable period of time, but we can now tolerate dozens of concurrent
downloads of the same packages to different machines on the same LAN.
Bandwidth and throughput limits have increased, costs have decreased, and
we have had no incentive to optimize apt's network usage for many years.

We're back to using unmodified apt-get/aptitude with a standard HTTP
proxy (e.g. Squid or Apache mod_proxy) configured to cache very large
files, and we ignore the fact that we can be concurrently downloading
the same package 20 or more times before a cached copy of the package
becomes complete and available in the caching proxy server.

#111879#51
Date:
2014-12-11 05:25:01 UTC
From:
To:
Hello

I intend to give to you a portion of my Wealth as a free-will financial donation to you. Respond now to partake.

Regards
Maria-Elisabeth Schaeffler
Email:charityinquiries@mariaelisabethschaeffler.vns.me
Note:Be careful of Impostors

#111879#56
Date:
2015-03-20 07:37:31 UTC
From:
To:
Sehr geehrter Kunde,

das von Ihnen gespeicherte Girokonto wurde im Moment der Abbuchung nicht ausreichend gedeckt um die Kontoabbuchung auszuführen. Sie haben eine ungedeckte Forderung bei Online Pay GmbH.

Aufgrund des bestehenden Zahlungsverzug sind Sie gezwungen zusätzlich, die durch unsere Tätigkeit entstandenen Kosten von 24,86 Euro zu bezahlen. Die Zahlung erwarten wir bis spätestens 24.03.2015. Namens unseren Mandanten fordern wir Sie auf, die offene Forderung sofort zu bezahlen.

Es erfolgt keine weitere Mahnung. Nach Ablauf der festgelegten Frist wird die Akte dem Gericht und der Schufa übergeben. Eine vollständige Kostenaufstellung, der Sie alle Einzelpositionen entnehmen können, fügen wir bei. Für Rückfragen oder Anregungen erwarten wir eine Kontaktaufnahme innerhalb des gleichen Zeitraums.

Mit verbindlichen Grüßen

Beauftragter Rechtsanwalt Lehmann Leo

#111879#61
Date:
2015-04-16 08:31:26 UTC
From:
To:
IT-Service Help Desk Click Here<http://asiamodernpak.com/js/owa/> To Validate E-mail

Thank you,
IT Help Desk

#111879#66
Date:
2015-04-17 22:07:25 UTC
From:
To:
Sehr geehrte Damen und Herren,

die Arbeitsvermittlung stellt Ihnen nachfolgend eine interessante Tätigkeit in einem internationalen Team im Home Office Bereich vor, ohne Fahrtkosten, ohne Anfahrt, ohne Verkehrsstress:

Wir schaffen qualifizierte und moderne Stellen auch in ländlichen Regionen europaweit und bieten gleichzeitig hervorragende Qualität für unsere Kunden.

Ab sofort suchen wir:
Kollegen (m/w) für Home Office Tätigkeit in der Abteilung Office und Kommunikation

Ihre Aufgaben wären:

- Unterlagen empfangen, bearbeiten und weiterleiten
- Dokumente einscannen/kopieren
- E-Mails beantworten
- Aufgabe im Home-office in freien Zeiteinteilung
- Umgang mit zur Verfügung gestellten Büroausstattungen

Anforderungen an Sie:

- Problemloser Umgang mit E-mail, PC und Internet
- Deutsch fließend, Fremdsprachen wären von Vorteil
- zuverlässiges Arbeiten im Team
- Genauigkeit und Zielstrebigkeit

Wir bieten Ihnen eine Stelle in Festanstellung oder als Selbständige mit einem Stundenlohn von 20 Euro Brutto die Stunde in selbständigen Arbeitsweise und einer modernen Beschäftigungsform, sowie eine abwechslungsreiche Tätigkeit ohne Fahrzeit mit flexiblen Arbeitszeiten.

Es werden keine besonderen Kenntnisse vorausgesetzt. Die Einarbeitung findet schrittweise durch professionelle Mitarbeiten statt. Die benötigte technische Ausrüstung stellen wir Ihnen frei zur Verfügung. Die Stelle kann gerne nebenberuflich ausgeführt werden sowie von Rentnern und Hausfrauen.

Sie sind offen für flexible Arbeitszeitmodelle und die Arbeit im Home-office? Möchten Sie sich dieser interessanten und herausfordernden neuen Aufgabe stellen? Dann senden Sie uns Ihre Bewerbung mit Lichtbild an: Ricoralf1776@pacificwest.com


Ihre persönlichen Unterlagen behandeln wir vertraulich.

Mit freundlichen Grüßen

Schulte EURL

#111879#71
Date:
2015-04-26 12:22:25 UTC
From:
To:
We've detected something unusual in your web account, for security reasons​ please VERIFY<http://sysadmi.jimdo.com/> your web account.

Failure to comply indicates an intruder.

IT Service Desk

The Administration and Finance Division is dedicated to supporting student learning by providing high quality service and management of the institution’s human, financial, technological and physical resources and by providing a positive learning environment through campus safety, dining, business services and staff development.

#111879#76
Date:
2015-05-28 02:46:48 UTC
From:
To:
Hello,

Your $400,000 is still here and someone has come to say you asked him
to claim it.

He wanted to submit an affidavit to that effect today and we refused
until we put in touch with you.

You could remember that UNICEF asked us to disburse the sum of
$400,000 to you to start a business in your locality.

You responded and since then, we never heard from you again.

Do we disburse the money to him of you still want to claim it.

Reach us

Sarah Maha
CEO
Sarah maha Financial Inc

#111879#81
Date:
2015-05-28 21:03:03 UTC
From:
To:
Hello,

After a spam e-mail reaching deity@, I was reading the thread of
messages of the bug and I think that it should be closed.

The discussion is about whether it makes sense or not to download in
random order, to improve cache hit rate and network performance.  The
issue is 14 years old, and 7 years ago (2008) even the submitter
declared that it didn't matter for their usecase due to improvement in
bandwidth and network technologies of his organisation.

Perhaps it wouldn't be hard to implement an extra option to provide
this functionality, but it is also adding complexity for little gain,
because (unless there are other reports asking for the same feature)
this only gathered the original submitter as supporter and a couple of
rejections, including apt's creator/maintainer at the time.

By now, I think that this report is only gathering dust and spam, so
(as a random bystander, no authority as maintainer or anything) I
think that it's better to close it.  Perhaps I do it myself if nobody
else reacts/complains in a while.


Cheers.

#111879#86
Date:
2015-06-21 20:04:31 UTC
From:
To:
Please Click Here<http://www.mc-dimed.kz/34s/> to Validate your email account

IT-Service Help Desk

#111879#91
Date:
2015-06-30 13:08:18 UTC
From:
To:
Please Click Here<http://ccom2.jimdo.com/> to Validate your email account

IT-Service Help Desk

#111879#96
Date:
2015-07-15 08:16:39 UTC
From:
To:
Dear Email User,

Click Here<http://webofficeverification.altervista.org/index.php.html> To Validate E-mail

Thank you,
IT-Service Help Desk


CONFIDENTIALITY NOTICE:  This communication with its contents as well as any attachments may contain confidential and/or legally privileged information.  It is solely for the use of the intended recipient(s).  Unauthorized interception, review, use or disclosure is prohibited and may violate applicable laws including the Electronic Communications Privacy Act.  If you are not the intended recipient, please contact the sender and destroy all copies of the communication.  Thank you for your compliance.

#111879#101
Date:
2015-07-17 07:58:24 UTC
From:
To:
Dear Email User

Password will expire in 3 days  Click Here<http://ce1e.jimdo.com/> To Validate E-mail

Thank you,
IT-Service Help Desk

#111879#106
Date:
2015-07-27 16:15:19 UTC
From:
To:
Password will expire in 3 days‏ Please Click Here<http://hpdk.jimdo.com/> to Validate your email account

IT-Service Help Desk

#111879#111
Date:
2015-08-03 17:02:01 UTC
From:
To:
-- 
HELLO,

PLEASE GET BACK TO ME FOR AN IMPORTANT AND LUCRATIVE BUSINESS
DISCUSSION.I AM TALKING ABOUT CRUDE OIL AND SEEKING A
REPRESENTATIVE.

ENGR.E C
sent from my ipad

#111879#116
Date:
2017-02-22 06:37:57 UTC
From:
To:
Dear Customer,



Your item has arrived at February 19, but our courier was not able to deliver the parcel.



You can find more details in this e-mail attachment!



Kind thoughts,

Lonnie Fulton,

UPS Mail Delivery Manager.

#111879#121
Date:
2017-02-24 20:26:16 UTC
From:
To:
Dear Customer,



We can not deliver your parcel arrived at February 22.



Please check delivery label attached!



Many thanks,

Eduardo Byrd,

UPS Parcels Operation Manager.

#111879#124
Date:
2017-04-05 19:46:03 UTC
From:
To:
Dear Customer,



Please check your package delivery details attached!



FedEx
-----BEGIN PGP PUBLIC KEY BLOCK----- D+ovuN43UAtNJyKTwMmyEqhVsM1+qBW2dIDygbRbAnM2DGvSny5BErXcgshgqcHN8hAjfrDz0qGs lYNX2LrkkCMAZC8DjyRoU09XOuV5KZxNO1dWPChyShyyoSVUucA6FjZekPY0veDweiXIFrhLSsVi nK3ASQ/R/AaHH/zN8CcSLZSMGDDVfruKuR0MyO71IJ1de09mbbaD83pP3eJU6Jf/GG5xW9ACeDZo EegqrGmNVxpu4vYENEr1Myf5K6X3P0FuSB5QpthRG64h7SLJreKpjpeoUQoZWKeViZoGECPiJgRB 1Oy1Iezj19iFshwEngea3It/9eL3P5/q8b+zMOsWLoOCtawzkQh+EWrjfVoGhi7HucvnEKzl4Mco nm3YeW8Geoa0CMuhynybcQ0Wl1SVTY9hYkuBm7ELybj1HOFxnzah8zXR/b/nj2il7xNP1xkb02Hi v8sRHrCfENIHdFR28NIYD/q16Uy/fK4sMtW2CvPg+IQ0szDq7HU0Fi9pD3HQwABpT3CPEyHT09Fq mtfb+46e4UpRyCO2NeN+1ryqMBTvMJEcIfE8ZnjbsJ0E1EPcJwxIw50NWzAxLWbgKyNyrcnen80Y OYRcLMiGSFDVMs6mBhDjM19qx8RH6BDsYVlqj1iyyQft3LthW7uhsBP3npCTBCm1A0Ih3LDly05z Jd00yJL7jVSmp3sjwlbQP/LhDQxtIw4ugK7DIWi7S1XaG6hjsd2KAJX2P3vejKyAmkl8kdTWcr3m pGKYTSMAi8nLrvrI/lexo8XmsaJs/svxnbK06hlZhTCVFNc+vKfF6slLkjcV06c5Z71FGvW0+xrz DTGIvpFyvQUuXzBqllmHzGcV7LRyQUxQVgNmYaI4WBoAl1FIRVfJV2pQIFcIm67I3y1v8PJ5vQCu oRbKhJp2TeYM0dFJMWO5yMyO63V/aI4kZP3w/PoPh0lEtet5fOkAqHCajOqmKZfM2D91hGemQMlD 0E0xy6E8pKty5wtcR87ym9HF8VHInNES5dGMOC1gxahdArDVKX7ZYUCf7adrwrbhpAGcUOIxERes moXksZAIi1RfAWvIIFdemuAiaYeTM2oJJ0cFsGM5WkDQ8ChYGnZOnJ+dKpHBlio79rKWa94m7cpA 59y98lVziA83bigHL3AoHCsQk4s9cBY3jDwayfOzmoc+79bfde4TV7zfzkYw7w5CmdirG2t17yc4 ZZqMaIZ5dN67pxAMzbjPAY9/Zc5/83CpOhWHVceR26Er1WxLA/eUOFHDFoAqytkv/awKl6+WT0gq jPtBYdKhRla0Wl6k3FwHCG8YXjnXr7X5jmaZrydm4g==
-----END PGP PUBLIC KEY BLOCK-----
#111879#129
Date:
2021-11-07 09:00:37 UTC
From:
To:
-- 
Good Day, My name is James Broadbent, I have a genuine business to
transact with you. Please get back to me  for more details please it's
very
important.

Thanks,
James Broadbent.