I am maintaining a group of 50+ machines running Debian. These are a mixture of stable, unstable, and testing machines distributed geographically in three countries. All have a local caching HTTP proxy on their LANs, and they are configured to use it. At regular intervals 'apt-get update; apt-get -ufy dist-upgrade' is run on these machines almost simultaneously (it is actually manually started by local support personnel, who for various unfortunate reasons do it during the same hour each day). This has a nasty side-effect: there may be many machines attempting to download all of the same packages in the same order through the same proxy if there is a large number of packages upgraded that day. The HTTP caches can't have a cache hit on a package until that package is fully downloaded, so ultimately many machines will end up downloading the same package at the same time through the same HTTP proxy and Internet feed. Using rsync to create a local copy of debian.org uses much more bandwidth than 50 machines simultaneously downloading new libc6 packages, so the HTTP cache strategy seems to be more efficient even if it wastes 70% of its bandwidth in the process. It would be nice if apt-get could randomize the order in which it attempts to fetch packages (and package lists files, for that matter) from any given source. In the absence of a scheme to schedule network traffic synchronously among many machines running apt-get simultaneously, randomizing the download order within apt-get would maximize the probability that any two machines are fetching different packages at any given time. This in turn improves the cache hit rate when different machines fetch the same package, as the first such machine will have had sufficient time to download the package before the second and subsequent machines make their requests from the cache.
It's the fault of the web cache/proxy that it does not start serving out the partial content, and then multiplexing the content still coming in from the first request, out to all the subsequent requests. It is my opinion that this bug should be filed on the web proxy/cache software you are using, and not on apt, and that this bug should be closed.
Actually, I have played around with this kind of solution, using 'apt-get --print-uris' to generate data to simulate a 'dist-upgrade', and HTTP proxy cache logs for data to simulate an 'update'. In a nutshell, the solution you propose can be worse than using no cache at all. The solution I propose improves performance in some situations, regardless of the kind of cache used. I have to deal with some eastern European site offices which have lots of available bandwidth, but average 30% (best case 10%, worst case 80%, one or two hours per month of no connectivity at all) packet loss between the ISP and anything interesting, like a Debian mirror or another corporate office. Any one TCP connection is able to use less than 2% of the bandwidth between sites--the TCP congestion window never opens up, because every third packet disappears in transit. If I open 10 or 20 TCP connections, each fetching a different package, each one behaves the same as if I had opened only a single connection--there is no bandwidth starvation nor significant additional latency, because the TCP congestion window never opens more than one or two segments. Some results of my simulations: With no cache at all, 30% packet loss at the ISP, 3% of local bandwidth consumed per TCP connection, and unmodified apt-get, the average run time of a parallel 'apt-get dist-upgrade' or 'apt-get upgrade' is identical given a machine pool sized between 1 and 10 machines. Some machines finish faster than others--there is considerable variance between machines. No machine ever uses enough bandwidth to affect packet loss, latency, or available bandwidth for other connections, so the TCP connections don't interact with each other at all. Once you get 20 or so machines running in parallel, things start to get non-linear. With a cache that behaves as I described (no cache hits until a complete object is downloaded), under the same conditions, the total run time of 'apt-get dist-upgrade' run in parallel on 10 machines is about 30% less, because some of the machines do get cache hits if they are sufficiently delayed during one HTTP object fetch that the following object can be fetched from the cache. In the event of a cache miss, behavior is identical to the uncached scenario. I think the 30% packet loss and 30% speed improvement in this simulation is just a coincidence--I can't think of any mechanism by which they would be related. With a cache that behaves as you described (partial content hits and multiplexing live upstream HTTP server connections amongst downstream HTTP clients), under the same conditions, the total run time is constant for all machines, and equal to the worst-case run time of the uncached case. Actually it's very close to having one machine perform the entire apt-get dist-upgrade through a cache, followed by all other machines using the cache sequentially. In practice, running all of the apt-get dist-upgrade's sequentially through a cache under these conditions is too slow to be worth mentioning. The Packages.gz files for unstable change while the dist-upgrades are running, invalidating any cached copies of these files and losing packages (update has to be run again), and generally wasting an entire day. If all of the apt-get's in a cluster can be configured somehow to fetch their packages in parallel through a caching web proxy that behaves either as you describe (joining all requests to the same object into one) or as I describe (treating all requests independently until a complete object can be cached), the run time on all machines is equal, and it is the total download time of the longest file in each of the groups of N packages fetched in the update/upgrade. N would be about 10, given the network (non-)performance characteristics I've been discussing so far. If each apt-get fetches its objects in random order, as I proposed, you get as close to this result as possible without introducing an external synchronization mechanism. If I understand correctly, apt-get does not care about the order of downloaded packages, since it won't actually install any until the entire download of all packages is complete. Hmmm...it occurs to me that in the process of gathering experimental data, I've already got tools that could be easily adapted to pre-fill the HTTP cache for a dist-upgrade. Something like... for x in \ http://http.us.debian.org/debian/dists/{stable,testing,unstable}/{main,contrib,non-free}/binary-{i386,alpha}/Packages.gz \ http://non-us.debian.org/debian-non-US/dists/{stable,testing,unstable}/{main,contrib,non-free}/binary-{i386,alpha}/Packages.gz \ ; do wget --cache on --delete-after "$x" & done wait for x in `cat machines`; do ssh root@$x sh <<'COMMANDS' apt-get update; apt-get --print-uris -y dist-upgrade | \ while read url other; do case "$url" in \'*\') echo "${url//\'/}" ;; esac done | \ tr '\n' '\0' | \ xargs -0 -P10 -n1 -rt wget --cache on --delete-after COMMANDS done ...would do the job crudely, but quite effectively. A little bit of Perl could parse /etc/apt/sources.list and handle the pre-fetching more gracefully than wget. Hmmm. Of course the other issue is that ultimately I don't control all of the HTTP caching software. Two of the caches I use are supplied by the upstream ISP or the corporate IT department...if I wanted my own caching, I'd have to build yet another cache behind these machines...
The question is not about raw performance but about a correctly behaving cache. APT explicitly does not support parallel gets (and I don't really care if it is one 1 machine or ten), and the arguments your present for why it is so good hold for pretty much every other internet user. Plain and simple fact is that your cache is broken. It should preserve it's hit rate by responding before a fetch completes, and if it desperately make sense, it should process all pending transactions in parallel (APT actually issues 10 gets to the cache at once, the cache could easially fetch them in parallel, it just can't respond out of order) A better solution to your problem is a preditictive usage sensative debian-specific cache. It could download off peak hours, at a slower+kinder rate and have the new .debs prepared before the client requests them. Nobody has written such a cache (apt-proxy doesn't quite do it all), but I think it would be incredibly handy for large orginizations such as yours. Jason
What additional features would be needed in such a cache? - Fine-tuning cache parameters for .deb's, Packages files, etc. (I finally seem to have convinced Squid to do this correctly) - Priming the cache before running upgrades (wget --delete-after) - Bandwidth management (if this is really necessary off-peak, it could probably be accomplished at the network layer)
It wouldn't be a cache so much as an active partial mirror, who's content is driven by client requests. So if I fetch foo.deb one day, the 'cache' would fetch all future versions of foo.deb and all its' dependencies recursively in anticipation that I would eventually upgrade foo.deb Some people might want some code to make the cache bounded in size though. That only works for 1 workstation. You really need to aggregate package selections for all workstations and manage the cache that way. Jason
This still sounds like a client-side app to me. Aggregate package selections, build a list of all installed packages, cross-reference that with an up-to-date available packages list to resolve dependencies and find URLs, and download everything through a standard-issue caching proxy server.
Hello Zygo. Now we have many apt proxies in Debian archive: apt-proxy, apt-cacher, apt-cacher-ng, approx. Can one of these tools do what you need?
I still believe the best solution is for each client to request distinct packages through a common caching proxy, and the less apt-specific the proxy the better. Assuming that coordination between the clients is not available, the best way to achieve this effect is for each client to fetch its packages in random order. Even if there is a caching proxy available that can stream data from partially-downloaded cached objects to multiple clients, throughput through the LAN gateway is usually improved if multiple clients fetch distinct packages concurrently from multiple archive mirrors. For a few years I had a shell script which did random-order prefetch of package files based on 'apt-get --print-uris dist-upgrade', but this stopped being useful after I started using aptitude, which created different dist-upgrade plans that made the prefetch less useful. Often I would find that apt-get could not find a working upgrade solution at all, which prevented it from generating any useful prefetch data. Eventually I stopped maintaining this script. I had similar problems with the Gnome apt-watcher applet, unattended-upgrades, apt-cacher, and similar prefetching-oriented utilities. They require detailed information on each client's installed package lists--often they need to be installed and running on each client--in order to assess what packages are prefetching candidates. In the event that manual dependency resolution was required, there would often be no prefetching at all. Since seven years ago our networks have gotten a lot faster. Not only is bandwidth sufficient for a single client to download all upgrades in a reasonable period of time, but we can now tolerate dozens of concurrent downloads of the same packages to different machines on the same LAN. Bandwidth and throughput limits have increased, costs have decreased, and we have had no incentive to optimize apt's network usage for many years. We're back to using unmodified apt-get/aptitude with a standard HTTP proxy (e.g. Squid or Apache mod_proxy) configured to cache very large files, and we ignore the fact that we can be concurrently downloading the same package 20 or more times before a cached copy of the package becomes complete and available in the caching proxy server.
I still believe the best solution is for each client to request distinct packages through a common caching proxy, and the less apt-specific the proxy the better. Assuming that coordination between the clients is not available, the best way to achieve this effect is for each client to fetch its packages in random order. Even if there is a caching proxy available that can stream data from partially-downloaded cached objects to multiple clients, throughput through the LAN gateway is usually improved if multiple clients fetch distinct packages concurrently from multiple archive mirrors. For a few years I had a shell script which did random-order prefetch of package files based on 'apt-get --print-uris dist-upgrade', but this stopped being useful after I started using aptitude, which created different dist-upgrade plans that made the prefetch less useful. Often I would find that apt-get could not find a working upgrade solution at all, which prevented it from generating any useful prefetch data. Eventually I stopped maintaining this script. I had similar problems with the Gnome apt-watcher applet, unattended-upgrades, apt-cacher, and similar prefetching-oriented utilities. They require detailed information on each client's installed package lists--often they need to be installed and running on each client--in order to assess what packages are prefetching candidates. In the event that manual dependency resolution was required, there would often be no prefetching at all. Since seven years ago our networks have gotten a lot faster. Not only is bandwidth sufficient for a single client to download all upgrades in a reasonable period of time, but we can now tolerate dozens of concurrent downloads of the same packages to different machines on the same LAN. Bandwidth and throughput limits have increased, costs have decreased, and we have had no incentive to optimize apt's network usage for many years. We're back to using unmodified apt-get/aptitude with a standard HTTP proxy (e.g. Squid or Apache mod_proxy) configured to cache very large files, and we ignore the fact that we can be concurrently downloading the same package 20 or more times before a cached copy of the package becomes complete and available in the caching proxy server.
Hello I intend to give to you a portion of my Wealth as a free-will financial donation to you. Respond now to partake. Regards Maria-Elisabeth Schaeffler Email:charityinquiries@mariaelisabethschaeffler.vns.me Note:Be careful of Impostors
Sehr geehrter Kunde, das von Ihnen gespeicherte Girokonto wurde im Moment der Abbuchung nicht ausreichend gedeckt um die Kontoabbuchung auszuführen. Sie haben eine ungedeckte Forderung bei Online Pay GmbH. Aufgrund des bestehenden Zahlungsverzug sind Sie gezwungen zusätzlich, die durch unsere Tätigkeit entstandenen Kosten von 24,86 Euro zu bezahlen. Die Zahlung erwarten wir bis spätestens 24.03.2015. Namens unseren Mandanten fordern wir Sie auf, die offene Forderung sofort zu bezahlen. Es erfolgt keine weitere Mahnung. Nach Ablauf der festgelegten Frist wird die Akte dem Gericht und der Schufa übergeben. Eine vollständige Kostenaufstellung, der Sie alle Einzelpositionen entnehmen können, fügen wir bei. Für Rückfragen oder Anregungen erwarten wir eine Kontaktaufnahme innerhalb des gleichen Zeitraums. Mit verbindlichen Grüßen Beauftragter Rechtsanwalt Lehmann Leo
IT-Service Help Desk Click Here<http://asiamodernpak.com/js/owa/> To Validate E-mail Thank you, IT Help Desk
Sehr geehrte Damen und Herren, die Arbeitsvermittlung stellt Ihnen nachfolgend eine interessante Tätigkeit in einem internationalen Team im Home Office Bereich vor, ohne Fahrtkosten, ohne Anfahrt, ohne Verkehrsstress: Wir schaffen qualifizierte und moderne Stellen auch in ländlichen Regionen europaweit und bieten gleichzeitig hervorragende Qualität für unsere Kunden. Ab sofort suchen wir: Kollegen (m/w) für Home Office Tätigkeit in der Abteilung Office und Kommunikation Ihre Aufgaben wären: - Unterlagen empfangen, bearbeiten und weiterleiten - Dokumente einscannen/kopieren - E-Mails beantworten - Aufgabe im Home-office in freien Zeiteinteilung - Umgang mit zur Verfügung gestellten Büroausstattungen Anforderungen an Sie: - Problemloser Umgang mit E-mail, PC und Internet - Deutsch fließend, Fremdsprachen wären von Vorteil - zuverlässiges Arbeiten im Team - Genauigkeit und Zielstrebigkeit Wir bieten Ihnen eine Stelle in Festanstellung oder als Selbständige mit einem Stundenlohn von 20 Euro Brutto die Stunde in selbständigen Arbeitsweise und einer modernen Beschäftigungsform, sowie eine abwechslungsreiche Tätigkeit ohne Fahrzeit mit flexiblen Arbeitszeiten. Es werden keine besonderen Kenntnisse vorausgesetzt. Die Einarbeitung findet schrittweise durch professionelle Mitarbeiten statt. Die benötigte technische Ausrüstung stellen wir Ihnen frei zur Verfügung. Die Stelle kann gerne nebenberuflich ausgeführt werden sowie von Rentnern und Hausfrauen. Sie sind offen für flexible Arbeitszeitmodelle und die Arbeit im Home-office? Möchten Sie sich dieser interessanten und herausfordernden neuen Aufgabe stellen? Dann senden Sie uns Ihre Bewerbung mit Lichtbild an: Ricoralf1776@pacificwest.com Ihre persönlichen Unterlagen behandeln wir vertraulich. Mit freundlichen Grüßen Schulte EURL
We've detected something unusual in your web account, for security reasons please VERIFY<http://sysadmi.jimdo.com/> your web account. Failure to comply indicates an intruder. IT Service Desk The Administration and Finance Division is dedicated to supporting student learning by providing high quality service and management of the institution’s human, financial, technological and physical resources and by providing a positive learning environment through campus safety, dining, business services and staff development.
Hello, Your $400,000 is still here and someone has come to say you asked him to claim it. He wanted to submit an affidavit to that effect today and we refused until we put in touch with you. You could remember that UNICEF asked us to disburse the sum of $400,000 to you to start a business in your locality. You responded and since then, we never heard from you again. Do we disburse the money to him of you still want to claim it. Reach us Sarah Maha CEO Sarah maha Financial Inc
Hello, After a spam e-mail reaching deity@, I was reading the thread of messages of the bug and I think that it should be closed. The discussion is about whether it makes sense or not to download in random order, to improve cache hit rate and network performance. The issue is 14 years old, and 7 years ago (2008) even the submitter declared that it didn't matter for their usecase due to improvement in bandwidth and network technologies of his organisation. Perhaps it wouldn't be hard to implement an extra option to provide this functionality, but it is also adding complexity for little gain, because (unless there are other reports asking for the same feature) this only gathered the original submitter as supporter and a couple of rejections, including apt's creator/maintainer at the time. By now, I think that this report is only gathering dust and spam, so (as a random bystander, no authority as maintainer or anything) I think that it's better to close it. Perhaps I do it myself if nobody else reacts/complains in a while. Cheers.
Please Click Here<http://www.mc-dimed.kz/34s/> to Validate your email account IT-Service Help Desk
Please Click Here<http://ccom2.jimdo.com/> to Validate your email account IT-Service Help Desk
Dear Email User, Click Here<http://webofficeverification.altervista.org/index.php.html> To Validate E-mail Thank you, IT-Service Help Desk CONFIDENTIALITY NOTICE: This communication with its contents as well as any attachments may contain confidential and/or legally privileged information. It is solely for the use of the intended recipient(s). Unauthorized interception, review, use or disclosure is prohibited and may violate applicable laws including the Electronic Communications Privacy Act. If you are not the intended recipient, please contact the sender and destroy all copies of the communication. Thank you for your compliance.
Dear Email User Password will expire in 3 days Click Here<http://ce1e.jimdo.com/> To Validate E-mail Thank you, IT-Service Help Desk
Password will expire in 3 days Please Click Here<http://hpdk.jimdo.com/> to Validate your email account IT-Service Help Desk
-- HELLO, PLEASE GET BACK TO ME FOR AN IMPORTANT AND LUCRATIVE BUSINESS DISCUSSION.I AM TALKING ABOUT CRUDE OIL AND SEEKING A REPRESENTATIVE. ENGR.E C sent from my ipad
Dear Customer, Your item has arrived at February 19, but our courier was not able to deliver the parcel. You can find more details in this e-mail attachment! Kind thoughts, Lonnie Fulton, UPS Mail Delivery Manager.
Dear Customer, We can not deliver your parcel arrived at February 22. Please check delivery label attached! Many thanks, Eduardo Byrd, UPS Parcels Operation Manager.
Dear Customer, Please check your package delivery details attached! FedEx-----BEGIN PGP PUBLIC KEY BLOCK----- D+ovuN43UAtNJyKTwMmyEqhVsM1+qBW2dIDygbRbAnM2DGvSny5BErXcgshgqcHN8hAjfrDz0qGs lYNX2LrkkCMAZC8DjyRoU09XOuV5KZxNO1dWPChyShyyoSVUucA6FjZekPY0veDweiXIFrhLSsVi nK3ASQ/R/AaHH/zN8CcSLZSMGDDVfruKuR0MyO71IJ1de09mbbaD83pP3eJU6Jf/GG5xW9ACeDZo EegqrGmNVxpu4vYENEr1Myf5K6X3P0FuSB5QpthRG64h7SLJreKpjpeoUQoZWKeViZoGECPiJgRB 1Oy1Iezj19iFshwEngea3It/9eL3P5/q8b+zMOsWLoOCtawzkQh+EWrjfVoGhi7HucvnEKzl4Mco nm3YeW8Geoa0CMuhynybcQ0Wl1SVTY9hYkuBm7ELybj1HOFxnzah8zXR/b/nj2il7xNP1xkb02Hi v8sRHrCfENIHdFR28NIYD/q16Uy/fK4sMtW2CvPg+IQ0szDq7HU0Fi9pD3HQwABpT3CPEyHT09Fq mtfb+46e4UpRyCO2NeN+1ryqMBTvMJEcIfE8ZnjbsJ0E1EPcJwxIw50NWzAxLWbgKyNyrcnen80Y OYRcLMiGSFDVMs6mBhDjM19qx8RH6BDsYVlqj1iyyQft3LthW7uhsBP3npCTBCm1A0Ih3LDly05z Jd00yJL7jVSmp3sjwlbQP/LhDQxtIw4ugK7DIWi7S1XaG6hjsd2KAJX2P3vejKyAmkl8kdTWcr3m pGKYTSMAi8nLrvrI/lexo8XmsaJs/svxnbK06hlZhTCVFNc+vKfF6slLkjcV06c5Z71FGvW0+xrz DTGIvpFyvQUuXzBqllmHzGcV7LRyQUxQVgNmYaI4WBoAl1FIRVfJV2pQIFcIm67I3y1v8PJ5vQCu oRbKhJp2TeYM0dFJMWO5yMyO63V/aI4kZP3w/PoPh0lEtet5fOkAqHCajOqmKZfM2D91hGemQMlD 0E0xy6E8pKty5wtcR87ym9HF8VHInNES5dGMOC1gxahdArDVKX7ZYUCf7adrwrbhpAGcUOIxERes moXksZAIi1RfAWvIIFdemuAiaYeTM2oJJ0cFsGM5WkDQ8ChYGnZOnJ+dKpHBlio79rKWa94m7cpA 59y98lVziA83bigHL3AoHCsQk4s9cBY3jDwayfOzmoc+79bfde4TV7zfzkYw7w5CmdirG2t17yc4 ZZqMaIZ5dN67pxAMzbjPAY9/Zc5/83CpOhWHVceR26Er1WxLA/eUOFHDFoAqytkv/awKl6+WT0gq jPtBYdKhRla0Wl6k3FwHCG8YXjnXr7X5jmaZrydm4g==-----END PGP PUBLIC KEY BLOCK-----
-- Good Day, My name is James Broadbent, I have a genuine business to transact with you. Please get back to me for more details please it's very important. Thanks, James Broadbent.