I think I have a very good idea of what is causing all those MD5Sum mismatch errors during apt-get update. ( http://article.gmane.org/gmane.linux.debian.user.mirrors/1368 ) You see during a single apt-get update, there will be TWO (2) queries made to the DNS server for each ONE (1) line in a sources.list file. I believe one query gets the thing. The other gets the checksum of the thing. Now you can guess what will happen when that one line is a round robin site name. Yup, if the _two different machines_ now being called are slightly out of sync, naturally the checksums will not match! The cure is to fix apt so that it only makes one query! Making a second query not only does not even out the total load on the servers any more, it also means there are several windows of time each day when you are comparing apples from machine 1 to oranges from machine 2! Keep it all on one machine and you will be safe. You can test it yourself. Turn on verbose debugging in your DNS server, and do apt-get update, and check the log. Voila, two queries for each one line in sources.list! Now try a $ ping example.com Check your DNS logs. Only one DNS query is made, despite many repeated connections. Ping has got it right. Apt has got it wrong.
Gentlemen, junior programmer me has finally found the reason behind apt's MD5Sum mismatchs: multiple DNS queries! http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=636292
retitle 636292 MD5Sum mismatch error
thanks
jidanni@jidanni.org, le Tue 02 Aug 2011 09:10:40 +0800, a écrit :
I'm getting the error on all ftp.{uk,ch,fr}.debian.org sites, which do
not use round robin at all.
Samuel
ST> I'm getting the error on all ftp.{uk,ch,fr}.debian.org sites, which do
ST> not use round robin at all.
All I know is rocky-mountain.csail.mit.edu is rock solid. Try that.
Hi, 積丹尼 wrote: That particular consequence of mirrors' use of round-robin DNS is tracked as Bug#582352. As far as I can tell, it violates the HTTP spec and can confuse proxies even if the clients are fixed. I would be willing to carry out a protocol change to make this work (doing one DNS query and using the IP as hostname from then on), but it's not clear anyone involved is interested, so for now I just avoid round-robin DNS in sources.list on machines I manage. Thanks for the reproduction recipe.
forcemerge 636292 582352 thanks JN> 積丹尼 wrote: JN> That particular consequence of mirrors' use of round-robin DNS is JN> tracked as Bug#582352. As far as I can tell, it violates the HTTP JN> spec and can confuse proxies even if the clients are fixed. I would JN> be willing to carry out a protocol change to make this work (doing one JN> DNS query and using the IP as hostname from then on), but it's not JN> clear anyone involved is interested, so for now I just avoid JN> round-robin DNS in sources.list on machines I manage. JN> Thanks for the reproduction recipe. I'll forcemerge the bugs. That will swing them into action.
the A (ipv4) and AAAA (ipv6) record:
19:03:20.575070 IP localhost.35750 > localhost.domain: 41865+ A? ftp.be.debian.org. (35)
19:03:20.575688 IP localhost.domain > localhost.35750: 41865 1/4/7 A 77.243.184.65 (281)
19:03:20.575885 IP localhost.35750 > localhost.domain: 48866+ AAAA? ftp.be.debian.org. (35)
19:03:20.576190 IP localhost.domain > localhost.35750: 48866 1/4/7 AAAA 2a01:300:11:4:2e0:81ff:fe63:cdb2 (293)
There are no other queries, and this is perfectly normal. There
is nothing wrong with this.
Even with multiple lines in the sources.list file I only see those
2 requests.
(tested with apt 0.8.15.4, I doubt 0.8.15.5 behaves differently.)
As far as I know the issues with hash sum mismatches is either one
of:
- They use an old version of the mirror script that didn't exclude
InRelease in the first stage. As a result the InRelease file
was already updated while the Packages/Sources file isn't for
a long time. This has been a problem since ftp-master started
generating those InRelease file, which was just after the
squeeze release.
- There is always a delay between updating the Release file and
the Packages and Sources file, and the error should go away
after a short time.
- ftp-master generated broken files for some reason. It sometimes
happen but not that often.
So I suggest you make sure that all the mirrors that you see
an issue with have updated their mirror script, since I think
that's the biggest issue at the moment.
This was fixed with this commit in archvsync:
commit 77223bb1af262e139a898020a05680e932d51888
Author: Joerg Jaspert <joerg@debian.org>
Date: Tue Feb 22 22:32:13 2011 +0100
ftpsync
update rsync_options1 to also exclude the newish InRelease files in the first run
Signed-off-by: Joerg Jaspert <joerg@debian.org>
This is part of the 80387 version that you can find in
project/ftpsync/ on the Debian mirrors. 80387 was released
the next day.
If they are using this script to update the mirror, you should
be able to see the version in project/trace/
If there is no version in that file (only a date) they're probably
using an even older script that's also broken.
If they're not using that script or the latest version of it, you
will very likely see the hash sum issues during the mirror sync.
An other issue might be that you're behind some broken transparent
proxy and your connection gets directed to a different servers for
each file you get. As far as I know apt will only open 1
connection to the server and requests all files over that single
connection, so this really shouldn't happen.
Kurt
(cc's kept since I am not really sure everyone involved is in subscribed to debian-mirrors. If you want me to start trimming them down, please say so). Hmm, a normal request like this is supposed to return a number of A or AAAA records for, e.g. ftp.us.debian.org, and not just one. Just so that we can close that door completely, does apt do the right thing and use always the same A record or AAAA record from the returned set, switching to the next one only if there are problems? I believe it does it right, but it would be nice to have a definitive answer on it (and I don't really grok apt to take a quick look at the source to check it myself). That is actually quite possible. However, it is also something we can assert for sure: So, it is time to inspect the project/trace/* files in every mirror on the multi-mirror aliases that users have complained about. That might not be true if it is a http/1.0 proxy, or if persistent connections get disabled for whatever reason. In that case, apt would have to make multiple connections, and therefore any proxy, transparent or not, would likely round-robin over the multiple A and AAAA records. The answer for that would be to update our repository format to have something seqlock-like to allow apt to detect metadata generation mismatch, and thus be able to automatically refetch things until it gets all metadata with the same generation number: http://en.wikipedia.org/wiki/Seqlock Maybe using rsync or ftp can help, if it enforces the "get everything using the same connection" that http might or might not allow apt to do. But that does NOT scale well at the mirror server side, at all.
Oh my god even my "rock solid" rocky-mountain server is crumbling today: W: Failed to fetch http://rocky-mountain.csail.mit.edu/debian/dists/experimental/main/binary-i386/PackagesIndex MD5Sum mismatch W: Failed to fetch http://rocky-mountain.csail.mit.edu/debian/dists/unstable/main/binary-i386/PackagesIndex MD5Sum mismatch E: Some index files failed to download. They have been ignored, or old ones used instead. My theories are up in the air. My reputation is ruined.
Ha ha ha, it really does split a single apt-get update into two different places completely across the Internet. Any maybe even for singular servers like rocky-mountain... maybe upstream from it is the same splitting problem somewhere. Anyway here we go: # cat /etc/apt/sources.list.d/* deb http://ftp.us.debian.org/debian unstable contrib # tcpflow -i ppp0 & # apt-get update # ls -og /tmp/m -rw-r--r-- 1 146150 Aug 6 05:54 064.050.233.100.00080-218.163.001.135.45826 -rw-r--r-- 1 68985 Aug 6 05:54 199.006.012.070.00080-218.163.001.135.56243 -rw-r--r-- 1 185 Aug 6 05:54 218.163.001.135.45826-064.050.233.100.00080 -rw-r--r-- 1 1432 Aug 6 05:54 218.163.001.135.56243-199.006.012.070.00080 $ host ftp.us.debian.org ftp.us.debian.org has address 128.30.2.36 ftp.us.debian.org has address 199.6.12.70 ftp.us.debian.org has address 35.9.37.225 ftp.us.debian.org has address 64.50.233.100 ftp.us.debian.org has address 64.50.236.52 ftp.us.debian.org has IPv6 address 2001:500:61:28::70
KR> - There is always a delay between updating the Release file and KR> the Packages and Sources file, and the error should go away KR> after a short time. NOT acceptable. I hope on the mirrors they are not doing something like $ cd staging_area && wget a b when they should be doing $ wget a b && mv a b staging_area
With a and b and staging_area all being on the same disk partition, for almost an atomic operation... OK this is probably not the culprit today, but it is just good practice.
H> Maybe using rsync or ftp can help, if it enforces the "get everything H> using the same connection" that http might or might not allow apt to do. H> But that does NOT scale well at the mirror server side, at all. Well whatever you do, remember a+b+c+a+b+c=a+a+b+b+c+c, so please be sure no round robin switching is occurring when it shouldn't. No matter during user operations or mirror operations. In the big picture the load all evens out anyway, so no savings are had, and instead errors are introduced.
Except that it's about 1000 files. This is basicly what rsync --delay-updates does, and what is being used. And on a very busy mirror this can actually take some time to do. Kurt
KR> Except that it's about 1000 files. This is basicly what rsync KR> --delay-updates does, and what is being used. And on a very busy KR> mirror this can actually take some time to do. Well all I know is the 998 .debs should be done first. Then the 1 index file and 1 checksum file second. And that second step being as atomic as $ ln a b staging_area You get in to trouble when you put the president on the same slow train as the common person, even if he is supposed to arrive after the other participants are seated.
No, this is 1000 index files. Please note that we have more than 1 suite and more than 1 arch, and each of those have several files. Just take a look at the Release file itself to know how many files need to be updated at the same time. The new .debs are done first, so that if you get a Packages or Sources file, you can actually download the files mentioned in those files. They are directly copied to the correct place since they are new files and not updated files. Then the Packages, Sources, Release and other files are first all transfered, then moved to the correct place. After that old files are removed. And ftp-master only removes them after a few days that no Pacakges or Sources files mentions them. The critical part is moving all the Packages/Sources/Release files to the new place. You want to do that in as short a time as possible. The problem you're most likely seeing is that the InRelease file is done together with copying the .deb files, while it should be part of the Packages/Sources/Release files part. And I already explained that part. But also note that an atomic update on the server side doesn't help. If I start downloading the Release file, and while I'm downloading the Release files the Release/Packages files are updated on the server, and then download a Packages file, the Packages and Release file still won't be from the same time. Kurt
Well, that has two problems we have observed in practice: 1. Not all mirrors have up-to-date mirror scripts, and that _does_ include mirrors selected for the multi-mirror aliases; 2. Mirrors in the same multi-mirror alias are not updated at the same time, and it is very possible (especially in http scenarios) to get metadata skew problems across mirrors even when they are perfectly fine and internally consistent. That doesn't even need a third issue (multiple DNS queries) to cause problems, way too many users are behind http proxies and caches that break things regardless. Maybe we should start designing sequence tagging/generation tagging for the metadata? If nobody has time to implement it right now, it would be a damn fine GSOC project for 2013...
H> That doesn't even need a third issue (multiple DNS queries) OK. But at least that part could be fixed now. No denying it is happening, as I showed with tcpflow!
SP> Could you please try to use ftp.us.d.o and confirm up to date ftpsync on all SP> backends solved your problem ? I would be extremely ecstatically happy to. However, as I _proved_ in 636292 using tcpflow(1), a simple "apt-get update", will make TWO calls to the DNS. The checksum will come from a _different_ round robin machine, four out of five times. It's Russian Roulette. I can't bear to pull the trigger. A user would have to be crazy to use a round robin mirror until the apt team finally gets around to fixing this probably one line bug.
First it does some UDP thing to all IP addresses:
[pid 25433] socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 3
[pid 25433] connect(3, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("128.30.2.36")}, 16) = 0
[pid 25433] getsockname(3, {sa_family=AF_INET, sin_port=htons(49660), sin_addr=inet_addr("10.0.200.1")}, [16]) = 0
[pid 25433] connect(3, {sa_family=AF_UNSPEC, sa_data="\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16) = 0
[pid 25433] connect(3, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("199.6.12.70")}, 16) = 0
[pid 25433] getsockname(3, {sa_family=AF_INET, sin_port=htons(35821), sin_addr=inet_addr("10.0.200.1")}, [16]) = 0
[pid 25433] connect(3, {sa_family=AF_UNSPEC, sa_data="\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16) = 0
[pid 25433] connect(3, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("35.9.37.225")}, 16) = 0
[pid 25433] getsockname(3, {sa_family=AF_INET, sin_port=htons(52379), sin_addr=inet_addr("10.0.200.1")}, [16]) = 0
[pid 25433] connect(3, {sa_family=AF_UNSPEC, sa_data="\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16) = 0
[pid 25433] connect(3, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("64.50.233.100")}, 16) = 0
[pid 25433] getsockname(3, {sa_family=AF_INET, sin_port=htons(39421), sin_addr=inet_addr("10.0.200.1")}, [16]) = 0
[pid 25433] connect(3, {sa_family=AF_UNSPEC, sa_data="\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16) = 0
[pid 25433] connect(3, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("64.50.236.52")}, 16) = 0
[pid 25433] getsockname(3, {sa_family=AF_INET, sin_port=htons(37020), sin_addr=inet_addr("10.0.200.1")}, [16]) = 0
[pid 25433] close(3) = 0
[pid 25433] socket(PF_INET6, SOCK_DGRAM, IPPROTO_IP) = 3
[pid 25433] connect(3, {sa_family=AF_INET6, sin6_port=htons(80), inet_pton(AF_INET6, "2001:500:61:28::70", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
[pid 25433] getsockname(3, {sa_family=AF_INET6, sin6_port=htons(50188), inet_pton(AF_INET6, "2001:0:53aa:64c:2ca7:460f:aeac:9430", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
[pid 25433] close(3) = 0
No idea what it's really trying to do, but I guess it's trying to see which if they're routable.
The AF_UNSPEC part probably doesn't make much sense.
Then it goes on with:
[pid 25433] socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
[pid 25433] fcntl(3, F_GETFL) = 0x2 (flags O_RDWR)
[pid 25433] fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 25433] connect(3, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("128.30.2.36")}, 16) = -1 EINPROGRESS (Operation now in progress)
[...]
[pid 25433] write(3, "GET /debian/dists/sid/InRelease HTTP/1.1\r\nHost: ftp.us.debian.org\r\nConnection: keep-alive\r\nCache-Control: max-age=0\r\nIf-Modified-Since: Thu, 11 Aug 2011 20:22:47 GMT\r\nUser-Agent: Debian APT-HTTP/1.3 (0.8.15.5)\r\n\r\n", 213) = 213
[...]
[pid 25433] read(3, "HTTP/1.1 304 Not Modified\r\nDate: Thu, 11 Aug 2011 22:32:27 GMT\r\nServer: Apache/2.2.9 (Debian)\r\nConnection: Keep-Alive\r\nKeep-Alive: timeout=15, max=100\r\nETag: \"1d0a203-239d0-4aa408f6173c0\"\r\n\r\n", 65536) = 191
[...]
[pid 25433] write(3, "GET /debian/dists/sid/main/binary-amd64/Packages.diff/Index HTTP/1.1\r\nHost: ftp.us.debian.org\r\nConnection: keep-alive\r\nCache-Control: max-age=0\r\nIf-Modified-Since: Thu, 11 Aug 2011 20:16:48 GMT\r\nUser-Agent: Debian APT-HTTP/1.3 (0.8.15.5)\r\n\r\n", 241) = 241
[...]
[pid 25433] read(3, "HTTP/1.1 304 Not Modified\r\nDate: Thu, 11 Aug 2011 22:32:28 GMT\r\nServer: Apache/2.2.9 (Debian)\r\nConnection: Keep-Alive\r\nKeep-Alive: timeout=15, max=99\r\nETag: \"1d0a308-7f6-4aa4079fb8c00\"\r\n\r\n", 65345) = 188
So it looked for the InRelease and Packages file over the same connection.
And than for some unclear reason to me it closes and opens the connection again to get the i18n files:
[pid 25433] close(3) = 0
[pid 25433] read(0, 0x7fff66c68790, 64000) = -1 EAGAIN (Resource temporarily unavailable)
[pid 25433] close(4294967295) = -1 EBADF (Bad file descriptor)
[pid 25433] write(1, "102 Status\nURI: http://ftp.us.debian.org/debian/dists/sid/main/i18n/Index\nMessage: Connecting to ftp.us.debian.org (199.6.12.70)\n\n", 130) = 130
[pid 25433] socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
[pid 25433] fcntl(3, F_GETFL) = 0x2 (flags O_RDWR)
[pid 25433] fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 25433] connect(3, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("199.6.12.70")}, 16) = -1 EINPROGRESS (Operation now in progress)
[...]
[pid 25433] write(3, "GET /debian/dists/sid/main/i18n/Index HTTP/1.1\r\nHost: ftp.us.debian.org\r\nConnection: keep-alive\r\nCache-Control: max-age=0\r\nIf-Modified-Since: Thu, 11 Aug 2011 19:55:34 GMT\r\nUser-Agent: Debian APT-HTTP/1.3 (0.8.15.5)\r\n\r\n", 219 <unfinished ...>
[...]
[pid 25433] read(3, "HTTP/1.1 304 Not Modified\r\nServer: nginx/0.8.54\r\nDate: Thu, 11 Aug 2011 22:32:46 GMT\r\nLast-Modified:
Thu, 11 Aug 2011 19:55:34 GMT\r\nConnection: keep-alive\r\n\r\n", 65536) = 158
[...]
[pid 25433] exit_group(100) = ?
(It stops the program without closing the socket.)
This i18n/Index file is also covered by the InRelease, so this clearly is a problem.
Kurt
Now, don't be absurd.
On Fri, Aug 12, 2011 at 03:39, Henrique de Moraes Holschuh <hmh@debian.org> wrote: Yeah, it's getting hilarious since a while ... Now as ftpsync is fixed on the US mirrors all checksum problems should be solved.
to be in sync *across* mirrors, and we cannot trust the network backends to always connect to the same mirror. The multiple DNS lookups bug just breaks a workaround for that design bug that works well in a particular case (fortunately, a common one): persistent connections. What I consider absurd is jidanni's "probably one line bug" comment.
H> What I consider absurd is jidanni's "probably one line bug" comment.
Naw... it's probably just a case of
for(thing,checksum_of_thing){
do_dns_query(); #move this line before the loop
get_it();
}
SP> It's no longer the case, all ftp.us have no 80387. SP> jidanni, do you still observe issues ? Yes, as a matter of fact I do. I even recorded the exact time window for you. In UTC as a special bonus. starting Sun Aug 21 21:30:51 UTC 2011 W: Failed to fetch http://ftp.us.debian.org/debian/dists/experimental/main/binary-i386/PackagesIndex MD5Sum mismatch W: Failed to fetch http://ftp.us.debian.org/debian/dists/unstable/main/binary-i386/PackagesIndex MD5Sum mismatch E: Some index files failed to download. They have been ignored, or old ones used instead. ending Sun Aug 21 21:36:55 UTC 2011 I have a recommendation: that you fellows fix the this bug. As I have noted, it is certainly a one-liner. I mean aren't we running out of other things to blame for the problem? Thanks.
[...] As we already pointed out, it is not a one-liner. If you're so sure it's a one-liner, I suggest you submit a patch. Even if we fix the problem with connecting to multiple servers, there are various other reasons why it can fail, and they have all been explained already. I'm not even sure that if you fix the multiple server connections that would get better or worse results. But I would still suggest that we do try and connect to only 1 server. Kurt
Actually the first five minutes were spent in my 'sleep 5m' so it really is << starting Sun Aug 21 21:35:51 UTC 2011 KR> I'm not even sure that if you fix the multiple server connections KR> that would get better or worse results. But I would still suggest KR> that we do try and connect to only 1 server. I've now also added a tcpflow(1) wrapper enabling me to send you all byte-by-byte evidence the next time it happens... but why allow me that wicked pleasure?
We know what the problem is, that's not needed. Kurt
K> We know what the problem is, that's not needed. Are you sure?
Actually all that is going to happen is one day I will accidentally send the tcpflow logs containing unrelated personal traffic too as the filtering is too complex, so I would appreciate it if someone looked into this bug.
Three different mirrors in a single _botched_ apt-get update.
I recall that was taken care of. Why doesn't someone take care of that.
tags 636292 will-get-fixed-by-donkult-then-hell-freezes-over kthxbye This is open source software: YOU are part of the awesome team! So feel free to blame yourself that you haven't taken care of it. In fact, as we are all volunteers you can only blame yourself… your one line patch to this bugreport. We need NOTHING else from you. I repeat: NOTHING ELSE! No goddamn tcpflow logs nor any other data, just provide your simple patch and everybody will be happy. Thanks. I can only speak for myself, but I haven't even tried to look at this issue because of this sentence (and all the howling before and after that). And I am pretty sure it will need a loooooooong time until I feel motivated to do so thanks to your behavior in the buglog, so if I were you I would submit a patch or keep silent until I can provide something useful to fix the bug I respond to. Bonus points if you can do both. If you want to blame anyone in the meantime, blame yourself for considerable lower the chances to get this or any related bug fixed by working hard on demotivating at least one of the few people who regularly contribute to APT… Thats a great achievement, given that even the worst kids in my young groups can't make that happen, so: Congratulations! David Kalnischkies P.S.: Don't bother to answer, the buglog includes enough messages already and I will not read it anyway. Everything we need is your patch now, so hurry up.
Might be that the trade off between "time to spend" and "what I like to do first" might not suite your wishes. As always "Show the code" still applies.
A bug involving InRelease files which has similar symptoms was reported on http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=641769
retitle 636292 dak/apt: deficiencies at handling out-of-sync metadata summary 636292 87 thanks If anyone disagrees with the above triage, please change the summary and/or title. Thank you. You're expected to read the entire thing when refered to a bug report in a thread you're replying to. Anyway, triaged. I didn't want to do it because it is not my bug, it is not a package I work on, and I have no idea wether the apt developers agree with my anaylsis of #636292 or not. Please feel free to improve the title or chose a new summary. That misses the point, IMO. To me, it looks like what's "broken" is that the repository format _and_ the front-ends have deficiencies at handling metadata which is unsyncronized either in-mirror or across mirrors. And these deficiencies are a lot more important nowadays than they once were, as we have now many dinstall runs per day, lots of users tracking testing and unstable, a larger set of metadata files, a larger and more diverse set of mirrors... I.e: a lot more chances to hit unsyncronized metadata windows.
Le 2011-09-17 11:34, Henrique de Moraes Holschuh a écrit : I was unaware of that. [...] I don't think increasing dinstall frequency worsens these issues significantly if dinstalls get shorter (unless previous dinstalls ran during the night). I also think archive size growth should have been compensated by performance increases. I think the time spent synchronizing a mirror must not have increased a lot. What did change here (dramatically) is the proportion of that time where APT indices updates fail. Round-robin mirrors might also have worsened. Anyway, the repository format is not a problem per se, it's the combination of what's on a mirror and how APT fetches it that's a problem. If you assume the communication protocol is HTTP-like, then indeed there should be mechanisms to cope with race conditions - i.e. file versioning and/or having APT retry or report desynchronizations.