Dear Maintainer,
I mounted an S3QL file system and ran rsync to transfer data to the file
system. After a few minutes I noticed the rsync process was no longer
scanning the file system or transfering data. Ctrl-C would not terminate
the rsync (presumably locked in some I/O transaction). mount.s3ql appears
hung and I could not unmount the file system.
The mount log file (see below) showed an exception had been thrown and
my .s3ql directory has about 600 raw_streamXXX files in it.
I killed everything, ran fsck on the file system, remounted and tried again
with the same result.
Contents of mount.log:
2014-12-03 23:17:03.472 23261:MainThread (name)s.determine_threads: Using 8 upload threads.
2014-12-03 23:17:03.473 23261:MainThread (name)s.main: Autodetected 4040 file descriptors available for cache entries
2014-12-03 23:17:12.025 23261:MainThread (name)s.get_metadata: Using cached metadata.
2014-12-03 23:17:12.029 23261:MainThread (name)s.main: Mounting filesystem...
2014-12-03 23:17:12.043 23269:MainThread (name)s.detach_process_context: Daemonizing, new PID is 23270
2014-12-03 23:20:46.372 23270:Thread-18 (name)s._parse_xml_response: Server did not provide Content-Type, assuming XML
2014-12-03 23:21:21.816 23270:Thread-18 (name)s.excepthook: Uncaught top-level exception:
Traceback (most recent call last):
File "/usr/lib/s3ql/s3ql/mount.py", line 66, in run_with_except_hook
run_old(*args, **kw)
File "/usr/lib/python3.4/threading.py", line 868, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/s3ql/s3ql/block_cache.py", line 684, in _removal_loop
backend.delete_multi(['s3ql_data_%d' % i for i in ids])
File "/usr/lib/s3ql/s3ql/backends/comprenc.py", line 273, in delete_multi
return self.backend.delete_multi(keys, force=force)
File "/usr/lib/s3ql/s3ql/backends/s3.py", line 81, in delete_multi
self._delete_multi(tmp, force=force)
File "/usr/lib/s3ql/s3ql/backends/common.py", line 46, in wrapped
return method(*a, **kw)
File "/usr/lib/s3ql/s3ql/backends/s3.py", line 116, in _delete_multi
resp = self._do_request('POST', '/', subres='delete', body=body, headers=headers)
File "/usr/lib/s3ql/s3ql/backends/s3c.py", line 409, in _do_request
query_string=query_string, body=body)
File "/usr/lib/s3ql/s3ql/backends/s3c.py", line 638, in _send_request
self.conn.send_request(method, path, body=body, headers=headers)
File "/usr/lib/python3/dist-packages/dugong/__init__.py", line 484, in send_request
self.timeout)
File "/usr/lib/python3/dist-packages/dugong/__init__.py", line 1373, in eval_coroutine
if not next(crt).poll(timeout=timeout):
File "/usr/lib/python3/dist-packages/dugong/__init__.py", line 511, in co_send_request
self.connect()
File "/usr/lib/python3/dist-packages/dugong/__init__.py", line 410, in connect
self._sock = socket.create_connection((self.hostname, self.port))
File "/usr/lib/python3.4/socket.py", line 491, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
File "/usr/lib/python3.4/socket.py", line 530, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
severity 771969 important thanks Thanks for the report! I'll set the severity to important for now, because the package still works nicely for many other users. Not sure why it's making so much trouble for you.. but I notice that (like the last bug you found) this seems related to the handling of unreliable network connections. In this case there seems to be a temporary problem with your DNS server which S3QL has trouble coping with. The raw_streamXXX files are from the patch that you applied to the debug the previous issue. I'd recommend to purge and reinstall both the s3ql and python3-dugong packages to make sure that you're in vanilla state again. If the file system still hangs (or the next time it hangs), could you follow the instructions at https://bitbucket.org/nikratio/s3ql/wiki/Providing%20Debugging%20Info to obtain a Python stacktrace, and attach it to this issue? Thanks! -Nikolaus
Hi, Is it possible that your DNS server behaves as described in https://bitbucket.org/nikratio/python-dugong/issue/16/? If so, could you test if upgrading to python3-dugong 3.4 from experimental fixes the problem? (In that case there still remains the issue of mount.s3ql hanging instead of terminating cleanly, so the stacktrace would be useful even then). Best, -Nikolaus
Because computers hate me :-) I actually have yet another problem with an sshfs mounted s3ql file system but was trying to get the issues with my S3 mount resolved first. From a high level perspective, the connection (generally) is pretty reliable, but I wouldn't notice momentary network issues. I forgot about the dugong patch, so I reinstalled both packages, using the Debian-unstable version for s3ql Attached are two log files with the mount and stack trace information. Shannon C. Dealy | DeaTech Research Inc. dealy@deatech.com | - Custom Software Development - USA Phone: +1 800-467-5820 | - Natural Building Instruction - numbers : +1 541-929-4089 | www.deatech.com
On 12/03/2014 11:44 PM, Shannon Dealy wrote: the file system still hangs (or the next time it hangs), could you This looks odd. Did you issue both "setfattr -n fuse_stacktrace" *and* "kill -s USR1" commands? The attached files indicate both, and moreover, the number of active threads during the "kill -s USR1" command was much lower than during the "setfattr" command. How much time did pass between these commands (assuming that you issued both)? At least at the time that you issued the "kill" command, mount.s3ql was busy doing the SSL handshake with the remote server, so while it looked like it was hanging, it was probably waiting for the remote server. Best, -Nikolaus
Yes, I didn't realize that the setfattr command was being used to trigger a one time event, I thought it was more like turning on a debug mode with extra info. If I am not mixing things up, I believe that initial setfattr was run right after mounting the file system, and the "kill -s USR1" was run when the file system was hung during the unmount, just before I did a kill -9 on mount.s3ql I don't recall. for this case. I have limited the cache to 1GB so I don't have to wait too long if I need to kill a backup, pack up the computer and leave. Because of this, I know from past experience the maximum amount of time it takes to flush the cache and unmount using this network. I figure it is dead if the network has remained up, but network and cpu load for this process are minimal to zero and it has taken more than twice as long as it should for the unmount. In this case it should be done in 15 to 20 minutes max, so figure it had been at least twice that. Sorry I don't have a better answer. If you have a suggestion for a better way to tell if the filesystem is hung, it would be helpful. On one occasion I let a hung system sit for hours to see if it would recover (it didn't). Regards, Shannon C. Dealy | DeaTech Research Inc. dealy@deatech.com | - Custom Software Development - USA Phone: +1 800-467-5820 | - Natural Building Instruction - numbers : +1 541-929-4089 | www.deatech.com
lot of data to my S3 file system. Stopped it a couple of times, unmounted
once (successfully), remounted and continued transferring data. After
around 13-15 GB total had been uploaded, the file system crashed again (no
hang):
rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]:
Broken pipe (32)
rsync: write failed on <file name>
Transport endpoint is not connected (107)
rsync error: error in file IO (code 11) at receiver.c(322)
[receiver=3.0.9]
rsync: connection unexpectedly closed (51874 bytes received so far)
[sender]
rsync error: error in rsync protocol data stream (code 12) at io.c(605)
[sender=3.0.9]
See attached mount log file for more details.
Shannon C. Dealy | DeaTech Research Inc.
dealy@deatech.com | - Custom Software Development -
USA Phone: +1 800-467-5820 | - Natural Building Instruction -
numbers : +1 541-929-4089 | www.deatech.com
Shannon Dealy <dealy@deatech.com> writes: [...] [...] [...] Unfortunately Linux does not provide an easy way to distinguish between an unavailable DNS server, and an un-resolvable host name. To distinguish these cases, S3QL/Dugong attempts to resolve a number of "test" hostnames. If these resolve, but the S3 hostname does not, S3QL/Dugong concludes that this hostname is not resolvable and terminates. Otherwise it assumes that the DNS server is currently not reachable and retries. Attempting to resolve hostnames on your system frequently fails (sometimes 3 times in a row), and sometimes it's apparently sufficiently flaky that 1. server-external-2014-11-21-deatech-com-s3ql.s3.amazonaws.com cannot be resolved 2. Any of google.com, iana.org, or root-servers.net can be resolved 3. server-external-2014-11-21-deatech-com-s3ql.s3.amazonaws.com cannot be resolved in this order, and without any waiting times. At this point, S3QL thus assumes that your bucket ceased to exist and terminates. To avoid this, you'll have to fix the DNS resolution issues on your system. Maybe install a caching proxy nameserver like dnsmasq? Best, -Nikolaus
[snip]
problems I am seeing under the old version of s3ql, and this is not the
only network I connect to where I am seeing these problems. This could
simply mean one or more of:
- the network(s) have become more unstable (certainly possible)
- that S3QL's test doesn't work quite the same way as before
- there is a bug somewhere (S3QL or python libraries - I had to upgrade
python to install the new S3QL) that makes it appear that DNS is still
broken after it recovers
- or something else is going on.
I see in the log I sent you that a few failures were reported 10 minutes
before S3QL gave up, but it is not clear to me if S3QL was able to resolve
DNS names between these. Am I correct in assuming that there a timer as
well as the three time in a row failure limit you mentioned? If not,
there should be. Network failures often take a few minutes to
resolve/recover so I would imagine the test should look something like:
while failing and < 10 minutes (or possibly some configurable value)
wait some short interval
try again
It is certainly not unusual to see short network glitches (DNS,
connection loss, routing problems) lasting a few seconds to (rarely) a
minute here, the five to ten minute DNS or general outage that I would
view as a minimum before S3QL gives up is very uncommon (I pretty much
live on the network when I am awake, so I am pretty sure I would notice
problems of this level). Of course, I may be wrong, perhaps I should set
up some kind of network monitor to see what kind of failure intervals
there are.
FWIW.
Shannon C. Dealy | DeaTech Research Inc.
dealy@deatech.com | - Custom Software Development -
USA Phone: +1 800-467-5820 | - Natural Building Instruction -
numbers : +1 541-929-4089 | www.deatech.com
Shannon Dealy <dealy@deatech.com> writes: The relevant code in S3QL 2.2 was last touched in version 2.2, so I don't think that's it. However, relevant code in Dugong was last modified in version 3.2. Prior to 3.2, *any* DNS problem was considered temporary. In 3.2 and later, DNS problems are only considered temporary of *no* hostname can be resolved. The problem with the prior behavior was that it would permanently retry even in situations like this: mkfs.s3ql s3://mybucket.aws.s3.cmo Having this command block for any significant amount of time in the hope that the wrong hostname becomes resolvable was rather surprising, and people complained that their scripts would just hang instead of properly reporting an error. It seems for you the situation is the other way around... There's no limit on the number of retries. After the 3rd retry, S3QL starts to emit log messages. The time interval between the retries is increasing over time. However, all this happens only if the DNS server is unreachable. If the DNS server is apparently reachable, but unable to resolve the hostname, no retrying is done. Best, -Nikolaus
It occurs to me that the above can be problematic as a test under some senarios. When segmentation of the internet occurs, routing and DNS (from a given location) is often lost to some but not all major players. On a couple of occasions I have been unable to resolve google.com but could resolve debian.org, amazon.com and many other sites. I understand your approach and reasoning, and at the moment can't think of a better solution to what you are trying to do here, however, even if my network link and DNS server are running flawlessly, events can occur on the internet where where my connection will fail the above test. [snip] Given the senario you were trying to fix with the change, perhaps a better approach would be to fail if initial resolution fails, but if the initial resolution succeeds, then the end point can reasonably be assumed to exist and future failures should keep retrying, at least for a substantial (possibly configurable) period of time. This provides the immediate feedback for manual or scripting related interactions, but once the file system is mounted focuses on maintaining/recovering the connection. Personally, I would love it if I could simply keep the file system mounted at all times, even when there is no network link, so that when there is a connection I can simply start using it, and when the connection goes away (even for a day or two), everything blocks and simply picks up where it left off when the connection returns. I often leave my sshfs mount connected this way as it recovers when the link returns. Of course my use-case may be different from others, I use it for maintaining laptop backups and access to data that won't fit on my laptop drive. This means that network connections come and go multiple times each day as I move between different locations, and stay down whenever I don't use the laptop for a few days. FWIW. Shannon C. Dealy | DeaTech Research Inc. dealy@deatech.com | - Custom Software Development - USA Phone: +1 800-467-5820 | - Natural Building Instruction - numbers : +1 541-929-4089 | www.deatech.com
Shannon Dealy <dealy@deatech.com> writes: Yes, that would be the best solution. It's just ugly to code, either way you to pass a "do_not_fail" parameter all the way from the main function to the low level network routines, or you have to do the first lookup on a higher level (which then needs to know details about all the storage backends). Well, that should actually work already. When there is no network connection, s3ql retries for a long amount. The problem only arises if there is partial connectivity. Best, -Nikolaus
Shannon Dealy <dealy@deatech.com> writes: The only truly reliable way is to generate a stacktrace (as you've done) and check if any threads are blocked in a read/recv/write/send syscall. If not, the file system hangs. If they are, then it's waiting for the remote server. Unfortunately S3QL's TCP timeout setting only applies after the SSL handshake, so the handshake can last as long as the Linux TCP stack allows... Best, -Nikolaus
[snip] I would suggest that it should pass a timeout value rather than a "do_not_fail". This would give the application level code the greatest flexibility allowing for both immediate and short term failure settings as well as an effective "never fail" by just passing a ridiculously large value. recovered (not sure why). Don't know on the current version. Had another filesystem crash last night and while I can't say the exact series of events from the perspective of the file system, it was a complete network failure that crashed it (not just DNS). This would imply that perhaps the test is too sensitive right at the boundary of a network failure (perhaps some packets get through, but most don't) and needs to retry over a longer period of time before deciding if the failure is network or DNS. Upon further reflection: I looked over your dugong code and have given some thought to what little I know of the local network topology, and my guess is that your test for live DNS will always decide that DNS is up at my location whenever the network connection fails (though it is just a guess). The ISP appears to be in another country, and their local network in this building feeds about 500 rooms. It is likely they are using a local DNS caching server (for the building, city or country, doesn't really matter which) which responds from its cache with anything it knows about, and forwards all other requests up to the ISP's main servers. If that is the case, any time the network gets cut off between the local caching DNS server and the primary DNS servers, the test will fail because google.com will always resolve (since everyone uses it), but www.iana.org and C.root-servers.org will not since most internet users never have reason to do a direct lookup on the later two domains, and any recursive lookups of these domains as a result of a local query will always happen and be stored only at the primary server. Based on this, I would suggest that a more robust test would be to declare a DNS failure if any of the three lookups fail (after the host lookup previously failed). After all, a failure on any of these lookups would at least suggest a serious problem with the internet which is a reasonably likely cause of the initial host lookup failure. FWIW. Shannon C. Dealy | DeaTech Research Inc. dealy@deatech.com | - Custom Software Development - USA Phone: +1 800-467-5820 | - Natural Building Instruction - numbers : +1 541-929-4089 | www.deatech.com
As a workaround, could you use the attached program instead of mount.s3ql? It should work just like mount.s3ql, but it will retry on any kind of DNS error. I'm still planning to fix this properly (probably by not retrying on the first resolution attempt), but that is going to take a while. Best, -Nikolaus
It should be noted for anyone else who tries to use this wrapper around the mount program, it requires python3-dugong version 3.4 (currently in experimental), it throws an exception with the 3.3 version in testing: AttributeError: 'module' object has no attribute 'HostnameNotResolvable' I'm giving it try now which should be a good test of the code and local network as I'm moving about 400GB into an S3 file system. Regards, Shannon C. Dealy | DeaTech Research Inc. dealy@deatech.com | - Custom Software Development - USA Phone: +1 800-467-5820 | - Natural Building Instruction - numbers : +1 541-929-4089 | www.deatech.com
This is tracked upstream at https://bitbucket.org/nikratio/s3ql/issue/110/improve-handling-of-name-resolution