#771969 mount.s3ql crashes if hostname resolution fails

Package:
s3ql
Source:
s3ql
Description:
Full-featured file system for online data storage
Submitter:
Shannon Dealy
Date:
2015-02-01 05:03:08 UTC
Severity:
important
#771969#5
Date:
2014-12-03 23:03:32 UTC
From:
To:
Dear Maintainer,

I mounted an S3QL file system and ran rsync to transfer data to the file
system.  After a few minutes I noticed the rsync process was no longer
scanning the file system or transfering data.  Ctrl-C would not terminate
the rsync (presumably locked in some I/O transaction).  mount.s3ql appears
hung and I could not unmount the file system.

The mount log file (see below) showed an exception had been thrown and
my .s3ql directory has about 600 raw_streamXXX files in it.

I killed everything, ran fsck on the file system, remounted and tried again
with the same result.

Contents of mount.log:

2014-12-03 23:17:03.472 23261:MainThread (name)s.determine_threads: Using 8 upload threads.
2014-12-03 23:17:03.473 23261:MainThread (name)s.main: Autodetected 4040 file descriptors available for cache entries
2014-12-03 23:17:12.025 23261:MainThread (name)s.get_metadata: Using cached metadata.
2014-12-03 23:17:12.029 23261:MainThread (name)s.main: Mounting filesystem...
2014-12-03 23:17:12.043 23269:MainThread (name)s.detach_process_context: Daemonizing, new PID is 23270
2014-12-03 23:20:46.372 23270:Thread-18 (name)s._parse_xml_response: Server did not provide Content-Type, assuming XML
2014-12-03 23:21:21.816 23270:Thread-18 (name)s.excepthook: Uncaught top-level exception:
Traceback (most recent call last):
  File "/usr/lib/s3ql/s3ql/mount.py", line 66, in run_with_except_hook
    run_old(*args, **kw)
  File "/usr/lib/python3.4/threading.py", line 868, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/s3ql/s3ql/block_cache.py", line 684, in _removal_loop
    backend.delete_multi(['s3ql_data_%d' % i for i in ids])
  File "/usr/lib/s3ql/s3ql/backends/comprenc.py", line 273, in delete_multi
    return self.backend.delete_multi(keys, force=force)
  File "/usr/lib/s3ql/s3ql/backends/s3.py", line 81, in delete_multi
    self._delete_multi(tmp, force=force)
  File "/usr/lib/s3ql/s3ql/backends/common.py", line 46, in wrapped
    return method(*a, **kw)
  File "/usr/lib/s3ql/s3ql/backends/s3.py", line 116, in _delete_multi
    resp = self._do_request('POST', '/', subres='delete', body=body, headers=headers)
  File "/usr/lib/s3ql/s3ql/backends/s3c.py", line 409, in _do_request
    query_string=query_string, body=body)
  File "/usr/lib/s3ql/s3ql/backends/s3c.py", line 638, in _send_request
    self.conn.send_request(method, path, body=body, headers=headers)
  File "/usr/lib/python3/dist-packages/dugong/__init__.py", line 484, in send_request
    self.timeout)
  File "/usr/lib/python3/dist-packages/dugong/__init__.py", line 1373, in eval_coroutine
    if not next(crt).poll(timeout=timeout):
  File "/usr/lib/python3/dist-packages/dugong/__init__.py", line 511, in co_send_request
    self.connect()
  File "/usr/lib/python3/dist-packages/dugong/__init__.py", line 410, in connect
    self._sock = socket.create_connection((self.hostname, self.port))
  File "/usr/lib/python3.4/socket.py", line 491, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
  File "/usr/lib/python3.4/socket.py", line 530, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

#771969#10
Date:
2014-12-04 01:06:57 UTC
From:
To:
severity 771969 important
thanks

Thanks for the report! I'll set the severity to important for now,
because the package still works nicely for many other users. Not sure
why it's making so much trouble for you.. but I notice that (like the
last bug you found) this seems related to the handling of unreliable
network connections. In this case there seems to be a temporary problem
with your DNS server which S3QL has trouble coping with.

The raw_streamXXX files are from the patch that you applied to the debug
the previous issue. I'd recommend to purge and reinstall both the s3ql
and python3-dugong packages to make sure that you're in vanilla state again.

If the file system still hangs (or the next time it hangs), could you
follow the instructions at
https://bitbucket.org/nikratio/s3ql/wiki/Providing%20Debugging%20Info to
obtain a Python stacktrace, and attach it to this issue?

Thanks!
-Nikolaus

#771969#17
Date:
2014-12-04 01:54:30 UTC
From:
To:
Hi,

Is it possible that your DNS server behaves as described in
https://bitbucket.org/nikratio/python-dugong/issue/16/?

If so, could you test if upgrading to python3-dugong 3.4 from
experimental fixes the problem?

(In that case there still remains the issue of mount.s3ql hanging
instead of terminating cleanly, so the stacktrace would be useful even
then).

Best,
-Nikolaus

#771969#22
Date:
2014-12-04 07:44:52 UTC
From:
To:
Because computers hate me :-)  I actually have yet another problem with
an sshfs mounted s3ql file system but was trying to get the issues with
my S3 mount resolved first.

From a high level perspective, the connection (generally) is pretty
reliable, but I wouldn't notice momentary network issues.

I forgot about the dugong patch, so I reinstalled both packages, using
the Debian-unstable version for s3ql
Attached are two log files with the mount and stack trace information.


Shannon C. Dealy           |         DeaTech Research Inc.
dealy@deatech.com          |    - Custom Software Development -
USA Phone: +1 800-467-5820 |    - Natural Building Instruction -
numbers  : +1 541-929-4089 |            www.deatech.com

#771969#27
Date:
2014-12-04 23:45:33 UTC
From:
To:
On 12/03/2014 11:44 PM, Shannon Dealy wrote:
the file system still hangs (or the next time it hangs), could you

This looks odd. Did you issue both "setfattr -n fuse_stacktrace" *and*
"kill -s USR1" commands?

The attached files indicate both, and moreover, the number of active
threads during the "kill -s USR1" command was much lower than during the
"setfattr" command. How much time did pass between these commands
(assuming that you issued both)?


At least at the time that you issued the "kill" command, mount.s3ql was
busy doing the SSL handshake with the remote server, so while it looked
like it was hanging, it was probably waiting for the remote server.


Best,
-Nikolaus

#771969#32
Date:
2014-12-05 00:25:13 UTC
From:
To:
Yes, I didn't realize that the setfattr command was being used to trigger
a one time event, I thought it was more like turning on a debug mode with
extra info.

If I am not mixing things up, I believe that initial setfattr was run
right after mounting the file system, and the "kill -s USR1" was run when
the file system was hung during the unmount, just before I did a kill -9
on mount.s3ql

I don't recall.
for this case.  I have limited the cache to 1GB so I don't have to wait
too long if I need to kill a backup, pack up the computer and leave.
Because of this, I know from past experience the maximum amount of time
it takes to flush the cache and unmount using this network.  I figure it
is dead if the network has remained up, but network and cpu load for this
process are minimal to zero and it has taken more than twice as long as it
should for the unmount.  In this case it should be done in 15 to 20
minutes max, so figure it had been at least twice that.  Sorry I don't
have a better answer.  If you have a suggestion for a better way to tell
if the filesystem is hung, it would be helpful.  On one occasion I let a
hung system sit for hours to see if it would recover (it didn't).

Regards,

Shannon C. Dealy           |         DeaTech Research Inc.
dealy@deatech.com          |    - Custom Software Development -
USA Phone: +1 800-467-5820 |    - Natural Building Instruction -
numbers  : +1 541-929-4089 |            www.deatech.com

#771969#37
Date:
2014-12-05 00:38:41 UTC
From:
To:
lot of data to my S3 file system.  Stopped it a couple of times, unmounted
once (successfully), remounted and continued transferring data.  After
around 13-15 GB total had been uploaded, the file system crashed again (no
hang):

    rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]:
    Broken pipe (32)
    rsync: write failed on <file name>
    Transport endpoint is not connected (107)
    rsync error: error in file IO (code 11) at receiver.c(322)
    [receiver=3.0.9]
    rsync: connection unexpectedly closed (51874 bytes received so far)
    [sender]
    rsync error: error in rsync protocol data stream (code 12) at io.c(605)
    [sender=3.0.9]

See attached mount log file for more details.

Shannon C. Dealy           |         DeaTech Research Inc.
dealy@deatech.com          |    - Custom Software Development -
USA Phone: +1 800-467-5820 |    - Natural Building Instruction -
numbers  : +1 541-929-4089 |            www.deatech.com

#771969#42
Date:
2014-12-05 01:00:02 UTC
From:
To:
Shannon Dealy <dealy@deatech.com> writes:
[...]
[...]
[...]


Unfortunately Linux does not provide an easy way to distinguish between
an unavailable DNS server, and an un-resolvable host name. To
distinguish these cases, S3QL/Dugong attempts to resolve a number of
"test" hostnames. If these resolve, but the S3 hostname does not,
S3QL/Dugong concludes that this hostname is not resolvable and
terminates. Otherwise it assumes that the DNS server is currently not
reachable and retries.

Attempting to resolve hostnames on your system frequently fails
(sometimes 3 times in a row), and sometimes it's apparently sufficiently
flaky that

1. server-external-2014-11-21-deatech-com-s3ql.s3.amazonaws.com cannot
be resolved

2. Any of google.com, iana.org, or root-servers.net can be resolved

3. server-external-2014-11-21-deatech-com-s3ql.s3.amazonaws.com cannot
be resolved

in this order, and without any waiting times.


At this point, S3QL thus assumes that your bucket ceased to exist and
terminates.

To avoid this, you'll have to fix the DNS resolution issues on your
system. Maybe install a caching proxy nameserver like dnsmasq?

Best,
-Nikolaus

#771969#47
Date:
2014-12-05 07:48:05 UTC
From:
To:
[snip]
problems I am seeing under the old version of s3ql, and this is not the
only network I connect to where I am seeing these problems.  This could
simply mean one or more of:

  - the network(s) have become more unstable (certainly possible)
  - that S3QL's test doesn't work quite the same way as before
  - there is a bug somewhere (S3QL or python libraries - I had to upgrade
    python to install the new S3QL) that makes it appear that DNS is still
    broken after it recovers
  - or something else is going on.

I see in the log I sent you that a few failures were reported 10 minutes
before S3QL gave up, but it is not clear to me if S3QL was able to resolve
DNS names between these.  Am I correct in assuming that there a timer as
well as the three time in a row failure limit you mentioned?  If not,
there should be.  Network failures often take a few minutes to
resolve/recover so I would imagine the test should look something like:

    while failing and < 10 minutes (or possibly some configurable value)
       wait some short interval
       try again

It is certainly not unusual to see short network glitches (DNS,
connection loss, routing problems) lasting a few seconds to (rarely) a
minute here, the five to ten minute DNS or general outage that I would
view as a minimum before S3QL gives up is very uncommon (I pretty much
live on the network when I am awake, so I am pretty sure I would notice
problems of this level).  Of course, I may be wrong, perhaps I should set
up some kind of network monitor to see what kind of failure intervals
there are.

FWIW.

Shannon C. Dealy           |         DeaTech Research Inc.
dealy@deatech.com          |    - Custom Software Development -
USA Phone: +1 800-467-5820 |    - Natural Building Instruction -
numbers  : +1 541-929-4089 |            www.deatech.com

#771969#52
Date:
2014-12-05 17:02:02 UTC
From:
To:
Shannon Dealy <dealy@deatech.com> writes:

The relevant code in S3QL 2.2 was last touched in version 2.2, so I
don't think that's it. However, relevant code in Dugong was last
modified in version 3.2. Prior to 3.2, *any* DNS problem was considered
temporary. In 3.2 and later, DNS problems are only considered temporary
of *no* hostname can be resolved.

The problem with the prior behavior was that it would permanently retry
even in situations like this:

mkfs.s3ql s3://mybucket.aws.s3.cmo

Having this command block for any significant amount of time in the hope
that the wrong hostname becomes resolvable was rather surprising, and
people complained that their scripts would just hang instead of properly
reporting an error.

It seems for you the situation is the other way around...

There's no limit on the number of retries. After the 3rd retry, S3QL
starts to emit log messages. The time interval between the retries is
increasing over time.

However, all this happens only if the DNS server is unreachable. If the
DNS server is apparently reachable, but unable to resolve the hostname,
no retrying is done.


Best,
-Nikolaus

#771969#57
Date:
2014-12-05 18:03:10 UTC
From:
To:

It occurs to me that the above can be problematic as a test under some
senarios.  When segmentation of the internet occurs, routing and DNS
(from a given location) is often lost to some but not all major players.
On a couple of occasions I have been unable to resolve google.com but
could resolve debian.org, amazon.com and many other sites.  I understand
your approach and reasoning, and at the moment can't think of a better
solution to what you are trying to do here, however, even if my network
link and DNS server are running flawlessly, events can occur on the
internet where where my connection will fail the above test.

[snip]

Given the senario you were trying to fix with the change, perhaps a better
approach would be to fail if initial resolution fails, but if the
initial resolution succeeds, then the end point can reasonably be assumed
to exist and future failures should keep retrying, at least for a
substantial (possibly configurable) period of time.  This provides the
immediate feedback for manual or scripting related interactions, but once
the file system is mounted focuses on maintaining/recovering the
connection.

Personally, I would love it if I could simply keep the file system mounted
at all times, even when there is no network link, so that when there is a
connection I can simply start using it, and when the connection goes away
(even for a day or two), everything blocks and simply picks up where it
left off when the connection returns.  I often leave my sshfs mount
connected this way as it recovers when the link returns.

Of course my use-case may be different from others, I use it for
maintaining laptop backups and access to data that won't fit on my
laptop drive.  This means that network connections come and go multiple
times each day as I move between different locations, and stay down
whenever I don't use the laptop for a few days.

FWIW.

Shannon C. Dealy           |         DeaTech Research Inc.
dealy@deatech.com          |    - Custom Software Development -
USA Phone: +1 800-467-5820 |    - Natural Building Instruction -
numbers  : +1 541-929-4089 |            www.deatech.com

#771969#62
Date:
2014-12-06 05:00:02 UTC
From:
To:
Shannon Dealy <dealy@deatech.com> writes:

Yes, that would be the best solution. It's just ugly to code, either way
you to pass a "do_not_fail" parameter all the way from the main function
to the low level network routines, or you have to do the first lookup on
a higher level (which then needs to know details about all the storage
backends).

Well, that should actually work already. When there is no network
connection, s3ql retries for a long amount. The problem only arises if
there is partial connectivity.

Best,
-Nikolaus

#771969#67
Date:
2014-12-06 05:15:38 UTC
From:
To:
Shannon Dealy <dealy@deatech.com> writes:

The only truly reliable way is to generate a stacktrace (as you've done)
and check if any threads are blocked in a read/recv/write/send
syscall. If not, the file system hangs. If they are, then it's waiting
for the remote server.

Unfortunately S3QL's TCP timeout setting only applies after the SSL
handshake, so the handshake can last as long as the Linux TCP stack
allows...

Best,
-Nikolaus

#771969#72
Date:
2014-12-06 08:20:36 UTC
From:
To:
[snip]

I would suggest that it should pass a timeout value rather than a
"do_not_fail".  This would give the application level code the greatest
flexibility allowing for both immediate and short term failure settings as
well as an effective "never fail" by just passing a ridiculously large
value.
recovered (not sure why).  Don't know on the current version.

Had another filesystem crash last night and while I can't say the exact
series of events from the perspective of the file system, it was a
complete network failure that crashed it (not just DNS).  This would imply
that perhaps the test is too sensitive right at the boundary of a network
failure (perhaps some packets get through, but most don't) and needs to
retry over a longer period of time before deciding if the failure is
network or DNS.

Upon further reflection:

I looked over your dugong code and have given some thought to what little
I know of the local network topology, and my guess is that your test for
live DNS will always decide that DNS is up at my location whenever the
network connection fails (though it is just a guess).  The ISP appears to
be in another country, and their local network in this building feeds
about 500 rooms.  It is likely they are using a local DNS caching server
(for the building, city or country, doesn't really matter which) which
responds from its cache with anything it knows about, and forwards
all other requests up to the ISP's main servers.  If that is the case, any
time the network gets cut off between the local caching DNS server and the
primary DNS servers, the test will fail because google.com will always
resolve (since everyone uses it), but www.iana.org and C.root-servers.org
will not since most internet users never have reason to do a direct lookup
on the later two domains, and any recursive lookups of these domains as a
result of a local query will always happen and be stored only at the
primary server.

Based on this, I would suggest that a more robust test would be to declare
a DNS failure if any of the three lookups fail (after the host lookup
previously failed).  After all, a failure on any of these lookups would at
least suggest a serious problem with the internet which is a reasonably
likely cause of the initial host lookup failure.

FWIW.

Shannon C. Dealy           |         DeaTech Research Inc.
dealy@deatech.com          |    - Custom Software Development -
USA Phone: +1 800-467-5820 |    - Natural Building Instruction -
numbers  : +1 541-929-4089 |            www.deatech.com

#771969#77
Date:
2014-12-10 02:53:19 UTC
From:
To:
As a workaround, could you use the attached program instead of mount.s3ql?

It should work just like mount.s3ql, but it will retry on any kind of
DNS error.


I'm still planning to fix this properly (probably by not retrying on the
first resolution attempt), but that is going to take a while.

Best,
-Nikolaus

#771969#82
Date:
2014-12-13 10:24:26 UTC
From:
To:
It should be noted for anyone else who tries to use this wrapper around
the mount program, it requires python3-dugong version 3.4 (currently in
experimental), it throws an exception with the 3.3 version in testing:

   AttributeError: 'module' object has no attribute 'HostnameNotResolvable'

I'm giving it try now which should be a good test of the code and local
network as I'm moving about 400GB into an S3 file system.

Regards,

Shannon C. Dealy           |         DeaTech Research Inc.
dealy@deatech.com          |    - Custom Software Development -
USA Phone: +1 800-467-5820 |    - Natural Building Instruction -
numbers  : +1 541-929-4089 |            www.deatech.com

#771969#91
Date:
2015-02-01 04:38:17 UTC
From:
To:
This is tracked upstream at https://bitbucket.org/nikratio/s3ql/issue/110/improve-handling-of-name-resolution