#983170 s3ql: High load causes "Transport endpoint is not connected"

Package:
s3ql
Source:
s3ql
Description:
Full-featured file system for online data storage
Submitter:
Graham Cobb
Date:
2021-09-17 17:03:02 UTC
Severity:
important
#983170#5
Date:
2021-02-20 13:24:16 UTC
From:
To:
Dear Maintainer,

*** Reporter, please consider answering these questions, where appropriate ***

   * What led up to the situation?
   * What exactly did you do (or not do) that was effective (or
     ineffective)?
   * What was the outcome of this action?
   * What outcome did you expect instead?

*** End of the template - remove these template lines ***

After upgrading s3ql from 3.3.2+dfsg-1 I suffered from bug #982381 (trio) so
I tried manually installing trio 0.15 (as mentioned in that thread).
Although it allowed some files to be created, any long or complex operation
(such as a backup, or even an `rm -rf` of a large directory) cause a
'Software caused connection abort' error followed by
enormous numbers of 'Transport endpoint is not connected' errors.

In case it was some transient network problem, I used fsck.s3ql to fix the
filesystem and retried - same errors. And the same errors if I (fsck again
and) just try a `rm -rf` on a large directory.

I then tried installing trio 0.18. Same problem.

Mounting is working. fsck is working. Simple file operations are working.
But heavy load causes 'Software caused connection abort'. Completely repeatable.

The following commands reproduce the problem for me:

 cd /mnt/mountpoint
 count=100
 mkdir testdir ; for f in `seq 1 $count` ; do mkdir testdir/$f ; dd if=/dev/urandom  bs=1000 count=1 of=testdir/$f/test status=none ; done
 rm -rf testdir
 umount /mnt/mountpoint

With the count at 100 the problem occurs when the unmount happens. If the count
is increased to 2000 the problem occurs during the run.

This is using the S3 backend.

By the way, this workload has been working for many years with no problems,
and was working with 3.3.2+dfsg-1 before I decided to try testing 3.7.0+dfsg-2.

#983170#10
Date:
2021-02-20 15:42:46 UTC
From:
To:
severity 983170 grave
tags 983170 + upstream help
thanks

It seems to me that this version cannot simply be distributed as is.
Even, the wrong assumption about trio version compatibility renders it not
compatible with bullseye status.

#983170#15
Date:
2021-02-20 19:34:31 UTC
From:
To:
Could you please follow-up with your ~/.s3ql/mount.log log about the error?
#983170#20
Date:
2021-02-20 23:45:07 UTC
From:
To:
The mount.log consists of minor variations on the following...

2021-02-20 13:21:46.208 238604:MainThread s3ql.mount.determine_threads: Using 10 upload threads.
2021-02-20 13:21:46.210 238604:MainThread s3ql.mount.main: Autodetected 1048514 file descriptors available for cache entries
2021-02-20 13:21:46.982 238604:MainThread s3ql.mount.get_metadata: Using cached metadata.
2021-02-20 13:21:47.001 238604:MainThread s3ql.mount.main_async: Setting cache size to 17754 MB
2021-02-20 13:21:47.004 238604:MainThread s3ql.block_cache.__init__: Loaded 0 entries from cache
2021-02-20 13:21:47.040 238604:MainThread s3ql.mount.main_async: Mounting s3://eu-west-1/xxx/s3ql/yyy/ at /mnt/a...
2021-02-20 13:21:47.050 238624:MainThread s3ql.daemonize.detach_process_context: Daemonizing, new PID is 238625
2021-02-20 13:22:56.691 238625:MainThread s3ql.mount.unmount: Unmounting file system...
2021-02-20 13:23:01.703 238625:MainThread s3ql.block_cache.destroy: Could not complete object removals, no removal threads left alive
2021-02-20 13:23:01.710 238625:MainThread root.excepthook: Uncaught top-level exception:
Traceback (most recent call last):
  File "/usr/bin/mount.s3ql", line 33, in <module>
    sys.exit(load_entry_point('s3ql==3.7.0', 'console_scripts', 'mount.s3ql')())
  File "/usr/lib/s3ql/s3ql/mount.py", line 131, in main
    trio.run(main_async, options, stdout_log_handler)
  File "/usr/local/lib/python3.9/dist-packages/trio/_core/_run.py", line 1932, in run
    raise runner.main_task_outcome.error
  File "/usr/lib/s3ql/s3ql/mount.py", line 274, in main_async
    await pyfuse3.main()
  File "/usr/lib/python3/dist-packages/_pyfuse3.py", line 30, in wrapper
    await fn(*args, **kwargs)
  File "src/pyfuse3.pyx", line 776, in main
  File "/usr/local/lib/python3.9/dist-packages/trio/_core/_run.py", line 815, in __aexit__
    raise combined_error_from_nursery
trio.MultiError: NoWorkerThreads('no removal threads'), NoWorkerThreads('no removal threads'), NoWorkerThreads('no removal threads'), NoWorkerThreads('no removal threads'), NoWorkerThreads('no removal threads'), NoWorkerThreads('no removal threads'), NoWorkerThreads('no removal threads'), NoWorkerThreads('no removal threads'), NoWorkerThreads('no removal threads'), NoWorkerThreads('no removal threads'), NoWorkerThreads('no removal threads'), NoWorkerThreads('no removal threads')

Details of embedded exception 1:

  Traceback (most recent call last):
    File "/usr/lib/s3ql/s3ql/block_cache.py", line 598, in _deref_block
      self.to_remove.put(obj_id, block=False)
    File "/usr/lib/python3.9/queue.py", line 137, in put
      raise Full
  queue.Full

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
    File "/usr/lib/python3/dist-packages/_pyfuse3.py", line 30, in wrapper
      await fn(*args, **kwargs)
    File "src/internal.pxi", line 278, in _session_loop
    File "/usr/lib/s3ql/s3ql/fs.py", line 1172, in forget
      await self.cache.remove(id_, 0, inode.size // self.max_obj_size + 1)
    File "/usr/lib/s3ql/s3ql/block_cache.py", line 847, in remove
      await self._deref_block(block_id)
    File "/usr/lib/s3ql/s3ql/block_cache.py", line 600, in _deref_block
      await trio.to_thread.run_sync(self._queue_removal, obj_id)
    File "/usr/local/lib/python3.9/dist-packages/trio/_threads.py", line 207, in to_thread_run_sync
      return await trio.lowlevel.wait_task_rescheduled(abort)
    File "/usr/local/lib/python3.9/dist-packages/trio/_core/_traps.py", line 166, in wait_task_rescheduled
      return (await _async_yield(WaitTaskRescheduled(abort_func))).unwrap()
    File "/usr/lib/python3/dist-packages/outcome/_sync.py", line 111, in unwrap
      raise captured_error
    File "/usr/local/lib/python3.9/dist-packages/trio/_threads.py", line 157, in do_release_then_return_result
      return result.unwrap()
    File "/usr/lib/python3/dist-packages/outcome/_sync.py", line 111, in unwrap
      raise captured_error
    File "/usr/local/lib/python3.9/dist-packages/trio/_threads.py", line 170, in worker_fn
      ret = sync_fn(*args)
    File "/usr/lib/s3ql/s3ql/block_cache.py", line 553, in _queue_removal
      raise NoWorkerThreads('no removal threads')
  s3ql.block_cache.NoWorkerThreads: no removal threads

Details of embedded exception 2:

...

embedded exception 2 is a copy of embedded exception 1, and there are another 10 identical embedded exceptions.

The complete log is available at http://www.cobb.uk.net/s3ql-983170-mount.log.gz

#983170#25
Date:
2021-09-17 10:58:44 UTC
From:
To:
Now that bullseye has shipped, and I have moved on to bookworm, I am keen to do
anything I can to help resolve this. Is there anything I can do? For example
testing with packages? Or is there an upstream fix available for testing?