#1014037 mailman3-web: Possible memory leak: uwsgi OOMs after a few weeks

#1014037#5
Date:
2022-06-29 01:11:15 UTC
From:
To:
Dear Maintainer,

 I have a mailman3 system backed by PostGRES, exim4, and nginx;
 and it is set up and works properly.  However, the uwsgi process
 keeps growing and growing until the system OOMs. typically
 after two to three weeks.

 I added more RAM (the system now has 3Gb) but that postponed
 but did not fix the problem.  As a workaround I now restart
 the mailman3 service once a day.

#1014037#10
Date:
2022-06-30 08:18:46 UTC
From:
To:
Hi,

I'm pretty sure I'm seeing this too. I'm running it under apache2
and with mariadb.

After a week or so uwsgi was using about 7% RAM on an 8G machine. I
restarted mailman3-web and that went back to 1%. One day later it is
up to 1.2%; I guess it will keep growing and I will also have to
regularly restart mailman3-web.

Is it easy to switch the mailman3-web package to run under
gunicorn?

Cheers,
Andy

#1014037#15
Date:
2024-04-24 23:59:07 UTC
From:
To:
Hi,

Peter Chubb <peter.chubb@unsw.edu.au> wrote on 29/06/2022 at 03:11:15+0200:

Having the same kind of setup for the past 6 years, I never had such an
issue.

Do you have more intel?

#1014037#22
Date:
2024-04-25 01:05:26 UTC
From:
To:

Pierre-Elliott> Having the same kind of setup for the past 6 years, I
Pierre-Elliott> never had such an issue.


Since increasing the size of the VM and the last Mailman3 upgrade, I
haven't seen the issue.

#1014037#27
Date:
2024-05-10 11:43:16 UTC
From:
To:
I can also confirm this running mailman3-web in Apache. Usually it only takes a few days. I have attached a graph to illustrate the growth.

I am using this to launch it:

  WSGIDaemonProcess mailman3 processes=1 threads=8 display-name=%{GROUP} home=/usr/share/mailman3-web

Adding a "maximum-requests=1” does not help at all. Swapping processes and threads does not bring any change either.

I have no idea what to look for but I am happy to investigate if you have any ideas.

root@mail02:~/configuration/haj.ipfire.org/mail02# dpkg -l mailman3*
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version         Architecture Description
+++-==============-===============-============-============================================================
ii  mailman3       3.3.8-2~deb12u1 all          Mailing list management system
un  mailman3-core  <none>          <none>       (no description available)
un  mailman3-doc   <none>          <none>       (no description available)
ii  mailman3-full  3.3.8-2~deb12u1 all          Full Mailman3 mailing list management suite (metapackage)
ii  mailman3-web   0+20200530-2.1  all          Django project integrating Mailman3 Postorius and HyperKitty

#1014037#32
Date:
2025-01-15 15:04:54 UTC
From:
To:
Control: tags -1 -moreinfo

[...]

What do you need? :)

We've been running Mailman 3 from Debian packages for a couple of months
now, and we're seeing recurring OOM errors. At first, we were hitting
8GB memory usage, and bumped the memory of that machine to 16GB, but
we're still getting OOMs. Our incident log is in:

https://gitlab.torproject.org/tpo/tpa/team/-/issues/41957

Here's a screenshot of a Grafana dashboard of our "per-process memory
exporter" that shows, well, per-process memory usage:

#1014037#39
Date:
2025-01-15 15:18:21 UTC
From:
To:
Hello everyone,

I would also happy to provide more information.

I am running mailman3-web in Apache with mod_wsgi and I also have the same memory usage problem. Therefore I thought it was a mailman3 problem rather than in the application that is hosting it.

I would be happy to hear if running mailman3 in Gunicorn resolves the problem, but maybe it is just a coincidence that the problem doesn’t appear there?

All the best,
-Michael

#1014037#44
Date:
2025-01-15 15:41:02 UTC
From:
To:
Have you pinned down exactly *what* process is eating memory? For us
it's clearly uwsgi, so we're thinking the issue actually doesn't lie
within mailman itself, and upstream seems to think so as well.

It could be! If you could show us OOM dmesg logs, they should show which
process was actually using memory when the OOM happens, this should
inform next steps pretty well.

Alternatively, having per-process memory graphs would help too, I think.

Otherwise I'm not sure what peb needs here. :)

a.

#1014037#49
Date:
2025-01-15 16:07:46 UTC
From:
To:
On 2025-01-15 15:56:27, Michael Tremer wrote:

[...]

I would try bumping memory to 16GiB, to see if it improves the situation
for you. In our case, it clearly showed, rather conclusively, that the
problem was not just "oh, mailman3 is using more memory" but more
clearly "wow, there's a problem with uwsgi".

In the above stats, it's not entirely clear to me the cause is with
Apache: you have a lot going on there, and it *could* actually be
there's an issue with the overall memory usage and Apache is just being
tagged as the culprit by the OOM...

But yeah, your numbers might show there's actually an underlying issue
with mailman-web itself. Our tests with gunicorn will more conclusively
show whether or not it's the case: if the issue goes away in gunicorn,
then this could be an issue in *both* uwsgi and apache2-wsgi...

a.

#1014037#54
Date:
2025-01-15 16:23:22 UTC
From:
To:
On 2025-01-15 16:13:44, Michael Tremer wrote:

[...]

For the record, I absolutely agree.

Honestly, I'm scratching an itch here. If I can get away with getting
rid of the OOM by switching to gunicorn, I'll be happy, especially since
we use gunicorn elsewhere...

a.

#1014037#59
Date:
2025-01-16 10:20:56 UTC
From:
To:
Good morning everyone,

I ran the machine now with a total of 16 GiB - no other modifications have been made.

Since then, the Apache process consumed the entirety of memory (minus the other basic system services) and was killed by the OOM. Graph attached.

#1014037#64
Date:
2025-01-16 14:40:05 UTC
From:
To:
I bet this is Apache killing and spawning its children from time to
time, sometimes hitting leaky ones.


Clearly there's a memory leak in this implementation as well, but we'll
know better whether it's specific to apache/uwsgi when we test with
gunicorn.

Stay tuned!

#1014037#69
Date:
2025-01-23 21:20:07 UTC
From:
To:
On 2025-01-15 10:04:54, Antoine Beaupré wrote:

[...]

It's been a little over 24 hours and we can already say that we still
get OOMs under gunicorn.

The interesting thing is that it's a different process showing the OOM
condition: instead of it being gunicorn itself (which you'd expect if it
was designed like uwsgi or apache2-mod-wsgi), it's Python itself eating
all the memory. See this comment:

https://gitlab.torproject.org/tpo/tpa/team/-/issues/41957#note_3151902

Directly link to the per-process memory graph:

https://gitlab.torproject.org/-/project/441/uploads/c8ebf60612c426688e651853f251edd5/mem3.png

So my theory, at this moment, is that the assumption that the problem is
related to the process manager (uwsgi or apache or gunicorn) is
incorrect; this is actually and truly a memory management issue inside
the Python process running mailman3-web.

This bug report would, therefore, seem to be filed at the right place.

The question at this point is: how do we profile this any further? Any
advice? run a memory profiler like austin?

a.

#1014037#74
Date:
2025-02-07 03:12:28 UTC
From:
To:
At last, we have news!

I *think* I have identified the culprit. While handling an unrelated
issue (GDPR anyone?) we had to rebuild the search indexes and, while
testing *that*, we found that we could pretty reliably crash mailman-web
by... well, just searching all lists crashes it.

Boom. It's search?

I'm in the process of switching to Xapian now. This brings a whole lot
of other issues (it uses more disk space and there's a bug in the
xapian-haystack library that crashes indexing, see #), but so far, we've
completely cleared out any OOM errors we were previously getting.

Check out this beauty:

#1014037#79
Date:
2025-02-07 03:26:53 UTC
From:
To:
Forgot the bug number here, it's #1095320.
#1014037#84
Date:
2025-02-24 15:56:57 UTC
From:
To:
Hello,

Apologies for my late reply.

Hmm, I don’t want to bring everybody down, but I think I cannot confirm this.

# FTS
HAYSTACK_CONNECTIONS = {
    'default' : {
        'PATH'   : "/var/lib/mailman3/web/fulltext_index",
        'ENGINE' : 'xapian_backend.XapianEngine'
    },
}

Looks pretty much the same to me.

Xapian has not been giving us great results in the rest of our infrastructure. We used it in dovecot and it is creating HUGE indexes which were about half the size of the original inboxes and therefore was even very slow to search in it. We migrated to Solr there, but that was not an option for Mailman.

Whoosh was expectedly worse.

So, has this solved it all for good for you guys? What release of xapian are you on?

# apt-cache show python3-xapian-haystack
Package: python3-xapian-haystack
Source: python-xapian-haystack
Version: 2.1.1-1+deb12u1
Installed-Size: 91
Maintainer: Debian Python Team <team+python@tracker.debian.org>
Architecture: all
Depends: python3-django-haystack, python3-xapian, python3-django, python3:any
Enhances: python3-django-haystack
Description-en: Xapian backend for Django-Haystack (Python3 version)
 Xapian-haystack is a backend of Django-Haystack for the Xapian search engine.
 It provides all the standard features of Haystack:
  * Weighting
  * Faceted search (date, query, etc.)
  * Sorting
  * Spelling suggestions
  * EdgeNGram and Ngram (for autocomplete)
 The endswith search operation is not supported.
 .
 This package contains the Python 3 version of the library.
Description-md5: 5e43ae0149e2df6b3df16ddcf87f3b13
Homepage: https://github.com/notanumber/xapian-haystack/
Section: python
Priority: optional
Filename: pool/main/p/python-xapian-haystack/python3-xapian-haystack_2.1.1-1+deb12u1_all.deb
Size: 21412
MD5sum: c203fd6ef9a992ad418f0685a528a45e
SHA256: 9b70209f36b9bccbfda0346b048d024c48fd4c168e8a0bfe811a3c770eb18287

#1014037#89
Date:
2025-02-24 16:51:16 UTC
From:
To:
On 2025-02-24 15:56:57, Michael Tremer wrote:

[...]
OOM/day, with peaks at 15, 120 when reindexing, and this is down to 1-5
a day, depending on the day. Kind of hard to track discrete events like
this...

We had a single OOM in the last 48h. That's "nice".

We also have stupidly large Xapian indexes now, it's ridiculous. Clearly
something wrong either with the haystack or the hyperkitty
implementation. So far I've filed it in the latter:

https://gitlab.com/mailman/hyperkitty/-/issues/533

So, TL;DR: improved, but not fixed. I suspect we had a multi-dimensional
issue, of which search/whoosh *was* a part of, because we would see a
huge increase in OOMs when rebuilding the indexes. But we're still
having an issue, so perhaps there's something else.

We tried to hookup a memory profiler (austin) but it failed because it
didn't work with Python 3.11... so maybe that's something we'll try to
revisit after our trixie upgrades (hopefully soon!).

a.

#1014037#94
Date:
2025-02-24 17:51:17 UTC
From:
To:
Hello Antoine,

Okay, this might still be a slight step in the right direction.

This is my experience with Xapian and I have found confirmation that this is supposed to be normal. My mailbox indexes were massive and there was no point having them any more. So I can confirm that this looked very similar in Dovecot, too.

I tried that but I was struggling with a missing sssd and some other things. Not sure I am ready to try again. It would also not help us to find out where exactly this went wrong if it were fixed :(

#1014037#99
Date:
2025-02-24 18:34:57 UTC
From:
To:
[...]

A 10x amplification in the disk usage is not normal.

https://gitlab.com/mailman/hyperkitty/-/issues/533

I use notmuch as a search index here, and the amplification is
*opposite*, 4.5x *reduction* in disk usage compared to the original
dataset.

#1014037#104
Date:
2025-03-05 21:32:52 UTC
From:
To:
https://gitlab.torproject.org/tpo/tpa/team/-/issues/41957

We've gone from 20-40 OOMs/week (multiple daily) to ~3 per week, so
Xapian has definitely improved the situation.

I don't think this bug report should be closed though: we still have a
memory leak issue. I don't think it's reasonable for mailman to take
16GB of RAM for such small setups. Xapian is also using an unreasonable
amount of disk space.

But for all intents and purposes, this is as much effort I can dedicate
to this. Hopefully, when we upgrade to trixie, we can run a profiler on
this.

Feel free to remind me of that in a year.

Otherwise happy to provide more info as needed of course.

Cheers!

#1014037#109
Date:
2025-03-06 10:27:52 UTC
From:
To:
De : Michael Tremer <michael.tremer@ipfire.org>
À : Antoine Beaupré <anarcat@debian.org>
Cc : 1014037@bugs.debian.org; Pierre-Elliott Bécue <peb@debian.org>; Peter Chubb <peter.chubb@unsw.edu.au>
Date : 6 mars 2025 11:21:38
Objet : Re: Bug#1014037: mailman3-web: Possible memory leak: uwsgi OOMs after a few weeks

Hello,

For the sake of clarity I am waiting for transitional freeze to update all mailman3 packages as any py3 transition so far broke a lot of things.

In parallel I started to dive a bit in this Xapian matter. Using mu, I agree that the current size for the index is weird. I have yet to finish understanding the codebase but I'll definitely try to see through it ASAP

Bests,

#1014037#114
Date:
2025-03-06 10:21:19 UTC
From:
To:
Hello,

I have been looking at alternatives to mailman3 recently. I think that there is a very good chance we would migrate to mlmmj. There are currently too many large outstanding problems with mailman and it seems that there is not enough of a community around it to get them fixed in time. Although the large memory consumption is mostly annoying and not a deal-breaker, mailman keeps stopping to accept emails sometimes and needs a restart.

However, mlmmj (also packaged in Debian) is super small and super simple. The feature set is exactly what we need, although it would have been nice to have an API to subscribe/unsubscribe users. Since it is all a collection of small binaries, that can be built very easily with a couple of CGI scripts or so.

But it does not have any archiving features beyond storing all emails in a directory. So there is public-inbox which has a simple web UI, stores emails in a Git repositories which can be cloned and backed up very easily, *and* it is using Xapian indexes. So I thought I would give this a go and import our lists into it - just so that I have a way to compare.

On mailman, my Xapian index is about 4.9 GiB, on public-inbox I have 1.4 GiB. A significant change. This is still kind of large, but roughly only a quarter. The search also seems to be much faster. So I assume there is some configuration here that makes the index a lot smaller and the smaller the index the faster the search usually.

The mlmmj + public-inbox solution seems to have gained a lot of traction recently. The Linux kernel people are using it (https://lore.kernel.org/), Gentoo is using it (https://public-inbox.gentoo.org/), Promox, the list is actually quote long. So I think we might have a better chance to get something back that worked as well as mailman 2 without all this large complexity.

I agree. This is not the most painful problem in the world, but our mailing list needs to *just work* and I cannot spend a lot of time on keeping it working.

I would still be curious to find out what the actual problem was here though...

#1014037#119
Date:
2025-03-06 14:59:36 UTC
From:
To:
On 2025-03-06 11:27:52, Pierre-Elliott Bécue wrote:

[...]

That sounds fantastic peb! Let me know if you need any more data!

(What's "mu"?)

#1014037#124
Date:
2025-08-11 21:06:45 UTC
From:
To:
On 2025-01-23 16:20:07, Antoine Beaupré wrote:

[...]

We found a fix.

First, we tried austin, but it was filling up our logs more than
anything. Then lavamind found the trick: turns out the mbox export
slurps the entire archive in memory before compression.

https://gitlab.com/mailman/hyperkitty/-/issues/385

oops.

so this is a bug in hyperkitty, and a pretty big one, IMHO. this is
denial-of-service security level stuff, but it's being treated as a
feature request upstream.

the workaround is to set `HYPERKITTY_MBOX_EXPORT = False`

perhaps we could ship such a configuration in debian by default?

incident closed on our end, thanks everyone here for the help!

a.