I'd like to request lintian to make use of multiple threads when performing its evaluations, I came to notice that running lintian against the curl package takes a few seconds (on a powerful machine) and it uses only a single thread. I believe there could be noticeable performance gains from using all the threads available. Although I don't know how feasible that is with lintian+perl. Note that I didn't go all the way to debugging lintian to confirm it's single-threaded, I only noticed that I had a thread on 100% while lintian was running and I'm considering this to be good evidence. Worst case scenario the maintainer can clarify I'm wrong (I know there's some chance lintian is actually multi-threaded but it was waiting for something else that's single-threaded). Thanks,
Hi Samuel, I share your hope and have implemented two attempts to parallelize the ~300 or so checks. My first attempt used IO::Async but failed. That module is probably the best one currently available, but it replaces the SIGCHLD handler. Lintian uses dozens of other modules that call external programs via other means. Unfortunately, those do not interact well with IO::Async, which causes the parallel execution to freeze or otherwise experience strange bugs. A particularly serious problem for Lintian was the interaction with Path::Tiny. [1] You may be able to find some details by searching the Git log for "Heisenbug" (capital H, please). My current implementation uses MCE [2] which works okay, but does not yet yield the performance gains you and I are hoping for. That is why the experimental branch has not been merged. As far as I can tell, the degradation relates to the serializations Perl performs between parent and child processes. It is possible to "close" on the in-memory file indexes as part of the fork() but it's not enough to explain the difference. (The indexes are large and also being transitioned to disk for unrelated reasons.) Memory usage is higher, as well. I may have to implement better profiling before we make significant progress. That is because at least half the time is spent generating the file indexes, which require a different parallelization strategy than the checks. One long-term plan could be to have a data interchange format between the parent and the child processes. It would also allow checks to be written in other programming languages, such as Haskell, but I would seek further community input before proceeding with anything like that. [1] https://github.com/dagolden/Path-Tiny/issues/224 [2] https://metacpan.org/pod/MCE Perl performs surprisingly well for an interpreted language, but I am not sure true "threading" works well. In Lintian, we use multiple processes, if at all. That is how I interpreted your use of the word "threads". You are right. For the purposes of your analysis, Lintian uses a single process. Thank you for your valuable suggestions! Kind regards, Felix Lechner
Is your work available anywhere? Thanks,
I would be very happy if this were implemented. To quote an email I sent debian-devel: For point 2, it seems the easiest way to make a significant difference would be if lintain could run multi-threaded. My current development CPU has 8 physical cores hyper-threaded, which present to the OS as 16 logical cores. Most of the build process is multi-threaded and uses all the cores to their maximum potential simultaneously. But lintian is single-threaded, so it only uses one core and the other 15 sit idle. There might be some lintian tests that depend on the output of other lintian tests, but I would imagine that most of them could be run in parallel with the results combined at the end. I don’t know enough Perl to know how easy it would be to run lintian in a multi-threaded manner, but if this was not a difficult change it would speed up lintian runs dramatically. In the case of qtwebengine-opensource-src on my hardware, assuming that all cores could be efficiently utilized and there are no other bottlenecks in RAM or disk access, it would drop lintian’s runtime from about 30 minutes to about 2 minutes. https://lists.debian.org/debian-devel/2024/05/msg00169.html
The current blocker for multithreading is Lintian::Index, which is not shareable/serializable at the moment so it can't be copied between all the threads. Even if it was, it's a huge amount of data being copied into each thread, and only a few checks need Lintian::Index. Possible pathways: - Make Lintian::Index shareable - Implement some sort of communication between threads to allow a worker thread to RPC the Index in the main thread - Refactor Lintian::Index and/or any checks that use it