Subject: ITP: pufferfish -- An efficient index for the colored, compacted, de Bruijn graph Package: wnpp Owner: Michael R. Crusoe <michael.crusoe@gmail.com> Severity: wishlist * Package name : pufferfish Version : 1.0.0 Upstream Author : , 2016 Rob Patro, Avi Srivastava, Hirak Sarkar * URL : https://github.com/COMBINE-lab/pufferfish * License : GPL-3+ Programming Lang: C Description : An efficient index for the colored, compacted, de Bruijn graph Pufferfish is a new time and memory-efficient data structure for indexing a compacted, colored de Bruijn graph (ccdBG). . Though the de Bruijn Graph (dBG) has enjoyed tremendous popularity as an assembly and sequence comparison data structure, it has only relatively recently begun to see use as an index of the reference sequences (e.g. deBGA, kallisto). Particularly, these tools index the compacted dBG (cdBG), in which all non-branching paths are collapsed into individual nodes and labeled with the string they spell out. This data structure is particularly well-suited for representing repetitive reference sequences, since a single contig in the cdBG represents all occurrences of the repeated sequence. The original positions in the reference can be recovered with the help of an auxiliary "contig table" that maps each contig to the reference sequence, position, and orientation where it appears as a substring. The deBGA paper has a nice description how this kind of index looks (they call it a unipath index, because the contigs we index are unitigs in the cdBG), and how all the pieces fit together to be able to resolve the queries we care about. Moreover, the cdBG can be built on multiple reference sequences (transcripts, chromosomes, genomes), where each reference is given a distinct color (or colour, if you're of the British persuasion). The resulting structure, which also encodes the relationships between the cdBGs of the underlying reference sequences, is called the compacted, colored de Bruijn graph (ccdBG). This is not, of course, the only variant of the dBG that has proven useful from an indexing perspective. The (pruned) dBG has also proven useful as a graph upon which to build a path index of arbitrary variation / sequence graphs, which has enabled very interesting and clever indexing schemes like that adopted in GCSA2. Also, thinking about sequence search in terms of the dBG has led to interesting representations for variation-aware sequence search backed by indexes like the vBWT (implemented in the excellent gramtools package). Remark: This package is maintained by Debian Med Packaging Team at https://salsa.debian.org/med-team/pufferfish This package will be team maintained by Debian-Med
Hello, I have stumbled upon pufferfish in Salsa which used to FTBFS. I patched its build system a bit and now it builds successfully. If it is OK with you, I can continue finalizing the package. Best, Andrius
That's great news, please go for it!
Am Thu, Mar 24, 2022 at 06:39:22PM +0100 schrieb Michael Crusoe: ... as always. If you fix something, just upload. Thanks a lot, Andreas.
Hello, Build dependencies for pufferfish 1.8.0+dfsg-1 [1] cannot be satisfied due to #1006920: twopaco needs older libtbb, while pufferfish needs newer, and both are not co-installable. I can confirm that applying patch from #1006920 solves the issue. However, once #1006920 is out of the way, pufferfish wants to link with -ltwopaco -lgraphdump -lntcard. However, neither twopaco (seems to be the source for first two libraries) nor ntcard (source for the last) do not seem to build shared libraries. Pristine upstream tarball for pufferfish embeds these projects, thus it is possible they are patched forks. [1] salsa git commit 409e6a8dc660a12de6f1239cca106049a7318a3a Andrius
Hello, twopaco has entered testing (yay!), thus I gave its reverse dependency, pufferfish (ITP bug #944785), a look. pufferfish carries embedded copies of twopaco and ntcard with a modified build system to create static (or is it shared?) libraries for these two and then links pufferfish with them. For Debian twopaco and ntcard have been un-embedded from pufferfish and packaged as separate binary packages instead. However, they do not build neither shared nor static libraries, just executables. I think we can get around by patching twopaco and ntcard builds to include static libraries in their binary packages. Does this sound right? Best, Andrius
Hi Andrius,
Am Mon, Oct 03, 2022 at 10:19:41AM +0300 schrieb Andrius Merkys:
My main motivation to start ntcard and twopaco packages was to avoid
code duplication in pufferfish. I admit it seems I faild in doing this
sensibly to forget creating a library package. Simply do whatever
brings you forward with pufferfish and fix what I failed to do.
Kind regards
Andreas.
Hi, Thanks for replies, Andreas and Steffen! Alas, I did not get far. Static libraries for ntcard and twopaco are easy to add (I have pushed 'static-library' branches to salsa for these packages). However, pufferfish has patched main() functions of ntcard and twopaco executables in order to use them in internal calls. At this point I do not think much can be done without getting the upstreams of pufferfish, ntcard and twopaco to align their interfaces. Best, Andrius
Hi Andrius,
Am Tue, Oct 04, 2022 at 10:38:01AM +0300 schrieb Andrius Merkys:
This sounds like you should keep the code copies (which was your initial
strategy anyway if I remember correctly).
IMHO we need some fast migration path of pufferfish into Debian to
get salmon fixed / updated. I do not have the feeling that aligning
upstreams is a promising way to be fast.
Kind regards
Andreas.
Hi Andreas, Agree. I would suggest bringing back the embedded twopaco and ntcard for now, launching upstream alignment in the background. Best, Andrius