#944785 ITP: pufferfish -- An efficient index for the colored, compacted, de Bruijn graph

Package:
wnpp
Source:
wnpp
Submitter:
"Michael R. Crusoe"
Date:
2025-11-29 16:45:26 UTC
Severity:
wishlist
#944785#5
Date:
2019-11-15 11:21:05 UTC
From:
To:
Subject: ITP: pufferfish -- An efficient index for the colored, compacted, de Bruijn graph
Package: wnpp
Owner: Michael R. Crusoe <michael.crusoe@gmail.com>
Severity: wishlist

* Package name    : pufferfish
  Version         : 1.0.0
  Upstream Author : , 2016 Rob Patro, Avi Srivastava, Hirak Sarkar
* URL             : https://github.com/COMBINE-lab/pufferfish
* License         : GPL-3+
  Programming Lang: C
  Description     : An efficient index for the colored, compacted, de Bruijn graph
 Pufferfish is a new time and memory-efficient data structure for indexing a
 compacted, colored de Bruijn graph (ccdBG).
 .
 Though the de Bruijn Graph (dBG) has enjoyed tremendous popularity as an
 assembly and sequence comparison data structure, it has only relatively
 recently begun to see use as an index of the reference sequences (e.g. deBGA,
 kallisto). Particularly, these tools index the compacted dBG (cdBG), in which
 all non-branching paths are collapsed into individual nodes and labeled with
 the string they spell out. This data structure is particularly well-suited for
 representing repetitive reference sequences, since a single contig in the cdBG
 represents all occurrences of the repeated sequence. The original positions in
 the reference can be recovered with the help of an auxiliary "contig table"
 that maps each contig to the reference sequence, position, and orientation
 where it appears as a substring. The deBGA paper has a nice description how
 this kind of index looks (they call it a unipath index, because the contigs we
 index are unitigs in the cdBG), and how all the pieces fit together to be able
 to resolve the queries we care about.  Moreover, the cdBG can be built on
 multiple reference sequences (transcripts, chromosomes, genomes), where each
 reference is given a distinct color (or colour, if you're of the British
 persuasion). The resulting structure, which also encodes the relationships
 between the cdBGs of the underlying reference sequences, is called the
 compacted, colored de Bruijn graph (ccdBG).  This is not, of course, the only
 variant of the dBG that has proven useful from an indexing perspective. The
 (pruned) dBG has also proven useful as a graph upon which to build a path
 index of arbitrary variation / sequence graphs, which has enabled very
 interesting and clever indexing schemes like that adopted in GCSA2. Also,
 thinking about sequence search in terms of the dBG has led to interesting
 representations for variation-aware sequence search backed by indexes like the
 vBWT (implemented in the excellent gramtools package).

Remark: This package is maintained by Debian Med Packaging Team at
https://salsa.debian.org/med-team/pufferfish

This package will be team maintained by Debian-Med

#944785#12
Date:
2022-03-24 17:38:47 UTC
From:
To:
Hello,

I have stumbled upon pufferfish in Salsa which used to FTBFS. I patched
its build system a bit and now it builds successfully. If it is OK with
you, I can continue finalizing the package.

Best,
Andrius

#944785#17
Date:
2022-03-24 17:39:22 UTC
From:
To:
That's great news, please go for it!
#944785#22
Date:
2022-03-24 20:23:24 UTC
From:
To:
Am Thu, Mar 24, 2022 at 06:39:22PM +0100 schrieb Michael Crusoe:

... as always.  If you fix something, just upload.

Thanks a lot, Andreas.

#944785#29
Date:
2022-04-25 13:48:33 UTC
From:
To:
Hello,

Build dependencies for pufferfish 1.8.0+dfsg-1 [1] cannot be satisfied
due to #1006920: twopaco needs older libtbb, while pufferfish needs
newer, and both are not co-installable. I can confirm that applying
patch from #1006920 solves the issue.

However, once #1006920 is out of the way, pufferfish wants to link with
-ltwopaco -lgraphdump -lntcard. However, neither twopaco (seems to be
the source for first two libraries) nor ntcard (source for the last) do
not seem to build shared libraries. Pristine upstream tarball for
pufferfish embeds these projects, thus it is possible they are patched
forks.

[1] salsa git commit 409e6a8dc660a12de6f1239cca106049a7318a3a

Andrius

#944785#40
Date:
2022-10-03 07:19:41 UTC
From:
To:
Hello,

twopaco has entered testing (yay!), thus I gave its reverse dependency,
pufferfish (ITP bug #944785), a look. pufferfish carries embedded copies
of twopaco and ntcard with a modified build system to create static (or
is it shared?) libraries for these two and then links pufferfish with
them. For Debian twopaco and ntcard have been un-embedded from
pufferfish and packaged as separate binary packages instead. However,
they do not build neither shared nor static libraries, just executables.

I think we can get around by patching twopaco and ntcard builds to
include static libraries in their binary packages. Does this sound right?

Best,
Andrius

#944785#45
Date:
2022-10-03 15:32:14 UTC
From:
To:
Hi Andrius,

Am Mon, Oct 03, 2022 at 10:19:41AM +0300 schrieb Andrius Merkys:

My main motivation to start ntcard and twopaco packages was to avoid
code duplication in pufferfish.  I admit it seems I faild in doing this
sensibly to forget creating a library package.  Simply do whatever
brings you forward with pufferfish and fix what I failed to do.

Kind regards

      Andreas.

#944785#50
Date:
2022-10-04 07:38:01 UTC
From:
To:
Hi,

Thanks for replies, Andreas and Steffen!

Alas, I did not get far. Static libraries for ntcard and twopaco are
easy to add (I have pushed 'static-library' branches to salsa for these
packages). However, pufferfish has patched main() functions of ntcard
and twopaco executables in order to use them in internal calls.

At this point I do not think much can be done without getting the
upstreams of pufferfish, ntcard and twopaco to align their interfaces.

Best,
Andrius

#944785#55
Date:
2022-10-04 09:54:33 UTC
From:
To:
Hi Andrius,

Am Tue, Oct 04, 2022 at 10:38:01AM +0300 schrieb Andrius Merkys:

This sounds like you should keep the code copies (which was your initial
strategy anyway if I remember correctly).

IMHO we need some fast migration path of pufferfish into Debian to
get salmon fixed / updated.  I do not have the feeling that aligning
upstreams is a promising way to be fast.

Kind regards

    Andreas.

#944785#60
Date:
2022-10-07 06:26:32 UTC
From:
To:
Hi Andreas,

Agree. I would suggest bringing back the embedded twopaco and ntcard for
now, launching upstream alignment in the background.

Best,
Andrius