#924040 ITP: archivebox -- open source self-hosted web archive

Package:
wnpp
Source:
wnpp
Submitter:
Antoine Beaupre
Date:
2025-11-29 16:43:56 UTC
Severity:
wishlist
#924040#5
Date:
2019-03-08 18:39:51 UTC
From:
To:
* Package name    : archivebox
  Version         : 0.2.4
  Upstream Author : Nick Sweeting
* URL             : https://archivebox.io/
* License         : MIT/Expat?
  Programming Lang: Python
  Description     : open source self-hosted web archive

ArchiveBox takes a list of website URLs you want to archive, and
creates a local, static, browsable HTML clone of the content from
those websites (it saves HTML, JS, media files, PDFs, images and
more).

You can use it to preserve access to websites you care about by
storing them locally offline. ArchiveBox works by rendering the pages
in a headless browser, then saving all the requests and fully loaded
pages in multiple redundant common formats (HTML, PDF, PNG, WARC) that
will last long after the original content dissapears off the
internet. It also automatically extracts assets like git repositories,
audio, video, subtitles, images, and PDFs into separate files using
youtube-dl, pywb, and wget.

ArchiveBox doesn’t require a constantly running server or backend,
instead you just run the ./archive command each time you want to
import new links and update the static output. It can import and
export JSON (among other formats), so it’s easy to script or hook up
to other APIs. If you run it on a schedule and import from browser
history or bookmarks regularly, you can sleep soundly knowing that the
slice of the internet you care about will be automatically preserved
in multiple, durable long-term formats that will be accessible for
decades (or longer).
----

I'm not using this just yet because the upstream packaging is somewhat
weird right now.

https://github.com/pirate/ArchiveBox/issues/120#issuecomment-471027516

It's eventually going to end up on pypi, at which point i'll look at
packaging this myself.

There are, as far as I know, no similar tool in Debian right
now. There are web crawlers and grabbers, but nothing as comprehensive
as this.

I'd be happy to co-maintain this or delegate to whoever is interested.

#924040#10
Date:
2019-04-20 09:34:00 UTC
From:
To:
Many people use xul-ext-scrapbook, but this means, they are
stuck with firefox 52.9.0esr-1. I proposed packaging
webext-scrapbookq (#898545), but TBH it does not seem to be an
adequate replacement. At first sight, archivebox looks better.

I'm willing to help under the PAPT (or other team) umbrella.

#924040#15
Date:
2019-04-20 16:18:49 UTC
From:
To:
That would be great. I have very little time to work on this right now
but would be available to test a package.

Upstream did get their act together to fix Python packaging to be more
standard, I believe. There are talks of uploading to PyPI and using
setup.py properly, and I think it might be time to give it another try,
for what that's worth.

A.

#924040#20
Date:
2020-03-11 03:24:11 UTC
From:
To:
Dear colleagues,

I mourn loss of Scrapbook and ArchiveBox appears to be a viable alternative.

I've noticed this ITP but since there are no signs of initial packaging I
took the liberty of making a first draft:

https://salsa.debian.org/debian/archivebox

Let's work together on it, shall we? I hope that my initial packaging could
be of help as a starting point.

Thanks.
--- It is a mistake to try to look too far ahead. The chain of destiny can only be grasped one link at a time. -- Winston Churchill
#924040#25
Date:
2020-03-11 13:24:14 UTC
From:
To:
Feel free to take over this ITP, I'm swamped. :)

a.

#924040#30
Date:
2020-03-11 22:14:18 UTC
From:
To:
So do I.

Thanks for that!

As anarcat, I'm too busy with other things now, but I'm starting
to test your draft package right now. Observations so far:

 - the link "https://nicksweeting.com/images/archive.png" in
   archivebox/templates/link_index.html should be replaced with
   a local copy

 - when starting archivebox, I'm getting a strange message:
   "fatal: not a git repository (or any of the parent directories): .git"

 - archivebox likes to write to /usr/share/output/, which
   probably should be ~/.cache/archivebox/

 - I'm trying to archive www.debian.org using the command
   echo https://www.debian.org/ | archivbox
   but get the error:
   "! Failed to archive link: KeyError: 'domain'"
   and the resulting subdirectory
   /usr/share/output/archive/1583964646 remains empty

I'm desperately in need of something replacing scrapbook. If you
are working on archivebox, I'ld in turn test the package and
send bug reports, maybe even with patches.

#924040#35
Date:
2020-03-13 09:30:27 UTC
From:
To:
Hi Martin,

Thanks a lot for the feedback.

Upstream addressed that already so I've packaging a new upstream snapshot to
pick up changes.

That was a version detection implying that archivebox executable is running
from git repository. I've patched away this logic and embedded package
version.

Indeed it is an inconvenient default even though it is customizable by
setting "OUTPUT_DIR" environment variable or in "~/.ArchiveBox.conf" as per
template in

  /usr/share/doc/archivebox/examples/ArchiveBox.conf.default

I've patched ArchiveBox to use current working directory.
"~/.cache/archivebox/" feels oddly specific...

It has been fixed by recent updates to packaging. I could successfully
archive some web sites but archived debian.org seems to CSS so the menu is
not rendered properly...

I need Scrapbook alternative as well but I'll be working slowly towards that
goal due to pressure from other priorities. I'm yet to learn how to use
ArchiveBox... Also upstream prepares some serious changes for next release --
I hope it won't be too difficult to package. I'd appreciate any help with
ArchiveBox.

Did you have a look at Archivematica by any chance?
--- Facebook in particular is the most appalling spying machine that has ever been invented. -- Julian Assange
#924040#40
Date:
2020-03-13 09:30:58 UTC
From:
To:
Thank you very much. I'll probably take over in few weeks once I find another
opportunity window to spend more time on ArchiveBox...
--- It is a fine thing to be honest, but it is also very important to be right. -- Winston Churchill
#924040#45
Date:
2021-01-12 13:26:07 UTC
From:
To:
Hi all,

I'm @pirate (the ArchiveBox creator).

Thanks for all of your work here so far with trying to get ArchiveBox
packaged. I've finally gotten around to packaging ArchiveBox for Debian
myself, it's currently on a PPA, if you have any problems/concerns with it
just let me know.

https://github.com/ArchiveBox/debian-archivebox
https://github.com/ArchiveBox/ArchiveBox/blob/dev/stdeb.cfg
https://launchpad.net/~archivebox/+archive/ubuntu/archivebox

v0.5.3 (the latest version) has a small bug with a missing python package
(base32-crockford), but it will be fixed in v0.5.4 which I'll be releasing
shortly.

It's my first time packaging for debian so I might've gotten some things
wrong, but so far it seems to be working well on my Ubuntu test machines.

another
right.

#924040#50
Date:
2021-02-20 01:05:06 UTC
From:
To:
Hi Martin, Antoine,

I've updated ArchiveBox to the latest release and it seems to work
well. Could you have a look and provide your feedback please?

https://salsa.debian.org/debian/archivebox

Thanks.
--- There are occasions when it pays better to fight and be beaten than not to fight at all. -- George Orwell
--- A study on infectivity of asymptomatic SARS-CoV-2 carriers, concludes weak transmission. "The median contact time for patients was four days and that for family members was five days." -- https://pubmed.ncbi.nlm.nih.gov/32513410/
#924040#55
Date:
2022-07-21 05:27:35 UTC
From:
To:
Hi! and thanks for your efforts to package archivebox.

I have picked it up from there and tried to build the current version
(0.6.2). For that I have refreshed the patches, here is a MR:
https://salsa.debian.org/debian/archivebox/-/merge_requests/1

Feel free to pick up whatever you need from there.

There is a test error:

     ============================= test session starts
==============================
     platform linux -- Python 3.9.2, pytest-6.0.2, py-1.10.0, pluggy-0.13.0
     rootdir: /tmp/build-area/archivebox-0.6.2
     plugins: django-3.5.1, sugar-0.9.4
     collected 44 items / 1 error / 43 selected

     ==================================== ERRORS
====================================
     ____ ERROR collecting 
.pybuild/cpython3_3.9/build/tests/test_extractors.py _____
     ImportError while importing test module
'/tmp/build-area/archivebox-0.6.2/.pybuild/cpython3_3.9/build/tests/test_extractors.py'.
     Hint: make sure your test modules/packages have valid Python names.
     Traceback:
     ...
     archivebox/system.py:14: in <module>
         from .vendor.atomicwrites import atomic_write as lib_atomic_write
     E   ModuleNotFoundError: No module named
'archivebox.vendor.atomicwrites'
     =========================== short test summary info
============================
     ERROR tests/test_extractors.py
     !!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection
!!!!!!!!!!!!!!!!!!!!
     =============================== 1 error in 0.60s
===============================

but this does not prevent the package to build (?).

Anyway I m not familiar with this style of tracking only the debian dir
in salsa (I normally use gbp and push the upstream sources as well). How
can you enable salsa CI BTW ?

Paolo

#924040#60
Date:
2022-07-21 07:19:12 UTC
From:
To:
I installed the package above and when I tried the archivebox command I
got the same error about the missing atomicwrites module.

This is easy to fix by adding
lib/python3.9/site-packages/atomicwrites/__init__.py from
https://pypi.org/project/atomicwrites/ 1.4.1 as
debian/vendor/atomicwrites.py.

A better way of vendoring dependencies would be to use gbp components,
so that they are versioned, uscan looks for new versions etc.

Another issue is that if I try to "archivebox add" an url I get these
warnings:

     ! SINGLEFILE_BINARY: single-file (unable to detect version)
     ! READABILITY_BINARY: readability-extractor (unable to detect version)
     ! MERCURY_BINARY: mercury-parser (unable to detect version)

Indeed the page is archived as (I have the recommended dep chromium):
- Chrome > PDF
- Chrome Screenshot
- wget > HTML
- Archive.org
- Headers
- Chrome > HTML
- MEdia
- Git

But these fail:
- Chrome > SingleFile
- Readability
- Mercury
- Git (not sure why)

Getting the first three to work would require installing the JS
dependencies listed in package.json which is a mess.

But after the atomicwrites fix the package seems to be usable as-is and
worth uploading!

Paolo