For a while now I have experienced that my tor hidden services,
typically ssh on my servers behind NAT, some times become unavailable.
The only fix I have found so far is to restart the tor daemon (or
sometimes ask for the machine to be rebooted if restarting tor is out of
reach). I have also experienced the Debian APT services behind .onion
addresses becoming unavailable, and wondered if this is the same
problem.
With this background, I asked my local artificial idiocy setup (OpenCode
using local llama.cpp with model Qwen 3.6) to analyze the code and see
if it could find the cause and perhaps provide a fix, as well as create
a synthetic test that could trigger the problem and demonstrate that the
fix actually work. As far as I can tell, it was able to come up with a
explanation, a fix and a test, and I will test it on my servers in the
near future to see if it improves reliability. The problem is that it
is quite unpredictable when one of my servers become unavailable, so it
is hard to know if the fix worked. Because of this, I decided to share
the findings here right now, in case someone else can help me test it.
I plan to submit this patch upstream too, and have tried to request
access to the upstream gitlab, but got a 500 server error on my access
application and am not sure it made it through. I will wait a few days
to see fi I get any response on the request.
In any case, attached are the patches, one for the test case and another
for the fix, for your consideration. Feel free to pass it upstream if
you believe it is the right fix.
I also asked the bullshit generator to explain its findings, which
resulted in the markdown text. Sharing it here as background
information. Note the "original analysis" mentioned was done by Claw
Code using the same llama.cpp setup and model, and did do not seem to be
reach as good a solution as OpenCode.
# Tor Hidden Service Gradual Failure — Corrected Root Cause Analysis
## Problem Statement
Tor hidden services (v3 onion services) gradually stop working after running for some time. This affects both custom SSH services and standard ones. The failure is progressive rather than sudden: the service becomes increasingly unreachable until it stops entirely, requiring a restart of Tor to recover.
## Repository Context
- **Repo**: `tor-packaging` — Debian/Ubuntu packaging of Tor
- **Branch**: `debian-main`
- **Source**: Full Tor source tree under `src/` (Debian patch format)
---
## Corrected Root Cause: Intro Point `circuit_retries` Never Resets on Success
### The Bug
The per-intro-point field `ip->circuit_retries` (`hs_service.h:89`) is incremented every time a circuit is launched for an intro point, in `launch_intro_point_circuits()` at `hs_service.c:3005`:
```c
ip->circuit_retries++;
if (hs_circ_launch_intro_point(service, ip, ei, direct_conn) < 0) { ... }
```
This counter is **never reset to zero when the circuit succeeds**. It is only ever checked against `MAX_INTRO_POINT_CIRCUIT_RETRIES` (3, per `or.h`) in `should_remove_intro_point()` at `hs_service.c:2549-2550`:
```c
bool has_no_retries = (ip->circuit_retries > MAX_INTRO_POINT_CIRCUIT_RETRIES);
```
### Why This Causes Gradual Failure
The lifecycle of a single intro point under normal relay churn:
1. Intro point selected, circuit launched → `circuit_retries = 1`, circuit succeeds
2. ~30–60 minutes later, the circuit times out naturally (relay churn, hibernation, etc.)
3. Scheduled event triggers rebuild → `circuit_retries = 2`, circuit succeeds again
4. Another natural timeout → rebuild → `circuit_retries = 3`
5. Another timeout → rebuild → `circuit_retries = 4`, which exceeds `MAX_INTRO_POINT_CIRCUIT_RETRIES (3)` → **intro point removed** (`hs_service.c:2615`)
Each intro point can survive roughly 3–4 circuit lifecycles before being discarded. With a default of 3 intro points and circuits lasting ~30–60 minutes, after several hours all three intro points will have been eliminated, leaving the hidden service with zero functional introduction points.
A new intro point is eventually selected to replace the removed one (via descriptor regeneration), but it too accumulates retries across its own lifecycle, eventually getting dropped again. The net effect is a gradually degrading service: intro points are lost faster than they can be stably maintained.
### Why Restarting Tor Fixes It
Restarting Tor creates fresh `hs_service_intro_point_t` objects with `circuit_retries = 0`, resetting the counter for all intro points. This aligns exactly with the observed symptom that a restart recovers service.
---
## What Is NOT the Root Cause (Corrected from Original Analysis)
### The Retry Budget Window Is 5 Minutes, Not 3 Hours
The original analysis claimed `IntroPointPeriod` (~3 hours) gates the circuit launch budget. This is incorrect. The circuit launch rate-limiting uses `INTRO_CIRC_RETRY_PERIOD = 300 seconds` (5 minutes), defined at `hs_common.h:38`. The counter resets every 5 minutes (`hs_service.c:3065-3069`):
```c
if (now > (service->state.intro_circ_retry_started_time + INTRO_CIRC_RETRY_PERIOD)) {
service->state.num_intro_circ_launched = 0;
}
```
The `IntroPointPeriod` (~3 hours) controls when intro points are scheduled for rotation — a separate mechanism.
### The Budget Is Generous (28 circuits per window with defaults)
With default config (`NumIntroPoints = 3`, two descriptors):
- Base: `(3 + 2)` extra = 5
- Retries: `3 × MAX_INTRO_POINT_CIRCUIT_RETRIES(3)` = 9
- Multiplier (two descriptors): × 2
- **Total: 28 circuits per 5-minute window** (`hs_service.c:3018-3022`)
Exhausting this budget requires all 28 attempts to fail within a single 5-minute window, which is unlikely under normal conditions. Budget exhaustion is not the primary cause of gradual failure.
### Default `NumIntroPoints` Is 3, Not 2
The default is `NUM_INTRO_POINTS_DEFAULT = 3` (`hs_common.h:30`). The original analysis stated 2.
---
## Code Paths
| File | Line(s) | Role |
|------|---------|------|
| `src/feature/hs/hs_service.h` | 85–89 | `circuit_retries` field declaration and comment |
| `src/feature/hs/hs_service.c` | 3005 | `ip->circuit_retries++` — incremented on every launch |
| `src/feature/hs/hs_service.c` | 2549–2550 | Comparison against `MAX_INTRO_POINT_CIRCUIT_RETRIES` |
| `src/feature/hs/hs_service.c` | 2615–2617 | Intro point removal when retries exceeded |
| `src/feature/hs/hs_circuit.c` | 1087+ | Circuit open callback — where fix belongs |
| `src/feature/hs/hs_common.h` | 38, 41 | `INTRO_CIRC_RETRY_PERIOD`, `MAX_INTRO_CIRCS_PER_PERIOD` |
---
## Diagnosing the Bug in Production
### Log Message to Look For
When an intro point is removed due to retry exhaustion, `should_remove_intro_point()` logs at `LOG_INFO` level (`hs_service.c:2578-2583`):
```
Intro point <desc> (retried: N times). Removing it.
```
**Distinguishing the bug from other removal reasons:** The log suffix tells you why:
| Suffix | Cause | Bug or normal? |
|--------|-------|----------------|
| *(none)* — just `(retried: N times)` | Retry exhaustion | **This is the bug** |
| ` has expired` | Time or INTRODUCE2 count limit reached | Normal rotation |
| ` fell off the consensus` | Relay disappeared from network | Normal churn |
So look for lines matching this pattern where there's **no suffix** and `N >= 4`:
```bash
grep 'Intro point.*retried: [0-9]\+ times\)\. Removing it\.' /var/log/tor/log
```
Each such line means an intro point was removed solely because its retry counter accumulated across successful circuit builds rather than actual failures. When all three intro points are removed this way, the service has zero functional introduction points and becomes unreachable to clients.
### Enabling Info-Level Logging
By default Tor only logs `notice` level. To see these messages, add to `torrc`:
```
Log info file /var/log/tor/hs-info.log
```
Or use syslog with appropriate facility configuration.
---
## Synthetic Test
A test reproducing the bug was added to `src/test/test_hs_service.c` as `test_circuit_retries_reset_on_open`. It:
1. Creates a service with 3 intro points (default).
2. Mocks `node_get_by_id` so nodes are always found — isolating retry-exhaustion removal from "fell off consensus" removal.
3. Runs **6 rounds** of circuit turnover (`MAX_INTRO_POINT_CIRCUIT_RETRIES + 3 = 6`): each round increments retries (simulating `launch_intro_point_circuits()` at `hs_service.c:3005`) then opens the circuit via `hs_service_circuit_has_opened()`.
4. Runs **`run_housekeeping_event()`** — the actual production housekeeping that calls `should_remove_intro_point()` → `cleanup_intro_points()`.
5. Asserts all 3 intro points survive.
**Without the fix:** retries accumulate to 6 (> MAX=3), so housekeeping removes every intro point, leaving zero — service unreachable. Test fails with `tt_u64_op(remaining, OP_EQ, num_ips)` assertion at end of test.
**With the fix:** each successful open resets retries to 0, so housekeeping never sees retries exceeding the limit. All 3 intro points survive. Test passes.
---
## Fix
Reset `ip->circuit_retries = 0` when an intro circuit successfully opens, in `hs_circ_service_intro_has_opened()` at `src/feature/hs/hs_circuit.c`. A successful circuit open means the relay is reachable — the counter should only track consecutive failures.
---
## Debian-Specific Patches Checked
No Debian patches modify hidden service circuit logic. The issue exists upstream.