What Is a Stuck Job?
A job is considered stuck when it has been in the InProgress state for longer than the configured timeout without producing any output, tool calls, or state updates. Common causes:- LLM provider timeout or rate limit with no retry budget remaining
- Tool call hanging on an unresponsive external service
- Container resource exhaustion (OOM, CPU throttle)
- Network partition between the agent and a sandboxed worker
Configuration
SELF_REPAIR_TIMEOUT_SECS should be set lower than SANDBOX_TIMEOUT_SECS. The sandbox enforces a hard kill; self-repair is a soft recovery that runs before the hard kill triggers.Detection
The self-repair system runs as a background task alongside the scheduler. It periodically scans all InProgress jobs and compares the last-activity timestamp against the stuck threshold.tool_failures table accumulates failure records per job and per tool. This data is used to assess whether recovery is worth attempting again or whether the job should be moved directly to Failed.
Recovery Flow
When a stuck job is detected, the self-repair system attempts to restart it:Transition to Stuck
The job state changes from InProgress to Stuck. The event is logged with the failure reason and timestamp.
Inspect failure history
The system checks the
tool_failures table for this job. If the job has exceeded the maximum retry count, it is transitioned directly to Failed and recovery is skipped.Re-enter InProgress
If retries remain, the job transitions back to InProgress. A new worker picks it up and resumes execution from the last saved checkpoint.
State Diagram
Tool Failure Tracking
Every time a tool fails during job execution, the event is recorded:| Field | Description |
|---|---|
job_id | The job that experienced the failure |
tool_name | Which tool failed |
error | Error message or failure reason |
occurred_at | Timestamp of the failure |
Observability
Stuck and recovered jobs are visible in:- Job history — The web gateway’s job list shows state transitions with timestamps
- Logs —
RUST_LOG=ironclaw::agent::self_repair=debugfor detailed repair events list_jobstool — Shows current state including Stuck jobs
Troubleshooting
Job repeatedly gets stuck
Job repeatedly gets stuck
- Check
RUST_LOG=ironclaw::agent::self_repair=debugfor failure reasons - Inspect tool failure records via
job_statustool - Consider increasing
SELF_REPAIR_TIMEOUT_SECSif the job is legitimately long-running - Check network connectivity to LLM provider and external services
Job fails immediately after recovery
Job fails immediately after recovery
- The tool failure history may reveal a specific tool that always fails
- Check whether the external service the tool depends on is available
- Review sandbox logs if the job runs in a container (
SANDBOX_ENABLED=true)
Stuck jobs not being detected
Stuck jobs not being detected
- Verify
SELF_REPAIR_ENABLED=true - Check that
SELF_REPAIR_TIMEOUT_SECSis not set too high - Confirm the self-repair monitor is running: look for
self_repairin startup logs