Skip to content

fix(screentracker): make writeStabilize Phase 1 non-fatal when agents don't echo input#208

Open
johnstcn wants to merge 4 commits intomainfrom
fix/write-stabilize-non-fatal-phase1
Open

fix(screentracker): make writeStabilize Phase 1 non-fatal when agents don't echo input#208
johnstcn wants to merge 4 commits intomainfrom
fix/write-stabilize-non-fatal-phase1

Conversation

@johnstcn
Copy link
Copy Markdown
Member

@johnstcn johnstcn commented Mar 31, 2026

Fixes #123.

Changes

  • Make Phase 1 (echo detection) of writeStabilize non-fatal on timeout — proceed to Phase 2 instead of returning HTTP 500
  • Guard non-fatal path with errors.Is(err, util.WaitTimedOut) so context cancellation still propagates
  • Reduce Phase 1 timeout from 15s to 2s (terminal echo is near-instant)
  • Extract writeStabilizeEchoTimeout and writeStabilizeProcessTimeout constants
  • Log at Info level (not Warn) since non-echoing agents hit this on every message
  • Add send-message-no-echo-agent-reacts test: agent does not echo but reacts to Enter → success
  • Add send-message-no-echo-no-react test: agent is unresponsive → error from Phase 2
  • Add send-message-no-echo-context-cancelled test: context cancellation during Phase 1 propagates as fatal (validates errors.Is guard)
  • Add doc comment on formatPaste in claude.go documenting the ESC limitation with TUI selection prompts

Known limitation

For TUI selection prompts (numbered/arrow-key lists), this fix eliminates the 500 but does not deliver the correct selection — the \x1b (ESC) in bracketed paste cancels the selection widget. The correct approach is MessageTypeRaw. Documented via a comment on formatPaste in lib/httpapi/claude.go.

Also discovered a separate issue during smoke-testing: #209

Implementation plan and decision log

Root cause

writeStabilize Phase 1 assumes the screen will change after writing message text (echo detection). TUI agents using bracketed paste buffer input internally and do not render until Enter. Phase 1 waited 15s for a change that never came → timeout → HTTP 500.

Key decisions

Decision Rationale
Non-fatal only for WaitTimedOut ctx.Err() must still propagate — otherwise context cancellation logs a misleading warning and writes a spurious \r
2s timeout (down from 15s) Echo is near-instant; WaitFor polls at 50ms intervals (5+ checks/s). 2s is generous.
Info level, not Warn Non-echoing agents hit this on every message. Warn implies something a human should investigate.
"echo detection timed out" log message Matches codebase style (short, descriptive). Structured timeout field carries the duration.
Doc comment on formatPaste instead of screentracker test Per review feedback (mafredri P3): the ESC limitation lives in the formatting layer, not the screentracker layer. A comment is cheaper and equally durable.

Behavioral changes

  • Slow-echo agents (>1s): may now trigger the non-fatal timeout. Benign — Phase 2 still succeeds.
  • Unresponsive agents: total timeout increases from 15s to ~17s (2s + 15s). Carriage return is now sent before failing, leaving PTY in a more consistent state.

🤖 Written by a Coder Agent. Will be reviewed by a human.

…agents don't echo input

Phase 1 of writeStabilize waited 15s for the screen to change after
writing message text (echo detection), returning HTTP 500 if it didn't.
Many TUI agents using bracketed paste don't echo input until Enter is
pressed, causing every message send to fail.

Phase 1 timeout is now non-fatal (2s) — if the screen doesn't change,
we log at Info level and proceed to Phase 2 (send carriage return).
Phase 2 remains the real indicator of agent responsiveness.

Key changes:
- Guard non-fatal path with errors.Is(err, util.WaitTimedOut) so
  context cancellation still propagates as a fatal error
- Reduce Phase 1 timeout from 15s to 2s (echo is near-instant)
- Extract named constants for both timeouts
- Add tests for no-echo-success and no-echo-no-react-failure
- Add documentation test for TUI selection prompt ESC limitation

Closes #123
@github-actions
Copy link
Copy Markdown

✅ Preview binaries are ready!

To test with modules: agentapi_version = "agentapi_208" or download from: https://github.com/coder/agentapi/releases/tag/agentapi_208

…tests

Restructure test comments to follow Cucumber-style Given/When/Then
pattern for clarity. Also fix send-no-echo-agent-reacts assertion
to scan for the user message instead of assuming it's the last
message in the conversation (the snapshot loop may append an agent
message after Send returns).
@johnstcn johnstcn self-assigned this Mar 31, 2026
@johnstcn johnstcn requested a review from mafredri March 31, 2026 11:58
Copy link
Copy Markdown
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean design. Making Phase 1 non-fatal for WaitTimedOut while preserving context cancellation as fatal is the right call. The errors.Is guard, extracted constants, and 2s timeout are all well-calibrated. Two P2 findings (missing test coverage for the key invariant, gofmt failure), two P3s (doc accuracy, test layering), and a handful of notes.

"Oh, this test suite looks lovely! Fifty rows, full coverage, green across the board. It's fake. Every row hits the same code path. You dressed up one test in fifty outfits." -- Bisky, on a different test. The new tests here are mostly genuine.

Severity count: 0 P0, 0 P1, 2 P2, 2 P3, 3 Notes.

🤖 This review was automatically generated with Coder Agents.

- Add send-message-no-echo-context-cancelled test: verifies the
  errors.Is(WaitTimedOut) guard by cancelling context during Phase 1
  and asserting context.Canceled propagates (P2)
- Fix gofmt: correct indentation, proper brace placement (P2)
- Fix constant comment: describe WaitFor timeout semantics accurately,
  note 1s stability check can extend past timeout, add TODO tag (P3)
- Drop send-tui-selection-esc-cancels test from screentracker, add
  ESC limitation comment to formatPaste in claude.go instead (P3)
- Shorten log message to match codebase style (Note)
- Rename tests to send-message-* prefix, use newConversation helper
  with opts callbacks (Note)
The test had a race: advanceFor could complete before the Send()
goroutine enqueued, so the stableSignal never fired, and sendCancel
ran while the message was still queued (never reaching writeStabilize).

Fix: use onWrite callback as a synchronization point. advanceUntil
waits for writeStabilize to start writing (onWrite fires), then
cancels. This guarantees Phase 1 WaitFor is running when ctx is
cancelled, and its sleep select sees ctx.Done() immediately.
@johnstcn johnstcn marked this pull request as ready for review March 31, 2026 15:28
Copilot AI review requested due to automatic review settings March 31, 2026 15:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the screentracker PTY conversation send pipeline to avoid failing requests when an agent doesn’t echo typed input during writeStabilize Phase 1 (echo detection), and adds tests to codify the new behavior.

Changes:

  • Make writeStabilize Phase 1 (echo detection) timeout non-fatal and proceed to Phase 2 (processing detection), while still propagating context cancellation.
  • Reduce Phase 1 timeout and extract Phase 1/2 timeouts into constants.
  • Add new test coverage for non-echoing agents (reacting vs unresponsive) and context-cancellation behavior; add clarifying documentation comment about bracketed paste and TUI selection cancellation.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
lib/screentracker/pty_conversation.go Makes echo-detection timeout non-fatal, adds timeouts as constants, and adjusts logging/error handling.
lib/screentracker/pty_conversation_test.go Adds tests for non-echoing agents, unresponsive agents, and context cancellation during Phase 1.
lib/httpapi/claude.go Documents bracketed-paste ESC interaction and suggests MessageTypeRaw for TUI selection prompts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

failed to send message: failed to send message: failed to wait for screen to stabilize: timeout waiting for condition

3 participants