Skip to content

Several fixes to worker logic#57

Open
C-Achard wants to merge 22 commits intojaap/add_compatibility_testsfrom
cy/pre-release-2.0-review1
Open

Several fixes to worker logic#57
C-Achard wants to merge 22 commits intojaap/add_compatibility_testsfrom
cy/pre-release-2.0-review1

Conversation

@C-Achard
Copy link

@C-Achard C-Achard commented Mar 2, 2026

Motivation

The following addresses several review comments in #38, bringing fixes and improvements to workers lifecycles with a focus on safety and error handling.

Overall this is more of "thorough but still minimal" approach. This may benefit from further refactoring in the future yet most of these problems are not really expected to occur by design, so avoiding overengineering may be desirable.

NOTE: To avoid conflicts with #56, this PR is based on it.


This pull request introduces several improvements and bug fixes across the camera control, video recording, and pose estimation workers, with a particular focus on thread safety, lifecycle management, and error handling.

The most significant changes include a major refactor of the DLCLiveProcessor worker lifecycle, enhancements to multi-camera thread management, and improved robustness in the video recorder. These updates aim to make the system more reliable during startup, shutdown, and error scenarios.

DLCLiveProcessor Lifecycle & Thread Safety:

  • Refactored the worker thread management in DLCLiveProcessor by introducing a WorkerState enum and a _lifecycle_lock to ensure safe state transitions and prevent race conditions during startup and shutdown. The worker's state is now explicitly tracked, and error/fault states are handled more robustly. (dlclivegui/services/dlc_processor.py)

  • The configure method now raises an error if called while the processor is running, enforcing correct usage patterns and preventing misconfiguration. (dlclivegui/services/dlc_processor.py)

Multi-Camera Controller Improvements:

  • Improved the logic for tracking expected cameras by using unique camera IDs, which prevents issues with duplicate or reused IDs. (dlclivegui/services/multi_camera_controller.py)

  • Enhanced camera thread shutdown: If a camera thread fails to quit gracefully, it is now forcibly terminated, and an error is logged, increasing robustness against frozen threads. (dlclivegui/services/multi_camera_controller.py)

Video Recorder Robustness:

  • Improved the write and stop methods in the video recorder to better handle stop events, queue state, and thread termination. If the recorder thread does not terminate cleanly, a warning is logged. (dlclivegui/services/video_recorder.py) [1] [2]

General Bug Fixes & Minor Improvements:

  • Fixed a potential bug in camera ID selection logic by replacing a None check with a truthy check. (dlclivegui/gui/main_window.py)
  • Clarified timestamp handling in frame writing to avoid ambiguity when timestamp=0. (dlclivegui/gui/recording_manager.py)
  • Updated docstrings and minor code comments for clarity. (dlclivegui/services/multi_camera_controller.py, dlclivegui/services/dlc_processor.py) [1] [2]

These changes collectively improve the reliability and maintainability of the system, especially under edge cases involving thread management and error recovery.

C-Achard added 9 commits March 2, 2026 10:58
Preserve explicit timestamp values and strengthen worker shutdown behavior. Key changes:
- recording_manager: use timestamp if timestamp is not None else time.time() so falsy timestamps (e.g. 0) are preserved instead of being overridden.
- services/dlc_processor: import DLCLive at module top and enable profiling; make _stop_worker return a boolean and have reset() check that return value to avoid resetting the DLCLive instance while the worker thread is still alive (with warnings/logging). Adjusted comments about thread termination.
- gui/main_window: use a truthiness check for dlc_cam_id to avoid treating empty values as valid IDs.
These updates prevent spurious timestamp overrides, avoid unsafe DLCLive resets when threads are still running, and clarify import/profiling intent.
Introduce a WorkerState enum and import Enum/auto in dlc_processor.py to represent worker lifecycle states (STOPPED, STARTING, RUNNING, STOPPING, FAULTED). This centralizes state representation for processing workers and improves clarity for future state management.
Add explicit worker lifecycle management and synchronization to DLCLive processor: introduce WorkerState, a lifecycle Lock, and stop/start state transitions. Make _start/_stop behavior safer by snapshotting the queue, returning boolean on stop, joining the thread with fault handling, and avoiding races by using a local queue reference in the worker loop. Harden enqueue_frame to respect lifecycle/state/stop flags and reduce copies by reusing a single frame copy and enqueuing enqueue timestamp. Improve DLCLive initialization error handling (set FAULTED on failure) and ensure caller-visible errors when DLCLive is not initialized. Prevent configuring the processor while running and add logging/warnings for shutdown/termination issues. Minor cleanup: adjust where initialized/state flags are set and remove redundant checks.
When queuing work, ensure the background worker is started if the worker thread is None or not alive by calling _start_worker_locked(frame_c, timestamp). Also clear the _stop_event in both the early-return (no thread) and final cleanup paths when stopping so a stale stop flag doesn't prevent future restarts.
Improve error handling in the worker loop when the queue is missing or queue.get raises unexpected exceptions. When q is None the code now logs a clearer message, sets the worker state to WorkerState.FAULTED under the lifecycle lock, and emits an error signal. Also adds a broad except around queue.get to log the exception, set FAULTED state, emit the error message, and break the loop to avoid silent failures.
Make multi-camera shutdown more robust by iterating threads with their camera IDs, skipping non-running threads, and quitting each thread. If a thread fails to stop within 5000ms, log an error and call terminate() then wait an additional 1000ms to ensure it stops. Also simplify the start() docstring.
Previously _expected_cameras was set to len(active_settings), which could overcount when multiple CameraSettings referred to the same physical camera. This change computes a set of unique camera IDs (via get_camera_id) and sets _expected_cameras to its length, ensuring duplicates aren't double-counted.
Add robust shutdown and error handling for the video recorder. Prevent writes after stop by checking the stop event early; record and raise encoding errors as before. Refactor stop() to use local vars, signal the writer via a sentinel, join the writer thread safely, and avoid saving timestamps prematurely. Rework _writer_loop to handle a missing queue, broaden exception handling when pulling/writing items, ensure q.task_done() is always called, properly handle the sentinel, update encoding stats and logging, and perform final cleanup (save timestamps, clear queue and thread) when exiting.
Consolidate the writer-thread loop and tighten up exception handling. Removed the conditional _save_timestamps path during teardown. In the writer loop sentinel handling and frame processing are unified; on write errors the exception is stored under _stats_lock, logged, and stop_event/stop_now are set. The stats updates (frames_written, total_latency, last_latency, written_times, frame_timestamps and periodic FPS computation) were moved into the same control flow to ensure consistent accounting and termination behavior.
@C-Achard C-Achard self-assigned this Mar 2, 2026
@C-Achard C-Achard added enhancement New feature or request camera Related to cameras and camera backends DLClive Related to DLCLive inference, versioning, model export... recording Related to video writing, codecs, .. labels Mar 2, 2026
@C-Achard C-Achard mentioned this pull request Mar 2, 2026
20 tasks
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR applies lifecycle and safety improvements across worker-style components (DLCLive pose processor, multi-camera controller, and video recorder), focusing on more robust startup/shutdown behavior and clearer handling of error conditions.

Changes:

  • Refactors DLCLiveProcessor worker lifecycle with explicit state tracking and a lifecycle lock.
  • Hardens multi-camera shutdown and expected-camera accounting using unique camera IDs.
  • Improves video recorder stop/write behavior and shifts timestamp persistence into the writer thread.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
dlclivegui/services/video_recorder.py Adjusts write/stop flow and writer-loop robustness; moves timestamp saving to thread exit.
dlclivegui/services/multi_camera_controller.py Tracks expected cameras by unique IDs; adds forced termination for frozen camera threads.
dlclivegui/services/dlc_processor.py Introduces WorkerState + _lifecycle_lock and updates enqueue/start/stop lifecycle behavior.
dlclivegui/gui/recording_manager.py Fixes timestamp handling so timestamp=0 is preserved rather than replaced.
dlclivegui/gui/main_window.py Fixes inference camera ID selection logic by using a truthy string check.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

C-Achard added 3 commits March 2, 2026 13:48
Add cooperative shutdown timeouts and robust cleanup for camera threads: introduce QUIT_WAIT_MS and TERMINATE_WAIT_MS, connect thread.finished to a new _cleanup_camera that removes frame data and deletes worker/thread QObjects, and enhance stop() to wait, terminate, and log/retain references if threads refuse to terminate (avoids use-after-free/segfaults). Adjust per-camera shutdown checks and ensure frames/timestamps are cleared only after safe termination. In video_recorder, remove premature clearing of the queue and _writer_thread during finalization to avoid unsafe cleanup while writer thread may still be exiting.
Introduce lifecycle locking and an "abandoned" state to VideoRecorder to prevent unsafe restarts and improve shutdown handling. start() now acquires a lifecycle lock, refuses to restart when a leftover thread is marked abandoned, and initializes/reset writer state inside the lock. stop() now marks the recorder as abandoned and records an encode error if the writer thread fails to terminate within the timeout, and ensures timestamps are saved. The writer loop was simplified to break directly on sentinel or encode errors and now closes WriteGear during finalization with exception handling. Miscellaneous reordering and small bookkeeping fixes were made to improve safety and logging during shutdown.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Comments suppressed due to low confidence (1)

dlclivegui/services/video_recorder.py:344

  • The writer is closed in _writer_loop() (finally block) and then closed again in stop() after join(). If WriteGear.close() isn't idempotent, this can raise/log exceptions on normal shutdown. Consider choosing a single place to close the writer (either in the worker thread only, or only in stop() after join) and remove/adjust the other path accordingly. Also, _finalize_writer() appears unused now; removing or reusing it would reduce dead code.
        finally:
            writer = self._writer
            if writer is not None:
                try:
                    writer.close()
                except Exception:
                    logger.exception("Failed to close WriteGear during writer loop finalization")

    def _finalize_writer(self) -> None:
        writer = self._writer
        self._writer = None
        if writer is not None:
            try:
                writer.close()
                time.sleep(0.2)  # give some time to finalize

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

C-Achard added 3 commits March 2, 2026 14:33
dlclivegui/services/dlc_processor.py: Move the frame copy and enqueue timestamp into the lifecycle lock to avoid unnecessary copying when the processor is stopped; on initialization errors set WorkerState.FAULTED under the lifecycle lock before emitting the error.

dlclivegui/services/multi_camera_controller.py: Reformat the running-cameras shutdown check into a multi-line condition (keeps the same intent of detecting normal shutdown when no cameras are started and all threads are stopped).

dlclivegui/services/video_recorder.py: Add best-effort cleanup for a stale video writer in start() by attempting to call its close() and logging any errors before clearing writer/queue/thread references. Wrap stop() logic with the lifecycle lock to avoid races when stopping, preserve timeout/abandon behavior for a non-terminating writer thread, and ensure writer closure and timestamp saving are handled with proper error logging.
Detect and remove stale camera worker/thread state when starting cameras and during shutdown. _start_camera now checks the existing thread for a cam_id and calls _cleanup_camera if the worker exists but the thread is missing or not running, and updates the log message to indicate the camera is already running. The shutdown loop likewise invokes _cleanup_camera for None or not-running threads and after successful quit/terminate waits to ensure worker/thread entries are cleared. These changes prevent stale entries from blocking restarts and ensure cleaner shutdown behavior.
Introduce STOP_JOIN_TIMEOUT (5s) and use it for the writer thread join timeout instead of a hardcoded value. Reorder lifecycle checks in VideoRecorder.start to raise if the recorder is marked _abandoned before short-circuiting on is_running, preventing accidental restarts of abandoned recorders. Add a BlockingWriteGear fake, fixture, and a test (test_stop_timeout_marks_abandoned_and_prevents_restart) that forces a stop timeout, verifies the recorder is marked abandoned, prevents restart, and performs cleanup of the blocked writer thread.
@C-Achard C-Achard requested a review from Copilot March 2, 2026 14:05
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Minimize time holding the DLCLiveProcessor lifecycle lock by moving expensive operations (frame.copy and timestamp handling) outside the lock and re-checking state before starting the worker to avoid races. Correct a duplicated condition in MultiCameraController so normal shutdown is detected using self._running. Move writer cleanup into a dedicated _finalize_writer(), add a warning when Queue.task_done() is called too many times, and centralize writer close/exception handling.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

dlclivegui/services/multi_camera_controller.py:224

  • stop() sets self._running = False immediately, but the method can return early when still_running is non-empty (threads refused to quit/terminate). That leaves the controller reporting "not running" (is_running() becomes false) while camera threads are still active, and it also prevents subsequent stop() calls from retrying because the top guard (if not self._running: return) will short-circuit. Consider only setting _running = False after a successful full stop, or introducing a FAULTED/STOPPING state that still allows repeated stop attempts while threads are alive.
    def stop(self, wait: bool = True) -> None:
        """Stop all cameras."""
        if not self._running:
            return

        self._running = False


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

C-Achard added 3 commits March 2, 2026 16:17
Introduce STOP_WORKER_TIMEOUT and a background reaper to handle worker threads that don't terminate within the stop timeout. Increase join timeout to STOP_WORKER_TIMEOUT, set _pending_reset when a reset is requested while the worker is alive, and schedule _schedule_reap to join the thread in the background instead of immediately marking the processor as FAULTED. The reaper performs final cleanup (clearing thread, queue, state, stop event) and applies a pending DLCLive reset once the thread is reaped, with guards to ensure only one reaper runs at a time.
Cache self._queue to a local variable (q) and use it for subsequent operations to avoid races and repeated attribute access. Remove the redundant assert and simplify qsize lookup by relying on the cached reference. This makes queue checks and put/qsize calls consistent in a multithreaded context and prevents potential None/attribute changes between checks and use.
Add a lifecycle lock and guard in DLCLiveProcessor.configure to ensure the processor is stopped before reconfiguration. The method now acquires self._lifecycle_lock and raises a RuntimeError if the processor state is not WorkerState.STOPPED, preventing concurrent/race-condition updates to settings or processor while running.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Ensure proper cleanup and restartability when writer threads time out: VideoRecorder.stop now clears abandoned state and associated resources if the writer thread has already exited, allowing the recorder to be fully stopped and restarted. DLCLiveProcessor: moved settings/processor assignment under the lifecycle lock, and mark _pending_reset when shutdown cannot stop the worker thread so the instance can be reaped later. Minor comment tweak for STOP_WORKER_TIMEOUT semantics. Tests updated to assert abandoned is cleared and that recorder can be restarted after the writer exits.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Detect and early-return when recorder is already stopped or recovered from an abandoned state, and reduce the time holding the lifecycle lock by moving queue/thread/writer shutdown work outside the lock. Ensure sentinel is enqueued, writer thread is joined with timeout (marking recorder as abandoned and recording an encode_error if it remains alive), close the writer cleanly, save timestamps, and finally clear lifecycle fields under the lock. Improves concurrency, cleanup correctness, and prevents restarting a recorder that was marked abandoned.
@C-Achard C-Achard marked this pull request as ready for review March 2, 2026 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

camera Related to cameras and camera backends DLClive Related to DLCLive inference, versioning, model export... enhancement New feature or request recording Related to video writing, codecs, ..

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants