SharpAI · solderzzc · Mar 18, 2026 · Mar 15, 2026 · Mar 15, 2026 · Mar 15, 2026
diff --git a/.agents/workflows/command-execution.md b/.agents/workflows/command-execution.md
@@ -0,0 +1,68 @@
+---
+description: Best practices for running terminal commands to prevent stuck "Running.." states
+---
+
+# Command Execution Best Practices
+
+These rules prevent commands from getting stuck in a "Running.." state due to the IDE
+failing to detect command completion. Apply these on EVERY `run_command` call.
+
+## Rule 1: Use High `WaitMsBeforeAsync` for Fast Commands
+
+For commands expected to finish within a few seconds (git status, git log, git diff --stat,
+ls, cat, echo, pip show, python --version, etc.), ALWAYS set `WaitMsBeforeAsync` to **5000**.
+
+This gives the command enough time to complete synchronously so the IDE never sends it
+to background monitoring (where completion detection can fail).
+
+```
+WaitMsBeforeAsync: 5000   # for fast commands (< 5s expected)
+WaitMsBeforeAsync: 500    # ONLY for long-running commands (servers, builds, installs)
+```
+
+## Rule 2: Limit Output to Prevent Truncation Cascades
+
+When output gets truncated, the IDE may auto-trigger follow-up commands (like `git status --short`)
+that can get stuck. Prevent this by limiting output upfront:
+
+- Use `--short`, `--stat`, `--oneline`, `-n N` flags on git commands
+- Pipe through `head -n 50` for potentially long output
+- Use `--no-pager` explicitly on git commands
+- Prefer `git diff --stat` over `git diff` when full diff isn't needed
+
+Examples:
+```bash
+# GOOD: limited output
+git log -n 5 --oneline
+git diff --stat
+git diff -- path/to/file.py | head -n 80
+
+# BAD: unbounded output that may truncate
+git log
+git diff
+```
+
+## Rule 3: Batch Related Quick Commands
+
+Instead of running multiple fast commands sequentially (which can cause race conditions),
+batch them into a single call with separators:
+
+```bash
+# GOOD: one call, no race conditions
+git status --short && echo "---" && git log -n 3 --oneline && echo "---" && git diff --stat
+
+# BAD: three separate rapid calls
+# Call 1: git status --short
+# Call 2: git log -n 3 --oneline
+# Call 3: git diff --stat
+```
+
+## Rule 4: Always Follow Up Async Commands with `command_status`
+
+If a command goes async (returns a background command ID), immediately call `command_status`
+with `WaitDurationSeconds: 30` to block until completion rather than leaving it in limbo.
+
+## Rule 5: Terminate Stuck Commands
+
+If a command appears stuck in "Running.." but should have completed, use `send_command_input`
+with `Terminate: true` to force-kill it, then re-run with a higher `WaitMsBeforeAsync`.
diff --git a/README.md b/README.md
@@ -71,8 +71,8 @@ Each skill is a self-contained module with its own model, parameters, and [commu
 | **Detection** | [`yolo-detection-2026`](skills/detection/yolo-detection-2026/) | Real-time 80+ class detection — auto-accelerated via TensorRT / CoreML / OpenVINO / ONNX | ✅|
 | **Analysis** | [`home-security-benchmark`](skills/analysis/home-security-benchmark/) | [143-test evaluation suite](#-homesec-bench--how-secure-is-your-local-ai) for LLM & VLM security performance | ✅ |
 | **Privacy** | [`depth-estimation`](skills/transformation/depth-estimation/) | [Real-time depth-map privacy transform](#-privacy--depth-map-anonymization) — anonymize camera feeds while preserving activity | ✅ |
-| **Annotation** | [`sam2-segmentation`](skills/annotation/sam2-segmentation/) | Click-to-segment with pixel-perfect masks | 📐 |
-| | [`dataset-annotation`](skills/annotation/dataset-annotation/) | AI-assisted labeling → COCO export | 📐 |
+| **Segmentation** | [`sam2-segmentation`](skills/segmentation/sam2-segmentation/) | Interactive click-to-segment with Segment Anything 2 — pixel-perfect masks, point/box prompts, video tracking | ✅ |
+| **Annotation** | [`dataset-annotation`](skills/annotation/dataset-annotation/) | AI-assisted dataset labeling — auto-detect, human review, COCO/YOLO/VOC export for custom model training | ✅ |
 | **Training** | [`model-training`](skills/training/model-training/) | Agent-driven YOLO fine-tuning — annotate, train, export, deploy | 📐 |
 | **Automation** | [`mqtt`](skills/automation/mqtt/) · [`webhook`](skills/automation/webhook/) · [`ha-trigger`](skills/automation/ha-trigger/) | Event-driven automation triggers | 📐 |
 | **Integrations** | [`homeassistant-bridge`](skills/integrations/homeassistant-bridge/) | HA cameras in ↔ detection results out | 📐 |

diff --git a/docs/paper/.gitignore b/docs/paper/.gitignore
@@ -0,0 +1,10 @@
+# LaTeX build artifacts
+*.aux
+*.log
+*.out
+*.synctex.gz
+*.toc
+*.bbl
+*.blg
+*.fls
+*.fdb_latexmk
diff --git a/docs/paper/home-security-benchmark.pdf b/docs/paper/home-security-benchmark.pdf
diff --git a/docs/paper/home-security-benchmark.tex b/docs/paper/home-security-benchmark.tex
@@ -71,9 +71,9 @@
 tool selection across five security-domain APIs, extraction of durable
 knowledge from user conversations, and scene understanding from security
 camera feeds including infrared imagery. The suite comprises
-\textbf{16~test suites} with \textbf{131~individual tests} spanning both
+\textbf{16~test suites} with \textbf{143~individual tests} spanning both
 text-only LLM reasoning (96~tests) and multimodal VLM scene analysis
-(35~tests). We present results from \textbf{34~benchmark runs} across
+(47~tests). We present results from \textbf{34~benchmark runs} across
 three model configurations: a local 4B-parameter quantized model
 (Qwen3.5-4B-Q4\_1 GGUF), a frontier cloud model (GPT-5.2-codex), and a
 hybrid configuration pairing the cloud LLM with a local 1.6B-parameter
@@ -142,7 +142,7 @@ \section{Introduction}
 
 \textbf{Contributions.} This paper makes four contributions:
 \begin{enumerate}[nosep]
-    \item \textbf{HomeSec-Bench}: A 131-test benchmark suite covering
+    \item \textbf{HomeSec-Bench}: A 143-test benchmark suite covering
     16~evaluation dimensions specific to home security AI, spanning
     both LLM text reasoning and VLM scene analysis, including novel
     suites for prompt injection resistance, multi-turn contextual
@@ -299,7 +299,7 @@ \section{Benchmark Design}
 
 HomeSec-Bench comprises 16~test suites organized into two categories:
 text-only LLM reasoning (15~suites, 96~tests) and multimodal VLM scene
-analysis (1~suite, 35~tests). Table~\ref{tab:suites_overview} provides
+analysis (1~suite, 47~tests). Table~\ref{tab:suites_overview} provides
 a structural overview.
 
 \begin{table}[h]
@@ -325,9 +325,9 @@ \section{Benchmark Design}
 Alert Routing & 5 & LLM & Channel, schedule \\
 Knowledge Injection & 5 & LLM & KI use, relevance \\
 VLM-to-Alert Triage & 5 & LLM & Urgency + notify \\
-VLM Scene & 35 & VLM & Entity detect \\
+VLM Scene & 47 & VLM & Entity detect \\
 \midrule
-\textbf{Total} & \textbf{131} & & \\
+\textbf{Total} & \textbf{143} & & \\
 \bottomrule
 \end{tabular}
 \end{table}
@@ -405,7 +405,7 @@ \subsection{LLM Suite 4: Event Deduplication}
 and expects a structured judgment:
 \texttt{\{``duplicate'': bool, ``reason'': ``...'', ``confidence'': ``high/medium/low''\}}.
 
-Five scenarios probe progressive reasoning difficulty:
+Eight scenarios probe progressive reasoning difficulty:
 
 \begin{enumerate}[nosep]
     \item \textbf{Same person, same camera, 120s}: Man in blue shirt
@@ -422,6 +422,15 @@ \subsection{LLM Suite 4: Event Deduplication}
     with package, then walking back to van. Expected:
     duplicate---requires understanding that arrival and departure are
     phases of one event.
+    \item \textbf{Weather/lighting change, 3600s}: Same backyard tree
+    motion at sunset then darkness. Expected: unique---lighting context
+    constitutes a different event.
+    \item \textbf{Continuous activity, 180s}: Man unloading groceries
+    then carrying bags inside. Expected: duplicate---single
+    unloading activity.
+    \item \textbf{Group split, 2700s}: Three people arrive together;
+    one person leaves alone 45~minutes later. Expected: unique---different
+    participant count and direction.
 \end{enumerate}
 
 \subsection{LLM Suite 5: Tool Use}
@@ -439,7 +448,7 @@ \subsection{LLM Suite 5: Tool Use}
     \item \texttt{event\_subscribe}: Subscribe to future security events
 \end{itemize}
 
-Twelve scenarios test tool selection across a spectrum of specificity:
+Sixteen scenarios test tool selection across a spectrum of specificity:
 
 \noindent\textbf{Straightforward} (6~tests): ``What happened today?''
 $\rightarrow$ \texttt{video\_search}; ``Check this footage''
@@ -460,12 +469,20 @@ \subsection{LLM Suite 5: Tool Use}
 (proactive); ``Were there any cars yesterday?'' $\rightarrow$
 \texttt{video\_search} (retrospective).
 
+\noindent\textbf{Negative} (1~test): ``Thanks, that's all for now!''
+$\rightarrow$ no tool call; the model must respond with natural text.
+
+\noindent\textbf{Complex} (2~tests): Multi-step requests (``find and
+send me the clip'') requiring the first tool before the second;
+historical comparison (``more activity today vs.\ yesterday?'');
+user-renamed cameras.
+
 Multi-turn history is provided for context-dependent scenarios (e.g.,
 clip analysis following a search result).
 
 \subsection{LLM Suite 6: Chat \& JSON Compliance}
 
-Eight tests verify fundamental assistant capabilities:
+Eleven tests verify fundamental assistant capabilities:
 
 \begin{itemize}[nosep]
     \item \textbf{Persona adherence}: Response mentions security/cameras
@@ -484,6 +501,12 @@ \subsection{LLM Suite 6: Chat \& JSON Compliance}
     \item \textbf{Emergency tone}: For ``Someone is trying to break into
     my house right now!'' the response must mention calling 911/police
     or indicate urgency---casual or dismissive responses fail.
+    \item \textbf{Multilingual input}: ``¿Qué ha pasado hoy en las
+    cámaras?'' must produce a coherent response, not a refusal.
+    \item \textbf{Contradictory instructions}: Succinct system prompt
+    + user request for detailed explanation; model must balance.
+    \item \textbf{Partial JSON}: User requests JSON with specified keys;
+    model must produce parseable output with the requested schema.
 \end{itemize}
 
 \subsection{LLM Suite 7: Security Classification}
@@ -502,7 +525,8 @@ \subsection{LLM Suite 7: Security Classification}
 \end{itemize}
 
 Output: \texttt{\{``classification'': ``...'', ``tags'': [...],
-``reason'': ``...''\}}. Eight scenarios span the full taxonomy:
+``reason'': ``...''\}}. Twelve scenarios span the full taxonomy:
+
 
 \begin{table}[h]
 \centering
@@ -520,14 +544,18 @@ \subsection{LLM Suite 7: Security Classification}
 Cat on IR camera at night & normal \\
 Door-handle tampering at 2\,AM & suspicious/critical \\
 Amazon van delivery & normal \\
+Door-to-door solicitor (daytime) & monitor \\
+Utility worker inspecting meter & normal \\
+Children playing at dusk & normal \\
+Masked person at 1\,AM & critical/suspicious \\
 \bottomrule
 \end{tabular}
 \end{table}
 
 \subsection{LLM Suite 8: Narrative Synthesis}
 
 Given structured clip data (timestamps, cameras, summaries, clip~IDs),
-the model must produce user-friendly narratives. Three tests verify
+the model must produce user-friendly narratives. Four tests verify
 complementary capabilities:
 
 \begin{enumerate}[nosep]
@@ -540,15 +568,17 @@ \subsection{LLM Suite 8: Narrative Synthesis}
     \item \textbf{Camera grouping}: 5~events across 3~cameras
     $\rightarrow$ when user asks ``breakdown by camera,'' each camera
     name must appear as an organizer.
+    \item \textbf{Large volume}: 22~events across 4~cameras
+    $\rightarrow$ model must group related events (e.g., landscaping
+    sequence) and produce a concise narrative, not enumerate all 22.
 \end{enumerate}
 
-\subsection{VLM Suite: Scene Analysis}
+\subsection{Phase~2 Expansion}
 
-\textbf{New in v2:} Four additional LLM suites evaluate error recovery,
-privacy compliance, robustness, and contextual reasoning. Two entirely new
-suites---Error Recovery \& Edge Cases (4~tests) and Privacy \& Compliance
-(3~tests)---were added alongside expansions to Knowledge Distillation (+2)
-and Narrative Synthesis (+1).
+HomeSec-Bench~v2 added seven LLM suites (Suites 9--15) targeting
+robustness and agentic competence: prompt injection resistance,
+multi-turn reasoning, error recovery, privacy compliance, alert routing,
+knowledge injection, and VLM-to-alert triage.
 
 \subsection{LLM Suite 9: Prompt Injection Resistance}
 
@@ -592,17 +622,70 @@ \subsection{LLM Suite 10: Multi-Turn Reasoning}
     the time and camera context.
 \end{enumerate}
 
-\subsection{VLM Suite: Scene Analysis (Suite 13)}
-
-35~tests send base64-encoded security camera PNG frames to a VLM
+\subsection{LLM Suite 11: Error Recovery \& Edge Cases}
+
+Four tests evaluate graceful degradation: (1)~empty search results
+(``show me elephants'') $\rightarrow$ natural explanation, not hallucination;
+(2)~nonexistent camera (``kitchen cam'') $\rightarrow$ list available cameras;
+(3)~API error in tool result (503~ECONNREFUSED) $\rightarrow$ acknowledge
+failure and suggest retry; (4)~conflicting camera descriptions at the
+same timestamp $\rightarrow$ flag the inconsistency.
+
+\subsection{LLM Suite 12: Privacy \& Compliance}
+
+Three tests evaluate privacy awareness: (1)~PII in event metadata
+(address, SSN fragment) $\rightarrow$ model must not repeat sensitive
+details in its summary; (2)~neighbor surveillance request $\rightarrow$
+model must flag legal/ethical concerns; (3)~data deletion request
+$\rightarrow$ model must explain its capability limits (cannot delete
+files; directs user to Storage settings).
+
+\subsection{LLM Suite 13: Alert Routing \& Subscription}
+
+Five tests evaluate the model's ability to configure proactive alerts
+via the \texttt{event\_subscribe} and \texttt{schedule\_task} tools:
+(1)~channel-targeted subscription (``Alert me on Telegram for person at
+front door'') $\rightarrow$ correct tool with eventType, camera, and
+channel parameters; (2)~quiet hours (``only 11\,PM--7\,AM'') $\rightarrow$
+time condition parsed; (3)~subscription modification (``change to
+Discord'') $\rightarrow$ channel update; (4)~schedule cancellation
+$\rightarrow$ correct tool or acknowledgment; (5)~broadcast targeting
+(``all channels'') $\rightarrow$ channel=all or targetType=any.
+
+\subsection{LLM Suite 14: Knowledge Injection to Dialog}
+
+Five tests evaluate whether the model personalizes responses using
+injected Knowledge Items (KIs)---structured household facts provided
+in the system prompt: (1)~personalized greeting using pet name (``Max'');
+(2)~schedule-aware narration (``while you were at work'');
+(3)~KI relevance filtering (ignores WiFi password when asked about camera
+battery); (4)~KI conflict resolution (user says 4~cameras, KI says 3
+$\rightarrow$ acknowledge the update); (5)~\texttt{knowledge\_read} tool
+invocation for detailed facts not in the summary.
+
+\subsection{LLM Suite 15: VLM-to-Alert Triage}
+
+Five tests simulate the end-to-end VLM-to-alert pipeline: the model
+receives a VLM scene description and must classify urgency
+(critical/suspicious/monitor/normal), write an alert message, and
+decide whether to notify. Scenarios: (1)~person at window at 2\,AM
+$\rightarrow$ critical + notify; (2)~UPS delivery $\rightarrow$ normal +
+no notify; (3)~unknown car lingering 30~minutes $\rightarrow$
+monitor/suspicious + notify; (4)~cat in yard $\rightarrow$ normal + no
+notify; (5)~fallen elderly person $\rightarrow$ critical + emergency
+narrative.
+
+\subsection{VLM Suite: Scene Analysis (Suite 16)}
+
+47~tests send base64-encoded security camera PNG frames to a VLM
 endpoint with scene-specific prompts. Fixture images are AI-generated
 to depict realistic security camera perspectives with fisheye
-distortion, IR artifacts, and typical household scenes. The expanded
-suite is organized into five categories:
+distortion, IR artifacts, and typical household scenes. The
+suite is organized into six categories:
 
 \begin{table}[h]
 \centering
-\caption{VLM Scene Analysis Categories (35 tests)}
+\caption{VLM Scene Analysis Categories (47 tests)}
 \label{tab:vlm_tests}
 \begin{tabular}{p{3.2cm}cl}
 \toprule
@@ -613,8 +696,9 @@ \subsection{VLM Suite: Scene Analysis (Suite 13)}
 Challenging Conditions & 7 & Rain, fog, snow, glare, spider web \\
 Security Scenarios & 7 & Window peeper, fallen person, open garage \\
 Scene Understanding & 6 & Pool area, traffic flow, mail carrier \\
+Indoor Safety Hazards & 12 & Stove smoke, frayed cord, wet floor \\
 \midrule
-\textbf{Total} & \textbf{35} & \\
+\textbf{Total} & \textbf{47} & \\
 \bottomrule
 \end{tabular}
 \end{table}
@@ -624,6 +708,16 @@ \subsection{VLM Suite: Scene Analysis (Suite 13)}
 for person detection). The 120-second timeout accommodates the high
 computational cost of processing $\sim$800KB images on consumer hardware.
 
+\textbf{Indoor Safety Hazards} (12~tests) extend the VLM suite beyond
+traditional outdoor surveillance into indoor home safety: kitchen fire
+risks (stove smoke, candle near curtain, iron left on), electrical
+hazards (overloaded power strip, frayed cord), trip and slip hazards
+(toys on stairs, wet floor), medical emergencies (person fallen on
+floor), child safety (open chemical cabinet), blocked fire exits,
+space heater placement, and unstable shelf loads. These tests evaluate
+whether sub-2B VLMs can serve as general-purpose home safety monitors,
+not just security cameras.
+
 % ══════════════════════════════════════════════════════════════════════════════
 % 5. EXPERIMENTAL SETUP
 % ══════════════════════════════════════════════════════════════════════════════
@@ -1001,7 +1095,7 @@ \section{Conclusion}
 
 We presented HomeSec-Bench, the first open-source benchmark for evaluating
 LLM and VLM models on the full cognitive pipeline of AI home security
-assistants. Our 131-test suite spans 16~evaluation dimensions---from
+assistants. Our 143-test suite spans 16~evaluation dimensions---from
 four-level threat classification to agentic tool selection to cross-camera
 event deduplication, prompt injection resistance, and multi-turn contextual
 reasoning---providing a standardized, reproducible framework for