Grizzlebear Codebase Improvement Tracker

Auto-generated: 2026-05-29 (re-verified against the 2026-05-28 security/cleanup commits) Review scope: Full codebase (dev/, ci/, website/, specs/, localhost/, root config) Previous reviews: 2026-05-29 (earlier same-day run, superseded), 2026-05-27, 2026-05-11, 2026-05-05, 2026-05-04, 2026-05-01, 2026-04-30, 2026-04-29, 2026-04-28, 2026-04-27, 2026-04-26, 2026-04-25, 2026-04-24, 2026-04-23, 2026-04-22, 2026-04-21 (initial), 2026-04-15

Correction note (2026-05-29): the earlier same-day review re-listed items as open that three 2026-05-28 commits had already fixed (835108a secrets/JWT/escaping, ca76051 SessionState/input-validation, 6c66c41 dead-code cleanup) — a stale-seed regression. This refresh re-verified every CRITICAL/HIGH item against current HEAD and moved the confirmed fixes to RESOLVED. The genuinely-still-open critical is the DEK debug log (now Priority 1).

TOP 2 PRIORITIES

1. Encryption Key Secret Logged in Plaintext (CRITICAL SECURITY — STILL OPEN)

dev/core/data.py:74 — logger.debug(f"dek: {os.environ['ACCOUNT_ENCRYPTION_KEY_SECRET']}") writes the master account encryption key secret to application logs in plaintext. This runs on every save_dek() call (i.e., every time a new Data Encryption Key is saved for an account). Verified still present at HEAD 2026-05-29.

Impact: ACCOUNT_ENCRYPTION_KEY_SECRET is the key-encryption-key (KEK) that protects all per-account DEKs. If application logs are stored, forwarded to a log aggregator, or visible via modal app logs, this key is exposed. Combined with access to the encrypted DEKs in S3, an attacker can decrypt all account data.

Fix:

Immediately remove the logger.debug(f"dek: ...") line at data.py:74.
Audit modal app logs retention to confirm the key hasn't been captured in log storage.
Also fix line 72: logger.debug("save_dek", account_id) — the second positional arg is silently dropped (logger signature bug, see Priority 2).

Ops follow-up (carried over from the 2026-05-28 secrets fix): the previously-hardcoded Resend API key (re_FKkPHP8c_…, removed from notifier.py) still lives in git history. Rotate it and consider git-filter-repo to scrub history. Same for any "your-super-secret-key" JWT placeholder occurrences in history.

Resolved parts (2026-05-28, see RESOLVED): hardcoded Resend key and "your-super-secret-key" JWT placeholder are no longer in source — both now read from Modal secrets via env (RESEND_API_KEY / JWT_INVITE_SECRET). CORS wildcards remain open — see H4.

2. Centralized Logging Sweep (IN PROGRESS — complete the migration)

Current state: dev/core/logging_config.py provides the get_logger() factory. Raw print() and direct-logging usage remain concentrated in agent code, ML CLI entrypoints, and ops/diagnostic scripts.

Audit counts (re-run 2026-05-29):

print( — 433 total matches across 39 files (includes dev/_archived/, tests, and diagnostic scripts). Production-path prints are concentrated in the files below.
traceback.print_exc() — 44 active in production code (1 commented in livekit_rpc_agent.py:441, 1 in _archived/).
import logging (bypassing get_logger()) — 26 files (excluding logging_config.py itself).
bare except: — 16 active in production (see H2).

Remaining print() concentrations (production paths):

dev/livekit_ts/redis_dump.py (36) — diagnostic script.
dev/livekit_ts/agents/livekit_rpc_agent.py (26) — agent code.
dev/verify_supabase_integration.py (24) — root-level verify script (see N5).
dev/sync_admins.py (22) — admin sync utility.
dev/ml/ml_endpoint.py (18) — CLI entrypoints and diagnostics (see N2).
dev/images/testing_gemini.py (15), dev/images/stitcher.py (7), dev/images/gemini_segment.py (6).
dev/livekit_ts/agents/ — livekit_video_agent (8), livekit_elevenlabs_vision_agent (8), livekit_agent_manager_rpc (7), livekit_transcriber_agent (4), livekit_logger_agent (4), livekit_tts_agent (3).
dev/scripts/build_ecr_base_on_ec2.py (6) — ops tooling.
dev/static_site/preview_dashboard.py (5) — NEW local mock-preview tool (added in the deploy-dashboard redesign); dev-only.

Logger signature bugs: the logger.debug("label", value) anti-pattern (extra positional args silently dropped) — 27 instances in production code (recount 2026-05-29):

dev/core/data.py:72,89,132,161,175,190,200,223,244,255,265,274 — 12 instances.
dev/users/auth.py:339,407,423,439,451,578,587,595,664,703,711,809,883 — 13 instances (line numbers drifted from the prior review; count up from 11).
dev/core/decorators.py:18,74 — 2 instances passing args, kwargs as positional args.
(notifier.py's 2 instances were fixed 2026-05-28 — now f-strings.)

Misuse at dev/users/users_endpoint.py:271: logger.debug(tradespark_http_exception, traceback.print_exc()) — calls traceback.print_exc() as a side effect (prints to stderr), then passes its return value (None) as a format argument. Double-logs via stderr + logger; the logger message gets no useful info.

traceback.print_exc() in agent code: 44 active occurrences across dev/livekit_ts/agents/ (all 8 agent files), dev/livekit_ts/livekit_endpoint.py:93,168, and dev/users/users_endpoint.py:271. These bypass the logging framework and write to stderr.

26 modules still use import logging directly (reconfirmed 2026-05-29): dev/data/sync.py, dev/websocket/ (2: websocket.py, websocket_messagepack.py), dev/ml/ (17 files — count up from 14: training/{trainer,model_registry,dataset_builder}, serving/{vllm_server,ab_router}, mobile/{quantize,litert}, gateway/{providers,logger,gateway,chat}, eval/{metrics,llm_judge,eval_runner,benchmarks}, data_pipeline/{synthetic_generator,converters}), dev/livekit_ts/agent/image_segmentation.py, 3 tsweb files (location.py, queries.py, scraper.py), and 2 queues files (session_to_splat.py, session_to_splat.video_3d_reconstruction.py).

Note: dev/queues/session_to_splat.py also has a manual logger-setup block duplicating the centralized factory (see M57).

Fix:

Fix all logger.debug("label", value) signatures across data.py, auth.py, decorators.py — use f-strings or %s placeholders.
Replace traceback.print_exc() with logger.exception() in agent files and users_endpoint.py.
Migrate the remaining 26 import logging modules to from core.logging_config import get_logger.
Add a lint rule (ruff T201/T203) to prevent print() regression.

HIGH PRIORITY

H1. Test Coverage for Critical Modules

80%+ of backend modules have zero tests. Missing coverage:

model_proxy/ (2 files)
websocket/ (multiple agent/handler files)
livekit_ts/ (20+ files including agent logic)
capture/, geocoding/, voices/
data/ (2 endpoint files, minimal coverage)
ml/ (entire pipeline: gateway, eval, training, serving, mobile — zero tests)

Website has only 1 partial E2E test (test.auth.ts) with multiple FIXMEs blocking execution.

Impact: Regressions go undetected; refactoring is risky without a safety net.

Note (from feedback memory): pytest is owned by the account-system dev; ml/general tests go in Bruno or preflight scripts, not pytest. This item scopes to the account-system pytest suite only.

H2. Bare Exception Handlers Masking Failures

16 bare except: blocks remain in owned production code (excluding dev/_archived/):

dev/livekit_ts/livekit_server.py:69,77
dev/livekit_ts/redis_dump.py:132
dev/livekit_ts/agents/ — 13 across all 8 agent files (livekit_rpc_agent.py:342, livekit_transcriber_agent.py:260,340, livekit_elevenlabs_vision_agent.py:723,798, livekit_agents.py:152, livekit_video_agent.py:591, livekit_logger_agent.py:263, livekit_basic_agent.py:218,294, livekit_tts_agent.py:344,420, livekit_agent_manager_rpc.py:307)

These catch KeyboardInterrupt, SystemExit, and asyncio.CancelledError — all of which should propagate.

Fix: Start with dev/livekit_ts/agents/. Replace bare except: with except Exception as e:; prefer specific exception types for known failure modes.

H3. Hardcoded Configuration Values

S3 bucket names, AWS regions, and URLs are scattered as string literals:

dev/core/data.py:24-30 — 7 hardcoded bucket names
dev/core/data.py:67 — region_name='us-west-2' with FIXME comment
dev/core/core.py:120 — Full S3 URL with date suffix baked in
dev/core/core.py:474-500 — S3 mount definitions with hardcoded bucket names including dates (e.g., "modal-config-120125")
dev/deploy.sh:21 — Hardcoded AWS profile "tradespark"
dev/ml/training/model_registry.py:34-35 — Hardcoded bucket name and Modal Volume path
dev/ml/gateway/providers.py:287 — Hardcoded vLLM slot set
dev/tsweb/scraper.py:33 — Hardcoded default bucket "tradespark-ml-datasets" and region "us-west-2" (line 39)

Fix: Consolidate into a single config module or environment variables with startup validation.

H4. Wildcard CORS on All 12 Production Endpoints

Every FastAPI app in production code allows allow_origins=["*"] (12 files, reconfirmed 2026-05-29):

data_endpoint.py:47, websocket_endpoint.py:26, users_endpoint.py:79 (has TODO: restrict in prod), capture_endpoint.py:39, ml_endpoint.py:51, model_proxy_endpoint.py:67, geocoding_endpoint.py:30, livekit_endpoint.py:37, livekit_dashboard.py:127, voices_endpoint.py:30, static_site/endpoint.py:94, tsweb/endpoint.py:45.

Fix: Create a shared CORS config function that reads allowed origins from environment; restrict to known domains in production.

H5. Missing DEK Encryption at Rest

dev/core/dek_store.py:20 still stores Data Encryption Keys in plaintext memory with TODO comments about encrypting with a backend key. The global deks_by_account dictionary is also not thread-safe for concurrent Modal requests, has no TTL/expiration, and no access control.

H8. Fragile Dependency Workaround in ML Training Image

dev/core/core.py:726 uses .run_commands("pip install 'transformers>=5.5.0'") to force-upgrade transformers after vLLM 0.19.0 pins transformers<5. This is a fragile workaround that will break unpredictably when either package updates.

H9. Race Condition in S3 Training Data Logger

dev/ml/gateway/logger.py appends to daily JSONL files in S3 via download-append-reupload. With @modal.concurrent(max_inputs=100) on the ML endpoint, concurrent requests can overwrite each other's log entries, silently losing training data.

Extended finding: logger.py:88-89 catches ALL exceptions (not just NoSuchKey) and silently sets body = "". If S3 returns AccessDenied or a network timeout, existing log data is overwritten with just the new line. Line 86 correctly catches NoSuchKey for the "new file" case, but the except Exception: at line 88 masks real errors.

Also: dev/ml/data_pipeline/supabase_scraper.py:44 has the same pattern — except Exception: return "2020-01-01T00:00:00Z" silently resets the sync timestamp on ANY error, causing full re-processing of all data.

Also: dev/tsweb/scraper.py replicates both issues:

Line 48: _get_last_sync_timestamp() catches except Exception: and returns "2020-01-01T00:00:00Z".
Lines 243-247: download-append-reupload on daily JSONL files with except Exception: body = "".

Fix: Switch to per-request S3 keys (e.g., {prefix}/{date}/{uuid}.jsonl) or use S3 append-only semantics. Catch only NoSuchKey/404 for the "doesn't exist yet" path; log and re-raise other exceptions.

H10. Missing Input Validation on ML Request Parameters

Multiple ML endpoint request models lack bounds validation (line numbers drifted — ml_endpoint.py grew to 2,333 lines):

ServeRequest.max_tokens (ml_endpoint.py:998) — no upper bound; users can request arbitrarily large generations.
ServeRequest.temperature (ml_endpoint.py:997) — no bounds (should be 0.0–2.0).
ServeRequest.model_slot — bare str with no enum constraint; invalid slots fail inside generate() instead of at request validation time.
ChatStreamRequest.messages (chat.py) — no limit on list length or content size.
GatewayRequest.system_prompt / user_input (gateway.py) — no max_length.

Impact: DoS via resource exhaustion or excessive LLM API costs from unbounded requests.

H13. Missing Error Handling on Remote Modal Calls

ML endpoints invoke .remote.aio() with no try/except (e.g. modal_download_base_models, modal_train_model, modal_ingest_litert, modal_convert_litert). Modal timeout, compute, or network errors propagate as unstructured 500 responses with no useful error message.

Fix: Wrap each in try/except to return structured error responses with the Modal error context.

H15. S3 Model Registry Race Condition

dev/ml/training/model_registry.py:257-263 — register_version() does load → append → save on registry.json in S3 with no locking or conditional write. If two Modal functions register versions simultaneously, one silently overwrites the other's entry.

Fix: Use S3 conditional writes (If-None-Match / If-Match ETags) for optimistic locking, or use DynamoDB for atomic registry updates.

H17. `sync.py` Imports Private `core.data` Functions

dev/data/sync.py:82-83,184 imports _put_blob, _asset_s3_key, and _get_blob — underscore-prefixed internal functions from core.data. These are implementation details that may change. The sync module is the only external consumer.

Fix: Either promote _put_blob/_get_blob/_asset_s3_key to public API, or refactor sync.py to use the public store() function.

H18. Missing Error Handling on `call_provider()` in Synthetic Endpoints

dev/ml/ml_endpoint.py — auto_answer_questions calls call_provider() with no try/except. If the LLM provider returns an error, the raw exception propagates as an unstructured 500. The subsequent json.loads(result.response_text) also assumes result is not None.

Fix: Wrap call_provider() in try/except; validate result before accessing .response_text. Return a structured 502 error on provider failure.

H19. `main`/`beta` Env Supabase Routing — RESOLVED 2026-05-11

Resolution summary: Self-managed anon login sidestepped the prod Supabase anonymous sign-in blocker. dev/core/core.py:429-430 now routes main/beta → SupabaseProd, everything else → SupabaseDev.

Still required for the website:

[ ] Provision the prod Supabase URL + publishable anon key in website/scripts/supabase.ts (PROD_SUPABASE is currently empty with a TODO).
[ ] Confirm Modal secret SupabaseProd in the main Modal environment points at the new prod project URL/keys before the first main deploy.
[ ] Seed the canary fixtures (test_project_id, test_location_id) in the prod project, OR update bruno/.../environments/main.bru.
[ ] Run cd dev && ./test_app.sh --env main after the first prod deploy.

H20. Missing `await` on Async Cleanup in agent.py

dev/livekit_ts/agent/agent.py — _create_video_stream() calls session_state.video_stream.aclose() without await (author flagged with # FIXME Not async.). Without await, the async generator is never properly closed; the underlying RTC video stream remains open and accumulates frame buffers until the session ends.

Fix: Make _create_video_stream() async and await session_state.video_stream.aclose(). Wrap in asyncio.create_task(...) if the callback signature requires sync.

H23. Beta Should Point at Supabase Staging, Not Prod

dev/core/core.py:429-430 currently routes both main and beta to the same SupabaseProd Modal secret. Long-term, beta should talk to a separate SupabaseStaging project.

Work to do:

[ ] Create a new SupabaseStaging project in Supabase.
[ ] Provision a SupabaseStaging Modal secret.
[ ] Update routing to use a dict lookup (main→Prod, beta→Staging, else→Dev).
[ ] Seed canary fixtures in SupabaseStaging.

H25. Deploy Dashboard Exposes Exception Details in Error Responses

dev/static_site/deploy_dashboard.py — multiple occurrences of type(e).__name__: {e} in HTTP error responses or return values (e.g. raise HTTPException(502, f"history fetch failed: {type(e).__name__}: {e}")). (Line numbers shift frequently — the dashboard sees near-daily commits; grep type(e).__name__ to locate.)

While these endpoints are admin-gated (TradesparkEmailAdmin), the raw exception strings can reveal internal implementation details — Modal function names, class paths, connection strings, or timeout details.

Risk: Medium (admin-only), but follows the same information-disclosure anti-pattern as M17.

Fix: Log the full exception at ERROR level (already done via logger.exception), but return a generic user-facing message.

H27. CI Webhook Token Comparison Not Timing-Safe

ci/webhook.py:453,486,499 — x_gitlab_token != expected and x_approve_token != expected (now 3 comparison sites; line numbers drifted from the prior 244/261) use Python's != operator, which short-circuits on the first differing byte. This enables timing side-channel attacks to guess the token character by character. The webhook is internet-facing and the token is the sole auth mechanism.

Fix: Use hmac.compare_digest(...) from the stdlib for all three comparisons.

H28. `ensure_cicd_fresh.sh` Hash Only Covers `.py` Files

ci/ShellScripts/ensure_cicd_fresh.sh:47 — The staleness hash is sha256sum ci/*.py | sha256sum. Changes to dev/requirements.txt, Dockerfiles used by CI, or non-Python CI configs don't change the hash. The CI workers won't be redeployed, and the next CI run will fail with an import error until someone manually runs modal deploy -e cicd.

Fix: Include dev/requirements.txt and relevant Dockerfiles in the hash.

H29. Env Allow-List Drift Across CLI Scripts and Dashboard

(Renumbered from a second, duplicated "H26" header.)

dev/stop_app.sh:13 defines SUPPORTED_ENVS=(main beta dev jh rk cc) — missing fl (which deploy.sh and the dashboard support). Neither stop_app.sh nor deploy.sh includes aw or jc, but the deploy dashboard's DEV_ENV_OPTIONS and STOPPABLE_ENVS (deploy_dashboard.py:46,55) do. Additionally, the dashboard's ALLOWED_ENVS (for deploy triggers) doesn't include aw/jc, so the UI shows a Deploy button for those envs that always returns HTTP 400.

Note: the root Justfile _ENVS := "main beta dev jh rk cc fl" is yet another copy of the allow-list.

Impact: Devs with fl/aw/jc envs get inconsistent tooling. Broken UX and confusion during incident response.

Fix: Define a single ALL_ENVS / STOPPABLE_ENVS set in one place (e.g., core/core.py or a shared config), and have deploy.sh, stop_app.sh, deploy_dashboard.py, and the Justfile all derive from it.

N20. `build_splat_base` GitLab CI Job Has No Docker-Capable Runner

The build_splat_base job in .gitlab-ci.yml needs Docker to build dev/queues/dockerfile.session_to_splat.base, but every push routes it to proxmox-runner-1 which is a shell executor (not docker-in-docker). Result: the job fails immediately with apk: command not found because the shell executor ignores the image: docker:24 + services: docker:24-dind declarations.

Workaround in place (2026-05-16): the job's auto-trigger on Dockerfile changes was removed; it's now when: manual + allow_failure: true. The actual splat base build is done via dev/queues/build_splat_base_on_ec2.py, which spins up a temporary m5.xlarge with real Docker, builds + pushes to ECR, and tears down. Manual but reliable.

Proper fix:

Register a Docker-executor GitLab runner tagged something like docker-dind.
Add tags: [docker-dind] to the build_splat_base job.
Restore the auto-trigger rule (changes: → when: on_success, fallback when: manual + allow_failure: true).
Verify by changing the Dockerfile and pushing.

Alternative: replace the CI job with an AWS CodeBuild project triggered by a GitLab webhook.

N21. `deploy_and_merge_dev` Job Can't Push to `origin/beta` — HTTP Basic Auth Denied

GitLab CI pipeline #11147 on the dev branch ran ci/ShellScripts/merge_branches.sh dev beta …. The local git merge inside the runner succeeded cleanly, but the final git push origin beta failed with remote: HTTP Basic: Access denied. The CI runner's persisted git credential has expired or been rotated. dev → beta auto-merge is silently blocked.

Fix:

Inspect the runner / job's git-push credential (ci/ShellScripts/merge_branches.sh).
Rotate: create a fresh GitLab project access token with write_repository scope, store as a masked CI/CD variable, use it in the push URL.
Retry the failed deploy_and_merge_dev job (or push a new commit).

Same auth pattern is used by merge_beta_to_main — that path will hit the same wall when it next runs.

N3. Shell Scripts Missing `set -e` Error Handling

dev/deploy.sh and dev/stop_app.sh have no set -e, set -u, or set -o pipefail at top. If any step fails, the script continues, potentially deploying partial/broken state or silently failing to stop apps.

Fix: Add set -euo pipefail at top of both scripts, then fix any commands that relied on silent failure.

MEDIUM PRIORITY

M1. Circular Dependency Workarounds

dev/core/core.py:173 uses local imports (from users.auth import bind_user_token) to avoid circular dependencies between core and users. This hides architectural coupling. Consider extracting shared interfaces into a separate module.

M2. Overly Long Functions and Files

dev/ml/ml_endpoint.py — 2,333 lines (was ~1,256 on 2026-05-29; nearly doubled). Spans FastAPI routes, Modal class definitions, CLI entrypoints, diagnostics, and four near-identical vLLM serving classes. The single largest file in the codebase — strong candidate for splitting (routes vs Modal serving classes vs CLI).
dev/core/data.py — 1,357 lines.
dev/users/auth.py — 1,194 lines, mixed auth logic, database ops, and email handling.
ci/_git_in_modal.py — 1,023 lines — the CI git helper has grown into a monolith (merge-and-push, fetch-history, get-env-states, get-active-dispatches, get-branch-tips, last-test-results, record-event). Consider splitting by concern.
dev/livekit_ts/agent/agent.py — entrypoint() packs nested class definitions (EventEmitter, SparkyAssistant) inside the function, hampering testing/reuse (file 631 lines).
dev/livekit_ts/agents/livekit_elevenlabs_vision_agent.py — 816 lines.

M3. Inconsistent Auth Patterns

Three different auth checking mechanisms coexist:

Dependency-based: dev/core/core.py:215-263
Decorator-based: dev/core/decorators.py:52-71
Service-layer: dev/users/auth.py

No single canonical pattern is documented or enforced.

M4. Duplicate Supabase Client and S3 Client Creation

Supabase clients created independently in:

dev/core/db.py:19-39 (cached via @lru_cache(maxsize=1) — not thread-safe for Modal concurrency)
dev/users/auth.py:80 (direct)

S3/boto3 clients: get_s3_sync_client() in data.py creates a new boto3 client on every call — 24 call sites in data.py alone. dev/tsweb/scraper.py:38 adds yet another independent boto3.client("s3") factory.

Fix: Cache the S3 client per container (module-level singleton or @lru_cache). Consolidate Supabase into a single factory in core/db.py.

M5. Modal Secret Ordering Fragility

dev/core/core.py:465-466 documents that supabase_secret MUST be last in the secrets list because Modal merges left-to-right and the TradeSpark secret contains a stale SUPABASE_URL. This implicit ordering is a maintenance trap.

M6. Duplicate Model Defaults in ML Gateway

Model defaults defined in three places with risk of inconsistency:

dev/ml/gateway/gateway.py:85-90
dev/ml/gateway/chat.py:74-79
dev/ml/gateway/providers.py:37,74,121 (function parameter defaults)

All hardcode model names (e.g., "claude-sonnet-4-6", "gemini-2.5-flash"). Should be a single config source.

M8. CI Pipeline Gaps

No GitLab CI cache configuration (each job reinstalls dependencies).
pip install modal --break-system-packages risks contaminating system Python. Same pattern for boto3 pyyaml.
No post-deployment smoke tests or rollback procedure.
No CI stages for website building/testing.

M9. No Timeout on Async Queue in ML Chat Streaming

dev/ml/gateway/chat.py:205 — await queue.get() has no timeout. If a provider stops sending events, the endpoint hangs indefinitely.

Extended: Task cancellation calls t.cancel() but doesn't await the tasks. Cancelled tasks may still be executing when the StreamingResponse context exits, leaking connections.

M10. Missing Startup Configuration Validation

No checks at application startup to ensure all required environment variables and secrets are present. MODAL_ENVIRONMENT silently defaults to "dev" (core.py:24). A dev/.env.template exists but nothing validates env-var presence at startup.

M11. Stale Backlog Documents

specs/WEEK_1_BACKLOG.md and specs/WEEK_2_BACKLOG.md are 140+ days old (Jan 8, 2026) with no indication of completion status. Should be archived or updated.

M12. No Docker Compose for Local Development

Tests require a local PostgreSQL at postgresql://postgres:testpass@localhost:5432/test_grizzlebear but there is no docker-compose.yml or setup instructions for reproducibility.

M13. Missing Health Check Endpoints

Six top-level endpoints have /health. Missing from:

dev/users/users_endpoint.py
dev/geocoding/geocoding_endpoint.py
dev/livekit_ts/livekit_endpoint.py
dev/voices/voices_endpoint.py
dev/livekit_ts/livekit_dashboard.py

M14. No Graceful Shutdown Handlers

No FastAPI app uses lifespan context managers or shutdown handlers. In-flight requests may be abruptly terminated when Modal containers scale down.

M15. Triplicated vLLM Generate Code

dev/ml/ml_endpoint.py:2035,2081,2125,2169 — the vLLM serving classes each have identical ~10-line generate() methods. A _make_vllm_generate() factory exists earlier in the file but is never used — dead code. (Line numbers updated 2026-05-29 after the file grew to 2,333 lines.)

M16. S3 list_objects_v2 Missing Pagination

Five locations call list_objects_v2 without handling pagination (S3 returns max 1000 keys per response):

dev/capture/capture.py:185,238
dev/ml/eval/eval_runner.py:193
dev/ml/ml_endpoint.py
dev/ml/data_pipeline/converters.py:127

M17. Information Disclosure via Error Messages

Several ML endpoints return raw Python exception strings to clients:

dev/ml/gateway/gateway.py:105 — detail=f"LLM provider error: {str(e)}"
dev/ml/gateway/chat.py:182 — "error": str(e) in SSE events
dev/ml/gateway/providers.py:346 — yield f"[vLLM error for {slot}:{version_str}: {e}]"

See also H25 (deploy dashboard) for the same pattern.

M18. Quintuplicated vLLM Slot Mapping

Model slot membership is maintained in five separate locations that must be kept in sync:

dev/ml/gateway/providers.py:287 — _has_vllm_cls() hardcoded set
dev/ml/gateway/providers.py:325-329 — cls_map dict
dev/ml/serving/vllm_server.py:48-52 — identical cls_map dict
dev/ml/gateway/chat.py:82 — _MODEL_SLOT_NAMES set
dev/core/model_versions.py:20-24 — MODEL_VERSIONS dict keys

M19. Race Condition in Room Agent Management

dev/livekit_ts/agent/room_agent_worker.py:61-65,116,125,132 — check-then-act pattern on shared core.data.active_room_agents dict without locking. The code has a FIXME at line 60 acknowledging this.

M21. Triplicated Modal Image Dependency Blocks

Four Modal image definitions in dev/core/core.py independently specify their pip dependencies. Core deps are copy-pasted across all four.

Fix: Extract a CORE_DEPS list and compose image deps as CORE_DEPS + image_specific_deps.

M22. No Preflight Gate for Dev Branch Deploys

The deploy_and_merge_dev CI stage skips the preflight_manifest.py check and immediately merges dev → beta.

M23. CI Merge Script Silently Overwrites Conflicts

ci/ShellScripts/merge_branches.sh:28-29 — git merge --no-ff -X theirs always takes the source branch version on conflict.

M24. Excessive HTTP Timeout on vLLM Proxy Requests

dev/ml/gateway/providers.py:391 — 10-minute timeout for the vLLM endpoint.

M25. Uninitialized CI Variables

.gitlab-ci.yml:191 — $MATTERMOST_WEBHOOK never declared or documented. main_tests_notify.sh:40 references undefined $MODAL_URL.

M26. No Timeout on Modal Deploy Commands

ci/ShellScripts/deploy_modal.sh:35 and dev/deploy.sh:295 — modal deploy commands have no timeout.

M27. main_tests_notify.sh Has Shebang After Executable Code

ci/ShellScripts/main_tests_notify.sh:1-8 — Lines 1-6 execute before the #!/bin/bash shebang on line 7.

M28. `register_external_asset` Broad Exception Catch

dev/core/data.py:1313 — except Exception: on s3.head_object() silently sets file_size=0.

M29. `list_location_assets` Generates Presigned URL Per Asset

dev/core/data.py:1252 — one S3 presigned URL per asset, each creating a new boto3 client.

M30. `supabase_queries.py` Broad Exception Catches Silently Return Empty Data

dev/core/supabase_queries.py catches except Exception as e: and returns empty results.

M31. `decorators.py:20` Unsafe UUID Parsing With No Guard

dev/core/decorators.py:20 — account_id = uuid.UUID(args[0], version=4) without try/except.

M33. vLLM Streaming Response Line Splitting

dev/ml/gateway/providers.py:398 — byte-level iteration doesn't guarantee complete SSE lines.

M34. No `@modal.exit()` Cleanup on vLLM Serving Classes

dev/ml/ml_endpoint.py — the vLLM serving classes define @modal.enter() but no @modal.exit() handlers. GPU memory and pending requests are abandoned without cleanup on scale-down.

M35. Training Format Functions Vulnerable to Chat Delimiter Injection

dev/ml/data_pipeline/converters.py:35-63 — format_gemma(), format_chatml(), and format_llama() interpolate user-supplied fields directly into template strings containing chat delimiters.

M36. Container Environment Variables Leaking to Modal App Logs

While running modal app logs, the log stream emits KEY=value lines exposing live credentials. Source still unknown (no print(os.environ) in current source). Closely related to Priority 1 (DEK log) — both leak secrets to modal app logs.

M37. innerHTML XSS in mobile-session.html renderAssets()

dev/static_site/templates/demos/mobile-session.html:336 — li.innerHTML with unescaped API response data (a.fileName, a.category, a.download_url). The same file has an _escapeHtml() helper at line 460 that isn't used in renderAssets().

M38. tsweb/queries.py Broad Exception Catches on Supabase Queries

dev/tsweb/queries.py catches except Exception as e: on Supabase query calls and returns empty results.

M40. Missing Error Handling on tsweb `/projects` Supabase Queries

dev/tsweb/endpoint.py:103-105,112-114 — Two Supabase queries with no try/except.

M41. Silent Cron Failure in tsweb Nightly Sync

dev/tsweb/scheduled.py:22-28 — nightly_supabase_sync() calls sync_new_projects() with no try/except.

M42. Broad `except Exception:` on Media Read in `sync.py`

dev/data/sync.py:206 — except Exception as e: on _get_blob() call silently logs a warning and continues.

M43. Race Condition in `SparkyAssistant._tasks` List

dev/livekit_ts/agent/agent.py — self._tasks: list[asyncio.Task] mutated from multiple async callbacks without a lock.

M44. Race Condition in DEK Loading (`_ensure_dek`)

dev/core/data.py — _ensure_dek() does check-then-load on the global deks_by_account dict without a lock.

M45. recorder.py Unbounded Retry Loop With No Backoff or Timeout

dev/livekit_ts/agent/recorder.py:45-57 — wait_for_recorder() busy-loops on the recorder URL with a fixed 0.5s sleep and no maximum retries.

M46. websocket_messagepack.py Silently Substitutes Defaults on Parse Failure

dev/websocket/websocket_messagepack.py:28-29 — On msgpack.unpackb() failure, silently substitutes a fake default tuple. Exception variable e is captured but never logged.

M47. room_agent_worker.py Double Cleanup in except + finally

dev/livekit_ts/agent/room_agent_worker.py:114-131 — Both except asyncio.CancelledError: and finally: perform the same cleanup operations.

M50. localhost/dockerfile Missing `.dockerignore`

localhost/dockerfile:20 — ADD ./ /opt/app/tradespark/ copies the entire build context. No .dockerignore exists, so the image includes .git/, .env, dev/.venv/, etc.

M51. Large Blocks of Commented-Out Code in livekit_ts/ Agents

Three agent files contain 20-80+ line blocks of commented-out code.

M55. `traction.py` Broad `except Exception:` on Date Parsing and Supabase Queries

dev/users/traction.py has 3 broad exception catches.

M56. `innerHTML` With Unescaped `e.message` in ML Demo Pages

dev/static_site/templates/demos/ml-training.html:395 and ml-eval.html:945 — error messages interpolated via innerHTML without escaping.

M57. `session_to_splat.py` Manual Logger Setup Bypasses Centralized Factory

dev/queues/session_to_splat.py:9-14 — sets up its own logging.StreamHandler with a custom format and DEBUG level, duplicating the centralized logging_config.py factory.

Fix: Replace with from core.logging_config import get_logger; logger = get_logger("queues.session_to_splat").

LOW PRIORITY

L17. ABRouter A/B Routing Is Dead/Incomplete

dev/ml/serving/ab_router.py:26-32 — hardcoded split=1.0 makes routing inert.

L18. SQL Identifier Interpolation in Schema Migrations

dev/core/data.py:670 — f-string DDL from trusted-only _MIGRATION_COLUMNS constant.

L20. integration-tests.sh Uses `set -ex` (Echoes Env)

dev/integration-tests.sh:19 — -x xtrace echoes env vars to CI logs.

L21. integration-tests.sh `cd $cwd` Unquoted

dev/integration-tests.sh:130 — cd $cwd without quotes.

L22. TOCTOU on `remote_participants` in agent.py

dev/livekit_ts/agent/agent.py:209-210 — check-then-access on a dict mutated by the event loop.

L25. Dead `_smoketest_in_modal.py` Still Present After Phase 5

ci/_smoketest_in_modal.py:14 — The file's own docstring says "Delete this file after Phase 3 (deploy_in_modal.py) is green." Dead code that also uses the old add_local_dir pattern.

Fix: Delete ci/_smoketest_in_modal.py.

L26. `_escape()` in deploy_dashboard.py Missing Single-Quote Escaping

dev/static_site/deploy_dashboard.py:108-119 — The custom _escape() handles & < > " but not '. Currently safe because server-side rendering uses double-quoted attributes, but fragile.

Fix: Replace with html.escape(s, quote=True) from stdlib.

L27. Unbounded `limit` Query Parameter on Deploy Dashboard History API

dev/static_site/deploy_dashboard.py:150 — limit: int = 100 has no upper bound. A caller can pass limit=999999999, loading the entire CI-history JSONL in one response.

Fix: Use Query(default=100, ge=1, le=500).

L1. Missing Type Hints on Key Functions

Functions in dev/users/auth.py and dev/core/core.py lack complete type annotations.

L2. Missing Docstrings on Public APIs

30+ public functions across core/, users/, livekit_ts/ have no docstrings.

L5. Broken Email HTML Links

dev/core/notifier.py:21 has # FIXME HTML links not working.

L6. Localhost Dockerfile Runs as Root

localhost/dockerfile has no non-root user defined. Also installs unused packages.

L7. Website SRI Integrity Checks Broken

website/build-scripts/build-html.ts FIXME noting SRI integrity checks are broken.

L8. 100+ TODO/FIXME Comments Need Triage

Scattered across the codebase. Notable concentrations: dev/users/auth.py, dev/livekit_ts/livekit_endpoint.py, dev/core/core.py.

L10. Postman Collection Alongside Bruno

Both postman_collection.json (root) and bruno/ exist.

L11. Dead Code in crypto.py

dev/core/crypto.py:12-74 — ~60 lines of commented-out encryption functions.

L12. Inconsistent Error Handling in CI Scripts

Mixed set -e / set -ex / no set -e across scripts.

L13. Unsafe Dictionary Access in ML Serving

dev/ml/gateway/providers.py:350 — safe but yields "unknown" rather than an error.

L14. No Rate Limiting on Any Endpoint

No rate limiting on any FastAPI endpoint. ML gateway proxies to paid LLM APIs.

L15. LLM Prompt Injection via f-String User Input

dev/ml/ml_endpoint.py:152-162 interpolates user-supplied input directly into an LLM system prompt.

N8. Typo "settting" in Test Fixture Log Message

dev/tests/conftest.py:609 — triple-t typo.

N13. Stale NVM Version in dockerfile.modal

dockerfile.modal:91 installs NVM v0.39.1 via the deprecated creationix URL.

N14. Missing JSON Parse Guard in CI Report Generator

ci/generate-html-report.js:8-13 — JSON.parse with no try/catch.

N15. `_save_location_db` Deprecated But Still Present

dev/core/data.py:691-714 — Deprecated write-path helper marked for removal.

N2. `dev/ml/ml_endpoint.py` CLI Uses Raw print() After Logger Migration

18 raw print() calls remain in the CLI entrypoint and diagnostic blocks.

N4. requirements.txt Has Both `google-genai` and `google-generativeai`

dev/requirements.txt:36-37 — google-genai==1.32.0 and google-generativeai==0.8.5, two different Google packages with overlapping functionality. (The bare google==3.0.0 meta-package was removed 2026-05-28 — see N18 in RESOLVED.)

N5. Verification Scripts at dev/ Root

dev/verify_supabase_integration.py (152 lines, 24 prints) remains at dev/ root.

N11. Unescaped JSON in CI Notification Script

ci/ShellScripts/notify_mattermost.sh:17-26 — interpolates $MESSAGE directly into JSON payload without escaping.

N12. Unbounded _db_cache and _db_locks Growth

dev/core/data.py — dictionaries grow unbounded with no eviction mechanism.

L16-followup. Gemini API Key — RESOLVED 2026-05-28

dev/websocket/gemini.py now raises RuntimeError when GEMINI_API_KEY is unset (was a "no-key-available" fallback). See RESOLVED.

NOT YET

H11. LiteRT/Training Admin Endpoints Missing Admin Auth Check

dev/ml/ml_endpoint.py admin endpoints only use CurrentUser dependency. Any authenticated user can trigger GPU-capable Modal containers.

Blocked on: TradesparkAdminUser needs work before relying on it for these gates.

RESOLVED

Security & Cleanup Sweep (2026-05-28, verified 2026-05-29)

Three commits landed substantial fixes that the earlier 2026-05-29 review erroneously re-listed as open. Re-verified against HEAD:

835108a — secrets / JWT / escaping:

Priority 1 (hardcoded Resend key) — dev/core/notifier.py now reads RESEND_API_KEY from env (via notifier_secret); no key in source. (Ops: rotate the historical key — see Priority 1.)
Priority 1 (JWT placeholder) — dev/users/auth.py:556 reads JWT_INVITE_SECRET from env (via new GrizzlebearInvite Modal secret); "your-super-secret-key" removed.
H22 (_validate_jwt_payload_item_or_throw no-op) — auth.py:656-659 now correctly does if key not in jwt_payload or jwt_payload[key] is None: raise.
M39 / M54 (static_site unescaped HTML) — dev/static_site/endpoint.py now html.escape()s all metadata interpolations in render_post, posts_index, _render_landing, docs_index, doc_page.
N1 (partial) — notifier.py signature bugs fixed (now f-strings). data.py/auth.py/decorators.py instances remain (see Priority 2).

ca76051 — SessionState + input validation:

H21 (SessionState cross-session leak) — dev/livekit_ts/agent/session_state.py:37-38 now initializes rooms and sampled_product_links in __init__ (no shared class-level mutable defaults).
L19 (RPC unbounded payload) — dev/livekit_ts/agent/rpc_handler.py adds _validated_payload() enforcing isinstance(str) + 2 KiB cap, applied to all three RPC handlers.
M32 (WebSocket binary no size limit) — dev/websocket/websocket.py:21 caps per-frame binary at 25 MiB and evicts oldest metadata past 32 entries.
L16 (Gemini dummy-key fallback) — dev/websocket/gemini.py raises RuntimeError if GEMINI_API_KEY unset.

6c66c41 — dead-code cleanup:

N10 / N17 / L24 (unused imports + dead commented code) — ~30 dead imports dropped across 7 livekit agent files, livekit_dashboard, livekit_helpers, websocket/gemini, and core/core.py (from boto3.resources import model gone); livekit_helpers commented blocks removed.
N16 (dead message_queue) — removed from WebSocket PrivateSession.
N18 (deprecated packages) — fernet==1.0.1 and bare google==3.0.0 dropped from requirements.txt.
L23 (== True / == False) — all live comparisons rewritten to truthy form; only commented-out instances remain.
N6 (stale badauth comment) — already removed in a prior sweep.

H24. Bruno Env Files No Longer Ship Credentials (verified 2026-05-29)

bruno/Grizzlebear API Collection/environments/*.bru now contain only endpoints and non-sensitive test IDs (account/project/location UUIDs, plus code, address slug) — no passwords/secrets/tokens. Cleaned up prior to the 2026-05-28 sweep.

Billing Archived (2026-05-18)

dev/billing/ moved to dev/_archived/billing in commit b5ca2b5. Resolves/obsoletes: H16 (Webhook Mark-Before-Handle Race), M48 (Stripe Webhook Cleanup Failure), M49 (current_period_end Required), M52 (Stripe Webhook Secret Logged), M53 (billing.py generic raise).

Verify Scripts Partially Archived (2026-05-18)

dev/verify_task1.py, dev/verify_billing_task1.py, dev/test_billing_import.py moved to dev/_archived/. Only dev/verify_supabase_integration.py remains (see N5).

Devices Archived (2026-05-18)

dev/devices/ archived in commit 79391ac.

`dev/not_yet/` Renamed to `dev/_archived/` (2026-05-18)

Commit b6b23cf. All prior references to dev/not_yet/ should use dev/_archived/.

App Consolidation — 4 Modal Functions → 2 (2026-05-18)

Commits d9a7e9e, 8f10753, 2ab08e4: low_priority_app consolidates voices + geocoding + capture + static_site; user_data_app consolidates users + data. Reduces min_containers from 4 to 1.

M20. Unsafe tarfile.extractall Removed (2026-04-30)

ETag CAS + Per-Key Lock for Race-Safe Location DB Writes (2026-04-26)

Supabase Credentials Moved to Modal Secrets (2026-04-25)

Priority 2. Centralized Logging (Foundation Complete — 2026-04-23/24)

H6. Integration Test Script Typos (2026-04-23)

H7. Duplicate and Unpinned Dependencies (2026-04-23)

M7. Duplicate and Unused Imports (2026-04-23)

L3. README.md Duplicate `## Setup` Header (2026-04-23)

L4. Large `results.json` and `build.log` Committed (2026-04-23)

L9. Worktree Directories in Repo Root (verified 2026-04-24)

H14. Unsafe vLLM Output Indexing (2026-04-22)

H12. Convert-LiteRT Endpoint Allocates GPU for NotImplementedError (2026-04-22)

OBSOLETE

M8-old. CI dev deployment commented out

Deliberate decision, not a gap. Moved to OBSOLETE on 2026-04-21.

H16, M48, M49, M52, M53 — Billing-related items

Moved to DONE — billing module archived 2026-05-18.

Grizzlebear Codebase Improvement Tracker

TOP 2 PRIORITIES

1. Encryption Key Secret Logged in Plaintext (CRITICAL SECURITY — STILL OPEN)

2. Centralized Logging Sweep (IN PROGRESS — complete the migration)

HIGH PRIORITY

H1. Test Coverage for Critical Modules

H2. Bare Exception Handlers Masking Failures

H3. Hardcoded Configuration Values

H4. Wildcard CORS on All 12 Production Endpoints

H5. Missing DEK Encryption at Rest

H8. Fragile Dependency Workaround in ML Training Image

H9. Race Condition in S3 Training Data Logger

H10. Missing Input Validation on ML Request Parameters

H13. Missing Error Handling on Remote Modal Calls

H15. S3 Model Registry Race Condition

H17. sync.py Imports Private core.data Functions

H18. Missing Error Handling on call_provider() in Synthetic Endpoints

H19. main/beta Env Supabase Routing — RESOLVED 2026-05-11

H20. Missing await on Async Cleanup in agent.py

H23. Beta Should Point at Supabase Staging, Not Prod

H25. Deploy Dashboard Exposes Exception Details in Error Responses

H27. CI Webhook Token Comparison Not Timing-Safe

H28. ensure_cicd_fresh.sh Hash Only Covers .py Files

H29. Env Allow-List Drift Across CLI Scripts and Dashboard

N20. build_splat_base GitLab CI Job Has No Docker-Capable Runner

N21. deploy_and_merge_dev Job Can't Push to origin/beta — HTTP Basic Auth Denied

N3. Shell Scripts Missing set -e Error Handling

MEDIUM PRIORITY

M1. Circular Dependency Workarounds

M2. Overly Long Functions and Files

M3. Inconsistent Auth Patterns

M4. Duplicate Supabase Client and S3 Client Creation

M5. Modal Secret Ordering Fragility

M6. Duplicate Model Defaults in ML Gateway

M8. CI Pipeline Gaps

M9. No Timeout on Async Queue in ML Chat Streaming

M10. Missing Startup Configuration Validation

M11. Stale Backlog Documents

M12. No Docker Compose for Local Development

M13. Missing Health Check Endpoints

M14. No Graceful Shutdown Handlers

M15. Triplicated vLLM Generate Code

M16. S3 list_objects_v2 Missing Pagination

M17. Information Disclosure via Error Messages

M18. Quintuplicated vLLM Slot Mapping

M19. Race Condition in Room Agent Management

M21. Triplicated Modal Image Dependency Blocks

M22. No Preflight Gate for Dev Branch Deploys

M23. CI Merge Script Silently Overwrites Conflicts

M24. Excessive HTTP Timeout on vLLM Proxy Requests

M25. Uninitialized CI Variables

M26. No Timeout on Modal Deploy Commands

M27. main_tests_notify.sh Has Shebang After Executable Code

M28. register_external_asset Broad Exception Catch

M29. list_location_assets Generates Presigned URL Per Asset

M30. supabase_queries.py Broad Exception Catches Silently Return Empty Data

M31. decorators.py:20 Unsafe UUID Parsing With No Guard

M33. vLLM Streaming Response Line Splitting

M34. No @modal.exit() Cleanup on vLLM Serving Classes

M35. Training Format Functions Vulnerable to Chat Delimiter Injection

M36. Container Environment Variables Leaking to Modal App Logs

M37. innerHTML XSS in mobile-session.html renderAssets()

M38. tsweb/queries.py Broad Exception Catches on Supabase Queries

M40. Missing Error Handling on tsweb /projects Supabase Queries

M41. Silent Cron Failure in tsweb Nightly Sync

M42. Broad except Exception: on Media Read in sync.py

M43. Race Condition in SparkyAssistant._tasks List

M44. Race Condition in DEK Loading (_ensure_dek)

M45. recorder.py Unbounded Retry Loop With No Backoff or Timeout

M46. websocket_messagepack.py Silently Substitutes Defaults on Parse Failure

M47. room_agent_worker.py Double Cleanup in except + finally

M50. localhost/dockerfile Missing .dockerignore

M51. Large Blocks of Commented-Out Code in livekit_ts/ Agents

M55. traction.py Broad except Exception: on Date Parsing and Supabase Queries

M56. innerHTML With Unescaped e.message in ML Demo Pages

M57. session_to_splat.py Manual Logger Setup Bypasses Centralized Factory

LOW PRIORITY

L17. ABRouter A/B Routing Is Dead/Incomplete

L18. SQL Identifier Interpolation in Schema Migrations

L20. integration-tests.sh Uses set -ex (Echoes Env)

H17. `sync.py` Imports Private `core.data` Functions

H18. Missing Error Handling on `call_provider()` in Synthetic Endpoints

H19. `main`/`beta` Env Supabase Routing — RESOLVED 2026-05-11

H20. Missing `await` on Async Cleanup in agent.py

H28. `ensure_cicd_fresh.sh` Hash Only Covers `.py` Files

N20. `build_splat_base` GitLab CI Job Has No Docker-Capable Runner

N21. `deploy_and_merge_dev` Job Can't Push to `origin/beta` — HTTP Basic Auth Denied

N3. Shell Scripts Missing `set -e` Error Handling

M28. `register_external_asset` Broad Exception Catch

M29. `list_location_assets` Generates Presigned URL Per Asset

M30. `supabase_queries.py` Broad Exception Catches Silently Return Empty Data

M31. `decorators.py:20` Unsafe UUID Parsing With No Guard

M34. No `@modal.exit()` Cleanup on vLLM Serving Classes

M40. Missing Error Handling on tsweb `/projects` Supabase Queries

M42. Broad `except Exception:` on Media Read in `sync.py`

M43. Race Condition in `SparkyAssistant._tasks` List

M44. Race Condition in DEK Loading (`_ensure_dek`)

M50. localhost/dockerfile Missing `.dockerignore`

M55. `traction.py` Broad `except Exception:` on Date Parsing and Supabase Queries

M56. `innerHTML` With Unescaped `e.message` in ML Demo Pages

M57. `session_to_splat.py` Manual Logger Setup Bypasses Centralized Factory

L20. integration-tests.sh Uses `set -ex` (Echoes Env)

L21. integration-tests.sh `cd $cwd` Unquoted

L22. TOCTOU on `remote_participants` in agent.py

L25. Dead `_smoketest_in_modal.py` Still Present After Phase 5

L26. `_escape()` in deploy_dashboard.py Missing Single-Quote Escaping

L27. Unbounded `limit` Query Parameter on Deploy Dashboard History API

N15. `_save_location_db` Deprecated But Still Present

N2. `dev/ml/ml_endpoint.py` CLI Uses Raw print() After Logger Migration

N4. requirements.txt Has Both `google-genai` and `google-generativeai`

`dev/not_yet/` Renamed to `dev/_archived/` (2026-05-18)

L3. README.md Duplicate `## Setup` Header (2026-04-23)

L4. Large `results.json` and `build.log` Committed (2026-04-23)

M8-old. CI dev deployment commented out

H16, M48, M49, M52, M53 — Billing-related items