Grizzlebear Codebase Improvement Tracker
Auto-generated: 2026-05-29 (re-verified against the 2026-05-28 security/cleanup commits) Review scope: Full codebase (dev/, ci/, website/, specs/, localhost/, root config) Previous reviews: 2026-05-29 (earlier same-day run, superseded), 2026-05-27, 2026-05-11, 2026-05-05, 2026-05-04, 2026-05-01, 2026-04-30, 2026-04-29, 2026-04-28, 2026-04-27, 2026-04-26, 2026-04-25, 2026-04-24, 2026-04-23, 2026-04-22, 2026-04-21 (initial), 2026-04-15
Correction note (2026-05-29): the earlier same-day review re-listed items as open that three 2026-05-28 commits had already fixed (
835108asecrets/JWT/escaping,ca76051SessionState/input-validation,6c66c41dead-code cleanup) — a stale-seed regression. This refresh re-verified every CRITICAL/HIGH item against current HEAD and moved the confirmed fixes to RESOLVED. The genuinely-still-open critical is the DEK debug log (now Priority 1).
TOP 2 PRIORITIES
1. Encryption Key Secret Logged in Plaintext (CRITICAL SECURITY — STILL OPEN)
dev/core/data.py:74 — logger.debug(f"dek: {os.environ['ACCOUNT_ENCRYPTION_KEY_SECRET']}") writes the master account encryption key secret to application logs in plaintext. This runs on every save_dek() call (i.e., every time a new Data Encryption Key is saved for an account). Verified still present at HEAD 2026-05-29.
Impact: ACCOUNT_ENCRYPTION_KEY_SECRET is the key-encryption-key (KEK) that protects all per-account DEKs. If application logs are stored, forwarded to a log aggregator, or visible via modal app logs, this key is exposed. Combined with access to the encrypted DEKs in S3, an attacker can decrypt all account data.
Fix:
- Immediately remove the
logger.debug(f"dek: ...")line atdata.py:74. - Audit
modal app logsretention to confirm the key hasn't been captured in log storage. - Also fix line 72:
logger.debug("save_dek", account_id)— the second positional arg is silently dropped (logger signature bug, see Priority 2).
Ops follow-up (carried over from the 2026-05-28 secrets fix): the previously-hardcoded Resend API key (re_FKkPHP8c_…, removed from notifier.py) still lives in git history. Rotate it and consider git-filter-repo to scrub history. Same for any "your-super-secret-key" JWT placeholder occurrences in history.
Resolved parts (2026-05-28, see RESOLVED): hardcoded Resend key and "your-super-secret-key" JWT placeholder are no longer in source — both now read from Modal secrets via env (RESEND_API_KEY / JWT_INVITE_SECRET). CORS wildcards remain open — see H4.
2. Centralized Logging Sweep (IN PROGRESS — complete the migration)
Current state: dev/core/logging_config.py provides the get_logger() factory. Raw print() and direct-logging usage remain concentrated in agent code, ML CLI entrypoints, and ops/diagnostic scripts.
Audit counts (re-run 2026-05-29):
print(— 433 total matches across 39 files (includesdev/_archived/, tests, and diagnostic scripts). Production-path prints are concentrated in the files below.traceback.print_exc()— 44 active in production code (1 commented inlivekit_rpc_agent.py:441, 1 in_archived/).import logging(bypassingget_logger()) — 26 files (excludinglogging_config.pyitself).- bare
except:— 16 active in production (see H2).
Remaining print() concentrations (production paths):
dev/livekit_ts/redis_dump.py(36) — diagnostic script.dev/livekit_ts/agents/livekit_rpc_agent.py(26) — agent code.dev/verify_supabase_integration.py(24) — root-level verify script (see N5).dev/sync_admins.py(22) — admin sync utility.dev/ml/ml_endpoint.py(18) — CLI entrypoints and diagnostics (see N2).dev/images/testing_gemini.py(15),dev/images/stitcher.py(7),dev/images/gemini_segment.py(6).dev/livekit_ts/agents/— livekit_video_agent (8), livekit_elevenlabs_vision_agent (8), livekit_agent_manager_rpc (7), livekit_transcriber_agent (4), livekit_logger_agent (4), livekit_tts_agent (3).dev/scripts/build_ecr_base_on_ec2.py(6) — ops tooling.dev/static_site/preview_dashboard.py(5) — NEW local mock-preview tool (added in the deploy-dashboard redesign); dev-only.
Logger signature bugs: the logger.debug("label", value) anti-pattern (extra positional args silently dropped) — 27 instances in production code (recount 2026-05-29):
dev/core/data.py:72,89,132,161,175,190,200,223,244,255,265,274— 12 instances.dev/users/auth.py:339,407,423,439,451,578,587,595,664,703,711,809,883— 13 instances (line numbers drifted from the prior review; count up from 11).dev/core/decorators.py:18,74— 2 instances passingargs, kwargsas positional args.- (notifier.py's 2 instances were fixed 2026-05-28 — now f-strings.)
Misuse at dev/users/users_endpoint.py:271: logger.debug(tradespark_http_exception, traceback.print_exc()) — calls traceback.print_exc() as a side effect (prints to stderr), then passes its return value (None) as a format argument. Double-logs via stderr + logger; the logger message gets no useful info.
traceback.print_exc() in agent code: 44 active occurrences across dev/livekit_ts/agents/ (all 8 agent files), dev/livekit_ts/livekit_endpoint.py:93,168, and dev/users/users_endpoint.py:271. These bypass the logging framework and write to stderr.
26 modules still use import logging directly (reconfirmed 2026-05-29): dev/data/sync.py, dev/websocket/ (2: websocket.py, websocket_messagepack.py), dev/ml/ (17 files — count up from 14: training/{trainer,model_registry,dataset_builder}, serving/{vllm_server,ab_router}, mobile/{quantize,litert}, gateway/{providers,logger,gateway,chat}, eval/{metrics,llm_judge,eval_runner,benchmarks}, data_pipeline/{synthetic_generator,converters}), dev/livekit_ts/agent/image_segmentation.py, 3 tsweb files (location.py, queries.py, scraper.py), and 2 queues files (session_to_splat.py, session_to_splat.video_3d_reconstruction.py).
Note: dev/queues/session_to_splat.py also has a manual logger-setup block duplicating the centralized factory (see M57).
Fix:
- Fix all
logger.debug("label", value)signatures across data.py, auth.py, decorators.py — use f-strings or%splaceholders. - Replace
traceback.print_exc()withlogger.exception()in agent files and users_endpoint.py. - Migrate the remaining 26
import loggingmodules tofrom core.logging_config import get_logger. - Add a lint rule (ruff
T201/T203) to prevent print() regression.
HIGH PRIORITY
H1. Test Coverage for Critical Modules
80%+ of backend modules have zero tests. Missing coverage:
model_proxy/(2 files)websocket/(multiple agent/handler files)livekit_ts/(20+ files including agent logic)capture/,geocoding/,voices/data/(2 endpoint files, minimal coverage)ml/(entire pipeline: gateway, eval, training, serving, mobile — zero tests)
Website has only 1 partial E2E test (test.auth.ts) with multiple FIXMEs blocking execution.
Impact: Regressions go undetected; refactoring is risky without a safety net.
Note (from feedback memory): pytest is owned by the account-system dev; ml/general tests go in Bruno or preflight scripts, not pytest. This item scopes to the account-system pytest suite only.
H2. Bare Exception Handlers Masking Failures
16 bare except: blocks remain in owned production code (excluding dev/_archived/):
dev/livekit_ts/livekit_server.py:69,77dev/livekit_ts/redis_dump.py:132dev/livekit_ts/agents/— 13 across all 8 agent files (livekit_rpc_agent.py:342, livekit_transcriber_agent.py:260,340, livekit_elevenlabs_vision_agent.py:723,798, livekit_agents.py:152, livekit_video_agent.py:591, livekit_logger_agent.py:263, livekit_basic_agent.py:218,294, livekit_tts_agent.py:344,420, livekit_agent_manager_rpc.py:307)
These catch KeyboardInterrupt, SystemExit, and asyncio.CancelledError — all of which should propagate.
Fix: Start with dev/livekit_ts/agents/. Replace bare except: with except Exception as e:; prefer specific exception types for known failure modes.
H3. Hardcoded Configuration Values
S3 bucket names, AWS regions, and URLs are scattered as string literals:
dev/core/data.py:24-30— 7 hardcoded bucket namesdev/core/data.py:67—region_name='us-west-2'with FIXME commentdev/core/core.py:120— Full S3 URL with date suffix baked indev/core/core.py:474-500— S3 mount definitions with hardcoded bucket names including dates (e.g.,"modal-config-120125")dev/deploy.sh:21— Hardcoded AWS profile"tradespark"dev/ml/training/model_registry.py:34-35— Hardcoded bucket name and Modal Volume pathdev/ml/gateway/providers.py:287— Hardcoded vLLM slot setdev/tsweb/scraper.py:33— Hardcoded default bucket"tradespark-ml-datasets"and region"us-west-2"(line 39)
Fix: Consolidate into a single config module or environment variables with startup validation.
H4. Wildcard CORS on All 12 Production Endpoints
Every FastAPI app in production code allows allow_origins=["*"] (12 files, reconfirmed 2026-05-29):
data_endpoint.py:47,websocket_endpoint.py:26,users_endpoint.py:79(has TODO: restrict in prod),capture_endpoint.py:39,ml_endpoint.py:51,model_proxy_endpoint.py:67,geocoding_endpoint.py:30,livekit_endpoint.py:37,livekit_dashboard.py:127,voices_endpoint.py:30,static_site/endpoint.py:94,tsweb/endpoint.py:45.
Fix: Create a shared CORS config function that reads allowed origins from environment; restrict to known domains in production.
H5. Missing DEK Encryption at Rest
dev/core/dek_store.py:20 still stores Data Encryption Keys in plaintext memory with TODO comments about encrypting with a backend key. The global deks_by_account dictionary is also not thread-safe for concurrent Modal requests, has no TTL/expiration, and no access control.
H8. Fragile Dependency Workaround in ML Training Image
dev/core/core.py:726 uses .run_commands("pip install 'transformers>=5.5.0'") to force-upgrade transformers after vLLM 0.19.0 pins transformers<5. This is a fragile workaround that will break unpredictably when either package updates.
H9. Race Condition in S3 Training Data Logger
dev/ml/gateway/logger.py appends to daily JSONL files in S3 via download-append-reupload. With @modal.concurrent(max_inputs=100) on the ML endpoint, concurrent requests can overwrite each other's log entries, silently losing training data.
Extended finding: logger.py:88-89 catches ALL exceptions (not just NoSuchKey) and silently sets body = "". If S3 returns AccessDenied or a network timeout, existing log data is overwritten with just the new line. Line 86 correctly catches NoSuchKey for the "new file" case, but the except Exception: at line 88 masks real errors.
Also: dev/ml/data_pipeline/supabase_scraper.py:44 has the same pattern — except Exception: return "2020-01-01T00:00:00Z" silently resets the sync timestamp on ANY error, causing full re-processing of all data.
Also: dev/tsweb/scraper.py replicates both issues:
- Line 48:
_get_last_sync_timestamp()catchesexcept Exception:and returns"2020-01-01T00:00:00Z". - Lines 243-247: download-append-reupload on daily JSONL files with
except Exception: body = "".
Fix: Switch to per-request S3 keys (e.g., {prefix}/{date}/{uuid}.jsonl) or use S3 append-only semantics. Catch only NoSuchKey/404 for the "doesn't exist yet" path; log and re-raise other exceptions.
H10. Missing Input Validation on ML Request Parameters
Multiple ML endpoint request models lack bounds validation (line numbers drifted — ml_endpoint.py grew to 2,333 lines):
ServeRequest.max_tokens(ml_endpoint.py:998) — no upper bound; users can request arbitrarily large generations.ServeRequest.temperature(ml_endpoint.py:997) — no bounds (should be 0.0–2.0).ServeRequest.model_slot— barestrwith no enum constraint; invalid slots fail insidegenerate()instead of at request validation time.ChatStreamRequest.messages(chat.py) — no limit on list length or content size.GatewayRequest.system_prompt/user_input(gateway.py) — no max_length.
Impact: DoS via resource exhaustion or excessive LLM API costs from unbounded requests.
H13. Missing Error Handling on Remote Modal Calls
ML endpoints invoke .remote.aio() with no try/except (e.g. modal_download_base_models, modal_train_model, modal_ingest_litert, modal_convert_litert). Modal timeout, compute, or network errors propagate as unstructured 500 responses with no useful error message.
Fix: Wrap each in try/except to return structured error responses with the Modal error context.
H15. S3 Model Registry Race Condition
dev/ml/training/model_registry.py:257-263 — register_version() does load → append → save on registry.json in S3 with no locking or conditional write. If two Modal functions register versions simultaneously, one silently overwrites the other's entry.
Fix: Use S3 conditional writes (If-None-Match / If-Match ETags) for optimistic locking, or use DynamoDB for atomic registry updates.
H17. sync.py Imports Private core.data Functions
dev/data/sync.py:82-83,184 imports _put_blob, _asset_s3_key, and _get_blob — underscore-prefixed internal functions from core.data. These are implementation details that may change. The sync module is the only external consumer.
Fix: Either promote _put_blob/_get_blob/_asset_s3_key to public API, or refactor sync.py to use the public store() function.
H18. Missing Error Handling on call_provider() in Synthetic Endpoints
dev/ml/ml_endpoint.py — auto_answer_questions calls call_provider() with no try/except. If the LLM provider returns an error, the raw exception propagates as an unstructured 500. The subsequent json.loads(result.response_text) also assumes result is not None.
Fix: Wrap call_provider() in try/except; validate result before accessing .response_text. Return a structured 502 error on provider failure.
H19. main/beta Env Supabase Routing — RESOLVED 2026-05-11
Resolution summary: Self-managed anon login sidestepped the prod Supabase anonymous sign-in blocker. dev/core/core.py:429-430 now routes main/beta → SupabaseProd, everything else → SupabaseDev.
Still required for the website:
- [ ] Provision the prod Supabase URL + publishable anon key in
website/scripts/supabase.ts(PROD_SUPABASEis currently empty with a TODO). - [ ] Confirm
Modal secret SupabaseProdin themainModal environment points at the new prod project URL/keys before the firstmaindeploy. - [ ] Seed the canary fixtures (
test_project_id,test_location_id) in the prod project, OR updatebruno/.../environments/main.bru. - [ ] Run
cd dev && ./test_app.sh --env mainafter the first prod deploy.
H20. Missing await on Async Cleanup in agent.py
dev/livekit_ts/agent/agent.py — _create_video_stream() calls session_state.video_stream.aclose() without await (author flagged with # FIXME Not async.). Without await, the async generator is never properly closed; the underlying RTC video stream remains open and accumulates frame buffers until the session ends.
Fix: Make _create_video_stream() async and await session_state.video_stream.aclose(). Wrap in asyncio.create_task(...) if the callback signature requires sync.
H23. Beta Should Point at Supabase Staging, Not Prod
dev/core/core.py:429-430 currently routes both main and beta to the same SupabaseProd Modal secret. Long-term, beta should talk to a separate SupabaseStaging project.
Work to do:
- [ ] Create a new
SupabaseStagingproject in Supabase. - [ ] Provision a
SupabaseStagingModal secret. - [ ] Update routing to use a dict lookup (
main→Prod,beta→Staging, else→Dev). - [ ] Seed canary fixtures in SupabaseStaging.
H25. Deploy Dashboard Exposes Exception Details in Error Responses
dev/static_site/deploy_dashboard.py — multiple occurrences of type(e).__name__: {e} in HTTP error responses or return values (e.g. raise HTTPException(502, f"history fetch failed: {type(e).__name__}: {e}")). (Line numbers shift frequently — the dashboard sees near-daily commits; grep type(e).__name__ to locate.)
While these endpoints are admin-gated (TradesparkEmailAdmin), the raw exception strings can reveal internal implementation details — Modal function names, class paths, connection strings, or timeout details.
Risk: Medium (admin-only), but follows the same information-disclosure anti-pattern as M17.
Fix: Log the full exception at ERROR level (already done via logger.exception), but return a generic user-facing message.
H27. CI Webhook Token Comparison Not Timing-Safe
ci/webhook.py:453,486,499 — x_gitlab_token != expected and x_approve_token != expected (now 3 comparison sites; line numbers drifted from the prior 244/261) use Python's != operator, which short-circuits on the first differing byte. This enables timing side-channel attacks to guess the token character by character. The webhook is internet-facing and the token is the sole auth mechanism.
Fix: Use hmac.compare_digest(...) from the stdlib for all three comparisons.
H28. ensure_cicd_fresh.sh Hash Only Covers .py Files
ci/ShellScripts/ensure_cicd_fresh.sh:47 — The staleness hash is sha256sum ci/*.py | sha256sum. Changes to dev/requirements.txt, Dockerfiles used by CI, or non-Python CI configs don't change the hash. The CI workers won't be redeployed, and the next CI run will fail with an import error until someone manually runs modal deploy -e cicd.
Fix: Include dev/requirements.txt and relevant Dockerfiles in the hash.
H29. Env Allow-List Drift Across CLI Scripts and Dashboard
(Renumbered from a second, duplicated "H26" header.)
dev/stop_app.sh:13 defines SUPPORTED_ENVS=(main beta dev jh rk cc) — missing fl (which deploy.sh and the dashboard support). Neither stop_app.sh nor deploy.sh includes aw or jc, but the deploy dashboard's DEV_ENV_OPTIONS and STOPPABLE_ENVS (deploy_dashboard.py:46,55) do. Additionally, the dashboard's ALLOWED_ENVS (for deploy triggers) doesn't include aw/jc, so the UI shows a Deploy button for those envs that always returns HTTP 400.
Note: the root Justfile _ENVS := "main beta dev jh rk cc fl" is yet another copy of the allow-list.
Impact: Devs with fl/aw/jc envs get inconsistent tooling. Broken UX and confusion during incident response.
Fix: Define a single ALL_ENVS / STOPPABLE_ENVS set in one place (e.g., core/core.py or a shared config), and have deploy.sh, stop_app.sh, deploy_dashboard.py, and the Justfile all derive from it.
N20. build_splat_base GitLab CI Job Has No Docker-Capable Runner
The build_splat_base job in .gitlab-ci.yml needs Docker to build dev/queues/dockerfile.session_to_splat.base, but every push routes it to proxmox-runner-1 which is a shell executor (not docker-in-docker). Result: the job fails immediately with apk: command not found because the shell executor ignores the image: docker:24 + services: docker:24-dind declarations.
Workaround in place (2026-05-16): the job's auto-trigger on Dockerfile changes was removed; it's now when: manual + allow_failure: true. The actual splat base build is done via dev/queues/build_splat_base_on_ec2.py, which spins up a temporary m5.xlarge with real Docker, builds + pushes to ECR, and tears down. Manual but reliable.
Proper fix:
- Register a Docker-executor GitLab runner tagged something like
docker-dind. - Add
tags: [docker-dind]to thebuild_splat_basejob. - Restore the auto-trigger rule (
changes:→when: on_success, fallbackwhen: manual+allow_failure: true). - Verify by changing the Dockerfile and pushing.
Alternative: replace the CI job with an AWS CodeBuild project triggered by a GitLab webhook.
N21. deploy_and_merge_dev Job Can't Push to origin/beta — HTTP Basic Auth Denied
GitLab CI pipeline #11147 on the dev branch ran ci/ShellScripts/merge_branches.sh dev beta …. The local git merge inside the runner succeeded cleanly, but the final git push origin beta failed with remote: HTTP Basic: Access denied. The CI runner's persisted git credential has expired or been rotated. dev → beta auto-merge is silently blocked.
Fix:
- Inspect the runner / job's git-push credential (ci/ShellScripts/merge_branches.sh).
- Rotate: create a fresh GitLab project access token with
write_repositoryscope, store as a masked CI/CD variable, use it in the push URL. - Retry the failed
deploy_and_merge_devjob (or push a new commit).
Same auth pattern is used by merge_beta_to_main — that path will hit the same wall when it next runs.
N3. Shell Scripts Missing set -e Error Handling
dev/deploy.sh and dev/stop_app.sh have no set -e, set -u, or set -o pipefail at top. If any step fails, the script continues, potentially deploying partial/broken state or silently failing to stop apps.
Fix: Add set -euo pipefail at top of both scripts, then fix any commands that relied on silent failure.
MEDIUM PRIORITY
M1. Circular Dependency Workarounds
dev/core/core.py:173 uses local imports (from users.auth import bind_user_token) to avoid circular dependencies between core and users. This hides architectural coupling. Consider extracting shared interfaces into a separate module.
M2. Overly Long Functions and Files
dev/ml/ml_endpoint.py— 2,333 lines (was ~1,256 on 2026-05-29; nearly doubled). Spans FastAPI routes, Modal class definitions, CLI entrypoints, diagnostics, and four near-identical vLLM serving classes. The single largest file in the codebase — strong candidate for splitting (routes vs Modal serving classes vs CLI).dev/core/data.py— 1,357 lines.dev/users/auth.py— 1,194 lines, mixed auth logic, database ops, and email handling.ci/_git_in_modal.py— 1,023 lines — the CI git helper has grown into a monolith (merge-and-push, fetch-history, get-env-states, get-active-dispatches, get-branch-tips, last-test-results, record-event). Consider splitting by concern.dev/livekit_ts/agent/agent.py—entrypoint()packs nested class definitions (EventEmitter,SparkyAssistant) inside the function, hampering testing/reuse (file 631 lines).dev/livekit_ts/agents/livekit_elevenlabs_vision_agent.py— 816 lines.
M3. Inconsistent Auth Patterns
Three different auth checking mechanisms coexist:
- Dependency-based:
dev/core/core.py:215-263 - Decorator-based:
dev/core/decorators.py:52-71 - Service-layer:
dev/users/auth.py
No single canonical pattern is documented or enforced.
M4. Duplicate Supabase Client and S3 Client Creation
Supabase clients created independently in:
dev/core/db.py:19-39(cached via@lru_cache(maxsize=1)— not thread-safe for Modal concurrency)dev/users/auth.py:80(direct)
S3/boto3 clients: get_s3_sync_client() in data.py creates a new boto3 client on every call — 24 call sites in data.py alone. dev/tsweb/scraper.py:38 adds yet another independent boto3.client("s3") factory.
Fix: Cache the S3 client per container (module-level singleton or @lru_cache). Consolidate Supabase into a single factory in core/db.py.
M5. Modal Secret Ordering Fragility
dev/core/core.py:465-466 documents that supabase_secret MUST be last in the secrets list because Modal merges left-to-right and the TradeSpark secret contains a stale SUPABASE_URL. This implicit ordering is a maintenance trap.
M6. Duplicate Model Defaults in ML Gateway
Model defaults defined in three places with risk of inconsistency:
dev/ml/gateway/gateway.py:85-90dev/ml/gateway/chat.py:74-79dev/ml/gateway/providers.py:37,74,121(function parameter defaults)
All hardcode model names (e.g., "claude-sonnet-4-6", "gemini-2.5-flash"). Should be a single config source.
M8. CI Pipeline Gaps
- No GitLab CI cache configuration (each job reinstalls dependencies).
pip install modal --break-system-packagesrisks contaminating system Python. Same pattern forboto3 pyyaml.- No post-deployment smoke tests or rollback procedure.
- No CI stages for website building/testing.
M9. No Timeout on Async Queue in ML Chat Streaming
dev/ml/gateway/chat.py:205 — await queue.get() has no timeout. If a provider stops sending events, the endpoint hangs indefinitely.
Extended: Task cancellation calls t.cancel() but doesn't await the tasks. Cancelled tasks may still be executing when the StreamingResponse context exits, leaking connections.
M10. Missing Startup Configuration Validation
No checks at application startup to ensure all required environment variables and secrets are present. MODAL_ENVIRONMENT silently defaults to "dev" (core.py:24). A dev/.env.template exists but nothing validates env-var presence at startup.
M11. Stale Backlog Documents
specs/WEEK_1_BACKLOG.md and specs/WEEK_2_BACKLOG.md are 140+ days old (Jan 8, 2026) with no indication of completion status. Should be archived or updated.
M12. No Docker Compose for Local Development
Tests require a local PostgreSQL at postgresql://postgres:testpass@localhost:5432/test_grizzlebear but there is no docker-compose.yml or setup instructions for reproducibility.
M13. Missing Health Check Endpoints
Six top-level endpoints have /health. Missing from:
dev/users/users_endpoint.pydev/geocoding/geocoding_endpoint.pydev/livekit_ts/livekit_endpoint.pydev/voices/voices_endpoint.pydev/livekit_ts/livekit_dashboard.py
M14. No Graceful Shutdown Handlers
No FastAPI app uses lifespan context managers or shutdown handlers. In-flight requests may be abruptly terminated when Modal containers scale down.
M15. Triplicated vLLM Generate Code
dev/ml/ml_endpoint.py:2035,2081,2125,2169 — the vLLM serving classes each have identical ~10-line generate() methods. A _make_vllm_generate() factory exists earlier in the file but is never used — dead code. (Line numbers updated 2026-05-29 after the file grew to 2,333 lines.)
M16. S3 list_objects_v2 Missing Pagination
Five locations call list_objects_v2 without handling pagination (S3 returns max 1000 keys per response):
dev/capture/capture.py:185,238dev/ml/eval/eval_runner.py:193dev/ml/ml_endpoint.pydev/ml/data_pipeline/converters.py:127
M17. Information Disclosure via Error Messages
Several ML endpoints return raw Python exception strings to clients:
dev/ml/gateway/gateway.py:105—detail=f"LLM provider error: {str(e)}"dev/ml/gateway/chat.py:182—"error": str(e)in SSE eventsdev/ml/gateway/providers.py:346—yield f"[vLLM error for {slot}:{version_str}: {e}]"
See also H25 (deploy dashboard) for the same pattern.
M18. Quintuplicated vLLM Slot Mapping
Model slot membership is maintained in five separate locations that must be kept in sync:
dev/ml/gateway/providers.py:287—_has_vllm_cls()hardcoded setdev/ml/gateway/providers.py:325-329—cls_mapdictdev/ml/serving/vllm_server.py:48-52— identicalcls_mapdictdev/ml/gateway/chat.py:82—_MODEL_SLOT_NAMESsetdev/core/model_versions.py:20-24—MODEL_VERSIONSdict keys
M19. Race Condition in Room Agent Management
dev/livekit_ts/agent/room_agent_worker.py:61-65,116,125,132 — check-then-act pattern on shared core.data.active_room_agents dict without locking. The code has a FIXME at line 60 acknowledging this.
M21. Triplicated Modal Image Dependency Blocks
Four Modal image definitions in dev/core/core.py independently specify their pip dependencies. Core deps are copy-pasted across all four.
Fix: Extract a CORE_DEPS list and compose image deps as CORE_DEPS + image_specific_deps.
M22. No Preflight Gate for Dev Branch Deploys
The deploy_and_merge_dev CI stage skips the preflight_manifest.py check and immediately merges dev → beta.
M23. CI Merge Script Silently Overwrites Conflicts
ci/ShellScripts/merge_branches.sh:28-29 — git merge --no-ff -X theirs always takes the source branch version on conflict.
M24. Excessive HTTP Timeout on vLLM Proxy Requests
dev/ml/gateway/providers.py:391 — 10-minute timeout for the vLLM endpoint.
M25. Uninitialized CI Variables
.gitlab-ci.yml:191 — $MATTERMOST_WEBHOOK never declared or documented. main_tests_notify.sh:40 references undefined $MODAL_URL.
M26. No Timeout on Modal Deploy Commands
ci/ShellScripts/deploy_modal.sh:35 and dev/deploy.sh:295 — modal deploy commands have no timeout.
M27. main_tests_notify.sh Has Shebang After Executable Code
ci/ShellScripts/main_tests_notify.sh:1-8 — Lines 1-6 execute before the #!/bin/bash shebang on line 7.
M28. register_external_asset Broad Exception Catch
dev/core/data.py:1313 — except Exception: on s3.head_object() silently sets file_size=0.
M29. list_location_assets Generates Presigned URL Per Asset
dev/core/data.py:1252 — one S3 presigned URL per asset, each creating a new boto3 client.
M30. supabase_queries.py Broad Exception Catches Silently Return Empty Data
dev/core/supabase_queries.py catches except Exception as e: and returns empty results.
M31. decorators.py:20 Unsafe UUID Parsing With No Guard
dev/core/decorators.py:20 — account_id = uuid.UUID(args[0], version=4) without try/except.
M33. vLLM Streaming Response Line Splitting
dev/ml/gateway/providers.py:398 — byte-level iteration doesn't guarantee complete SSE lines.
M34. No @modal.exit() Cleanup on vLLM Serving Classes
dev/ml/ml_endpoint.py — the vLLM serving classes define @modal.enter() but no @modal.exit() handlers. GPU memory and pending requests are abandoned without cleanup on scale-down.
M35. Training Format Functions Vulnerable to Chat Delimiter Injection
dev/ml/data_pipeline/converters.py:35-63 — format_gemma(), format_chatml(), and format_llama() interpolate user-supplied fields directly into template strings containing chat delimiters.
M36. Container Environment Variables Leaking to Modal App Logs
While running modal app logs, the log stream emits KEY=value lines exposing live credentials. Source still unknown (no print(os.environ) in current source). Closely related to Priority 1 (DEK log) — both leak secrets to modal app logs.
M37. innerHTML XSS in mobile-session.html renderAssets()
dev/static_site/templates/demos/mobile-session.html:336 — li.innerHTML with unescaped API response data (a.fileName, a.category, a.download_url). The same file has an _escapeHtml() helper at line 460 that isn't used in renderAssets().
M38. tsweb/queries.py Broad Exception Catches on Supabase Queries
dev/tsweb/queries.py catches except Exception as e: on Supabase query calls and returns empty results.
M40. Missing Error Handling on tsweb /projects Supabase Queries
dev/tsweb/endpoint.py:103-105,112-114 — Two Supabase queries with no try/except.
M41. Silent Cron Failure in tsweb Nightly Sync
dev/tsweb/scheduled.py:22-28 — nightly_supabase_sync() calls sync_new_projects() with no try/except.
M42. Broad except Exception: on Media Read in sync.py
dev/data/sync.py:206 — except Exception as e: on _get_blob() call silently logs a warning and continues.
M43. Race Condition in SparkyAssistant._tasks List
dev/livekit_ts/agent/agent.py — self._tasks: list[asyncio.Task] mutated from multiple async callbacks without a lock.
M44. Race Condition in DEK Loading (_ensure_dek)
dev/core/data.py — _ensure_dek() does check-then-load on the global deks_by_account dict without a lock.
M45. recorder.py Unbounded Retry Loop With No Backoff or Timeout
dev/livekit_ts/agent/recorder.py:45-57 — wait_for_recorder() busy-loops on the recorder URL with a fixed 0.5s sleep and no maximum retries.
M46. websocket_messagepack.py Silently Substitutes Defaults on Parse Failure
dev/websocket/websocket_messagepack.py:28-29 — On msgpack.unpackb() failure, silently substitutes a fake default tuple. Exception variable e is captured but never logged.
M47. room_agent_worker.py Double Cleanup in except + finally
dev/livekit_ts/agent/room_agent_worker.py:114-131 — Both except asyncio.CancelledError: and finally: perform the same cleanup operations.
M50. localhost/dockerfile Missing .dockerignore
localhost/dockerfile:20 — ADD ./ /opt/app/tradespark/ copies the entire build context. No .dockerignore exists, so the image includes .git/, .env, dev/.venv/, etc.
M51. Large Blocks of Commented-Out Code in livekit_ts/ Agents
Three agent files contain 20-80+ line blocks of commented-out code.
M55. traction.py Broad except Exception: on Date Parsing and Supabase Queries
dev/users/traction.py has 3 broad exception catches.
M56. innerHTML With Unescaped e.message in ML Demo Pages
dev/static_site/templates/demos/ml-training.html:395 and ml-eval.html:945 — error messages interpolated via innerHTML without escaping.
M57. session_to_splat.py Manual Logger Setup Bypasses Centralized Factory
dev/queues/session_to_splat.py:9-14 — sets up its own logging.StreamHandler with a custom format and DEBUG level, duplicating the centralized logging_config.py factory.
Fix: Replace with from core.logging_config import get_logger; logger = get_logger("queues.session_to_splat").
LOW PRIORITY
L17. ABRouter A/B Routing Is Dead/Incomplete
dev/ml/serving/ab_router.py:26-32 — hardcoded split=1.0 makes routing inert.
L18. SQL Identifier Interpolation in Schema Migrations
dev/core/data.py:670 — f-string DDL from trusted-only _MIGRATION_COLUMNS constant.
L20. integration-tests.sh Uses set -ex (Echoes Env)
dev/integration-tests.sh:19 — -x xtrace echoes env vars to CI logs.
L21. integration-tests.sh cd $cwd Unquoted
dev/integration-tests.sh:130 — cd $cwd without quotes.
L22. TOCTOU on remote_participants in agent.py
dev/livekit_ts/agent/agent.py:209-210 — check-then-access on a dict mutated by the event loop.
L25. Dead _smoketest_in_modal.py Still Present After Phase 5
ci/_smoketest_in_modal.py:14 — The file's own docstring says "Delete this file after Phase 3 (deploy_in_modal.py) is green." Dead code that also uses the old add_local_dir pattern.
Fix: Delete ci/_smoketest_in_modal.py.
L26. _escape() in deploy_dashboard.py Missing Single-Quote Escaping
dev/static_site/deploy_dashboard.py:108-119 — The custom _escape() handles & < > " but not '. Currently safe because server-side rendering uses double-quoted attributes, but fragile.
Fix: Replace with html.escape(s, quote=True) from stdlib.
L27. Unbounded limit Query Parameter on Deploy Dashboard History API
dev/static_site/deploy_dashboard.py:150 — limit: int = 100 has no upper bound. A caller can pass limit=999999999, loading the entire CI-history JSONL in one response.
Fix: Use Query(default=100, ge=1, le=500).
L1. Missing Type Hints on Key Functions
Functions in dev/users/auth.py and dev/core/core.py lack complete type annotations.
L2. Missing Docstrings on Public APIs
30+ public functions across core/, users/, livekit_ts/ have no docstrings.
L5. Broken Email HTML Links
dev/core/notifier.py:21 has # FIXME HTML links not working.
L6. Localhost Dockerfile Runs as Root
localhost/dockerfile has no non-root user defined. Also installs unused packages.
L7. Website SRI Integrity Checks Broken
website/build-scripts/build-html.ts FIXME noting SRI integrity checks are broken.
L8. 100+ TODO/FIXME Comments Need Triage
Scattered across the codebase. Notable concentrations: dev/users/auth.py, dev/livekit_ts/livekit_endpoint.py, dev/core/core.py.
L10. Postman Collection Alongside Bruno
Both postman_collection.json (root) and bruno/ exist.
L11. Dead Code in crypto.py
dev/core/crypto.py:12-74 — ~60 lines of commented-out encryption functions.
L12. Inconsistent Error Handling in CI Scripts
Mixed set -e / set -ex / no set -e across scripts.
L13. Unsafe Dictionary Access in ML Serving
dev/ml/gateway/providers.py:350 — safe but yields "unknown" rather than an error.
L14. No Rate Limiting on Any Endpoint
No rate limiting on any FastAPI endpoint. ML gateway proxies to paid LLM APIs.
L15. LLM Prompt Injection via f-String User Input
dev/ml/ml_endpoint.py:152-162 interpolates user-supplied input directly into an LLM system prompt.
N8. Typo "settting" in Test Fixture Log Message
dev/tests/conftest.py:609 — triple-t typo.
N13. Stale NVM Version in dockerfile.modal
dockerfile.modal:91 installs NVM v0.39.1 via the deprecated creationix URL.
N14. Missing JSON Parse Guard in CI Report Generator
ci/generate-html-report.js:8-13 — JSON.parse with no try/catch.
N15. _save_location_db Deprecated But Still Present
dev/core/data.py:691-714 — Deprecated write-path helper marked for removal.
N2. dev/ml/ml_endpoint.py CLI Uses Raw print() After Logger Migration
18 raw print() calls remain in the CLI entrypoint and diagnostic blocks.
N4. requirements.txt Has Both google-genai and google-generativeai
dev/requirements.txt:36-37 — google-genai==1.32.0 and google-generativeai==0.8.5, two different Google packages with overlapping functionality. (The bare google==3.0.0 meta-package was removed 2026-05-28 — see N18 in RESOLVED.)
N5. Verification Scripts at dev/ Root
dev/verify_supabase_integration.py (152 lines, 24 prints) remains at dev/ root.
N11. Unescaped JSON in CI Notification Script
ci/ShellScripts/notify_mattermost.sh:17-26 — interpolates $MESSAGE directly into JSON payload without escaping.
N12. Unbounded _db_cache and _db_locks Growth
dev/core/data.py — dictionaries grow unbounded with no eviction mechanism.
L16-followup. Gemini API Key — RESOLVED 2026-05-28
dev/websocket/gemini.py now raises RuntimeError when GEMINI_API_KEY is unset (was a "no-key-available" fallback). See RESOLVED.
NOT YET
H11. LiteRT/Training Admin Endpoints Missing Admin Auth Check
dev/ml/ml_endpoint.py admin endpoints only use CurrentUser dependency. Any authenticated user can trigger GPU-capable Modal containers.
Blocked on: TradesparkAdminUser needs work before relying on it for these gates.
RESOLVED
Security & Cleanup Sweep (2026-05-28, verified 2026-05-29)
Three commits landed substantial fixes that the earlier 2026-05-29 review erroneously re-listed as open. Re-verified against HEAD:
835108a — secrets / JWT / escaping:
- Priority 1 (hardcoded Resend key) —
dev/core/notifier.pynow readsRESEND_API_KEYfrom env (vianotifier_secret); no key in source. (Ops: rotate the historical key — see Priority 1.) - Priority 1 (JWT placeholder) —
dev/users/auth.py:556readsJWT_INVITE_SECRETfrom env (via newGrizzlebearInviteModal secret);"your-super-secret-key"removed. - H22 (
_validate_jwt_payload_item_or_throwno-op) —auth.py:656-659now correctly doesif key not in jwt_payload or jwt_payload[key] is None: raise. - M39 / M54 (static_site unescaped HTML) —
dev/static_site/endpoint.pynowhtml.escape()s all metadata interpolations inrender_post,posts_index,_render_landing,docs_index,doc_page. - N1 (partial) —
notifier.pysignature bugs fixed (now f-strings). data.py/auth.py/decorators.py instances remain (see Priority 2).
ca76051 — SessionState + input validation:
- H21 (SessionState cross-session leak) —
dev/livekit_ts/agent/session_state.py:37-38now initializesroomsandsampled_product_linksin__init__(no shared class-level mutable defaults). - L19 (RPC unbounded payload) —
dev/livekit_ts/agent/rpc_handler.pyadds_validated_payload()enforcingisinstance(str)+ 2 KiB cap, applied to all three RPC handlers. - M32 (WebSocket binary no size limit) —
dev/websocket/websocket.py:21caps per-frame binary at 25 MiB and evicts oldest metadata past 32 entries. - L16 (Gemini dummy-key fallback) —
dev/websocket/gemini.pyraisesRuntimeErrorifGEMINI_API_KEYunset.
6c66c41 — dead-code cleanup:
- N10 / N17 / L24 (unused imports + dead commented code) — ~30 dead imports dropped across 7 livekit agent files, livekit_dashboard, livekit_helpers, websocket/gemini, and
core/core.py(from boto3.resources import modelgone); livekit_helpers commented blocks removed. - N16 (dead
message_queue) — removed from WebSocket PrivateSession. - N18 (deprecated packages) —
fernet==1.0.1and baregoogle==3.0.0dropped from requirements.txt. - L23 (
== True/== False) — all live comparisons rewritten to truthy form; only commented-out instances remain. - N6 (stale
badauthcomment) — already removed in a prior sweep.
H24. Bruno Env Files No Longer Ship Credentials (verified 2026-05-29)
bruno/Grizzlebear API Collection/environments/*.bru now contain only endpoints and non-sensitive test IDs (account/project/location UUIDs, plus code, address slug) — no passwords/secrets/tokens. Cleaned up prior to the 2026-05-28 sweep.
Billing Archived (2026-05-18)
dev/billing/ moved to dev/_archived/billing in commit b5ca2b5. Resolves/obsoletes: H16 (Webhook Mark-Before-Handle Race), M48 (Stripe Webhook Cleanup Failure), M49 (current_period_end Required), M52 (Stripe Webhook Secret Logged), M53 (billing.py generic raise).
Verify Scripts Partially Archived (2026-05-18)
dev/verify_task1.py, dev/verify_billing_task1.py, dev/test_billing_import.py moved to dev/_archived/. Only dev/verify_supabase_integration.py remains (see N5).
Devices Archived (2026-05-18)
dev/devices/ archived in commit 79391ac.
dev/not_yet/ Renamed to dev/_archived/ (2026-05-18)
Commit b6b23cf. All prior references to dev/not_yet/ should use dev/_archived/.
App Consolidation — 4 Modal Functions → 2 (2026-05-18)
Commits d9a7e9e, 8f10753, 2ab08e4: low_priority_app consolidates voices + geocoding + capture + static_site; user_data_app consolidates users + data. Reduces min_containers from 4 to 1.
M20. Unsafe tarfile.extractall Removed (2026-04-30)
ETag CAS + Per-Key Lock for Race-Safe Location DB Writes (2026-04-26)
Supabase Credentials Moved to Modal Secrets (2026-04-25)
Priority 2. Centralized Logging (Foundation Complete — 2026-04-23/24)
H6. Integration Test Script Typos (2026-04-23)
H7. Duplicate and Unpinned Dependencies (2026-04-23)
M7. Duplicate and Unused Imports (2026-04-23)
L3. README.md Duplicate ## Setup Header (2026-04-23)
L4. Large results.json and build.log Committed (2026-04-23)
L9. Worktree Directories in Repo Root (verified 2026-04-24)
H14. Unsafe vLLM Output Indexing (2026-04-22)
H12. Convert-LiteRT Endpoint Allocates GPU for NotImplementedError (2026-04-22)
OBSOLETE
~~M8-old. CI dev deployment commented out~~
Deliberate decision, not a gap. Moved to OBSOLETE on 2026-04-21.
~~H16, M48, M49, M52, M53 — Billing-related items~~
Moved to DONE — billing module archived 2026-05-18.