Skip to content

CARI Technical Specification

Implemented — Phases 1–5 complete, 92 tests passing.

  1. Zero external cost — no LLM calls, no paid APIs, no database servers
  2. Single-file output — one SQLite database, portable and inspectable
  3. Three-signal fusion — AST structure + document semantics + git temporal data
  4. Gap detection — disagreements between signals are the most valuable findings
  5. CI-ready — exit codes, machine-readable formats, fast execution

Tree-sitter parses source files to produce a symbol registry:

  • Classes, interfaces, type aliases, enums
  • Functions (top-level and methods)
  • Exported variables and constants
  • Parameter types and return types (where available)

Supported languages: TypeScript, JavaScript, Swift.

Markdown documents are scanned for entity mentions:

SourceWhat It CapturesDepth
Headings (H1-H6)Section topicsstructured
Bold textEmphasized entitiesstructured
Code spansInline code referencesstructured
IdentifiersCamelCase / snake_case tokensstructured
Body textDictionary-matched termsfull only

Git log analysis produces:

  • Co-change Jaccard — P(A∩B) / P(A∪B) for file pairs across commits
  • Recency weighting — recent co-changes score higher
  • Hotspot detection — files with high churn
  • Ownership — most frequent committer per file
  • Staleness — time since last modification relative to dependent code

The annotator matches document mentions to code symbols:

  1. For each keyword mention in a document, search the symbol registry
  2. Exact matches get confidence 1.0
  3. Partial/fuzzy matches get reduced confidence
  4. In full mode, apply IDF penalties to body-text matches

Terms appearing in > 85% of documents are penalized:

  • Stopword baseline: ~50 known-noisy terms (e.g., “data”, “type”, “value”)
  • Ceiling: 0.15 maximum IDF score for baseline terms
  • Exemptions: Heading, bold, and code-span annotations are never penalized

Measured on the IntentWeave monorepo (264 code files, 7 docs, 5316 symbols):

MetricStructuredFullDelta
Build time1.1 s2.8 s+1.7 s
Annotations6,72111,533+72%
Grounded (code-linked)2,548 (38%)7,360 (64%)+189%
Co-occurrence edges1,0992,631+139%
IDF terms tracked2,843
Index file size~2 MB~4 MB+100%

All phases are complete:

PhaseScopeTests
1. FoundationWriter, schema, retrieval queries22
2. ConnectionsCo-occurrence, co-change, gap detection20
3. CI DriftCheck command, severity levels, formats16
4. IncrementalContent-hash updates, corpus report12
5. Annotation DepthDictionary matching, IDF filtering, stopword baseline22
Total92
FilePurpose
packages/index/src/writer.tsSQLite index builder
packages/index/src/annotator.tsMention→symbol matching + IDF
packages/index/src/idf.tsIDF scorer + stopword baseline
packages/index/src/schema.tsSQLite table definitions
packages/index/src/queries/retrieve.tsRanked retrieval
packages/index/src/queries/connections.tsConnection discovery
packages/index/src/queries/check.tsCI drift detection
packages/index/src/queries/report.tsHealth dashboard
packages/index/src/incremental.tsContent-hash updates
packages/analyzer/src/kwg/heuristicExtractor.tsKeyword extraction
packages/analyzer/src/kwg/kwxStage.tsKWX stage options
packages/cli/src/commands/indexBuild.tsBuild orchestrator