CARI Internals
Scoring Formula
Section titled “Scoring Formula”CARI’s retrieval score combines four signals into a single ranked score:
score = w_ann × annotation_relevance + w_sym × symbol_match + w_cooc × co_occurrence_boost + w_coch × co_change_signalEach signal is normalized to [0, 1] before weighting.
This formula is implemented in the retrieve query (packages/index/src/queries/retrieve.ts)
and is exposed via CariIndex.retrieve() and the cari_retrieve MCP tool. The four
component tables are annotations, symbols, co_occurrences, and co_changes.
Annotation Sources
Section titled “Annotation Sources”Annotations are created when a document mention matches a code symbol. Each annotation records its source — how the match was discovered:
| Source | Description | Depth Mode |
|---|---|---|
heading | H1–H6 heading text | structured |
bold | Bold/strong text | structured |
code_span | Inline code (backticks) | structured |
identifier | CamelCase / snake_case tokens | structured |
dictionary | Body text matched against symbol dictionary | full only |
IDF Noise Filtering
Section titled “IDF Noise Filtering”In full depth mode, body text scanning can match many common words that happen to
be symbol names (e.g., “data”, “error”, “config”). The IDF (Inverse Document Frequency)
filter penalizes terms that appear in nearly every document.
How It Works
Section titled “How It Works”- After all annotations are computed, calculate the document frequency for each term
- Terms appearing in > 85% of documents get an IDF penalty
- A stopword baseline of ~50 known-noisy terms (like “data”, “type”, “value”) starts with a ceiling IDF of 0.15
Configuration
Section titled “Configuration”The IDF system is automatic — no configuration needed. The stopword baseline and ceiling are tuned for typical codebases:
- STOPWORD_BASELINE: 50 terms known to be noisy across most projects
- STOPWORD_CEILING: 0.15 — maximum IDF score for baseline terms
- Exemptions: Annotations from
heading,bold, andcode_spansources are never penalized (they’re explicit mentions, not body-text noise)
SQLite Schema
Section titled “SQLite Schema”The index database (.iw/index.db) contains these tables:
symbols
Section titled “symbols”CREATE TABLE symbols ( id INTEGER PRIMARY KEY, name TEXT NOT NULL, kind TEXT NOT NULL, -- 'class', 'function', 'interface', 'type', 'enum', etc. file TEXT NOT NULL, line INTEGER, exported INTEGER DEFAULT 0, UNIQUE(name, kind, file, line));annotations
Section titled “annotations”CREATE TABLE annotations ( id INTEGER PRIMARY KEY, doc_file TEXT NOT NULL, symbol_id INTEGER REFERENCES symbols(id), mention TEXT NOT NULL, source TEXT NOT NULL, -- 'heading', 'bold', 'code_span', 'identifier', 'dictionary' confidence REAL DEFAULT 1.0, idf_score REAL DEFAULT 1.0, line INTEGER);co_occurrences
Section titled “co_occurrences”CREATE TABLE co_occurrences ( entity_a TEXT NOT NULL, entity_b TEXT NOT NULL, doc_file TEXT, score REAL DEFAULT 1.0, layer TEXT NOT NULL, -- 'doc_cooc', 'code_import' UNIQUE(entity_a, entity_b, layer, doc_file));co_changes
Section titled “co_changes”CREATE TABLE co_changes ( file_a TEXT NOT NULL, file_b TEXT NOT NULL, jaccard REAL NOT NULL, commit_count INTEGER DEFAULT 0, recency REAL DEFAULT 0.0, UNIQUE(file_a, file_b));CREATE TABLE files ( path TEXT PRIMARY KEY, last_modified INTEGER, content_hash TEXT, churn INTEGER DEFAULT 0, hotspot REAL DEFAULT 0.0, owner TEXT, kind TEXT -- 'code' or 'doc');def_use_chains
Section titled “def_use_chains”Intra-function def-use relationships extracted during AST traversal (§16.1). Each row records a variable assignment inside a function body, plus the expression that was assigned to it.
CREATE TABLE def_use_chains ( id INTEGER PRIMARY KEY, file TEXT NOT NULL, function_name TEXT NOT NULL, variable TEXT NOT NULL, -- local variable name assigned_from TEXT NOT NULL, -- the expression (e.g. "item.resource.path") line INTEGER, UNIQUE(file, function_name, variable, assigned_from));CREATE INDEX def_use_chains_file ON def_use_chains(file);CREATE INDEX def_use_chains_fn_var ON def_use_chains(function_name, variable);This table powers the taint_propagation feature in semantic rule checking:
when a property_access or call rule has taint_propagation: true, the rule
engine joins against def_use_chains to find variables that were assigned from
a matching expression, then checks whether those variables are subsequently used
in further accesses or calls within the same function.
Intra-Function Def-Use Extraction
Section titled “Intra-Function Def-Use Extraction”The AST extractor visits function bodies and records local variable declarations whose initializer matches one of these node kinds:
| AST Node Kind | Example | Recorded as |
|---|---|---|
member_expression | item.resource.path | assigned_from = "item.resource.path" |
call_expression | parseRef(id) | assigned_from = "parseRef(id)" |
await_expression | await fetch(url) | assigned_from = "await fetch(url)" |
Only local variables within function/method bodies are tracked — top-level declarations and module-level assignments are excluded. Tracking is strictly intra-function: def-use chains do not follow values across function boundaries or module exports.
How Taint Propagation Works
Section titled “How Taint Propagation Works”Given this code:
function render(item: Item) { const path = item.resource.path; // ← def-use chain recorded const label = path.split("/").pop(); // ← tainted call return label;}The rule engine:
- Finds
item.resource.pathas a direct violation of theproperty_accesschain**.resource.path - Looks up
def_use_chainsfor(file, function_name = "render", assigned_from LIKE "%.resource.path%") - Finds
variable = "path"was tainted - Scans for subsequent
property_accessorcallexpressions onpathin the same function - Reports
path.splitas a secondary (taint-propagated) violation
Secondary violations are reported with a (taint: path ← item.resource.path) suffix
in CLI output so they are distinguishable from direct violations.