Skip to content

CARI Internals

CARI’s retrieval score combines four signals into a single ranked score:

score = w_ann × annotation_relevance
+ w_sym × symbol_match
+ w_cooc × co_occurrence_boost
+ w_coch × co_change_signal

Each signal is normalized to [0, 1] before weighting.

Annotations are created when a document mention matches a code symbol. Each annotation records its source — how the match was discovered:

SourceDescriptionDepth Mode
headingH1–H6 heading textstructured
boldBold/strong textstructured
code_spanInline code (backticks)structured
identifierCamelCase / snake_case tokensstructured
dictionaryBody text matched against symbol dictionaryfull only

In full depth mode, body text scanning can match many common words that happen to be symbol names (e.g., “data”, “error”, “config”). The IDF (Inverse Document Frequency) filter penalizes terms that appear in nearly every document.

  1. After all annotations are computed, calculate the document frequency for each term
  2. Terms appearing in > 85% of documents get an IDF penalty
  3. A stopword baseline of ~50 known-noisy terms (like “data”, “type”, “value”) starts with a ceiling IDF of 0.15

The IDF system is automatic — no configuration needed. The stopword baseline and ceiling are tuned for typical codebases:

  • STOPWORD_BASELINE: 50 terms known to be noisy across most projects
  • STOPWORD_CEILING: 0.15 — maximum IDF score for baseline terms
  • Exemptions: Annotations from heading, bold, and code_span sources are never penalized (they’re explicit mentions, not body-text noise)

The index database (.iw/index.db) contains these tables:

CREATE TABLE symbols (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
kind TEXT NOT NULL, -- 'class', 'function', 'interface', 'type', 'enum', etc.
file TEXT NOT NULL,
line INTEGER,
exported INTEGER DEFAULT 0,
UNIQUE(name, kind, file, line)
);
CREATE TABLE annotations (
id INTEGER PRIMARY KEY,
doc_file TEXT NOT NULL,
symbol_id INTEGER REFERENCES symbols(id),
mention TEXT NOT NULL,
source TEXT NOT NULL, -- 'heading', 'bold', 'code_span', 'identifier', 'dictionary'
confidence REAL DEFAULT 1.0,
idf_score REAL DEFAULT 1.0,
line INTEGER
);
CREATE TABLE co_occurrences (
entity_a TEXT NOT NULL,
entity_b TEXT NOT NULL,
doc_file TEXT,
score REAL DEFAULT 1.0,
layer TEXT NOT NULL, -- 'doc_cooc', 'code_import'
UNIQUE(entity_a, entity_b, layer, doc_file)
);
CREATE TABLE co_changes (
file_a TEXT NOT NULL,
file_b TEXT NOT NULL,
jaccard REAL NOT NULL,
commit_count INTEGER DEFAULT 0,
recency REAL DEFAULT 0.0,
UNIQUE(file_a, file_b)
);
CREATE TABLE files (
path TEXT PRIMARY KEY,
last_modified INTEGER,
content_hash TEXT,
churn INTEGER DEFAULT 0,
hotspot REAL DEFAULT 0.0,
owner TEXT,
kind TEXT -- 'code' or 'doc'
);