Skip to content

feat(ai): add markdown chunking, wiki-link parsing, and knowledge graph utilities#12

Open
DhruvK278 wants to merge 3 commits intoAOSSIE-Org:mainfrom
DhruvK278:feat/ai-chunking-knowledge-graph
Open

feat(ai): add markdown chunking, wiki-link parsing, and knowledge graph utilities#12
DhruvK278 wants to merge 3 commits intoAOSSIE-Org:mainfrom
DhruvK278:feat/ai-chunking-knowledge-graph

Conversation

@DhruvK278
Copy link

@DhruvK278 DhruvK278 commented Mar 5, 2026

Summary

Introduce foundational AI utilities for Smart Notes.

This PR adds a set of lightweight, storage-agnostic utilities that prepare notes for semantic search and smart context features described in the project roadmap.

Features included

  • Markdown chunking utility with heading-aware segmentation
  • Wiki-link parser supporting [[Note]] and [[Note|Alias]] syntax
  • Knowledge graph builder for note relationships
  • Backlink computation for bidirectional linking
  • Unit tests for all utilities using Jest

The chunking utility splits markdown notes into smaller sections that can later be embedded and indexed for semantic search.

The link parser and graph builder extract relationships between notes and construct a basic knowledge graph, which can support features like related notes, auto-linking, and knowledge graph visualization.

All utilities are implemented as pure TypeScript modules with no dependency on the editor or storage layers, allowing them to integrate cleanly with the ongoing work in those areas.


Addressed Issues

N/A


Screenshots / Recordings

Not applicable.
This PR adds backend utilities and tests.


Additional Notes

  • Utilities are placed under src/ai/ to keep AI-related logic modular.
  • Tests are included to verify chunking behavior, link parsing, and graph construction.
  • The implementation is intentionally storage-agnostic so it can be integrated later with note storage and indexing pipelines.

These utilities will support future work on:

  • semantic search
  • local RAG pipelines
  • related notes sidebar
  • knowledge graph visualization

Checklist

  • My code follows the project's code style and conventions
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contributing Guidelines

⚠️ AI Notice - Important!

AI tools were used to assist with drafting and structuring parts of this implementation.
All code generated by AI has been reviewed, tested locally, and verified to pass the included unit tests.

Summary by CodeRabbit

  • New Features

    • Intelligent markdown chunking for semantic content processing
    • Wiki-link parsing to automatically extract note connections
    • Knowledge-graph construction to visualize note relationships
    • Related-notes discovery to recommend connected content
  • Tests

    • Comprehensive unit tests covering chunking, link parsing, graph building, backlinks, and related-note logic
  • Chores

    • Project initialization: package manifest, TypeScript config, Jest setup, and VCS ignore rules

…ph utilities

Introduce foundational AI utilities for Smart Notes.

This commit adds a set of lightweight, storage-agnostic utilities that
prepare notes for semantic search and smart context features.

Features included:
- Markdown chunking utility with heading-aware segmentation
- Wiki-link parser supporting [[Note]] and [[Note|Alias]] syntax
- Knowledge graph builder for note relationships
- Backlink computation for bidirectional linking
- Unit tests for all utilities using Jest

These utilities form the basis for upcoming features such as:
semantic search, local RAG pipelines, related notes sidebar,
and knowledge graph visualization.

The implementation is modular and independent from the editor
and storage layers to avoid conflicts with ongoing work.
@coderabbitai
Copy link

coderabbitai bot commented Mar 5, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: f84dfea3-00d7-461e-aaa7-6e106bf0f3ef

📥 Commits

Reviewing files that changed from the base of the PR and between c2a7b46 and 4bbdb10.

📒 Files selected for processing (2)
  • src/ai/chunker.ts
  • src/ai/knowledgeGraph.ts

Walkthrough

Adds a TypeScript/Node.js project scaffold (package.json, tsconfig, jest, .gitignore) and new AI utilities: markdown chunking, wiki-link extraction, knowledge-graph construction, related-note discovery, plus comprehensive Jest tests and a central re-export index.

Changes

Cohort / File(s) Summary
Project configuration
/.gitignore, package.json, tsconfig.json, jest.config.js
Adds Node/TypeScript project files: ignore rules, package manifest with build/test scripts and devDependencies, TypeScript compiler settings, and Jest configuration.
AI core modules
src/ai/chunker.ts, src/ai/linkParser.ts, src/ai/knowledgeGraph.ts, src/ai/relatedNotes.ts, src/ai/index.ts
New exported APIs: heading-aware Markdown chunker (Chunk interface, chunkMarkdown), wiki-style link extractor (extractWikiLinks), knowledge-graph builder (buildKnowledgeGraph, KnowledgeGraph), backlink computation (getBacklinks), and getRelatedNotes; index re-exports them.
Unit tests
src/ai/__tests__/chunker.test.ts, src/ai/__tests__/linkParser.test.ts, src/ai/__tests__/knowledgeGraph.test.ts, src/ai/__tests__/relatedNotes.test.ts
Adds Jest test suites covering chunking behavior and edge cases, link parsing variants (including aliases/duplicates), graph building/backlinks (including circular refs and deduplication), and related-note logic.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

Typescript Lang

Poem

🐰 I nibbled headings, links, and lore with glee,

Split notes to chunks and traced the wiki tree,
I wove backlinks, found friends both near and far,
With Jest to prove it, TypeScript is the star,
A rabbit's patch of code — hop, test, and see!

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main changes: it highlights the three core features added (markdown chunking, wiki-link parsing, knowledge graph utilities) and uses the conventional 'feat(ai):' prefix to indicate a new feature in the AI module.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added size/XL and removed size/XL labels Mar 5, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@package.json`:
- Around line 5-6: The package.json "main" and "types" entries point to
dist/index.js and dist/index.d.ts but your compiled files are emitted under
dist/ai/index.js and dist/ai/index.d.ts; update the package.json entries (the
"main" and "types" keys) to reference "dist/ai/index.js" and
"dist/ai/index.d.ts" (or alternatively adjust the build/output configuration so
compilation emits to dist/index.js and dist/index.d.ts) so consumers can import
the package correctly.

In `@src/ai/__tests__/chunker.test.ts`:
- Around line 18-24: Update the tests to consistently assert the noteId field
and add coverage for metadata: for every test that calls chunkMarkdown (e.g.,
the "returns single chunk for small content" case) add an assertion that
chunks[0].noteId equals the input noteId ("note1"), and add a new test that
constructs/returns a Chunk with a metadata object to assert metadata is present
and correct (reference Chunk interface and chunkMarkdown function to locate
where to add assertions). Ensure both single-chunk and multi-chunk tests include
noteId assertions and create one explicit test that verifies metadata handling
on the returned Chunk.

In `@src/ai/__tests__/knowledgeGraph.test.ts`:
- Around line 3-58: Add an integration test to cover aliased wiki-links so
buildKnowledgeGraph correctly parses links of the form [[Target|Alias]]: create
a note string containing an aliased link (e.g., "See [[C|See C]]") and assert
that buildKnowledgeGraph(notes) records an outgoing edge to "C" (the target) and
does not create a node for the alias ("See C"); place the test alongside the
existing cases so it verifies parser+graph integration for buildKnowledgeGraph.

In `@src/ai/__tests__/relatedNotes.test.ts`:
- Around line 3-14: Add tests in src/ai/__tests__/relatedNotes.test.ts to cover
deduplication and self-exclusion for getRelatedNotes: add a case where graph
contains duplicate paths to the same related note (e.g., two different nodes
linking to "C") and assert the returned array contains "C" only once, and add a
case where the source node links to itself and assert the source id is not
included in the results; reference getRelatedNotes in your new test cases and
use expect.arrayContaining plus length or Set checks to verify duplicates
removed and self excluded.

In `@src/ai/chunker.ts`:
- Around line 25-29: The chunkMarkdown function can enter an infinite loop when
maxWords <= 0 because the pagination loop uses i += maxWords; guard the
parameter at the start of chunkMarkdown (e.g., if maxWords is undefined/null or
<= 0) by either throwing a descriptive error or normalizing it to a safe minimum
(e.g., maxWords = Math.max(1, maxWords)) before the loop; update references
around the loop increment (i += maxWords) to rely on this validated value so the
loop always makes progress.
- Line 13: The exported Chunk type uses metadata?: Record<string, any> which
weakens type safety; change the metadata type to Record<string, unknown> in the
Chunk declaration (and any related exported interfaces/types or function
signatures that reference metadata) to avoid using any while preserving
extensibility—update occurrences of metadata, the Chunk type name, and any
imports/exports that expose that type so consumers receive the stronger
unknown-based typing.

In `@src/ai/knowledgeGraph.ts`:
- Around line 14-16: The graph currently keeps duplicate outgoing links because
extractWikiLinks returns duplicates, causing getBacklinks to report the same
source multiple times; update the graph construction where graph[noteName] is
assigned (use the extractWikiLinks result) to deduplicate links per source note
(e.g., convert to a Set then back to an array) so each outgoing link appears
once, and adjust or comment near extractWikiLinks and getBacklinks to note the
deduplication behavior if needed.

In `@src/ai/relatedNotes.ts`:
- Around line 18-19: The forEach callbacks on outgoing and backlinks return the
result of related.add(n) which triggers the lint rule; change both callbacks to
use a statement body (e.g., outgoing.forEach(n => { related.add(n); }); and
backlinks.forEach(n => { related.add(n); });) or replace with for..of loops over
outgoing and backlinks that call related.add(n) so the callbacks do not return a
value; update the lines referencing outgoing, backlinks, and related in
relatedNotes.ts accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 011d7bb9-4a2c-4cd3-8683-2080bc44c244

📥 Commits

Reviewing files that changed from the base of the PR and between a3ccb2b and e89732e.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (13)
  • .gitignore
  • jest.config.js
  • package.json
  • src/ai/__tests__/chunker.test.ts
  • src/ai/__tests__/knowledgeGraph.test.ts
  • src/ai/__tests__/linkParser.test.ts
  • src/ai/__tests__/relatedNotes.test.ts
  • src/ai/chunker.ts
  • src/ai/index.ts
  • src/ai/knowledgeGraph.ts
  • src/ai/linkParser.ts
  • src/ai/relatedNotes.ts
  • tsconfig.json

Clamp chunkMarkdown maxWords to at least 1 and tighten Chunk.metadata typing to Record<string, unknown> to prevent potential infinite loops and improve typing. Deduplicate extracted wiki links in buildKnowledgeGraph by using a Set. Minor cleanup in getRelatedNotes and ensure related notes are deduplicated and the source note is excluded. Update package.json main/types to point to dist/ai. Add unit tests to cover maxWords <= 0 guard, aliased wiki links, duplicate link deduplication, related-note deduplication, and self-link exclusion.
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 5, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/ai/chunker.ts`:
- Line 30: Normalize and validate the maxWords parameter to a finite positive
integer before chunking to avoid NaN/Infinity/fractional values corrupting the
loop; specifically, in the routine that sets maxWords (the variable named
maxWords in src/ai/chunker.ts) replace the current Math.max(1, maxWords) usage
with a check that coerces maxWords to a number, falls back to 1 on
NaN/Infinity/non-number, and uses Math.floor/Math.trunc to ensure an integer > 0
so the chunking loop (the for/while that increments i and slices words) never
receives NaN or fractional steps and processes all words predictably.

In `@src/ai/knowledgeGraph.ts`:
- Around line 12-17: The graph is built using native object keys from
user-supplied note names which allows prototype-pollution keys (e.g.,
"__proto__", "constructor", "prototype") to mutate behavior; to fix, create the
graph with a null prototype (use Object.create(null)) and skip or sanitize any
noteName that equals dangerous identifiers before assigning into graph in the
loop that populates KnowledgeGraph (the block using graph, noteName, content and
extractWikiLinks); also ensure any future lookups against graph handle its
null-prototype shape.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: bc5b9ff6-9180-45d6-91e4-cf828791a613

📥 Commits

Reviewing files that changed from the base of the PR and between e89732e and c2a7b46.

📒 Files selected for processing (7)
  • package.json
  • src/ai/__tests__/chunker.test.ts
  • src/ai/__tests__/knowledgeGraph.test.ts
  • src/ai/__tests__/relatedNotes.test.ts
  • src/ai/chunker.ts
  • src/ai/knowledgeGraph.ts
  • src/ai/relatedNotes.ts

chunker.ts: Validate and normalize the maxWords parameter into an integer (normalizedMaxWords) using Number.isFinite and Math.floor, defaulting to 1, and use it for chunking logic to avoid issues with non-finite or non-integer inputs.

knowledgeGraph.ts: Create graph and backlinks as null-prototype objects (Object.create(null)) to avoid prototype key collisions, cast graph to KnowledgeGraph, and use Object.prototype.hasOwnProperty.call when checking backlinks existence before pushing. These changes prevent unexpected behavior from inherited properties and improve robustness.
@github-actions github-actions bot added size/XL and removed size/XL labels Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant