LibrEd

LibrEd is a purely local, containerized, and agent-driven platform for exam preparation. It combines a modern React frontend with an autonomous backend pipeline that scrapes, classifies, and generates study materials from raw syllabus PDFs and local LLMs.

Live: https://dontcompete.vercel.app

Core Philosophy & Features

100% Local & Private: All data processing and AI generation happens on your machine using Ollama. No external APIs, no cloud dependencies.
Container-First Architecture: The entire system runs via Docker Compose. No local Python or Node.js environment setup required.
Functional Asset Generator:
- Sequential Pipeline: 8-stage functional sequence (Download -> OCR -> DB Sync -> Classification -> Theory -> Manifest).
- Deterministic: Heuristic Parsing ensures high-fidelity image extraction for questions and explanations.
- Idempotent: Re-runs extend existing datasets instead of reclaiming them.
Modern Modular Interface:
- Modular Shell: Minimal root layout delegating logic to specialized, reusable components.
- Adaptive Assessment: Handles MCQ, MSQ, and Numeric inputs with real-time validation.
- Dynamic Navigation: URI-based breadcrumbs and stateful dashboard expansion.

Possible Improvements and prototype ideas (during GSoC)

Getting Started

Prerequisites

Docker Desktop or Docker Engine + Compose.
Git.

Quick Start

Clone the repository:

git clone https://github.com/AOSSIE-Org/LibrEd.git
cd LibrEd

Launch the System:
```
docker compose up --build
```
- Frontend: Accessible at http://localhost:3000.
- Generator: Autonomously populates content in the background.
- Idempotency: Existing data is skipped; re-launching only processes new or missing streams.
Monitor Pipeline:
```
docker compose logs -f generator
```

Configuration

Central configuration is managed in generator/src/config.py. You can customize:

TARGET_STREAMS: Which exam streams to process (e.g., CS, DA).
OLLAMA_MODEL: The local LLM to use (default: llama3.1).

System Architecture

The system is split into two autonomous components that communicate via shared file-system artifacts:

Asset Generator (/generator): A functional Python pipeline using DuckDB, PyMuPDF, tenacity (retries), and Ollama.
Frontend (/frontend): A high-performance React application (Vite, TanStack Router) that dynamically discovers generated static assets via filesystem structure (Zero-Config discovery).

Data Pipeline: Detailed Components & Flows

The generator (generator/src/main.py) runs a sequential, atomic pipeline.

Stage 1: Acquisition (Scraping)

Component: ScraperEngine (scraper_engine.py) using Playwright.

Constraint: LLMs are explicitly NOT used for detection/downloading. Logic must be procedural/heuristic.
Logic:
1. Syllabus: Visit /syllabus/{stream} -> Find year page -> Extract PDF link.
2. PYQs: Visit /py-papers -> Filter by stream slug -> Iterate years -> Extract PDF links.
Optimization: Skips re-downloading if file exists in data/raw/.

Stage 2: Processing (Ingestion)

Component: pdf_utils.py using PyMuPDF (fitz) and Pillow.

State Machine:
- START -> Question \d+ -> QUESTION
- QUESTION -> Ans. -> ANSWER
- ANSWER -> Sol. -> EXPLANATION
Image Stitching:
- Full Width: Captures full content width.
- Vertical Merge: Merges multi-page segments into single q.png/exp.png.
Validation: Image extraction occurs only when valid boundaries are detected.

Stage 3: AI Analysis (Syllabus Parsing)

Component: knowledge_utils.py + SyllabusParser

Input: Syllabus PDFs from Stage 1.
Prompt: Extracts structured hierarchy (Subjects -> Subtopics) from raw PDF text.
Output: Populates subjects and subtopics tables (idempotent).
Constraint: Must run before Question Classification.

Stage 4: AI Analysis (Classification)

Component: knowledge_utils.py + prompt_utils.py + Ollama.

Prompt Generation

Stateless: Prompts must be self-contained within context window.
Input: Syllabus Database + Batch of Questions (default 5).
Task: Map Question ID -> Subject -> Subtopic.
Handling Unknowns: Maps "Other" to "General Aptitude" -> "Miscellaneous".

LLM Processing

Orchestration: Sequential/Batched execution to handle local resource limits.
Output: JSON-only response parsed and synced to questions table.

Stage 5: AI Analysis (Theory Generation)

For each Subtopic with > 0 questions:

Prompt: Includes existing theory and all questions as context to determine depth/scope.
Output: Markdown with Mermaid diagrams (graph LR, etc.) and KaTeX math.
Update Rule: Updates existing files only if there's something new to add.

Stage 6: Manifest Generation (Export)

Component: knowledge_utils.generate_manifest (Per-Stream)

No Global Registry: Does not generate a global exams.json or info.json. Discovery is purely filesystem-based.
Output: Generates structure.json inside each stream's folder.
Copy/Linking: Ensures all referenced images exist in frontend/assets.

Stage 7: Auditing

Users can improve the generated notes, and LLMs would use it as a reference.

Database Schema (DuckDB)

The system uses DuckDB (data/app.duckdb) as an intermediate relational store.

Table	Column	Type	Description
questions	`id`	VARCHAR	Global composite ID (`{stream}_{packet}_{qno}`)
	`stream_code`	VARCHAR	e.g., `computer-science-information-technology`
	`packet_id`	VARCHAR	Source PDF identifier (e.g., `2024-M`)
	`question_no`	VARCHAR	e.g., `1`, `55`
	`q_type`	VARCHAR	`MCQ`, `MSQ`, `NAT`
	`q_key`	VARCHAR	Answer Key (e.g. `A`, `55.2`)
	`q_text`	TEXT	Extracted text of question
	`a_text`	TEXT	Extracted text of answer
	`exp_text`	TEXT	Extracted text of explanation
	`subtopic_id`	VARCHAR	FK to `subtopics.id`. Populated by LLM.
	`img_path_q`	VARCHAR	Relative path to question image
	`img_path_exp`	VARCHAR	Relative path to explanation image
subjects	`id`	VARCHAR	e.g., `cs_subj_1`
	`name`	VARCHAR	e.g., `Digital Logic`
subtopics	`id`	VARCHAR	e.g., `cs_subj_1_topic_3`
	`subject_id`	VARCHAR	FK to `subjects.id`
	`name`	VARCHAR	e.g., `Minimization`
theory	`id`	VARCHAR	e.g., `theory_cs_subj_1_topic_3`
	`subtopic_id`	VARCHAR	FK to `subtopics.id`
	`content_md`	TEXT	Generated Markdown content

Frontend Architecture (React)

Tech Stack

Framework: TanStack Start / React (Vite).
Styling: Tailwind CSS + DaisyUI.
Routing: File-based (@tanstack/react-router).
Linting: Biome (No ESLint/Prettier).
MDX: rehype-katex and mermaid support.

Modular & Dynamic UI

Modular Root Layout: __root.tsx acts as a minimal structural shell, delegating specific behaviors to:
- ThemeScript: Injects a synchronous, blocking script into <head> to prevent Flash of Unstyled Content (FOUC).
- GlobalBreadcrumbs: Dynamically generates consistent navigation from URI path segments, avoiding hardcoded labels.
Stateful Dashboard: Uses query parameters (?expanded=) for targeted expansion while defaulting to "All Expanded" to maximize content visibility.

Assessment Logic

Flow: Stream -> Subject -> Subtopic -> Theory -> Assessment.
Rules:
- Max 20 questions per attempt (Randomized).
- Time Limit: 4 minutes per question.
Interaction:
- MCQ/MSQ/NAT: Adaptive input fields.
- Submission: Correct -> Next; Incorrect -> Show Explanation.
Rendering:
- Theory: MDX with rehype-katex and mermaid.
- Placeholders: Code-based UI for missing artifacts.

Data Contracts & Artifacts

Location

All frontend-consumable data resides in: frontend/public/assets/gate/

File Structure

assets/gate/
└── cs/
    ├── structure.json
    ├── digital-logic/
    │   ├── boolean-algebra.md
    │   └── number-systems.md
    └── questions/
        └── 2024-M/
            └── 1/
                ├── q.png
                ├── exp.png
                └── data.json

System Design Diagram

sequenceDiagram
    participant S as Scraper
    participant FS as FileSystem
    participant P as Processor (No OOP)
    participant DB as DuckDB
    participant G as Generator (Func)
    participant LLM as Ollama
    participant FE as Frontend

    Note over S, FS: Stage 1: Acquisition
    S->>S: Heuristic DOM Analysis (No LLM)
    S->>FS: Download PDF (Skip if exists)

    Note over P, DB: Stage 2: Processing & Sync
    P->>FS: Read PDF
    P->>P: Stitch (Full Width) & Crop (3%/5%)
    P->>FS: Save q.png, exp.png, data.json
    P->>DB: Sync Metadata

    Note over G, LLM: Stage 3,4,5: Classification
    G->>DB: Fetch Questions
    G->>G: Create Stateless Prompts
    G->>LLM: Classify (JSON)
    LLM-->>G: Response
    G->>DB: Update Taxonomy

    Note over G, LLM: Stage 6,7: Theory
    G->>DB: Fetch Context
    G->>LLM: Generate Theory (MD + Mermaid)
    G->>FS: Save {topic}.md

    Note over G, FS: Stage 8: Manifest
    G->>FS: Generate structure.json

    Note over FE, FS: Runtime
    FE->>FS: Load structure.json
    FE->>FE: Select Subtopic
    FE->>FE: Render Theory
    FE->>FE: Start Test (Random 20, 4min/q)

Constraints

Execution & Environment

No Local Installations: Entire workflow must run via Docker / Docker Compose.
Single-Entry Workflow: Docker Compose runs both asset generation and frontend.
Local & Private: Relies entirely on local LLMs (Ollama) and local artifacts; No remote API support.

Data Integrity & Reusability

Incremental & Idempotent: Re-runs extend existing datasets instead of recreating them.
Reusability-First: Existing PDFs, databases, and artifacts must be reused.
Single Source of Truth: All derived data must be traceable to original PDFs.
Robust Prompting: Prompts must be self-contained (stateless) and designed to fit within model context windows.
No Hardcoded Values: Architecture should minimize hardcoded values, unless module-specific.

Performance & Safety

Skip Re-downloading: Do not download PDFs if they already exist.
Safe DB Ops: Use INSERT OR IGNORE/REPLACE to maintain idempotency.
Valid Extraction: Image extraction occurs only when valid boundaries are detected.

Non-Goals

Authentication, Cloud Deployment, Real-time collaboration, Analytics (beyond counts).

Possible Improvements

OCR doesn't work well on different colors and some scenarios.
LLaMA 3.1 isn't accurate enough.
Duplicate handling in topic classification is a bit too strict.
Consider shifting knowledge generation fully to TypeScript?
Edit markdown from frontend?
Improve performance on CPU.
Platform is currently exam-specific; could be generalized.
Shift to asynchronous operations where viable.
Shift to better sources for PYQs and answer-keys, generate explanation with LLM. (Currently the project relies on GateAcademy)
Re-evaluate the decision of shifting away from LFS, it'll likely be needed.

Contributing

We are building a free, high-quality platform for everyone, and we need your help to achieve that!

Non-Coding Contributions

AI is a powerful accelerator, but it's not perfect. We rely on the community to ensure quality and depth.

Improve Theories: AI-generated explanations can be generic or miss nuance. If you have a better explanation, analogy, or diagram for a concept, please submit a PR!
Quality Assurance:
- Review: Help us verify the correctness of questions and answers in Pull Requests.
- Model Testing: Run the generator with different LLMs (Mistral, Gemma, Phi-3, or larger parameter models on your local machine) and report which yields the best results.
Community Questions: Identify gaps in our question bank and add commonly asked questions or "gotchas" for specific topics.
Expand Scope: PRs adding support for other competitive exams are highly welcome! Let's build a universal free platform together.

Testing

The project includes a comprehensive test suite that runs in Docker.

1. Generator Tests (Backend)

docker compose run --rm asset-generator pytest generator/tests

2. Frontend Tests (Playwright) Run the end-to-end tests using the official Playwright container:

docker run --rm --network gatebuster_app_network -e BASE_URL=http://frontend:3000 -v "$(pwd)/frontend:/app" -w /app mcr.microsoft.com/playwright:v1.58.0-jammy sh -c "npm install && npx playwright test"

Note: Ensure the frontend service is running (docker compose up) before starting Playwright tests.

License

Apache 2.0 License - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github		.github
.idx		.idx
frontend		frontend
generator		generator
.coderabbit.yaml		.coderabbit.yaml
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
VERSION		VERSION
docker-compose.yml		docker-compose.yml
pipeline.ipynb		pipeline.ipynb
vercel.json		vercel.json

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

LibrEd

Core Philosophy & Features

Possible Improvements and prototype ideas (during GSoC)

Getting Started

Prerequisites

Quick Start

Configuration

System Architecture

Data Pipeline: Detailed Components & Flows

Stage 1: Acquisition (Scraping)

Stage 2: Processing (Ingestion)

Stage 3: AI Analysis (Syllabus Parsing)

Stage 4: AI Analysis (Classification)

Prompt Generation

LLM Processing

Stage 5: AI Analysis (Theory Generation)

Stage 6: Manifest Generation (Export)

Stage 7: Auditing

Database Schema (DuckDB)

Frontend Architecture (React)

Tech Stack

Modular & Dynamic UI

Assessment Logic

Data Contracts & Artifacts

Location

File Structure

System Design Diagram

Constraints

Execution & Environment

Data Integrity & Reusability

Performance & Safety

Non-Goals

Possible Improvements

Contributing

Non-Coding Contributions

Testing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages