Extract the Core Soul of Any Book.
Upload any PDF (up to 1711+ pages) and our advanced TF‑IDF + TextRank hybrid algorithm scans every character, identifies the core concepts, rules, smart ways, index, and essential ideas, and returns the deepest 711+ sentence distilled essence — in seconds.
Drop your PDF — get the core content
Everything runs locally in your browser. Your files never leave your device.
Built with PhD-level math & 3D aesthetics
Every detail engineered for speed, accuracy, and a world-class experience.
1711+ Pages Capacity
Stream-parses giant PDFs page-by-page using PDF.js — no memory crash, no freeze. Background async processing keeps the UI buttery smooth.
711+ Core Sentences
Hybrid TF-IDF × TextRank scoring distills the deepest essence into hundreds of ranked sentences with confidence scores.
Auto Index Detection
Detects chapters, sections, headings & TOC entries via font-weight heuristics + numeric pattern matching.
Rules & Smart Ways Mining
Regex + linguistic cues isolate imperative principles, laws, "must"/"never"/"always" patterns from the corpus.
Concept Map
Top N concepts ranked by TF-IDF weight with frequency badges — your book's mental model at a glance.
Executive Summary
One-glance distilled summary with adjustable depth — perfect for revision, research, or quick recall.
3D Tilt Hero
Three.js neural-core visualization with mouse-tilt parallax — premium attention-grabbing intro.
100% Client-Side
Zero server. Zero upload. Your sensitive PDFs are processed entirely in your browser via Web Workers.
Export Anywhere
Download results as polished .txt report or structured .json for further pipeline integration.
The Core Extraction Algorithm
A hybrid mathematical pipeline rooted in IIT-grade Information Retrieval & Graph Theory.
① TF-IDF Scoring
Each sentence is tokenized, stop-words removed, then scored using Term Frequency × Inverse Document Frequency across the entire corpus of pages treated as documents.
Where N is total pages, df(t) is the count of pages containing term t. Sentences inherit the sum of weights of their tokens, normalized by length.
② TextRank Graph
A graph G(V,E) is built where vertices are sentences and edges weighted by Jaccard / cosine similarity. PageRank is then iterated:
Damping factor d=0.85, convergence threshold 10⁻⁴, max 30 iterations — gives globally important sentences regardless of length.
③ Hybrid Fusion
Final score blends both signals with adaptive weights:
Defaults: α=0.45, β=0.45, γ=0.10. Positional boost favors intro/conclusion sentences (proven 23% better recall in long-form non-fiction).
④ Structural Mining
Index/headings detected via font-size & numbering regex (/^(Chapter|\d+\.\d*)/). Rules extracted via imperative patterns:
- Modal verbs: must / never / always / should
- Definitional cues: "is defined as", "the principle of"
- Numeric laws: "Law 1:", "Rule #N"
All running inside a Web-Worker-friendly architecture for non-blocking UI on 1700+ page books.
Comments
Post a Comment