pdf-text-extraction

Star

Here are 28 public repositories matching this topic...

houking-can / PDFSDK

Star

Based on Foxit Quick PDF Library，python interface

pdf-merge pdf-split pdf-document-processor pdf-sdk pdf-text-extraction

Updated Apr 4, 2020
Python

mamiriqbal1 / rag_book_qa_prompt

Star

A simple demonstration of how you can implement retrieval augmented generation (RAG) for a book.

question-answering rag pdf-text-extraction large-language-models llm chatgpt-web retrieval-augmented-generation

Updated Nov 29, 2023
Jupyter Notebook

rithulkamesh / docproc

Sponsor

Star

Document Intelligence Platform — Extract, refine, and query documents with vision LLMs and config-driven RAG.

python machine-learning ocr text-classification text-extraction data-extraction region-detection content-extraction document-analysis layout-analysis pdf-processing pdf-text-extraction document-parsing equation-detection mathematical-symbols

Updated Mar 30, 2026
Python

hyeonsangjeon / PDF2LLM-Tuning-Studio

Star

PDF 문서에서 GPU 가속 처리로 고품질 질의응답(QA) 데이터를 자동 생성하고 LLM을 효율적으로 파인튜닝하는 솔루션입니다. Unstructured 라이브러리와 AWS Bedrock Claude로 도메인 특화 QA 쌍을 생성하고, LoRA 기법으로 경량 모델을 훈련합니다.

Updated Jan 22, 2026
Jupyter Notebook

vijayengineer / PDFTextSpeechConverter

Star

Converts scanned documents and ordinary documents into speech mp3 using Amazon Polly

pdf text images speech aws-polly audiobook synthesis scanned-documents pdf-text-extraction

Updated Dec 30, 2020
Python

PrathameshDhande22 / PdfTxtBot

Star

A Telegram bot which extract Text from PDF, also extract the Images of PDF Pages. Made with Python

python telegram telegram-bot python3 python-telegram-bot image-extractor python-telegram pdf-text pdf-text-extraction pdf-image

Updated Feb 27, 2023
Python

Zeeshanahmad4 / NLP-Pdf-Minning-Extracting-text-from-pdf

Star

NLP Pdf Minning Extracting text from pdf

python pdf pdf-converter text-extraction pdfkit pdf-files extract-text pdftotext pdf-format pdf-document-processor pdftoimage pdftools pdftohtml pdf-text-extraction pdfcon

Updated Apr 2, 2020
Python

kushalpatel0265 / Resume-Parser

Star

A resume parser that extracts key details from PDF files using Groq's LLM

python nlp api google-colab pdf-text-extraction streamlit-webapp llm

Updated Apr 14, 2025
Jupyter Notebook

eli64s / pdflex

Sponsor

Star

CLI for merging PDF contexts.

pdf-converter pdf-document pdf-generator pdf-manipulation pdf-extractor pdf-library pdf-parser pdf-data-extraction pdf-processor pdf-tools pdf-document-processor python-pdf pdf-search pdf-text-extraction pdf-python pdf-automation python-pdf-tools pdf-document-parser pdf-regex

Updated Mar 20, 2025
Python

VirajMadhu / pdf_key_matcher

Star

Highlights the key matches between your Given PDF and the description text

python open-source pdf cv python-script python3 text-extraction terminal-based ats text-compression pdf-text-extraction virajmadhu

Updated Dec 4, 2024
Python

bladeacer / pdf-fmt

Sponsor

Star

A PDF text extractor, processor and formatter. Supports regex based exclusions and other niceties.

python pdf text-formatting pdf-text-extraction

Updated Mar 24, 2026
Python

nsourlos / OCR_and_RAG

Star

Tests of OCR and RAG with LLMs

information-retrieval ocr gemini openai mistral document-processing cohere rag pdf-text-extraction colpali qwen2-vl

Updated Jun 23, 2025
Jupyter Notebook

ZobayerAkib / AI-Invoice-Analyzer

Star

An AI-powered invoice and receipt analyzer that extracts structured invoice data from images (JPG/PNG) and PDF documents using a Large Language Model (LLM).

pdf image fastapi pdf-text-extraction openai-api pymupdf-fitz llm invoice-analysis

Updated Mar 3, 2026
Python

holasoymas / text-finder

Star

PDF Text Finder Console App along with page number

csharp console-app pdf-text-extraction pdf-text-processing

Updated Mar 20, 2025
C#

rmottanet / unchainedtext

Star

UnchainedText: Break free from PDFs! Easily extract raw text to .txt for preprocessing.

extractor text-extraction data-extraction text-processing pdf-text-extraction text-extraction-tool

Updated Apr 2, 2024
Python

alorbach / pypdf-toolbox-gui

Star

A local, Python-based GUI toolbox for common PDF operations such as merge, split, scan, OCR, and document preprocessing. Fully offline, extensible, and open source.

python pdf cross-platform pdf-converter pdf-manipulation pdf-merge pdf-utilities pdf-tools pdf-splitter pdf-processing pdf-text-extraction

Updated Mar 20, 2026
Python

craigtrim / gpu-text-harvest

Star

GPU-accelerated batch PDF text extraction wrapper for marker-pdf on NVIDIA GraceBlackwell.

pymupdf pdf-text-extraction marker-pdf nvidia-graceblackwell ollama-cleanup

Updated Dec 5, 2025
Python

ahsan-javed-ds / file-text-extractor-java-project

Star

Multiple File Format (PDF/DOC/DOCX/XLSX/XLS/CSV) Text Extraction Utility Project in Java Programming Language

java maven log4j intellij text-extraction pdfbox java-programming java-project apache-tika apache-poi apache-maven pdf-text-extraction jdk17 doc-text-extraction docx-text-extraction xls-text-extraction xlsx-text-extraction csv-text-extraction

Updated Oct 24, 2024
Java

hafsa-imtiaz / legal-nlp-pipeline-from-scratch

Star

This repository implements an end-to-end NLP pipeline for legal documents, including OCR-based text extraction, neural language modeling from scratch (NumPy), sentence and document embeddings, extractive and abstractive summarization, grammar refinement, and semantic case similarity retrieval using cosine similarity.

natural-language-processing ocr numpy word-embeddings semantic-similarity cosine-similarity extractive-summarization sentence-embeddings abstractive-summarization document-embeddings grammar-correction legal-nlp pdf-text-extraction nlp-from-scratch

Updated Feb 7, 2026
Jupyter Notebook

harisuraram / Nitw-chatbot

Star

NITW Chatbot is a Retrieval-Augmented Generation (RAG) based AI system that answers queries using official institutional documents. It scrapes PDFs, generates embeddings, stores them in a FAISS vector index, and retrieves relevant context for LLM-based response generation, ensuring grounded and accurate answers.

python selenium faiss pdf-text-extraction embedding-model llm-api rag-pipeline

Updated Mar 1, 2026
Python

Improve this page

Add a description, image, and links to the pdf-text-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the pdf-text-extraction topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdf-text-extraction

Here are 28 public repositories matching this topic...

houking-can / PDFSDK

mamiriqbal1 / rag_book_qa_prompt

rithulkamesh / docproc

hyeonsangjeon / PDF2LLM-Tuning-Studio

vijayengineer / PDFTextSpeechConverter

PrathameshDhande22 / PdfTxtBot

Zeeshanahmad4 / NLP-Pdf-Minning-Extracting-text-from-pdf

kushalpatel0265 / Resume-Parser

eli64s / pdflex

VirajMadhu / pdf_key_matcher

bladeacer / pdf-fmt

nsourlos / OCR_and_RAG

ZobayerAkib / AI-Invoice-Analyzer

holasoymas / text-finder

rmottanet / unchainedtext

alorbach / pypdf-toolbox-gui

craigtrim / gpu-text-harvest

ahsan-javed-ds / file-text-extractor-java-project

hafsa-imtiaz / legal-nlp-pipeline-from-scratch

harisuraram / Nitw-chatbot

Improve this page

Add this topic to your repo