Notre avis

Recherche et extraction de contenu dans des fichiers PDF sans charger l'intégralité du document.

Points forts

Recherche rapide avec pdfgrep, y compris par expression régulière et sensible à la casse.
Extraction de pages spécifiques via pdftotext, utile pour ne pas surcharger le contexte.
Prise en charge du format texte et de la mise en page pour les tableaux.

Limites

Ne fonctionne pas avec les PDF scannés (images).
Nécessite l'installation préalable de pdfgrep et poppler-utils.
La recherche contextuelle est limitée à quelques lignes avant/après.

Quand l'utiliser

Pour extraire rapidement des passages précis d'un PDF volumineux sans le lire en entier.

Quand l'éviter

Quand le PDF est principalement constitué d'images scannées ou nécessite une extraction complexe avec mise en page exacte (préférer un OCR ou un outil dédié).

Exemples

Search for a term in a PDF

Search for 'authentication' in the manual.pdf file and show me the page numbers and a few lines of context around each match.

Extract pages from a PDF as text

Extract pages 10 to 15 from large-report.pdf and output the text.

Get page count of a PDF

Tell me how many pages are in the document guide.pdf.

name: pdf-tools description: Search and extract content from PDF files. Use when searching PDFs, finding text in documents, or extracting specific pages without reading the entire file. allowed-tools: Bash, Read, Glob

PDF Tools

Search and extract content from PDFs without loading entire files into context.

Installation

# macOS
brew install pdfgrep poppler

# Ubuntu/Debian
sudo apt install pdfgrep poppler-utils

Quick Reference

| Task | Command | |------|---------| | Search | pdfgrep "term" file.pdf | | Search with page numbers | pdfgrep -n "term" file.pdf | | Search with context | pdfgrep -n -C 2 "term" file.pdf | | Get page count | pdfinfo file.pdf \| grep Pages | | Extract pages 5-10 | pdftotext -f 5 -l 10 file.pdf - |

Core Workflow

Step 1: Search - Find where content lives

pdfgrep -n "authentication" large-manual.pdf
# Output: 42: User authentication requires...
#         45: Authentication tokens expire...

Step 2: Extract - Get just those pages

pdftotext -f 41 -l 46 large-manual.pdf -

Search Commands

# Basic search
pdfgrep "search term" document.pdf

# Case-insensitive
pdfgrep -i "search term" document.pdf

# With page numbers
pdfgrep -n "search term" document.pdf

# With context (2 lines before/after)
pdfgrep -n -C 2 "search term" document.pdf

# Count occurrences
pdfgrep -c "search term" document.pdf

# Search all PDFs in directory
pdfgrep -r "term" /path/to/pdfs/

Extract Commands

# Extract specific page range
pdftotext -f 10 -l 15 document.pdf -

# Extract single page
pdftotext -f 42 -l 42 document.pdf -

# Preserve layout (for tables)
pdftotext -layout -f 10 -l 10 document.pdf -

# Extract and limit output
pdftotext -f 10 -l 15 document.pdf - | head -50

Metadata

# Get page count
pdfinfo document.pdf | grep Pages

# Full metadata
pdfinfo document.pdf

Troubleshooting

Empty output from pdftotext: PDF is image-based (scanned). These tools work with text-based PDFs only.

pdfgrep missing matches: Try case-insensitive (-i). Check if PDF has selectable text.

Outils PDF

Recommandé pour