Skip to main content

Extracting Text from Documents

PDFDancer provides multiple methods for extracting text from PDF documents. Whether you need to extract all text from a document, specific paragraphs, or text from particular areas, this guide covers all available options.


Overview: Text Extraction Approaches

PDFDancer offers several approaches to text extraction:

  1. Document-level extraction: Extract all text from the entire document
  2. Page-level extraction: Extract text from specific pages
  3. Element-based extraction: Extract text from individual paragraphs or text lines
  4. Position-based extraction: Extract text from specific coordinates or regions
  5. Content-based extraction: Extract text matching specific patterns or criteria

Each approach is useful for different use cases, from simple text extraction to complex data parsing.


Extracting All Text

From Entire Document

Extract all text content from every page in the document:

from pdfdancer import PDFDancer

with PDFDancer.open("document.pdf") as pdf:
# Get all paragraphs in the document
all_paragraphs = pdf.select_paragraphs()

# Extract text from each paragraph
full_text = ""
for paragraph in all_paragraphs:
full_text += paragraph.text + "\n\n"

print(full_text)

# Or using a list comprehension
text_list = [para.text for para in all_paragraphs]
combined_text = "\n\n".join(text_list)

From Specific Page

Extract all text from a single page:

with PDFDancer.open("document.pdf") as pdf:
# Extract text from first page
page_paragraphs = pdf.page(1).select_paragraphs()
page_text = "\n\n".join([para.text for para in page_paragraphs])
print(f"Page 1 text:\n{page_text}")

# Extract from multiple specific pages
pages_to_extract = [0, 2, 4] # Pages 1, 3, and 5
for page_num in pages_to_extract:
paragraphs = pdf.page(page_num).select_paragraphs()
text = "\n\n".join([p.text for p in paragraphs])
print(f"\n--- Page {page_num + 1} ---\n{text}")

From Page Range

Extract text from a range of consecutive pages:

from pdfdancer import PDFDancer

def extract_page_range(pdf_path: str, start_page: int, end_page: int) -> str:
"""Extract text from a range of pages (inclusive)."""
with PDFDancer.open(pdf_path) as pdf:
extracted_text = []

for page_num in range(start_page, end_page + 1):
paragraphs = pdf.page(page_num).select_paragraphs()
page_text = "\n\n".join([p.text for p in paragraphs])
extracted_text.append(f"--- Page {page_num} ---\n{page_text}")

return "\n\n\n".join(extracted_text)

# Extract pages 2-5
text = extract_page_range("document.pdf", start_page=2, end_page=5)
print(text)

Line-by-Line Text Extraction

For finer control, extract text line by line instead of by paragraph:

with PDFDancer.open("document.pdf") as pdf:
# Extract all text lines from the document
all_lines = pdf.select_text_lines()

# Print each line with its position
for line in all_lines:
print(f"[{line.position.x():.1f}, {line.position.y():.1f}] {line.text}")

# Extract lines from a specific page
page_lines = pdf.page(1).select_text_lines()
page_text = "\n".join([line.text for line in page_lines])

print(f"\nPage text (line by line):\n{page_text}")

Position-Based Text Extraction

Extract Text at Specific Coordinates

Extract text located at known positions on a page:

with PDFDancer.open("invoice.pdf") as pdf:
# Extract invoice number at known position
invoice_num = pdf.page(1).select_paragraphs_at(x=450, y=750)
if invoice_num:
print(f"Invoice: {invoice_num[0].text}")

# Extract total amount at bottom right
total = pdf.page(1).select_paragraphs_at(x=450, y=100)
if total:
print(f"Total: {total[0].text}")

# Extract multiple fields at different positions
fields = {
"date": (100, 750),
"customer": (100, 700),
"amount": (450, 100)
}

extracted_data = {}
for field_name, (x, y) in fields.items():
elements = pdf.page(1).select_paragraphs_at(x=x, y=y)
if elements:
extracted_data[field_name] = elements[0].text

print(f"Extracted data: {extracted_data}")

Extract Text from Regions

Extract all text within a specific area or bounding box:

with PDFDancer.open("document.pdf") as pdf:
# Get all paragraphs on the page
all_paragraphs = pdf.page(1).select_paragraphs()

# Define region of interest (x, y, width, height)
region_x, region_y = 100, 400
region_width, region_height = 300, 200

# Extract paragraphs within the region
region_paragraphs = []
for para in all_paragraphs:
x = para.position.x()
y = para.position.y()

# Check if paragraph is within region
if (region_x <= x <= region_x + region_width and
region_y <= y <= region_y + region_height):
region_paragraphs.append(para)

# Extract text from the region
region_text = "\n\n".join([p.text for p in region_paragraphs])
print(f"Text in region:\n{region_text}")

Content-Based Text Extraction

Extract Text by Prefix

Extract text elements that start with specific text:

with PDFDancer.open("document.pdf") as pdf:
# Extract all invoice numbers
invoices = pdf.select_paragraphs_starting_with("Invoice #")
for inv in invoices:
print(f"Found: {inv.text}")

# Extract all dates
dates = pdf.select_paragraphs_starting_with("Date:")

# Extract specific section headers
headers = pdf.select_paragraphs_starting_with("Chapter")

# Extract from specific page
page_totals = pdf.page(1).select_paragraphs_starting_with("Total:")

Extract Text by Pattern (Regex)

Use regular expressions to extract text matching specific patterns:

Java Regex Syntax

All SDKs use Java-compatible regular expression syntax as defined in the Java Pattern class. Common patterns work the same across Python, TypeScript, and Java.

import re
from pdfdancer import PDFDancer

with PDFDancer.open("document.pdf") as pdf:
# Extract email addresses
emails = pdf.select_paragraphs_matching(r"[\w\.-]+@[\w\.-]+\.\w+")
for email in emails:
print(f"Email: {email.text}")

# Extract phone numbers
phones = pdf.select_paragraphs_matching(r"\(\d{3}\) \d{3}-\d{4}")

# Extract dates (YYYY-MM-DD format)
dates = pdf.select_paragraphs_matching(r"\d{4}-\d{2}-\d{2}")

# Extract dollar amounts
amounts = pdf.select_paragraphs_matching(r"\$[\d,]+\.\d{2}")
for amount in amounts:
print(f"Amount: {amount.text}")

# Extract invoice numbers with custom pattern
invoice_pattern = r"INV-\d{6}"
invoices = pdf.select_paragraphs_matching(invoice_pattern)

# Extract ZIP codes
zip_codes = pdf.select_text_lines_matching(r"\d{5}(-\d{4})?")

Advanced Text Extraction Patterns

Extract Structured Data from Tables

Extract tabular data by analyzing text positions:

from pdfdancer import PDFDancer
from collections import defaultdict

def extract_table_data(pdf_path: str, page_num: int,
table_x: float, table_y: float,
table_width: float, table_height: float) -> list:
"""
Extract table-like data from a specific region.
Returns list of rows, where each row is a list of cell texts.
"""
with PDFDancer.open(pdf_path) as pdf:
# Get all text lines in the table region
all_lines = pdf.page(page_num).select_text_lines()

# Filter lines within table bounds
table_lines = []
for line in all_lines:
x = line.position.x()
y = line.position.y()
if (table_x <= x <= table_x + table_width and
table_y <= y <= table_y + table_height):
table_lines.append(line)

# Group lines by Y coordinate (rows)
rows_dict = defaultdict(list)
for line in table_lines:
# Round Y to group nearby lines
y_rounded = round(line.position.y())
rows_dict[y_rounded].append(line)

# Sort rows by Y coordinate (top to bottom)
rows = []
for y in sorted(rows_dict.keys(), reverse=True):
# Sort cells in row by X coordinate (left to right)
row_cells = sorted(rows_dict[y], key=lambda l: l.position.x())
row_texts = [cell.text for cell in row_cells]
rows.append(row_texts)

return rows

# Extract table data
table_data = extract_table_data(
"invoice.pdf",
page_num=0,
table_x=50,
table_y=300,
table_width=500,
table_height=200
)

# Print extracted table
for row in table_data:
print(" | ".join(row))

Extract with Metadata

Extract text along with formatting and position metadata:

from dataclasses import dataclass
from typing import List

@dataclass
class TextElement:
text: str
page: int
x: float
y: float
font_name: str
font_size: float
color: tuple

def extract_with_metadata(pdf_path: str) -> List[TextElement]:
"""Extract text with complete metadata."""
elements = []

with PDFDancer.open(pdf_path) as pdf:
pages = pdf.pages()

for page_num, page in enumerate(pages):
paragraphs = page.select_paragraphs()

for para in paragraphs:
element = TextElement(
text=para.text,
page=page_num,
x=para.position.x(),
y=para.position.y(),
font_name=para.font_name or "Unknown",
font_size=para.font_size or 0,
color=(
para.color.r if para.color else 0,
para.color.g if para.color else 0,
para.color.b if para.color else 0
)
)
elements.append(element)

return elements

# Extract with metadata
elements = extract_with_metadata("document.pdf")

# Print elements with metadata
for elem in elements:
print(f"Page {elem.page + 1}: [{elem.x:.1f}, {elem.y:.1f}] "
f"{elem.font_name} {elem.font_size}pt - {elem.text[:50]}")

Use Cases and Examples

Invoice Data Extraction

Complete example of extracting structured data from invoices:

from dataclasses import dataclass
from pdfdancer import PDFDancer

@dataclass
class InvoiceData:
invoice_number: str = ""
date: str = ""
customer_name: str = ""
total_amount: str = ""
line_items: list = None

def extract_invoice_data(pdf_path: str) -> InvoiceData:
"""Extract structured data from an invoice PDF."""
invoice = InvoiceData(line_items=[])

with PDFDancer.open(pdf_path) as pdf:
# Extract invoice number
inv_nums = pdf.select_paragraphs_starting_with("Invoice #")
if inv_nums:
invoice.invoice_number = inv_nums[0].text

# Extract date
dates = pdf.select_paragraphs_starting_with("Date:")
if dates:
invoice.date = dates[0].text.replace("Date:", "").strip()

# Extract customer name
customers = pdf.select_paragraphs_starting_with("Bill To:")
if customers:
invoice.customer_name = customers[0].text.replace("Bill To:", "").strip()

# Extract total amount
totals = pdf.select_paragraphs_starting_with("Total:")
if totals:
invoice.total_amount = totals[0].text.replace("Total:", "").strip()

# Extract line items (items between "Description" and "Total")
all_lines = pdf.page(1).select_text_lines()
in_items_section = False

for line in all_lines:
if "Description" in line.text:
in_items_section = True
continue
if "Total" in line.text:
in_items_section = False
break
if in_items_section and line.text.strip():
invoice.line_items.append(line.text.strip())

return invoice

# Extract invoice data
invoice = extract_invoice_data("invoice.pdf")
print(f"Invoice: {invoice.invoice_number}")
print(f"Date: {invoice.date}")
print(f"Customer: {invoice.customer_name}")
print(f"Total: {invoice.total_amount}")
print(f"Line items: {len(invoice.line_items)}")

Export to JSON

Export extracted text data to JSON format:

import json
from pdfdancer import PDFDancer

def extract_to_json(pdf_path: str, output_path: str):
"""Extract all text and metadata to JSON."""
document_data = {
"pages": []
}

with PDFDancer.open(pdf_path) as pdf:
pages = pdf.pages()

for page_num, page in enumerate(pages):
page_data = {
"page_number": page_num + 1,
"paragraphs": []
}

paragraphs = page.select_paragraphs()
for para in paragraphs:
para_data = {
"text": para.text,
"position": {
"x": para.position.x(),
"y": para.position.y()
},
"font": {
"name": para.font_name,
"size": para.font_size
}
}
if para.color:
para_data["color"] = {
"r": para.color.r,
"g": para.color.g,
"b": para.color.b
}
page_data["paragraphs"].append(para_data)

document_data["pages"].append(page_data)

# Write to JSON file
with open(output_path, 'w', encoding='utf-8') as f:
json.dump(document_data, f, indent=2, ensure_ascii=False)

print(f"Exported to {output_path}")

# Extract to JSON
extract_to_json("document.pdf", "extracted_text.json")

Performance Tips

Optimize for Large Documents

When working with large PDFs:

  1. Process pages in batches: Don't load all pages at once
  2. Use page-level selection: Limit searches to specific pages when possible
  3. Filter early: Apply position/pattern filters to reduce data
  4. Stream results: Process results as you extract them instead of accumulating
def extract_large_document(pdf_path: str, start_page: int = 0, batch_size: int = 10):
"""Extract text from large document in batches."""
with PDFDancer.open(pdf_path) as pdf:
total_pages = len(pdf.pages())

for batch_start in range(start_page, total_pages, batch_size):
batch_end = min(batch_start + batch_size, total_pages)

print(f"Processing pages {batch_start + 1} to {batch_end}...")

for page_num in range(batch_start, batch_end):
# Process one page at a time
paragraphs = pdf.page(page_num).select_paragraphs()

# Process and output immediately
for para in paragraphs:
# Process paragraph (save to file, database, etc.)
yield (page_num, para.text)

# Process large document
for page_num, text in extract_large_document("large.pdf", batch_size=5):
print(f"Page {page_num + 1}: {text[:100]}...")

Comparison: Paragraphs vs Text Lines

When to use Paragraphs:

  • Extracting semantic blocks of text (articles, sections, headings)
  • Working with formatted multi-line content
  • When logical text grouping matters

When to use Text Lines:

  • Extracting tabular data
  • Processing forms with field labels
  • When precise line-level positioning is needed
  • Extracting structured data where line breaks are significant

Next Steps