Skip to main content

Finding Content

PDFDancer provides flexible selection methods to find content in your PDF documents. You can select all elements of a type, find elements at specific positions, or search by text content. All selection methods work at both document and page level.


Document-Level vs Page-Level Selection

Every selection method can search across the entire document or limit to a specific page:

from pdfdancer import PDFDancer

with PDFDancer.open("document.pdf") as pdf:
# Document-level: searches ALL pages
all_paragraphs = pdf.select_paragraphs()

# Page-level: searches only page 0
page_paragraphs = pdf.page(1).select_paragraphs()

1. Select All Elements

Get all elements of a specific type from the document or page.

Available Content Types

with PDFDancer.open("document.pdf") as pdf:
# Text content
all_paragraphs = pdf.select_paragraphs()
all_lines = pdf.select_text_lines() # or select_lines()

# Visual content
all_images = pdf.select_images()
all_paths = pdf.select_paths() # vector graphics
all_forms = pdf.select_forms() # form XObjects (templates, watermarks)

# Interactive content
all_fields = pdf.select_form_fields() # AcroForm fields

# Page-scoped selections
page_paragraphs = pdf.page(1).select_paragraphs()
page_images = pdf.page(1).select_images()

Select All Elements (Convenience Method)

Use select_elements() to get all content types at once:

with PDFDancer.open("document.pdf") as pdf:
# Get all elements in the entire document
# Returns paragraphs, text lines, images, paths, forms, and form fields
all_elements = pdf.select_elements()

# Get all elements on a specific page
page_elements = pdf.page(1).select_elements()

# Process all elements
for element in page_elements:
print(f"Type: {element.object_type}, Position: ({element.position.x}, {element.position.y})")

Working with Selected Elements

All selection methods return lists/arrays of typed objects with methods and properties:

with PDFDancer.open("document.pdf") as pdf:
# Select and inspect
paragraphs = pdf.page(1).select_paragraphs()
for para in paragraphs:
print(f"Text: {para.text}")
print(f"Font: {para.font_name} at {para.font_size}pt")
print(f"Position: ({para.position.x()}, {para.position.y()})")

# Select and manipulate
images = pdf.select_images()
for img in images:
print(f"Image at ({img.position.x()}, {img.position.y()})")
# Objects have .delete(), .move_to(), etc.

# Select and check form fields
fields = pdf.select_form_fields()
for field in fields:
print(f"{field.field_name}: {field.value}")

2. Select by Position

Find elements at specific x, y coordinates on a page. Useful when you know the layout structure.

Selecting at Coordinates

with PDFDancer.open("document.pdf") as pdf:
# Find paragraphs at specific coordinates
header = pdf.page(1).select_paragraphs_at(x=72, y=750)

# Find images at logo position
logo = pdf.page(1).select_images_at(x=50, y=750)

# Find form fields at signature area
signature = pdf.page(2).select_form_fields_at(x=150, y=100)

# Find paths (vector graphics) at specific position
lines = pdf.page(1).select_paths_at(x=200, y=400)

# Find form XObjects at watermark position
watermark = pdf.page(1).select_forms_at(x=300, y=400)

# Find text lines at specific coordinates
date_line = pdf.page(1).select_text_lines_at(x=500, y=750)

Practical Example: Fixed Layout Documents

with PDFDancer.open("invoice.pdf") as pdf:
# Invoice header is always at (72, 750)
header = pdf.page(1).select_paragraphs_at(72, 750)
if header:
print(f"Invoice: {header[0].text}")

# Total amount always at (450, 150)
total = pdf.page(1).select_paragraphs_at(450, 150)
if total:
print(f"Total: {total[0].text}")

# Company logo always at (50, 750)
logo = pdf.page(1).select_images_at(50, 750)
if logo:
print(f"Found logo at expected position")

Visual Example: Coordinate-Based Selection

See how PDFDancer finds elements at specific X,Y coordinates:

PDF with coordinate point (crosshair) and highlighted elements that intersect


3. Select by Content

Search for text elements by their content using text matching or regular expressions.

Select by Text Prefix

Find paragraphs or text lines that start with specific text. You can search across the entire document or limit the search to a specific page.

with PDFDancer.open("invoice.pdf") as pdf:
# Document-level search (searches all pages)
invoices = pdf.select_paragraphs_starting_with("Invoice #")
totals = pdf.select_paragraphs_starting_with("Total:")
disclaimers = pdf.select_paragraphs_starting_with("Note:")

# Find text lines by prefix across all pages
date_lines = pdf.select_text_lines_starting_with("Date:")
amount_lines = pdf.select_text_lines_starting_with("Amount:")

# Page-scoped prefix search (only searches specific page)
page_headers = pdf.page(1).select_paragraphs_starting_with("Executive Summary")

Select by Pattern (Regex)

Use regular expressions to find text matching complex patterns. Pattern matching works at both document and page levels.

Java Regex Syntax

All SDKs use Java-compatible regular expression syntax as defined in the Java Pattern class. Common patterns work the same across Python, TypeScript, and Java.

import re

with PDFDancer.open("document.pdf") as pdf:
# Document-level pattern matching (searches all pages)
# Find dates in format YYYY-MM-DD
dates = pdf.select_paragraphs_matching(r"\d{4}-\d{2}-\d{2}")

# Find email addresses
emails = pdf.select_paragraphs_matching(r"[\w\.-]+@[\w\.-]+\.\w+")

# Find dollar amounts
prices = pdf.select_paragraphs_matching(r"\$\d+\.\d{2}")

# Find phone numbers
phones = pdf.select_text_lines_matching(r"\(\d{3}\) \d{3}-\d{4}")

# Find ZIP codes
zips = pdf.select_text_lines_matching(r"\d{5}(-\d{4})?")

# Find invoice numbers (custom pattern)
invoice_nums = pdf.select_paragraphs_matching(r"INV-\d{6}")

# Page-scoped pattern matching (only searches specific page)
page_dates = pdf.page(1).select_paragraphs_matching(r"\d{4}-\d{2}-\d{2}")

Select by Field Name

For form fields, you can select by the field's name:

with PDFDancer.open("form.pdf") as pdf:
# Find specific form fields by name
signature_fields = pdf.select_form_fields_by_name("signature")
name_fields = pdf.select_form_fields_by_name("applicant_name")
date_fields = pdf.select_form_fields_by_name("application_date")

# Check and modify field values
if signature_fields:
sig = signature_fields[0]
print(f"Signature value: {sig.value}")
sig.edit().value("John Doe").apply()

Combining Selection Methods

You can chain and combine selection methods to create complex queries:

from pdfdancer import PDFDancer

with PDFDancer.open("invoice.pdf") as pdf:
# Find all invoices on first page
invoices = pdf.page(1).select_paragraphs_starting_with("Invoice #")

# Find all dollar amounts
prices = pdf.select_paragraphs_matching(r"\$\d+\.\d{2}")

# Find signature field and check if signed
sig_fields = pdf.select_form_fields_by_name("signature")
if sig_fields and sig_fields[0].value:
print("Document is signed")

# Delete all images on page 2
for img in pdf.page(2).select_images():
img.delete()

pdf.save("processed.pdf")

Convenience Methods: Select First Match

All multi-result selection methods (select_paragraphs_*, select_text_lines_*, etc.) return arrays/lists. For convenience, singular versions are available that return just the first match (or null/None if no matches found).

When to Use Singular Methods

Use singular selection methods when you:

  • Know there's only one match expected
  • Only need the first occurrence
  • Want to avoid index access (results[0])
from pdfdancer import PDFDancer

with PDFDancer.open("invoice.pdf") as pdf:
# Plural: returns list (may be empty)
headers = pdf.select_paragraphs_starting_with("Invoice #")
if headers:
first_header = headers[0]

# Singular: returns first match or None
first_header = pdf.select_paragraph_starting_with("Invoice #")
if first_header:
print(f"Invoice number: {first_header.text}")

Available Singular Methods

Plural MethodSingular MethodReturn Type
select_paragraphs()select_paragraph()First paragraph or null/None/Optional.empty()
select_paragraphs_starting_with(text)select_paragraph_starting_with(text)First match or null/None/Optional.empty()
select_paragraphs_matching(pattern)select_paragraph_matching(pattern)First match or null/None/Optional.empty()
select_paragraphs_at(x, y)select_paragraph_at(x, y)First match or null/None/Optional.empty()
select_text_lines()select_text_line()First text line or null/None/Optional.empty()
select_text_lines_starting_with(text)select_text_line_starting_with(text)First match or null/None/Optional.empty()
select_text_lines_matching(pattern)select_text_line_matching(pattern)First match or null/None/Optional.empty()
select_text_lines_at(x, y)select_text_line_at(x, y)First match or null/None/Optional.empty()
Java Optional Pattern

Java returns Optional<T> for singular methods, following Java best practices:

pdf.selectTextLineMatching("\\d{3}-\\d{4}")
.ifPresent(line -> System.out.println(line.getText()));
Python/TypeScript Return null/None

Python returns None and TypeScript returns null when no matches are found. Always check before using:

line = pdf.select_text_line_matching(r"\d{3}-\d{4}")
if line: # Check for None
print(line.text)

Selection Method Summary

By Content Type

Content TypeAllBy Text/NameBy PatternAt Coordinates
Paragraphsselect_paragraphs()select_paragraphs_starting_with()select_paragraphs_matching()select_paragraphs_at()
Text Linesselect_text_lines()select_text_lines_starting_with()select_text_lines_matching()select_text_lines_at()
Imagesselect_images()select_images_at()
Form Fieldsselect_form_fields()select_form_fields_by_name()select_form_fields_at()
Form XObjectsselect_forms()select_forms_at()
Pathsselect_paths()select_paths_at()

Scope

All selection methods work at two levels:

  • Document-level: pdf.select_*() - searches across all pages
  • Page-level: pdf.page(pageNumber).select_*() - searches only on specified page

Next Steps