Skip to main content

Working with Snapshots

Snapshots provide an efficient way to retrieve all elements from a PDF document or page in a single API call. This is particularly useful for bulk operations, filtering elements by type, and inspecting document structure.


What are Snapshots?

A snapshot is a complete view of a PDF document or page that includes:

  • All elements (paragraphs, text lines, images, paths, form fields)
  • Page metadata (size, orientation)
  • Font information (for document snapshots)

Snapshots are cached internally for performance, making repeated selections much faster.


Document Snapshots

Get a complete snapshot of the entire PDF document:

from pdfdancer import PDFDancer

with PDFDancer.open("document.pdf") as pdf:
# Get snapshot of entire document
snapshot = pdf.get_document_snapshot()

# Access document information
print(f"Total pages: {snapshot.page_count}")
print(f"Fonts used: {len(snapshot.fonts)}")

# Iterate through all pages
for page_snapshot in snapshot.pages:
page_number = page_snapshot.page_ref.position.page_number
element_count = len(page_snapshot.elements)
print(f"Page {page_number}: {element_count} elements")

# Get all elements across all pages
all_elements = []
for page_snap in snapshot.pages:
all_elements.extend(page_snap.elements)

print(f"Total elements in document: {len(all_elements)}")

Page Snapshots

Get a snapshot of a specific page:

from pdfdancer import PDFDancer

with PDFDancer.open("document.pdf") as pdf:
# Get snapshot of page 1
page_snapshot = pdf.get_page_snapshot(1)

# Access page information
page_ref = page_snapshot.page_ref
print(f"Page size: {page_ref.page_size}")
print(f"Orientation: {page_ref.orientation}")

# Get all elements on this page
elements = page_snapshot.elements
print(f"Elements on page: {len(elements)}")

# Iterate through elements
for element in elements:
print(f" {element.object_type}: {element.internal_id}")

Using Page Object

You can also get a snapshot from a page object:

with PDFDancer.open("document.pdf") as pdf:
page = pdf.page(1)

# Get snapshot from page object
snapshot = page.get_snapshot()

print(f"Elements: {len(snapshot.elements)}")

Filtering by Object Type

Snapshots can be filtered to only include specific types of elements:

from pdfdancer import PDFDancer, ObjectType

with PDFDancer.open("document.pdf") as pdf:
# Get only paragraphs and images
snapshot = pdf.get_document_snapshot(types="PARAGRAPH,IMAGE")

# Count elements by type
paragraph_count = 0
image_count = 0

for page_snap in snapshot.pages:
for element in page_snap.elements:
if element.object_type == "PARAGRAPH":
paragraph_count += 1
elif element.object_type == "IMAGE":
image_count += 1

print(f"Paragraphs: {paragraph_count}")
print(f"Images: {image_count}")

Available Object Types:

  • PARAGRAPH - Text paragraphs
  • TEXT_LINE - Individual text lines
  • IMAGE - Images
  • PATH - Vector graphics (lines, shapes)
  • FORM_X_OBJECT - Form XObjects (templates, watermarks)
  • FORM_FIELD - AcroForm fields
  • PAGE - Page references

Use Cases

1. Bulk Text Extraction

Extract all text from a document efficiently:

from pdfdancer import PDFDancer

with PDFDancer.open("document.pdf") as pdf:
# Get snapshot with only text elements
snapshot = pdf.get_document_snapshot(types="PARAGRAPH,TEXT_LINE")

all_text = []
for page_snap in snapshot.pages:
for element in page_snap.elements:
if hasattr(element, 'text') and element.text:
all_text.append(element.text)

full_text = "\n".join(all_text)
print(full_text)

2. Document Analysis

Analyze document structure and content distribution:

from pdfdancer import PDFDancer

with PDFDancer.open("document.pdf") as pdf:
snapshot = pdf.get_document_snapshot()

# Analyze each page
for i, page_snap in enumerate(snapshot.pages):
elements_by_type = {}

for element in page_snap.elements:
obj_type = element.object_type
elements_by_type[obj_type] = elements_by_type.get(obj_type, 0) + 1

print(f"\nPage {i}:")
for obj_type, count in elements_by_type.items():
print(f" {obj_type}: {count}")

Performance Benefits

Snapshots provide significant performance improvements:

  1. Single API Call: Get all elements in one request instead of multiple selection calls
  2. Internal Caching: Snapshots are cached automatically, making subsequent selections faster
  3. Efficient Filtering: Filter by object type on the server side
  4. Bulk Operations: Process multiple elements without repeated API calls
When to Use Snapshots

Use snapshots when you need to:

  • Inspect or analyze the entire document structure
  • Perform bulk operations on multiple elements
  • Filter elements by type across multiple pages
  • Extract all content of a specific type

For simple, targeted selections (e.g., finding one paragraph), use the regular select_* methods instead.


Next Steps