Skip to main content

Core Concepts

This guide explains the fundamental concepts you need to understand when working with PDFDancer, including both PDF-specific concepts and PDFDancer's content model.


PDF Fundamentals

PDF Coordinate System

PDF uses a Cartesian coordinate system with the origin at the bottom-left corner of the page:

  • X-axis: Increases from left (0) to right
  • Y-axis: Increases from bottom (0) to top
  • Units: PDF points (1 point = 1/72 inch)

PDF Coordinate System - Bottom-left origin with X increasing right, Y increasing up

Common Page Sizes:

  • Letter (US): 612 × 792 points (8.5" × 11")
  • A4: 595 × 842 points (210mm × 297mm)
  • Legal: 612 × 1008 points (8.5" × 14")

PDF Points

All measurements in PDF use points as the base unit:

  • 1 point = 1/72 inch
  • 72 points = 1 inch
  • 1 inch margin = 72 points

Page Sizes

PDFDancer provides constants for standard page sizes. All dimensions are in points.

ISO A Series:

from pdfdancer import PageSize

# ISO A Series
PageSize.A4 # 595 × 842 points (210mm × 297mm)
PageSize.A3 # 842 × 1191 points (297mm × 420mm)
PageSize.A5 # 420 × 595 points (148mm × 210mm)

US/North American Sizes:

# US Sizes
PageSize.LETTER # 612 × 792 points (8.5" × 11")
PageSize.LEGAL # 612 × 1008 points (8.5" × 14")
PageSize.TABLOID # 792 × 1224 points (11" × 17")

Custom Page Sizes:

# Create custom page size (width, height in points)
custom_size = PageSize(name=None, width=500.0, height=700.0)

Bounding Rectangles

Every PDF element has a bounding rectangle that defines its position and size:

{
"x": 100, # Left edge (from page left)
"y": 500, # Bottom edge (from page bottom)
"width": 200, # Width in points
"height": 50 # Height in points
}

PDFDancer Content Model

PDFDancer provides a structured way to interact with PDF content through several key object types.

Pages

Pages are the fundamental containers in a PDF document. Each page has:

  • A page number (page 1 is the first page)
  • Dimensions (width and height in points)
  • A bounding rectangle defining its size
  • Content (paragraphs, images, paths, form fields)

Pages are accessed using pdf.page(number):

PDFDancer uses standard page numbering — page 1 is the first page.

# Get first page
first_page = pdf.page(1)

# Get all pages
all_pages = pdf.pages()

Paragraphs

Paragraphs are PDFDancer's high-level text abstraction. A paragraph represents a logical block of text that may span multiple lines.

Key Properties:

  • text: The complete text content
  • position: Bounding rectangle and page location
  • internal_id: Unique identifier within the PDF

When to use Paragraphs:

  • Finding text blocks by content (e.g., "Invoice #12345")
  • Editing multi-line text blocks
  • Replacing entire sections of text
  • Adding formatted text content
# Select all paragraphs
paragraphs = pdf.select_paragraphs()

# Select by text prefix
headers = pdf.select_paragraphs_starting_with("Chapter")

# Access paragraph properties
for para in paragraphs:
print(f"Text: {para.text}")
print(f"Position: {para.position.bounding_rect}")

TextLines

TextLines represent individual lines of text within a paragraph. They provide finer-grained control than paragraphs.

Key Properties:

  • text: The text content of the line
  • position: Bounding rectangle of the line
  • internal_id: Unique identifier

When to use TextLines:

  • Precise line-by-line text manipulation
  • Finding single-line text elements
  • Working with tabular data or structured text
# Select all text lines
lines = pdf.page(1).select_lines()

# Select lines by prefix
date_lines = pdf.select_text_lines_starting_with("Date:")

for line in lines:
print(f"Line text: {line.text}")

Paragraph vs TextLine:

  • Use Paragraphs for semantic text blocks (headings, body text, captions)
  • Use TextLines for precise line-level control or single-line elements

Images

Images represent raster graphics (PNG, JPEG, etc.) embedded in the PDF.

Key Properties:

  • internal_id: Unique identifier
  • position: Bounding rectangle and location
  • Image data (for export/manipulation)

Common Operations:

  • Selecting images by position
  • Adding new images at specific coordinates
  • Deleting existing images
  • Replacing images
# Select all images on a page
images = pdf.page(1).select_images()

# Select images at coordinates
images_at_point = pdf.page(1).select_images_at(x=100, y=500)

# Add a new image
pdf.new_image() \
.from_file("logo.png") \
.at(page=1, x=50, y=700) \
.add()

Paths (Vector Graphics)

Paths are vector graphics elements that can represent lines, curves, shapes, and complex drawings.

What Paths Represent:

  • Lines and curves (straight lines, Bézier curves)
  • Shapes (rectangles, circles, polygons)
  • Borders and decorative elements
  • Technical drawings and diagrams

Key Concepts:

  • Bézier curves: Mathematical curves defined by control points
  • Stroke: The outline of a path (color, width)
  • Fill: The interior color of closed paths
# Select all paths on a page
paths = pdf.page(1).select_paths()

# Select paths at specific coordinates
paths_at_point = pdf.page(1).select_paths_at(x=150, y=320)

for path in paths:
print(f"Path ID: {path.internal_id}")

Form Fields (AcroForms)

Form Fields are interactive elements in PDF forms (AcroForms) that can be filled programmatically.

Common Field Types:

  • Text fields: Single-line or multi-line text input
  • Checkboxes: Boolean on/off values
  • Radio buttons: Single choice from multiple options
  • Dropdowns: Selection from a list

Key Properties:

  • name: Field identifier (e.g., "firstName", "email")
  • object_type: Type of field
  • position: Location on the page
# Select all form fields
fields = pdf.select_form_fields()

# Select by name
name_fields = pdf.select_form_fields_by_name("firstName")

# Fill a field
if name_fields:
name_fields[0].edit().value("John Doe").apply()

FormXObjects

FormXObjects (also called XObjects) are reusable content streams that can be referenced multiple times throughout a document.

Use Cases:

  • Company logos appearing on every page
  • Page headers and footers
  • Watermarks
  • Template overlays

Benefits:

  • Efficiency: Content is stored once, referenced many times
  • Consistency: Ensures identical appearance across pages
  • Smaller file size: No content duplication

FormXObjects can be transformed (scaled, rotated, positioned) each time they're used without modifying the original content.

Working with FormXObjects:

from pdfdancer import PDFDancer

with PDFDancer.open("document.pdf") as pdf:
# Select all FormXObjects on a page
formxobjects = pdf.page(1).select_formxobjects()

# Select FormXObjects at specific coordinates
formxobjects_at_point = pdf.page(1).select_formxobjects_at(x=100, y=500)

for fxo in formxobjects:
print(f"FormXObject ID: {fxo.internal_id}")
print(f"Position: {fxo.position.bounding_rect}")

Fonts

PDF supports both standard and custom fonts.

Standard PDF Fonts

The 14 standard PDF fonts are guaranteed to be available in all PDF readers and do not need to be embedded in the PDF document.

Serif Fonts (Times family):

from pdfdancer import StandardFonts

StandardFonts.TIMES_ROMAN # Times-Roman
StandardFonts.TIMES_BOLD # Times-Bold
StandardFonts.TIMES_ITALIC # Times-Italic
StandardFonts.TIMES_BOLD_ITALIC # Times-BoldItalic

Sans-serif Fonts (Helvetica family):

StandardFonts.HELVETICA              # Helvetica
StandardFonts.HELVETICA_BOLD # Helvetica-Bold
StandardFonts.HELVETICA_OBLIQUE # Helvetica-Oblique
StandardFonts.HELVETICA_BOLD_OBLIQUE # Helvetica-BoldOblique

Monospace Fonts (Courier family):

StandardFonts.COURIER              # Courier
StandardFonts.COURIER_BOLD # Courier-Bold
StandardFonts.COURIER_OBLIQUE # Courier-Oblique
StandardFonts.COURIER_BOLD_OBLIQUE # Courier-BoldOblique

Symbol and Decorative Fonts:

StandardFonts.SYMBOL          # Symbol (mathematical and special characters)
StandardFonts.ZAPF_DINGBATS # ZapfDingbats (decorative symbols)

Using Standard Fonts:

# Use standard font constant
pdf.new_paragraph() \
.text("Hello World") \
.font(StandardFonts.HELVETICA.value, 12) \
.add()

# Or use font name string directly
pdf.new_paragraph() \
.text("Hello World") \
.font("Helvetica", 12) \
.add()

Custom Fonts

PDFDancer supports embedding custom TrueType fonts (.ttf) for precise typography.

# Use standard font
pdf.new_paragraph() \
.text("Hello World") \
.font("Helvetica", 12) \
.add()

# Use custom font
pdf.new_paragraph() \
.text("Custom Typography") \
.font_file("custom-font.ttf", 14) \
.add()

Position Objects

Position objects encapsulate coordinate information for precise element placement and selection.

Creating Positions

from pdfdancer import Position, PositionMode

# Create position at point
position = Position.at_page_coordinates(page=0, x=100, y=200)

# Create position with bounding rect
position = Position(
page_number=1,
bounding_rect={"x": 100, "y": 200, "width": 50, "height": 30},
mode=PositionMode.INTERSECT
)

# Use for selection
paragraphs = pdf.select_paragraphs_at(position)

Position Modes

  • INTERSECT: Select elements that overlap with the position area
  • CONTAIN: Select elements fully contained within the position area
  • EXACT: Select elements at exact coordinates

Color

PDFDancer uses RGB color values for text and graphics.

from pdfdancer import Color

# Create colors
black = Color(0, 0, 0)
red = Color(255, 0, 0)
gray = Color(128, 128, 128)
custom = Color(70, 130, 180) # Steel blue

# Apply to text
pdf.new_paragraph() \
.text("Colored text") \
.color(red) \
.add()

Selection vs Creation

PDFDancer provides two primary workflows:

Selection (Read/Modify)

Use select_* methods to find existing content:

# Find existing content
paragraphs = pdf.select_paragraphs()
images = pdf.page(1).select_images()
fields = pdf.select_form_fields_by_name("email")

# Modify it
paragraphs[0].edit().replace("New text").apply()

Creation (Add)

Use new_* methods to add new content:

# Add new content
pdf.new_paragraph() \
.text("New content") \
.at(page_number=1, x=100, y=500) \
.add()

pdf.new_image() \
.from_file("logo.png") \
.at(page=1, x=50, y=700) \
.add()

Fluent Builders

PDFDancer uses fluent builder patterns for creating and editing content. Builders allow you to chain method calls for readable, declarative code:

# Paragraph builder
pdf.new_paragraph() \
.text("Hello World") \
.font("Helvetica", 12) \
.color(Color(0, 0, 0)) \
.line_spacing(1.5) \
.at(page_number=1, x=100, y=500) \
.add()

# Edit builder
paragraph.edit() \
.replace("New text") \
.font("Helvetica-Bold", 14) \
.color(Color(255, 0, 0)) \
.apply()

Thread Safety

Important

PDFDancer sessions are not thread-safe and must not be used concurrently.

Each session instance should only be accessed from a single thread at a time. Do not share session objects across threads or use them in concurrent operations.

Why This Matters:

When you call PDFDancer.open(), you create a session that maintains state on both the client and server. Concurrent access from multiple threads can lead to:

  • Race conditions and unpredictable behavior
  • Corrupted PDF state
  • API errors and failed operations

Safe Patterns:

from pdfdancer import PDFDancer
from concurrent.futures import ThreadPoolExecutor

# ✓ SAFE: Each thread creates its own session
def process_pdf(file_path: str) -> None:
with PDFDancer.open(file_path) as pdf:
# Operations on this session
paragraphs = pdf.select_paragraphs()
pdf.save(f"output_{file_path}")

# Process multiple PDFs in parallel - each gets its own session
with ThreadPoolExecutor() as executor:
executor.map(process_pdf, ["doc1.pdf", "doc2.pdf", "doc3.pdf"])


# ✗ UNSAFE: Sharing a session across threads
pdf = PDFDancer.open("document.pdf")
def unsafe_operation():
# DON'T DO THIS - multiple threads using the same session
pdf.select_paragraphs() # Not thread-safe!

with ThreadPoolExecutor() as executor:
executor.submit(unsafe_operation)
executor.submit(unsafe_operation)

Best Practice: Always create a new session instance for each thread or concurrent operation. Sessions are lightweight to create and are designed for single-threaded access.


Next Steps

Now that you understand the core concepts, explore how to use them: