Reference¶
Functions¶
convert_from_path & convert_from_bytes¶
Converts a PDF into image(s)
convert_from_path(
pdf_path,
dpi=200,
output_folder=None,
first_page=None,
last_page=None,
fmt="ppm",
jpegopt=None,
thread_count=1,
userpw=None,
use_cropbox=False,
strict=False,
transparent=False,
single_file=False,
output_file=uuid_generator(),
poppler_path=None,
grayscale=False,
size=None,
paths_only=False,
hide_annotations=False,
)
convert_from_bytes(
pdf_bytes,
dpi=200,
output_folder=None,
first_page=None,
last_page=None,
fmt="ppm",
jpegopt=None,
thread_count=1,
userpw=None,
use_cropbox=False,
strict=False,
transparent=False,
single_file=False,
output_file=uuid_generator(),
poppler_path=None,
grayscale=False,
size=None,
paths_only=False,
hide_annotations=False,
)
pdf_path
Path to the PDF file. Can be a string or a pathlib.Path
object
pdf_bytes
Bytes of the PDF file.
dpi
Dots per inch, can be seen as the relative resolution of the output PDF, higher is better but anything above 300 is usually not discernable to the naked eye. Keep in mind that this is directly related to the ouput images size when using file formats without compression (like PPM)
output_folder
Output directory for the generated files, should be seen more as a “working directory” than an output folder. The converted images will be written there to save system memory.
first_page
First page that will be converted. first_page=2
will skip page 1.
last_page
Last page that will be converted. last_page=2
will skip all pages after page 2.
fmt
File format or the output images. Supported values are ppm
, jpeg
, png
and tiff
.
jpegopt
Configuration for the jpeg output format. As such, only used with fmt='jpeg'
.
jpegopt={
"quality": 100,
"progressive": True,
"optimize": True
}
quality
: Selects the JPEG quality value. The value must be an integer between 0 and 100.progressive
: Select progressive JPEG output. The possible values areTrue
,False
, indicating progressive (yes) or non-progressive (no), respectively.optimize
: Sets whether to compute optimal Huffman coding tables for the JPEG output, which will create smaller files but make an extra pass over the data. The value must beTrue
orFalse
, withTrue
performing optimization, otherwise the default Huffman tables are used.
thread_count
Number of threads to use when converting the PDF. Limited to the actual number of pages.
userpw
Password for the PDF if it is password-protected.
use_cropbox
Uses the PDF cropbox instead of the default mediabox. This is a rather dark feature that should be set to true when the module does not seem to work with your data.
strict
Raises PDFSyntaxError when the PDF is partially malformed. Most PDF are partially malformed and that parameter should be kept to False
, unless standard compliance is paramount to your use case.
transparent
Instead of returning a white background, make the PDF background transparent. Only compatible with file formats that support transparency.
single_file
Only convert the PDF first page and does not append an index to the output file name.
output_file
Output filename, normally string, but can take a string generator.
poppler_path
Path to the poppler directory containing librairies and executable files.
grayscale
Returns grayscale images
size
Size of output images, using None
as any of the dimension will resize and preserve aspect ratio.
Examples of valid sizes are:
size=400
will fit the image to a 400x400 box, preserving aspect ratiosize=(400, None)
will make the image 400 pixels wide, preserving aspect ratiosize=(500, 500)
will resize the image to 500x500 pixels, not preserving aspect ratio
This behavior is derived directly from the -scale-to
, -scale-to-x
, and -scale-to-y
parameters.
paths_only
A list of image paths rather than preloaded images are returned.
jpegopt
Provide additional options for jpeg format conversions. Requires fmt="jpeg"
and is provided as dict, with all
optinal keywords:
jpegopt={"quality": 100, "optimize": True, "progressive": False}
hide_annotations
Hide link bounding boxes and other PDF annotations. This is only implemented in pdftoppm at the moment so it cannot be combined with pdftocairo flags.
Exceptions¶
from pdf2image.exceptions import (
PDFInfoNotInstalledError,
PDFPageCountError,
PDFSyntaxError
)
PDFInfoNotInstalledError¶
Exception raised when pdfinfo
, which is part of poppler-utils, was not found on your system. This can be tested by trying to call it from your command line.
When this error is raised, the error is almost always installation related.
PDFPageCountError¶
Exception raised when pdfinfo
, which is part of poppler-utils, was unable to get the page count from the PDF file. This is usually due to:
- An invalid PDF file path
- A malformed or invalid PDF
PDFSyntaxError¶
Exception raised when convert_from_path
or convert_from_bytes
is called using strict=True
and the input PDF contained a syntax error. Simply use strict=False
will usually solve this issue.
Note that most PDF contain syntax errors and you can safely ignore strict mode.