Class COSParser

java.lang.Object
org.apache.pdfbox.pdfparser.BaseParser
org.apache.pdfbox.pdfparser.COSParser
Direct Known Subclasses:
FDFParser, PDFParser

public class COSParser extends BaseParser
PDF-Parser which first reads startxref and xref tables in order to know valid objects and parse only these objects. First PDFParser.parse() or FDFParser.parse() must be called before page objects can be retrieved, e.g. PDFParser.getPDDocument(). This class is a much enhanced version of QuickParser presented in PDFBOX-1104 by Jeremy Villalobos.
  • Field Details

    • PDF_HEADER

      private static final String PDF_HEADER
      See Also:
    • FDF_HEADER

      private static final String FDF_HEADER
      See Also:
    • PDF_DEFAULT_VERSION

      private static final String PDF_DEFAULT_VERSION
      See Also:
    • FDF_DEFAULT_VERSION

      private static final String FDF_DEFAULT_VERSION
      See Also:
    • XREF_TABLE

      private static final char[] XREF_TABLE
    • XREF_STREAM

      private static final char[] XREF_STREAM
    • STARTXREF

      private static final char[] STARTXREF
    • ENDSTREAM

      private static final byte[] ENDSTREAM
    • ENDOBJ

      private static final byte[] ENDOBJ
    • MINIMUM_SEARCH_OFFSET

      private static final long MINIMUM_SEARCH_OFFSET
      See Also:
    • X

      private static final int X
      See Also:
    • STRMBUFLEN

      private static final int STRMBUFLEN
      See Also:
    • strmBuf

      private final byte[] strmBuf
    • source

      protected final RandomAccessRead source
    • accessPermission

      private AccessPermission accessPermission
    • keyStoreInputStream

      private InputStream keyStoreInputStream
    • password

      private String password
    • keyAlias

      private String keyAlias
    • SYSPROP_PARSEMINIMAL

      public static final String SYSPROP_PARSEMINIMAL
      Only parse the PDF file minimally allowing access to basic information.
      See Also:
    • SYSPROP_EOFLOOKUPRANGE

      public static final String SYSPROP_EOFLOOKUPRANGE
      The range within the %%EOF marker will be searched. Useful if there are additional characters after %%EOF within the PDF.
      See Also:
    • DEFAULT_TRAIL_BYTECOUNT

      private static final int DEFAULT_TRAIL_BYTECOUNT
      How many trailing bytes to read for EOF marker.
      See Also:
    • EOF_MARKER

      protected static final char[] EOF_MARKER
      EOF-marker.
    • OBJ_MARKER

      protected static final char[] OBJ_MARKER
      obj-marker.
    • TRAILER_MARKER

      private static final char[] TRAILER_MARKER
      trailer-marker.
    • OBJ_STREAM

      private static final char[] OBJ_STREAM
      ObjStream-marker.
    • trailerOffset

      private long trailerOffset
    • fileLen

      protected long fileLen
      file length.
    • isLenient

      private boolean isLenient
      is parser using auto healing capacity ?
    • initialParseDone

      protected boolean initialParseDone
    • trailerWasRebuild

      private boolean trailerWasRebuild
    • bfSearchCOSObjectKeyOffsets

      private Map<COSObjectKey,Long> bfSearchCOSObjectKeyOffsets
      Contains all found objects of a brute force search.
    • lastEOFMarker

      private Long lastEOFMarker
    • bfSearchXRefTablesOffsets

      private List<Long> bfSearchXRefTablesOffsets
    • bfSearchXRefStreamsOffsets

      private List<Long> bfSearchXRefStreamsOffsets
    • encryption

      private PDEncryption encryption
    • securityHandler

      protected SecurityHandler securityHandler
      The security handler.
    • readTrailBytes

      private int readTrailBytes
      how many trailing bytes to read for EOF marker.
    • LOG

      private static final org.apache.commons.logging.Log LOG
    • xrefTrailerResolver

      protected XrefTrailerResolver xrefTrailerResolver
      Collects all Xref/trailer objects and resolves them into single object using startxref reference.
    • TMP_FILE_PREFIX

      public static final String TMP_FILE_PREFIX
      The prefix for the temp file being used.
      See Also:
    • STREAMCOPYBUFLEN

      private static final int STREAMCOPYBUFLEN
      See Also:
    • streamCopyBuf

      private final byte[] streamCopyBuf
  • Constructor Details

    • COSParser

      public COSParser(RandomAccessRead source)
      Default constructor.
      Parameters:
      source - input representing the pdf.
    • COSParser

      public COSParser(RandomAccessRead source, String password, InputStream keyStore, String keyAlias)
      Constructor for encrypted pdfs.
      Parameters:
      source - input representing the pdf.
      password - password to be used for decryption.
      keyStore - key store to be used for decryption when using public key security
      keyAlias - alias to be used for decryption when using public key security
  • Method Details

    • setEOFLookupRange

      public void setEOFLookupRange(int byteCount)
      Sets how many trailing bytes of PDF file are searched for EOF marker and 'startxref' marker. If not set we use default value DEFAULT_TRAIL_BYTECOUNT.

      We check that new value is at least 16. However for practical use cases this value should not be lower than 1000; even 2000 was found to not be enough in some cases where some trailing garbage like HTML snippets followed the EOF marker.

      In case system property SYSPROP_EOFLOOKUPRANGE is defined this value will be set on initialization but can be overwritten later.

      Parameters:
      byteCount - number of trailing bytes
    • retrieveTrailer

      protected COSDictionary retrieveTrailer() throws IOException
      Read the trailer information and provide a COSDictionary containing the trailer information.
      Returns:
      a COSDictionary containing the trailer information
      Throws:
      IOException - if something went wrong
    • parseXref

      protected COSDictionary parseXref(long startXRefOffset) throws IOException
      Parses cross reference tables.
      Parameters:
      startXRefOffset - start offset of the first table
      Returns:
      the trailer dictionary
      Throws:
      IOException - if something went wrong
    • parseXrefObjStream

      private long parseXrefObjStream(long objByteOffset, boolean isStandalone) throws IOException
      Parses an xref object stream starting with indirect object id.
      Returns:
      value of PREV item in dictionary or -1 if no such item exists
      Throws:
      IOException
    • getStartxrefOffset

      protected final long getStartxrefOffset() throws IOException
      Looks for and parses startxref. We first look for last '%%EOF' marker (within last DEFAULT_TRAIL_BYTECOUNT bytes (or range set via setEOFLookupRange(int)) and go back to find startxref.
      Returns:
      the offset of StartXref
      Throws:
      IOException - If something went wrong.
    • lastIndexOf

      protected int lastIndexOf(char[] pattern, byte[] buf, int endOff)
      Searches last appearance of pattern within buffer. Lookup before _lastOff and goes back until 0.
      Parameters:
      pattern - pattern to search for
      buf - buffer to search pattern in
      endOff - offset (exclusive) where lookup starts at
      Returns:
      start offset of pattern within buffer or -1 if pattern could not be found
    • isLenient

      public boolean isLenient()
      Return true if parser is lenient. Meaning auto healing capacity of the parser are used.
      Returns:
      true if parser is lenient
    • setLenient

      public void setLenient(boolean lenient)
      Change the parser leniency flag. This method can only be called before the parsing of the file.
      Parameters:
      lenient - try to handle malformed PDFs.
    • getObjectId

      private long getObjectId(COSObject obj)
      Creates a unique object id using object number and object generation number. (requires object number < 2^31))
    • addNewToList

      private void addNewToList(Queue<COSBase> toBeParsedList, Collection<COSBase> newObjects, Set<Long> addedObjects)
      Adds all from newObjects to toBeParsedList if it is not an COSObject or we didn't add this COSObject already (checked via addedObjects).
    • addNewToList

      private void addNewToList(Queue<COSBase> toBeParsedList, COSBase newObject, Set<Long> addedObjects)
      Adds newObject to toBeParsedList if it is not an COSObject or we didn't add this COSObject already (checked via addedObjects). Simple objects are not added because nothing is done with them when toBeParsedList is processed.
    • parseDictObjects

      protected void parseDictObjects(COSDictionary dict, COSName... excludeObjects) throws IOException
      Will parse every object necessary to load a single page from the pdf document. We try our best to order objects according to offset in file before reading to minimize seek operations.
      Parameters:
      dict - the COSObject from the parent pages.
      excludeObjects - dictionary object reference entries with these names will not be parsed
      Throws:
      IOException - if something went wrong
    • addExcludedToList

      private void addExcludedToList(COSName[] excludeObjects, COSDictionary dict, Set<Long> parsedObjects)
    • parseObjectDynamically

      protected final COSBase parseObjectDynamically(COSObject obj, boolean requireExistingNotCompressedObj) throws IOException
      This will parse the next object from the stream and add it to the local state.
      Parameters:
      obj - object to be parsed (we only take object number and generation number for lookup start offset)
      requireExistingNotCompressedObj - if true object to be parsed must not be contained within compressed stream
      Returns:
      the parsed object (which is also added to document object)
      Throws:
      IOException - If an IO error occurs.
    • parseObjectDynamically

      protected COSBase parseObjectDynamically(long objNr, int objGenNr, boolean requireExistingNotCompressedObj) throws IOException
      This will parse the next object from the stream and add it to the local state. It's reduced to parsing an indirect object.
      Parameters:
      objNr - object number of object to be parsed
      objGenNr - object generation number of object to be parsed
      requireExistingNotCompressedObj - if true the object to be parsed must be defined in xref (comment: null objects may be missing from xref) and it must not be a compressed object within object stream (this is used to circumvent being stuck in a loop in a malicious PDF)
      Returns:
      the parsed object (which is also added to document object)
      Throws:
      IOException - If an IO error occurs.
    • parseFileObject

      private void parseFileObject(Long offsetOrObjstmObNr, COSObjectKey objKey, COSObject pdfObject) throws IOException
      Throws:
      IOException
    • parseObjectStream

      private void parseObjectStream(int objstmObjNr) throws IOException
      Throws:
      IOException
    • getLength

      private COSNumber getLength(COSBase lengthBaseObj, COSName streamType) throws IOException
      Returns length value referred to or defined in given object.
      Throws:
      IOException
    • parseCOSStream

      protected COSStream parseCOSStream(COSDictionary dic) throws IOException
      This will read a COSStream from the input stream using length attribute within dictionary. If length attribute is a indirect reference it is first resolved to get the stream length. This means we copy stream data without testing for 'endstream' or 'endobj' and thus it is no problem if these keywords occur within stream. We require 'endstream' to be found after stream data is read.
      Parameters:
      dic - dictionary that goes with this stream.
      Returns:
      parsed pdf stream.
      Throws:
      IOException - if an error occurred reading the stream, like problems with reading length attribute, stream does not end with 'endstream' after data read, stream too short etc.
    • readUntilEndStream

      private void readUntilEndStream(OutputStream out) throws IOException
      This method will read through the current stream object until we find the keyword "endstream" meaning we're at the end of this object. Some pdf files, however, forget to write some endstream tags and just close off objects with an "endobj" tag so we have to handle this case as well. This method is optimized using buffered IO and reduced number of byte compare operations.
      Parameters:
      out - stream we write out to.
      Throws:
      IOException - if something went wrong
    • readValidStream

      private void readValidStream(OutputStream out, COSNumber streamLengthObj) throws IOException
      Throws:
      IOException
    • validateStreamLength

      private boolean validateStreamLength(long streamLength) throws IOException
      Throws:
      IOException
    • checkXRefOffset

      private long checkXRefOffset(long startXRefOffset) throws IOException
      Check if the cross reference table/stream can be found at the current offset.
      Parameters:
      startXRefOffset -
      Returns:
      the revised offset
      Throws:
      IOException
    • checkXRefStreamOffset

      private boolean checkXRefStreamOffset(long startXRefOffset) throws IOException
      Check if the cross reference stream can be found at the current offset.
      Parameters:
      startXRefOffset - the expected start offset of the XRef stream
      Returns:
      the revised offset
      Throws:
      IOException - if something went wrong
    • calculateXRefFixedOffset

      private long calculateXRefFixedOffset(long objectOffset, boolean streamsOnly) throws IOException
      Try to find a fixed offset for the given xref table/stream.
      Parameters:
      objectOffset - the given offset where to look at
      streamsOnly - search for xref streams only
      Returns:
      the fixed offset
      Throws:
      IOException - if something went wrong
    • validateXrefOffsets

      private boolean validateXrefOffsets(Map<COSObjectKey,Long> xrefOffset) throws IOException
      Throws:
      IOException
    • checkXrefOffsets

      private void checkXrefOffsets() throws IOException
      Check the XRef table by dereferencing all objects and fixing the offset if necessary.
      Throws:
      IOException - if something went wrong.
    • findObjectKey

      private COSObjectKey findObjectKey(COSObjectKey objectKey, long offset) throws IOException
      Check if the given object can be found at the given offset. Returns the provided object key if everything is ok. If the generation number differs it will be fixed and a new object key is returned.
      Parameters:
      objectKey - the key of object we are looking for
      offset - the offset where to look
      Returns:
      returns the found/fixed object key
      Throws:
      IOException - if something went wrong
    • bfSearchForObjects

      private void bfSearchForObjects() throws IOException
      Brute force search for every object in the pdf.
      Throws:
      IOException - if something went wrong
    • bfSearchForXRef

      private long bfSearchForXRef(long xrefOffset, boolean streamsOnly) throws IOException
      Search for the offset of the given xref table/stream among those found by a brute force search.
      Parameters:
      streamsOnly - search for xref streams only
      Returns:
      the offset of the xref entry
      Throws:
      IOException - if something went wrong
    • searchNearestValue

      private long searchNearestValue(List<Long> values, long offset)
    • bfSearchForTrailer

      private boolean bfSearchForTrailer(COSDictionary trailer) throws IOException
      Brute force search for all trailer marker.
      Throws:
      IOException - if something went wrong
    • bfSearchForLastEOFMarker

      private void bfSearchForLastEOFMarker() throws IOException
      Brute force search for the last EOF marker.
      Throws:
      IOException - if something went wrong
    • bfSearchForObjStreams

      private void bfSearchForObjStreams() throws IOException
      Brute force search for all object streams.
      Throws:
      IOException - if something went wrong
    • bfSearchForXRefTables

      private void bfSearchForXRefTables() throws IOException
      Brute force search for all xref entries (tables).
      Throws:
      IOException - if something went wrong
    • bfSearchForXRefStreams

      private void bfSearchForXRefStreams() throws IOException
      Brute force search for all /XRef entries (streams).
      Throws:
      IOException - if something went wrong
    • rebuildTrailer

      protected final COSDictionary rebuildTrailer() throws IOException
      Rebuild the trailer dictionary if startxref can't be found.
      Returns:
      the rebuild trailer dictionary
      Throws:
      IOException - if something went wrong
    • searchForTrailerItems

      private boolean searchForTrailerItems(COSDictionary trailer) throws IOException
      Search for the different parts of the trailer dictionary.
      Parameters:
      trailer -
      Returns:
      true if the root was found, false if not.
      Throws:
      IOException
    • compareCOSObjects

      private COSObject compareCOSObjects(COSObject newObject, Long newOffset, COSObject currentObject, Long currentOffset)
    • retrieveCOSDictionary

      private COSDictionary retrieveCOSDictionary(COSObject object) throws IOException
      Throws:
      IOException
    • retrieveCOSDictionary

      private COSDictionary retrieveCOSDictionary(COSObjectKey key, long offset) throws IOException
      Throws:
      IOException
    • checkPages

      protected void checkPages(COSDictionary root)
      Check if all entries of the pages dictionary are present. Those which can't be dereferenced are removed.
      Parameters:
      root - the root dictionary of the pdf
    • checkPagesDictionary

      private int checkPagesDictionary(COSDictionary pagesDict, Set<COSObject> set)
    • isCatalog

      protected boolean isCatalog(COSDictionary dictionary)
      Tell if the dictionary is a PDF catalog. Override this for an FDF catalog.
      Parameters:
      dictionary -
      Returns:
      true if the given dictionary is a root dictionary
    • isInfo

      private boolean isInfo(COSDictionary dictionary)
      Tell if the dictionary is an info dictionary.
      Parameters:
      dictionary -
      Returns:
      true if the given dictionary is an info dictionary
    • parseStartXref

      private long parseStartXref() throws IOException
      This will parse the startxref section from the stream. The startxref value is ignored.
      Returns:
      the startxref value or -1 on parsing error
      Throws:
      IOException - If an IO error occurs.
    • isString

      private boolean isString(byte[] string) throws IOException
      Checks if the given string can be found at the current offset.
      Parameters:
      string - the bytes of the string to look for
      Returns:
      true if the bytes are in place, false if not
      Throws:
      IOException - if something went wrong
    • isString

      private boolean isString(char[] string) throws IOException
      Checks if the given string can be found at the current offset.
      Parameters:
      string - the bytes of the string to look for
      Returns:
      true if the bytes are in place, false if not
      Throws:
      IOException - if something went wrong
    • parseTrailer

      private boolean parseTrailer() throws IOException
      This will parse the trailer from the stream and add it to the state.
      Returns:
      false on parsing error
      Throws:
      IOException - If an IO error occurs.
    • parsePDFHeader

      protected boolean parsePDFHeader() throws IOException
      Parse the header of a pdf.
      Returns:
      true if a PDF header was found
      Throws:
      IOException - if something went wrong
    • parseFDFHeader

      protected boolean parseFDFHeader() throws IOException
      Parse the header of a fdf.
      Returns:
      true if a FDF header was found
      Throws:
      IOException - if something went wrong
    • parseHeader

      private boolean parseHeader(String headerMarker, String defaultVersion) throws IOException
      Throws:
      IOException
    • parseXrefTable

      protected boolean parseXrefTable(long startByteOffset) throws IOException
      This will parse the xref table from the stream and add it to the state The XrefTable contents are ignored.
      Parameters:
      startByteOffset - the offset to start at
      Returns:
      false on parsing error
      Throws:
      IOException - If an IO error occurs.
    • parseXrefStream

      private void parseXrefStream(COSStream stream, long objByteOffset, boolean isStandalone) throws IOException
      Fills XRefTrailerResolver with data of given stream. Stream must be of type XRef.
      Parameters:
      stream - the stream to be read
      objByteOffset - the offset to start at
      isStandalone - should be set to true if the stream is not part of a hybrid xref table
      Throws:
      IOException - if there is an error parsing the stream
    • getDocument

      public COSDocument getDocument() throws IOException
      This will get the document that was parsed. The document must be parsed before this is called. When you are done with this document you must call close() on it to release resources.
      Returns:
      The document that was parsed.
      Throws:
      IOException - If there is an error getting the document.
    • getEncryption

      public PDEncryption getEncryption() throws IOException
      This will get the encryption dictionary. The document must be parsed before this is called.
      Returns:
      The encryption dictionary of the document that was parsed.
      Throws:
      IOException - If there is an error getting the document.
    • getAccessPermission

      public AccessPermission getAccessPermission() throws IOException
      This will get the AccessPermission. The document must be parsed before this is called.
      Returns:
      The access permission of document that was parsed.
      Throws:
      IOException - If there is an error getting the document.
    • parseTrailerValuesDynamically

      protected COSBase parseTrailerValuesDynamically(COSDictionary trailer) throws IOException
      Parse the values of the trailer dictionary and return the root object.
      Parameters:
      trailer - The trailer dictionary.
      Returns:
      The parsed root object.
      Throws:
      IOException - If an IO error occurs or if the root object is missing in the trailer dictionary.
    • prepareDecryption

      private void prepareDecryption() throws IOException
      Prepare for decryption.
      Throws:
      InvalidPasswordException - If the password is incorrect.
      IOException - if something went wrong
    • parseDictionaryRecursive

      private void parseDictionaryRecursive(COSObject dictionaryObject) throws IOException
      Resolves all not already parsed objects of a dictionary recursively.
      Parameters:
      dictionaryObject - dictionary to be parsed
      Throws:
      IOException - if something went wrong