Understanding and Reconstructing PDFs

Shobhit Bhosure
4 min readMar 9, 2021

In this post, I’ll talk about the structure of Pdfs and how I managed to solve one of the engineering problems related to Pdfs.

So Coming straight to the point.

At LiveHealth, the Lab software (web application) comes with an option that enables users to attach a Pdf to the existing lab report. So the result will be a PDF version of the lab report merged with the attached report. The web application runs on Django 1.x.x and uses Python 2.x.x

The Problem

While working with Pdf, The PyPDF2 library raises an AssertionError Exception, the reason being PyPdf2 Lib doesn’t support Pdf files of version > 1.3 while the user was trying to upload Pdf file of version 1.5, So the obvious solution here was to switch the library to handle Pdfs which supports the highest version of pdf and the best alternative I found was pikePdf library which supports Pdf versions 1.3–1.7, But the problem here was It only supports python > 3.6 and as mentioned we had python 2.x.x.

AssertionError

In Python, AssertionError is raised whenever the condition followed by the “assert” keyword returns false, in this case from PyPDF2 lib generic.py

assert "/Length" in data

To understand why this occurred, we’ll need to understand the format/structure of the Pdf file first, which is the informative part here

Understanding PDF Format

Any pdf file is divided into a collection of objects, a table known as xref table, and a starting pointer which gives the starting position of xref table.
To understand it through an example.
If we tail the contents of any pdf, the output will be something like this

<<
/Size 36
/Root 35 0 R
/Info 34 0 R
/ID [<7B5D1B00ED1D55440A5C21D656836736><7B5D1B00ED1D55440A5C21D656836736>]
>>
startxref
49556
%%EOF

Here, the /Size 36 is nothing but no of objects in the Pdf file, Info object consists of metadata about the Pdf file e.g. Title, Author, Creation Date etc. The number followed by the keyword startxref is the starting position of xref table, so 49556th byte in Pdf file if opened in binary mode gives you xref table. So the interpreter scans the binary data for the startxref keyword and gets the number next to that line.

xref
0 36
0000000000 65535 f
0000048904 00000 n
0000000010 00000 n
0000004069 00000 n
0000005827 00000 n
0000009675 00000 n
0000014503 00000 n
0000024041 00000 n
0000026784 00000 n
0000029247 00000 n
...
0000049389 00000 n

Now once the interpreter gets the starting position of xref table, it will seek the pointer to that location i.e 49556th byte in the file

xref table

xref table consists of two main parts

  • First line tells how many objects we have in the file
  • Location of every object in bytes

For example, if we take the second location from xref table which is

0000000010 00000 n

and jump to that byte (10th bytes) in the file, we have

2 0 obj
<< /Type /XObject
/Subtype /Image
...
/Interpolate true
>>

Here 2 Indicates that its second object

Object in Pdf File

An object here is divided into two parts

  • metadata of object
  • Actual data of an object

Object format:-

2 0 obj
<< /Type /XObject
/Subtype /Image
/Width 172
/Height 26
/ColorSpace /DeviceRGB
/BitsPerComponent 8
/Filter /DCTDecode
/Interpolate true
/Length 434
>>
stream
------ Actual data of an object -----
endstream
endobj

Here most of the object properties are self-explanatory, the most important here is /Length . After reading length The Pdf interpreter expects the no of bytes between stream and endstream keyword to be exactly of length 434 in the above example. and once it encounters endobj keyword the Interpreter picks the next object from xref table to work on and continues until all the objects are rendered.

Now going back to the original problem, the AssertionError. If you can guess, the condition following assert keyword i.e "/Length" in data was raising error because of missing /Length property in the metadata of an object.

Solution

The solution I thought of was a very hard way to solve this problem. nevertheless, it worked flawlessly.

What I did in code was just reconstructed the pdf file in the following steps

  • Create an empty IO object to write a new pdf file to

buffer = io.BytesIO()

  • Open a pdf file to be reconstructed in binary mode

pdf = open(“file.pdf”, “rb”)

  • Iterate over the binary and identify objects metadata part using its identifier i.e n 0 obj and objects actual data using keywords such as stream and endstream and once the length of stream/object data is calculated we will append it to objects metadata

e.g /Length new_calculated_length

  • Once this step is done for all objects, we will iterate the newly written IO object once again and while iterating binary, we will keep track of starting position of every object, and once we have iterated all the objects we will finally rebuild the xref table with new starting positions of all the objects
  • Also while writing the new xref table we will keep track of the position where the new xref table is being written in some variable and again write that position after startxref keyword to indicate the starting position of the newly written xref table.
  • And finally, we will write this IO buffer object to a file to get our reconstructed pdf.

Here’s the link to the solution/code https://github.com/shobhit99/ReconstructPdf

--

--

No responses yet