Understanding and Reconstructing PDFs
In this post, I’ll talk about the structure of Pdfs and how I managed to solve one of the engineering problems related to Pdfs.
So Coming straight to the point.
At LiveHealth, the Lab software (web application) comes with an option that enables users to attach a Pdf to the existing lab report. So the result will be a PDF version of the lab report merged with the attached report. The web application runs on Django 1.x.x and uses Python 2.x.x
The Problem
While working with Pdf, The PyPDF2 library raises an AssertionError
Exception, the reason being PyPdf2 Lib doesn’t support Pdf files of version > 1.3 while the user was trying to upload Pdf file of version 1.5, So the obvious solution here was to switch the library to handle Pdfs which supports the highest version of pdf and the best alternative I found was pikePdf library which supports Pdf versions 1.3–1.7, But the problem here was It only supports python > 3.6 and as mentioned we had python 2.x.x.
AssertionError
In Python, AssertionError is raised whenever the condition followed by the “assert” keyword returns false, in this case from PyPDF2 lib generic.py
assert "/Length" in data
To understand why this occurred, we’ll need to understand the format/structure of the Pdf file first, which is the informative part here
Understanding PDF Format
Any pdf file is divided into a collection of objects, a table known as xref table, and a starting pointer which gives the starting position of xref table.
To understand it through an example.
If we tail the contents of any pdf, the output will be something like this
<<
/Size 36
/Root 35 0 R
/Info 34 0 R
/ID [<7B5D1B00ED1D55440A5C21D656836736><7B5D1B00ED1D55440A5C21D656836736>]
>>
startxref
49556
%%EOF
Here, the /Size 36
is nothing but no of objects in the Pdf file, Info
object consists of metadata about the Pdf file e.g. Title, Author, Creation Date etc. The number followed by the keyword startxref
is the starting position of xref table, so 49556
th byte in Pdf file if opened in binary mode gives you xref table. So the interpreter scans the binary data for the startxref
keyword and gets the number next to that line.
xref
0 36
0000000000 65535 f
0000048904 00000 n
0000000010 00000 n
0000004069 00000 n
0000005827 00000 n
0000009675 00000 n
0000014503 00000 n
0000024041 00000 n
0000026784 00000 n
0000029247 00000 n
...
0000049389 00000 n
Now once the interpreter gets the starting position of xref table, it will seek the pointer to that location i.e 49556
th byte in the file
xref table
xref table consists of two main parts
- First line tells how many objects we have in the file
- Location of every object in bytes
For example, if we take the second location from xref table which is
0000000010 00000 n
and jump to that byte (10th bytes) in the file, we have
2 0 obj
<< /Type /XObject
/Subtype /Image
...
/Interpolate true
>>
Here 2 Indicates that its second object
Object in Pdf File
An object here is divided into two parts
- metadata of object
- Actual data of an object
Object format:-
2 0 obj
<< /Type /XObject
/Subtype /Image
/Width 172
/Height 26
/ColorSpace /DeviceRGB
/BitsPerComponent 8
/Filter /DCTDecode
/Interpolate true
/Length 434
>>
stream
------ Actual data of an object -----
endstream
endobj
Here most of the object properties are self-explanatory, the most important here is /Length
. After reading length The Pdf interpreter expects the no of bytes between stream and endstream keyword to be exactly of length 434 in the above example. and once it encounters endobj
keyword the Interpreter picks the next object from xref table to work on and continues until all the objects are rendered.
Now going back to the original problem, the AssertionError
. If you can guess, the condition following assert
keyword i.e "/Length" in data
was raising error because of missing /Length
property in the metadata of an object.
Solution
The solution I thought of was a very hard way to solve this problem. nevertheless, it worked flawlessly.
What I did in code was just reconstructed the pdf file in the following steps
- Create an empty IO object to write a new pdf file to
buffer = io.BytesIO()
- Open a pdf file to be reconstructed in binary mode
pdf = open(“file.pdf”, “rb”)
- Iterate over the binary and identify objects metadata part using its identifier i.e
n 0 obj
and objects actual data using keywords such asstream
andendstream
and once the length of stream/object data is calculated we will append it to objects metadata
e.g /Length new_calculated_length
- Once this step is done for all objects, we will iterate the newly written IO object once again and while iterating binary, we will keep track of starting position of every object, and once we have iterated all the objects we will finally rebuild the
xref
table with new starting positions of all the objects - Also while writing the new xref table we will keep track of the position where the new xref table is being written in some variable and again write that position after
startxref
keyword to indicate the starting position of the newly written xref table. - And finally, we will write this IO buffer object to a file to get our reconstructed pdf.
Here’s the link to the solution/code https://github.com/shobhit99/ReconstructPdf