How to generate large PDF files using WKHTMLTOPDF library

Ronas IT
9 min readFeb 24, 2023

--

Once working on a project, our team of backend developers was requested to develop a task tracker for large enterprise employees. They needed to create a feature allowing users to generate PDF files for an arbitrary amount of data. The generated PDFs served as progress reports and were expected to contain both text and images. In this article, our middle backend developer Konstantin explains how he and his team managed to create a feature that allowed generating large PDF files using specific libraries and made this process user-friendly. This article will be especially useful for junior and middle backend developers who want to deepen their knowledge of backend solutions and find out more about the challenges they may face during their work.

Details of the project

We created the project’s backend using Django and PostgreSQL, while the front end was written with Angular. The server side was used as an API, so we also decided to implement Django REST Framework. Besides, we used expandable fields to get entities associated with the resource.

The PDF reports were to serve many functions involving both text and images. Each task in the report had a change log and could be updated with new data up to several dozen times. As the users were allowed to filter records, a PDF file could contain several thousands of tasks.

WKHTMLTOPDF library is a key solution

At first, we started looking for a ready-made solution that could be used for generating PDF files. WKHTMLTOPDF library seemed to be the most suitable option. It is really popular, according to Github, and quite productive as it is written with C++, and it doesn’t require a lot of resources. To make the usage of the Django library more convenient, we added the Django WKHTMLTOPDF adapter.

The initial setup was quite simple. First of all, we had to install the WKHTMLTOPDF package to the Dockerfile.

Then we installed Django-WKHTMLTOPDF, following the README file.

PDF files were generated for a particular entity. Also, we had to take into account the filters selected by the user. To cope with all these factors, our team had to dive into the source code of the Django WKHTMLTOPDF library. There we found the PDFTemplateResponse response type that could generate PDF using a specific template. And that’s how we implemented it. In the following example, we used the viewset method for tasks.

We also needed to create some specific fields in the reports. Not all of them were related directly to the entity. For example, we needed to know who made the last update of the task. This information wasn’t stored in the entity itself, but it was kept in the change log. To solve the problem, we created a separate TaskReportSerialzier serializer using FlexFieldsSerializer. SerializerMethodField in combination with expandable fields helped us with this.

At the same time, we shouldn’t have forgotten about record filtering.

As a result, users could open the page with tasks, filter them, and export them to PDF files. Then users received a completed file within a few seconds. This system worked only for a limited number of tasks, up to several hundred of them. And in case of a bigger number of tasks, the users had to wait longer.

The thing is that there could be several thousands of tasks, and we have to keep that in mind. What did this mean for users? First of all, they had to leave the task page open until the report was generated. This process could take a long time — minutes or even hours. If the user closed a task page and went to another page or the network failure occurred, the PDF generation would stop. No matter how long the user waited, the process had to be started again.

The PDF generation was too slow

And the main cause of it was the browser — it slowed down the process significantly. If it wasn’t part of it, the reports could be generated faster. One of the possible solutions was to make PDF generation a background process and send the completed file by email once the process is over. Thus the users wouldn’t need to wait until the generation of the PDF file was completed. They would receive a notification that the report was to be sent to their email. Once the customer approved the idea, we started the implementation process that involved the following steps.

We shortened the code in the viewset method that was responsible for PDF generation.

We added it to the job, and moved all the logic there:

Let’s figure out what has changed. send_report is a job for delayed processes. It was created with @job decorator from the Django RQ library. We put jobs in a separate queue called reports. That’s how we got the ability to manage resources specifically for it. There were no special features in the queue setup as everything was set up according to the library documentation. We had to keep in mind that there could be a lot of tasks. That’s why file generation could take hours. At first, we estimated that 5 hours was the longest time for PDF generation. Then we established a waiting limit of 10 hours. This time period was taken with the margin.

render_pdf_from_template is a function responsible for generating PDF files from a template. As a result, we got a completed file and passed it through SimpleUploadedFile because this was necessary for MediaSerializer. We won’t tell a lot about its function as this information was not really important. The file was saved to the file storage. We only needed an absolute path of the file to specify a link and use it to download the report.

ReportCreated()is an email class. It is a function for working with emails inside the project. The names of these functions explain how they work.

At that moment, the users needed to make the following steps to load the report. At first, they went to the task page, filtered the tasks, and started generating PDFs. Then they received a notification that PDF file generation had started and that the completed file would be sent to their email. After that, users were able to check other site pages, request more reports, and even close the browser. These changes allowed us to think about our primary problem — the speed of the file generation.

How did we speed up PDF generation?

We needed to figure out how to speed up the PDF generation process.

“This task was a real challenge. The huge amount of data made us thoroughly analyze each stage of PDF generation. We did this to make the process clear and user-friendly.”

Konstantin, backend developer

One solution was to create several parallel processes to generate files. We did the following:

Let’s explore the process step by step.

Destination file name:

We created prefixes for PDF files and merged them into one larger one.

Then we initialized the limit offset for queryset:

Also, we used Azure on this project. To work with it, we installed the django-stores that were connected to the azure-python-storage library. Then we used the service factory and created a block blob service. Learn more about block, page, and append blobs.

We didn’t want the system to produce PDF generation threads without any control. To limit them, we created a fixed-size queue where we put existing threads.

Then we did a few tasks.

And created a thread for PDF generation for these tasks.

Then we put a thread in the queue.

And finally, we launched it.

As soon as the queue was full, we waited, until all the threads completed working.

And we followed this cycle until all the requested tasks were processed.

Sometimes there were not enough tasks to create 20 threads and fill the queue. In this case, the inspection mentioned above wouldn’t work. At the same time, the script would continue working although the files are not generated yet. And it would lead to an error. Therefore, we needed to check if the queue was empty and wait until all the threads had completed their work.

Next, we will explore the generate_file function for generating PDF files. Let’s make the following part clear.

We created a blob file from the generated PDF. We wanted to use the name of the created file to form the final PDF.

Let’s have a look at how we created the final file.

Here we found the needed list of files using prefixes. The list contained only meta information about stored objects but didn’t contain files themselves. The order of files was important to us, so we sorted them by the numerical suffix in their name.

Thus, the files lined up in the order in which the tasks were received when they were generated. The users could set the sorting when they were sending the request. We needed to preserve this sorting.

We received the final file by merging small PDF files. It was decided to use the PyPDF2 library for that. Although this library is quite old and it hasn’t been updated for a long time, it still coped well with its tasks.

Then we created an object for merges.

At this stage, the file was ready to be added to the parent object.

Then the downloaded file should have been immediately deleted to clear the space.

After that, we created the final PDF and saved the merged file in one stream.

Finally, we turned the result into an object that was convenient for us to use.

As a result, the time of PDF file generation had been significantly reduced. By the end of the project, we understood that the parallel generation of small PDF files goes quite fast. But when these files start synchronously merging, the process slows down.

To sum up, we’d like to admit that the solution we had created wasn’t perfect due to several reasons. First of all, each time the process was interrupted, it had to be started again. At that moment the tasks could be updated, so the files saved in the cloud could display already outdated information. Secondly, tasks are loaded from the database not with only one request, but in parts. It means that PDF files didn’t reflect the state of tasks at one specific moment but they displayed the information collected within a certain period of time. It is another reason why files could display outdated information.

Although we thought we would only need to install Django-WKHTMLTOPDF library, this solution required some additional support. That’s why we had to implement multi-threading and queues in Python, blob storage in Azure, and PyPDF2 for merging PDF files. We were really glad to complete this challenging task as we gained a lot of valuable experience.

--

--

Ronas IT
Ronas IT

Written by Ronas IT

Full-cycle development company. We design, develop, and maintain apps since 2007. To learn more - https://ronasit.com/?utm_source=medium&utm_medium=profile-link

No responses yet