PDF to JSON – Trick to convert PDF documents to JSON format

PDF to JSON

PDF and Its Disadvantages

Portable Document Format (PDF) is a file format in use to present and exchange documents reliably. Even more, this format works commonly across software, hardware, or operating systems. However, if the user needs to edit the content in PDF according to their requirements, she faces a problem that the pdf’s are not editable.

Now with PDF Conversion, a user can convert a PDF to a .txt, .doc, .xlsx etc. files. This blog explains how to convert a PDF to JSON using Javascript modules. This will come in handy in projects which require conversion functionalities e.g. reading a pdf and showing the content on a screen or as different fields in a User interface.

Overcoming the problem with NodeJS

Using NodeJS, a user can convert the PDF to an Excel Document, JSON (using Excel-JS), Excel-Export, mammoth, Officegen npm etc. Internally what happens is, the user converts the PDF to a string. Consequently, the user generates the document or Excel from the string depending upon the requirements.

For the purpose of this blog, the goal is extracting the PDF and formatting the extracted string.

The npm available for converting PDF to JSON is pdf2json.

Why Pdf2json?

Pdf2json has a most noteworthy feature of providing the page number and callback methods which is useful when there is a need to parse the pdf page by page. Furthermore, it is easy to obtain the entire data at once using getRawTextContent() method.

1. file-system(fs)

The user can read files using file-system which is a built-in package in NodeJS.

2. pdf2json

Using the pdf2json, the user can extract the content from the PDF.

Code:

1. Add pdf2json package using npm
npm install pdf2json –save

2. Import the package in your project
let PDFParser = require(“pdf2json”);

3. Add the following code:

For a sample of how this works,  if a PDF document exists with the below format

The PDF will get converted to the string shown in notepad below

 

The PDF will get converted to the JSON shown in browser below

The code for the sample project is attached below. To start the project:

  1. Install node and npm
  2. Run command node server.js in the command prompt
  3. Finally, navigate to the URL http://localhost:3000 to see

The output will be displayed in the browser and you may also view the result in sample.txt of the project folder.

Download the sample project here! 

Sample project for converting PDF to JSON

Finally, by learning how to convert a PDF to JSON as well to a text file using pdf2jscon package, you can achieve functionalities like conversion to excel or document from the extracted data. Even more, converting  PDF to JSON format is useful for loop, adding properties etc. Therefore, you prevent hindered transmission of data between server and web applications.

Check our blog site to learn about more interesting features and projects in MEAN stack. For more information, please contact us.

 

Leave a Reply

Your email address will not be published. Required fields are marked *