PDF and Its Disadvantages
Portable Document Format (PDF) is a file format in use to present and exchange documents reliably. Even more, this format works commonly across software, hardware, or operating systems. However, if the user needs to edit the content in PDF according to their requirements, she faces a problem that the pdf’s are not editable.
Overcoming the problem with NodeJS
Using NodeJS, a user can convert the PDF to an Excel Document, JSON (using Excel-JS), Excel-Export, mammoth, Officegen npm etc. Internally what happens is, the user converts the PDF to a string. Consequently, the user generates the document or Excel from the string depending upon the requirements.
For the purpose of this blog, the goal is extracting the PDF and formatting the extracted string.
The npm available for converting PDF to JSON is pdf2json.
pdf2json is a NodeJS module which parses and converts a PDF from binary to JSON format. This module has been built with pdf.js. It’s a library build using the pdf.js module. Additionally, it has interactive form elements and text content parsing capabilities outside the browser.
Pdf2json has a most noteworthy feature of providing the page number and callback methods which is useful when there is a need to parse the pdf page by page. Furthermore, it is easy to obtain the entire data at once using getRawTextContent() method.
Modules in use
The user can read files using file-system which is a built-in package in NodeJS.
Using the pdf2json, the user can extract the content from the PDF.
1. Add pdf2json package using npm
npm install pdf2json –save
2. Import the package in your project
let PDFParser = require(“pdf2json”);
3. Add the following code:
For a sample of how this works, if a PDF document exists with the below format
The PDF will get converted to the string shown in notepad below
The PDF will get converted to the JSON shown in browser below
The code for the sample project is attached below. To start the project:
- Install node and npm
- Run command node server.js in the command prompt
- Finally, navigate to the URL http://localhost:3000 to see
The output will be displayed in the browser and you may also view the result in sample.txt of the project folder.
Download the sample project here!
Finally, by learning how to convert a PDF to JSON as well to a text file using pdf2jscon package, you can achieve functionalities like conversion to excel or document from the extracted data. Even more, converting PDF to JSON format is useful for loop, adding properties etc. Therefore, you prevent hindered transmission of data between server and web applications.
Check our blog site to learn about more interesting features and projects in MEAN stack. For more information, please contact us.