PDF to Text

  • Thread starter Thread starter arieger
  • Start date Start date
A

arieger

Guest
Okay, I have PDF claim documents that are submitted to us everyday with general claims information (ex. Insured Name, Insured Number, etc). Now these documents do not come in a standard form because different Insurance Associations have there own forms. I am in the process of developing a program to read in these pdf documents and extract certain information to store in our database. I have tried some software trials that allow you to parse pdf to text, I have tried using some "regular expressions" to extract certain data. Here is my question: How can you extract certain blocks of information from a pdf and put them into seperate text files??? For example.... If I had the following format:

Insured Adjuster
[NAME] [NAME]
[ADDRESS] [ADDRESS]
[CITY, STATE, ZIP] [CITY, STATE, ZIP]

Agent
[NAME]
[ADDRESS]
[CITY, STATE, ZIP]

Now I want to be able to extract the information for each category(insured, adjuster, agent). When I convert the pdf to text and then file stream the information by "regular expression" it will put the "insured name" at index 0, the "adjuster name" at index 1, and the the "agent name" at index 2. That is all good until my program reads a PDF file with a different format. I want to be able to say extract the insured block of text, then the adjuster block and so on. I know this sounds confusing, but just let me know if I need to be clearer about something.

Continue reading...
 
Back
Top