Vb.net RegEx PDF to DatagridView

  • Thread starter Thread starter Innovators World Wide
  • Start date Start date
I

Innovators World Wide

Guest
I am in a situation where I have to convert a PDF to a format that can be set to a DataGridView.

The only Resolution I can come up with is using Itextsharp and converting the PDF to a textfile for the most part the format is kept.


here is the Code to parse the text.

Dim mPDF As String = "C:\Users\Innovators World Wid\Documents\test.pdf"
Dim mTXT As String = "C:\Users\Innovators World Wid\Documents\test.txt"
Dim mPDFreader As New iTextSharp.text.pdf.PdfReader(mPDF)
Dim mPageCount As Integer = mPDFreader.NumberOfPages()
Dim parser As PdfReaderContentParser = New PdfReaderContentParser(mPDFreader)
'Create the text file.
Dim fs As FileStream = File.Create(mTXT)
Dim strategy As iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy
For i As Integer = 1 To mPageCount
strategy = parser.ProcessContent(i, New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy())
Dim info As Byte() = New UTF8Encoding(True).GetBytes(strategy.GetResultantText())
fs.Write(info, 0, info.Length)
Next
fs.Close()

The text output ends up looking like this. (also see attached copy of file.txt)


63 FMPC0847535411 OD119523523152105000 Aug 28, 2020 02:18 PM EXPRESS
64 FMPP0532201112 OD119523544975573000 Aug 28, 2020 02:18 PM EXPRESS
65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS
67 FMPC0847520947 OD119523760191783000 Aug 28, 2020 02:19 PM EXPRESS

Which is "Pretty close"

The issue is the lines where express then has another number next to it (look at line 65 where 66 starts on the line. It should look like this throughout (to make adding it to a DataGridView easier.

63 FMPC0847535411 OD119523523152105000 Aug 28, 2020 02:18 PM EXPRESS
64 FMPP0532201112 OD119523544975573000 Aug 28, 2020 02:18 PM EXPRESS
65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS
66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS
67 FMPC0847520947 OD119523760191783000 Aug 28, 2020 02:19 PM EXPRESS

The attempt was to use RegEx to remove everything but

"FMPC0847520947 OD119523760191783000 Aug 28, 2020 02:19 PM EXPRESS"

Or in some cases it may end a bit differently (like)

FMPC0847520947 OD119523760191783000 Aug 28, 2020 02:19 PM EXPRESS , Replacement Order

The RegEx is
(\d{2}\s.{14}\s.{20}\s.{3}\s\d{1,2},\s\d{4}\s\d{2}:\d{2}\s.{2}\sEXPRESS,*\s*R*e*p*l*a*c*e*m*e*n*t*\s*o*r*d*e*r*)";

Question Does anyone have a better solution. Or a cleaner solution. What I need is

PDF Somehow Converted to a format that can can be inputted in to a Datgrid in the appropriate rows and columns

Any method to do what I like is appreciated

Edit:


I am using RegEx at the moment. This is the sub

Private Sub Fixtext()

Dim regex As Regex = New Regex("\d{2}\s.{14}\s.{20}\s.{3}\s\d{1,2},\s\d{4}\s\d{2}:\d{2}\s.{2}\sEXPRESS,*\s*R*e*p*l*a*c*e*m*e*n*t*\s*o*r*d*e*r*")
Using reader As StreamReader = New StreamReader("C:\Users\Innovators World Wid\Documents\test.txt")
While (True)
Dim line As String = reader.ReadLine()
If line = Nothing Then
Return
End If
Dim match As Match = regex.Match(line)
If match.Success Then
Dim value As String = match.Groups(1).Value
Console.WriteLine(line)
End If
End While
End Using

End Sub

The issue is the output still contains a few issues.


490 FMPC0847531898 OD119522758218348000 Aug 28, 2020 03:20 PM EXPRESS 491 FMPP0532220915 OD119522825195489000 Aug 28, 2020 03:21 PM EXPRESS Tracking Id Forms Required Order Id RTS done on Notes492 FMPP0532194482 OD119522868525176000 Aug 28, 2020 03:21 PM EXPRESS 493 FMPP0532195684 OD119522871090000000 Aug 28, 2020 03:21 PM EXPRESS 494 FMPP0532224318 OD119522895172342000 Aug 28, 2020 03:21 PM EXPRESS 495 FMPC0847571813 OD119522919323643000 Aug 28, 2020 03:21 PM EXPRESS

That is one issue. It isn't removing the "Tracking ID Forms Required order ID RTS Done On Notes" Which should be removed

And a few lines are still crammed together.


65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS

should be

65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS 66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS

Continue reading...
 
Back
Top