I
Innovators World Wide
Guest
I am in a situation where I have to convert a PDF to a format that can be set to a DataGridView.
The only Resolution I can come up with is using Itextsharp and converting the PDF to a textfile for the most part the format is kept.
here is the Code to parse the text.
Dim mPDF As String = "C:\Users\Innovators World Wid\Documents\test.pdf"
Dim mTXT As String = "C:\Users\Innovators World Wid\Documents\test.txt"
Dim mPDFreader As New iTextSharp.text.pdf.PdfReader(mPDF)
Dim mPageCount As Integer = mPDFreader.NumberOfPages()
Dim parser As PdfReaderContentParser = New PdfReaderContentParser(mPDFreader)
'Create the text file.
Dim fs As FileStream = File.Create(mTXT)
Dim strategy As iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy
For i As Integer = 1 To mPageCount
strategy = parser.ProcessContent(i, New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy())
Dim info As Byte() = New UTF8Encoding(True).GetBytes(strategy.GetResultantText())
fs.Write(info, 0, info.Length)
Next
fs.Close()
The text output ends up looking like this. (also see attached copy of file.txt)
63 FMPC0847535411 OD119523523152105000 Aug 28, 2020 02:18 PM EXPRESS
64 FMPP0532201112 OD119523544975573000 Aug 28, 2020 02:18 PM EXPRESS
65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS
67 FMPC0847520947 OD119523760191783000 Aug 28, 2020 02:19 PM EXPRESS
Which is "Pretty close"
The issue is the lines where express then has another number next to it (look at line 65 where 66 starts on the line. It should look like this throughout (to make adding it to a DataGridView easier.
63 FMPC0847535411 OD119523523152105000 Aug 28, 2020 02:18 PM EXPRESS
64 FMPP0532201112 OD119523544975573000 Aug 28, 2020 02:18 PM EXPRESS
65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS
66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS
67 FMPC0847520947 OD119523760191783000 Aug 28, 2020 02:19 PM EXPRESS
The attempt was to use RegEx to remove everything but
"FMPC0847520947 OD119523760191783000 Aug 28, 2020 02:19 PM EXPRESS"
Or in some cases it may end a bit differently (like)
FMPC0847520947 OD119523760191783000 Aug 28, 2020 02:19 PM EXPRESS , Replacement Order
The RegEx is
(\d{2}\s.{14}\s.{20}\s.{3}\s\d{1,2},\s\d{4}\s\d{2}:\d{2}\s.{2}\sEXPRESS,*\s*R*e*p*l*a*c*e*m*e*n*t*\s*o*r*d*e*r*)";
Question Does anyone have a better solution. Or a cleaner solution. What I need is
PDF Somehow Converted to a format that can can be inputted in to a Datgrid in the appropriate rows and columns
Any method to do what I like is appreciated
Edit:
I am using RegEx at the moment. This is the sub
Private Sub Fixtext()
Dim regex As Regex = New Regex("\d{2}\s.{14}\s.{20}\s.{3}\s\d{1,2},\s\d{4}\s\d{2}:\d{2}\s.{2}\sEXPRESS,*\s*R*e*p*l*a*c*e*m*e*n*t*\s*o*r*d*e*r*")
Using reader As StreamReader = New StreamReader("C:\Users\Innovators World Wid\Documents\test.txt")
While (True)
Dim line As String = reader.ReadLine()
If line = Nothing Then
Return
End If
Dim match As Match = regex.Match(line)
If match.Success Then
Dim value As String = match.Groups(1).Value
Console.WriteLine(line)
End If
End While
End Using
End Sub
The issue is the output still contains a few issues.
490 FMPC0847531898 OD119522758218348000 Aug 28, 2020 03:20 PM EXPRESS 491 FMPP0532220915 OD119522825195489000 Aug 28, 2020 03:21 PM EXPRESS Tracking Id Forms Required Order Id RTS done on Notes492 FMPP0532194482 OD119522868525176000 Aug 28, 2020 03:21 PM EXPRESS 493 FMPP0532195684 OD119522871090000000 Aug 28, 2020 03:21 PM EXPRESS 494 FMPP0532224318 OD119522895172342000 Aug 28, 2020 03:21 PM EXPRESS 495 FMPC0847571813 OD119522919323643000 Aug 28, 2020 03:21 PM EXPRESS
That is one issue. It isn't removing the "Tracking ID Forms Required order ID RTS Done On Notes" Which should be removed
And a few lines are still crammed together.
65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS
should be
65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS 66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS
Continue reading...
The only Resolution I can come up with is using Itextsharp and converting the PDF to a textfile for the most part the format is kept.
here is the Code to parse the text.
Dim mPDF As String = "C:\Users\Innovators World Wid\Documents\test.pdf"
Dim mTXT As String = "C:\Users\Innovators World Wid\Documents\test.txt"
Dim mPDFreader As New iTextSharp.text.pdf.PdfReader(mPDF)
Dim mPageCount As Integer = mPDFreader.NumberOfPages()
Dim parser As PdfReaderContentParser = New PdfReaderContentParser(mPDFreader)
'Create the text file.
Dim fs As FileStream = File.Create(mTXT)
Dim strategy As iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy
For i As Integer = 1 To mPageCount
strategy = parser.ProcessContent(i, New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy())
Dim info As Byte() = New UTF8Encoding(True).GetBytes(strategy.GetResultantText())
fs.Write(info, 0, info.Length)
Next
fs.Close()
The text output ends up looking like this. (also see attached copy of file.txt)
63 FMPC0847535411 OD119523523152105000 Aug 28, 2020 02:18 PM EXPRESS
64 FMPP0532201112 OD119523544975573000 Aug 28, 2020 02:18 PM EXPRESS
65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS
67 FMPC0847520947 OD119523760191783000 Aug 28, 2020 02:19 PM EXPRESS
Which is "Pretty close"
The issue is the lines where express then has another number next to it (look at line 65 where 66 starts on the line. It should look like this throughout (to make adding it to a DataGridView easier.
63 FMPC0847535411 OD119523523152105000 Aug 28, 2020 02:18 PM EXPRESS
64 FMPP0532201112 OD119523544975573000 Aug 28, 2020 02:18 PM EXPRESS
65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS
66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS
67 FMPC0847520947 OD119523760191783000 Aug 28, 2020 02:19 PM EXPRESS
The attempt was to use RegEx to remove everything but
"FMPC0847520947 OD119523760191783000 Aug 28, 2020 02:19 PM EXPRESS"
Or in some cases it may end a bit differently (like)
FMPC0847520947 OD119523760191783000 Aug 28, 2020 02:19 PM EXPRESS , Replacement Order
The RegEx is
(\d{2}\s.{14}\s.{20}\s.{3}\s\d{1,2},\s\d{4}\s\d{2}:\d{2}\s.{2}\sEXPRESS,*\s*R*e*p*l*a*c*e*m*e*n*t*\s*o*r*d*e*r*)";
Question Does anyone have a better solution. Or a cleaner solution. What I need is
PDF Somehow Converted to a format that can can be inputted in to a Datgrid in the appropriate rows and columns
Any method to do what I like is appreciated
Edit:
I am using RegEx at the moment. This is the sub
Private Sub Fixtext()
Dim regex As Regex = New Regex("\d{2}\s.{14}\s.{20}\s.{3}\s\d{1,2},\s\d{4}\s\d{2}:\d{2}\s.{2}\sEXPRESS,*\s*R*e*p*l*a*c*e*m*e*n*t*\s*o*r*d*e*r*")
Using reader As StreamReader = New StreamReader("C:\Users\Innovators World Wid\Documents\test.txt")
While (True)
Dim line As String = reader.ReadLine()
If line = Nothing Then
Return
End If
Dim match As Match = regex.Match(line)
If match.Success Then
Dim value As String = match.Groups(1).Value
Console.WriteLine(line)
End If
End While
End Using
End Sub
The issue is the output still contains a few issues.
490 FMPC0847531898 OD119522758218348000 Aug 28, 2020 03:20 PM EXPRESS 491 FMPP0532220915 OD119522825195489000 Aug 28, 2020 03:21 PM EXPRESS Tracking Id Forms Required Order Id RTS done on Notes492 FMPP0532194482 OD119522868525176000 Aug 28, 2020 03:21 PM EXPRESS 493 FMPP0532195684 OD119522871090000000 Aug 28, 2020 03:21 PM EXPRESS 494 FMPP0532224318 OD119522895172342000 Aug 28, 2020 03:21 PM EXPRESS 495 FMPC0847571813 OD119522919323643000 Aug 28, 2020 03:21 PM EXPRESS
That is one issue. It isn't removing the "Tracking ID Forms Required order ID RTS Done On Notes" Which should be removed
And a few lines are still crammed together.
65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS
should be
65 FMPP0532243104 OD119523557412412000 Aug 28, 2020 02:18 PM EXPRESS 66 FMPC0847516962 OD119523576945605000 Aug 28, 2020 02:18 PM EXPRESS
Continue reading...