Extracting Text from HTML

samsmithnz

Well-known member
Joined
Jul 22, 2003
Messages
1,038
Location
Boston
Is there a way of loading an HTML page and extracting the text (with no formatting/tags/etc) in .NET???
 
[VB]
Public Shared Function RemoveHtmlTags(ByVal htmlText As String) As String
Return Regex.Replace(htmlText, "(<[^>]*>)", "", RegexOptions.Multiline Or RegexOptions.Compiled)
End Function
[/VB]
 
Last edited by a moderator:
Nice try, but the result is not even close. I want the TEXT, not the tags.

I was hoping that Id be able to load it into some sort of HTMLDocument, (I may be making thinks up now) and then just query for the InnerHTML... or is that just an XMl thing... :p
 
Last edited by a moderator:
This is my entire project, but it doesnt work. It copies the html file as is...

Code:
Private Sub LoadForm()

        Dim objFile As System.IO.File
        Dim objSR As System.IO.StreamReader
        Dim strText As String

        objSR = objFile.OpenText("C:\Projects\Books.html")

        strText = objSR.ReadToEnd

        strText = RemoveHtmlTags(strText)

        TextBox1.Text = strText

    End Sub

    Private Function RemoveHtmlTags(ByVal htmlText As String) As String
        RemoveHtmlTags = System.Text.RegularExpressions.Regex.Replace(htmlText, "(<[^>]*> )", "", System.Text.RegularExpressions.RegexOptions.Multiline Or System.Text.RegularExpressions.RegexOptions.Compiled)
    End Function
 


Write your reply...
Back
Top