Extracting Text from HTML

samsmithnz · Jan 16, 2004

Is there a way of loading an HTML page and extracting the text (with no formatting/tags/etc) in .NET???

splice · Jan 16, 2004

[VB]
Public Shared Function RemoveHtmlTags(ByVal htmlText As String) As String
Return Regex.Replace(htmlText, "(<[^>]*>)", "", RegexOptions.Multiline Or RegexOptions.Compiled)
End Function
[/VB]

samsmithnz · Jan 16, 2004

Nice try, but the result is not even close. I want the TEXT, not the tags.

I was hoping that Id be able to load it into some sort of HTMLDocument, (I may be making thinks up now) and then just query for the InnerHTML... or is that just an XMl thing...

splice · Jan 16, 2004

what? That does strip the tags! Did you try it?

samsmithnz · Jan 16, 2004

This is my entire project, but it doesnt work. It copies the html file as is...

Code:

Private Sub LoadForm()

        Dim objFile As System.IO.File
        Dim objSR As System.IO.StreamReader
        Dim strText As String

        objSR = objFile.OpenText("C:\Projects\Books.html")

        strText = objSR.ReadToEnd

        strText = RemoveHtmlTags(strText)

        TextBox1.Text = strText

    End Sub

    Private Function RemoveHtmlTags(ByVal htmlText As String) As String
        RemoveHtmlTags = System.Text.RegularExpressions.Regex.Replace(htmlText, "(<[^>]*> )", "", System.Text.RegularExpressions.RegexOptions.Multiline Or System.Text.RegularExpressions.RegexOptions.Compiled)
    End Function

splice · Jan 16, 2004

remove the space between > )

the VB code block formats that wrong. A space should not be there.

samsmithnz · Jan 16, 2004

EXCELLENT that works perfectly. Thank you for your help!

Extracting Text from HTML

samsmithnz

Well-known member

splice

Well-known member

samsmithnz

Well-known member

splice

Well-known member

samsmithnz

Well-known member

splice

Well-known member

samsmithnz

Well-known member