Problems reading source code from some webpages

Arokh

Well-known member
Joined
Apr 11, 2006
Messages
124
I have searched through the forum for a method to download the html code from webpages and
found some threads easily.

So far I have:
[VB] Public Function GetURLSource(ByVal URL As String) As String
Dim wClient = New System.Net.WebClient()
Dim buffer As Byte()

buffer = wClient.DownloadData(URL)
GetURLSource = System.Text.Encoding.Default.GetString(buffer, 0, buffer.Length)
End Function[/VB]

On most webpages it works as it should, but on the webpage I want to read from it doesnt.
For example http://anidb.info/perl-bin/animedb.pl?show=anime&aid=96
I want to read the page so I can get the episode names from the series.
But all the function returns is "
 
GZip encoded

According to the page headers, the content is GZip compressed. If you are using version 2.0 of the framework, you can use the System.IO.Compression.GZipStream to decompress it. Besides that, I would recommend using the System.Net.WebRequest class rather than System.Web. This way you can handle different encodings, content types, and so forth:

Code:
    Public Function GetURLSource(ByVal URL As String) As String
        Dim httpReq As WebRequest
        Dim httpRes As HttpWebResponse
        Dim gzStm As Compression.GZipStream
        Dim buffer As Byte()

        httpReq = System.Net.WebRequest.Create(URL)
        httpRes = DirectCast(httpReq.GetResponse(), HttpWebResponse)

        Perhaps check status code here


        Check encoding
        If (httpRes.ContentEncoding = "gzip") Then
            Content is GZiped, must extract first
            gzStm = New Compression.GZipStream(httpRes.GetResponseStream(), Compression.CompressionMode.Decompress)

            etc
        Else
            Not GZip compressed, do something else
        End If

        Code here to determine character encoding (eg UTF-8)

        Do NOT assume UTF-8, should check with returned data
        Return Encoding.UTF8.GetString(buffer)
    End Function

Good luck :cool:
 
Back
Top