web spider questions

dotnetnoob

New member
Joined
Jan 26, 2003
Messages
3
This is for learning perpose and does not have to function really well.... What I would like to do is create a simplified web spider. But Im running into a few problems/questions.

-I cant seem to read in a HTML into a string. My program does nothing then an error message appears saying the connection was closed and it could not connect to the remote server. Is this my PC? Or is it me whos forgetting something? Anyone has working code for this?

-If I ever manage to succesfull read in a HTML file. What would be the best way to collect all the <a>-tags out of it? I want to use all of these links as the next URL to visit....

-If I collect all of the links from a page and then do the same for these pages, chances are big my program ends up in a never ending loop. For example 4 pages with the same links (menu)... and my program will keep on visiting the same pages. What would be a good and fast way to check what URLS I already visited?

Its for a school assignment, so it doesnt have to work real good..... but good enough.
 
Well, for the first question, use a WebClient:
Code:
Dim wc As System.Net.WebClient
Dim bhtml() As Byte, html As String

bhtml = wc.DownloadData("www.somesite.com")
html = System.Text.AsciiEncoding.Ascii.GetString(bhtml)
For your second question, there are a few choices. You can go with
some simple link parsing with .IndexOf statements, .SubString statements,
and the like, but I would use RegularExpressions for something like
this. RegularExpressions is a sort of search language; you can enter
a complex pattern, and it will attempt to find it. Its a very complex
thing to master, but it can save a lot of time if you learn it. Read
about it in the MSDN.

As for your third question, you could use an ArrayList, I believe. Like this:
Code:
After visiting:
        visited.Add("http://www.somesite.com")

        To check if its been visited:
        If visited.Contains("http://www.somesite.com") Then
 
Thanks, ill look into it.

However, I still cant get the HTML reading to work. Even with your code I get the following error:
An unhandled exception of type System.Net.WebException occurred in system.dll

Additional information: The underlying connection was closed: Unable to connect to the remote server.

I have internet on this PC via cable (and im behind a router)... what should I do to fix this?
 
Oh, add http:// in front of the URL, and make sure the address is
valid.
 
I have http:// in front of it, I tried multiple urls.... I even tried with specifying a full path to a html file (so .../index.html), no luck.
Im using a proxy..... perhaps thats the problem? Please anyone? What can I do to fix this problem. Even the sample code from MSDN gives me the same error.
 
I have a similar problem when using the WebClient class except my error message is "The underlying connection was closed: The remote name could not be resolved."

I believe it to be a security/configuration issue as our internet connection goes through one of our companies proxy servers and prompts for a username as password. I have to enter this every time I open a browser and attempt to connect to the internet but when I use the WebClient class the dialog asking for the username and password doesnt appear.
 
Back
Top