html link extract function

neodammer

Well-known member
Joined
Sep 10, 2003
Messages
197
Location
Atlanta GA
Anybody know of a good regex function for extracting links from html code? Im finding it hard with the various ways to display links.
 
IngisKahn said:
(?<=href=")\S+?(?=") will extract everything in href="..."
What else do you need?


Well still kinda learning this Regex. Trying searing msdn for the a syntax code or some beginner examples on Regex for VB.net but havent found any. Could you just give me a small example of how id use that with a string? just curious not asking you to write the whole code (id never do that) just a small example that i could work with.
 
Check the sticky for info and tools; I use Regex Master.
C#:
Regex regex = new Regex(@"(?<=href="")\S+?(?="")");
Match match = regex.Match(htmlDocument);

Now you can use the match object to iterate thru all the matches.
 
ahh.. C# is good ill try to port it over to vb.net thanks man you rock. :cool:

Just curious wouldnt that take every link on the page? I guess that works I will figure out how to include just links with .jpg endings shouldnt be too hard.
 
you could just use the Document Object Module (you need to import the Microsoft MSHTML reference):

Code:
Dim I As Object
Dim WDoc As HTMLDocument
Dim Wlval As HTMLAnchorElement
Dim nelements As Short
Dim sHref As String
Dim sTitle As String
Dim sText As String

WDoc = WebBrowser1.Document
nelements = WDoc.links.length

For I = 0 To nelements - 1
            Wlval = WDoc.links.item(I)
            sHref = Wlval.href 
            sText= Wlval.outerText
            sTitle = Wlval.title
            lstbox1.Items.add(sHref)
            to see if it ends with a .jpg, you could just do the following:
            If sHref.EndsWith(".jpg") Then
                  lstBox2.Items.add(sHref)
            End If
Next

By using this you can get soo much information about a webpage :)
 
neodammer said:
Anybody know of a good regex function for extracting links from html code? Im finding it hard with the various ways to display links.

Here are a couple of good ones from http://www.regular-expressions.info/.
That is a great reference for new RegEx and old Regex users.

createRegexObj("<" + tagName + "[^>]*>(.*?)</" + tagName + ">");
matchObj = regexObj.Match(search);

You could then loop through the matches.

This next one does the same thing but uses Backreferences to capture the text inside the tags.

createRegexObj(@"<([A-Z][A-Z0-9]*)[^>]*>(.*?)</\1>");
matchObj = regexObj.Match(search);

You could write a generic print module to show all the results like this:

private void printMatch()
{
// Regex.Match constructs and returns a Match object
// You can query this object to get all possible information about the match
while (matchObj.Success)
{

Console.WriteLine("Match offset: " + matchObj.Index.ToString() + "\r\n");
Console.WriteLine("Match length: " + matchObj.Length.ToString() +"\r\n");
Console.WriteLine("Matched text: " + matchObj.Value + "\r\n");
if (matchObj.Groups.Count > 1)
{
// matchObj.Groups[0] holds the entire regex match also held by
// matchObj itself. The other Group objects hold the matches for
// capturing parentheses in the regex
for (int i = 1; i < matchObj.Groups.Count; i++)
{
Group g = matchObj.Groups;
if (g.Success)
{
Console.WriteLine("Group " + i.ToString() +
" offset: " + g.Index.ToString() + "\r\n");
Console.WriteLine("Group " + i.ToString() +
" length: " + g.Length.ToString() + "\r\n");
Console.WriteLine("Group " + i.ToString() +
" text: " + g.Value + "\r\n");
}
else
{
Console.WriteLine("Group " + i.ToString() +
" did not participate in the overall match\r\n");
}
}
}
else
{
Console.WriteLine("no backreferences/groups");
}

// Get the next match
matchObj = matchObj.NextMatch();
}

}

Neither of these get tags within tags. You would need to loop through the backexpressions to do that.
 
Back
Top