How do i use htmlagilitypack to retrive only links from a website that start with http and https ?

EDN Admin

Well-known member
Joined
Aug 7, 2010
Messages
12,794
Location
In the Machine
I have this code:

<pre class="prettyprint private List<string> getLinks(HtmlAgilityPack.HtmlDocument document)
{

List<string> mainLinks = new List<string>();
var linkNodes = document.DocumentNode.SelectNodes("//a[@href]");
if (linkNodes != null)
{
foreach (HtmlNode link in linkNodes)
{
var href = link.Attributes["href"].Value;
mainLinks.Add(href);
}
}
return mainLinks;

}[/code]
Then im adding the links im getting to a List<string> like this:

<pre class="prettyprint private List<string> test(string url, int levels , DoWorkEventArgs eve)
{
HtmlAgilityPack.HtmlDocument doc;
HtmlWeb hw = new HtmlWeb();
List<string> webSites;

try
{
doc = hw.Load(url);
webSites = getLinks(doc);[/code]

The problem is sometimes in webSites i see links like "/" or "/videos or "//gifs
From what i understand those are sub folders for example if i had a link : www.google.com/videos
So /videos is the sub of www.google.com/videos
But what i want is that in webSites all the time i will have only a links of websites like:

www.google.com
http://www.google.com
or https://www.google.com https://www.google.com

Only this kind of links types. And not sub folders/links like "/" or "/videos"
So how can i filter/check -- it in the getLinks function ? <hr class="sig danieli

View the full article
 
Back
Top