Hello to all,
Im working on Visual C# Express 2005 (whidbey) and i want to scan a plain-text file (especially .html and other source pages) to get all relative or not URI/URL.
Exemple :
<a href="/james/photos.htm">Jamess Pics</a>
<a href="http://www.jamespics.com">...</a>
And get all URI on the page! But, i have only the pattern for matching URI and this pattern doesnt work it throws me an exception :
And, the exception :
So, if someone has a solution get all URI (all styles : http(s)://(www.)yahoo.co.uk/dir/page.php?var=none&var2=LOL and all other style like www.hey-you.com/mister/james.php?page=pics, etc...) and the relative uri (ex.: ./dir/page.php?lol=no (so we must get the href="CONTENT")
Thanks, ive tried but never succedeed to make my own regexp :s
Thanks a lot!
Im working on Visual C# Express 2005 (whidbey) and i want to scan a plain-text file (especially .html and other source pages) to get all relative or not URI/URL.
Exemple :
<a href="/james/photos.htm">Jamess Pics</a>
<a href="http://www.jamespics.com">...</a>
And get all URI on the page! But, i have only the pattern for matching URI and this pattern doesnt work it throws me an exception :
Code:
string pattern = @"^(http|https|ftp)\://([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&%\$\-]
+)*@)*((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1
}|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9
]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1
}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1
-9]{1}[0-9]{1}|[0-9])|localhost|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-
9\-]+\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|a
ero|coop|museum|[a-zA-Z]{2}))(\:[0-9]+)*(/($|[a-zA-Z0-9\.\,\
?\\\\+&%\$#\=~_\-]+))*$";
Regex regexp = new Regex(pattern);
MatchCollection mc = regexp.Matches("http://www.yahoo.co.uk/nostream.php?acting=lolz", 0);
foreach(Match match in mc)
{
MessageBox.Show(match.Value);
}
Code:
parsing \"^(http|https|ftp)\\://([a-zA-Z0-9\\.\\-]+(\\:[a-zA-Z0-9\\.&%\\$\\-]\r\n +)*@)*((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1\r\n }|[1-9])\\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9\r\n ]{1}|[1-9]|0)\\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1\r\n }[0-9]{1}|[1-9]|0)\\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1\r\n -9]{1}[0-9]{1}|[0-9])|localhost|([a-zA-Z0-9\\-]+\\.)*[a-zA-Z0-\r\n 9\\-]+\\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|a\r\n ero|coop|museum|[a-zA-Z]{2}))(\\:[0-9]+)*(/($|[a-zA-Z0-9\\.\\,\\\r\n ?\\\\\\\\+&%\\$#\\=~_\\-]+))*$\" - [x-y] range in reverse order.
So, if someone has a solution get all URI (all styles : http(s)://(www.)yahoo.co.uk/dir/page.php?var=none&var2=LOL and all other style like www.hey-you.com/mister/james.php?page=pics, etc...) and the relative uri (ex.: ./dir/page.php?lol=no (so we must get the href="CONTENT")
Thanks, ive tried but never succedeed to make my own regexp :s
Thanks a lot!