Need RegExp for href="<THIS>" !

MoPraL

Member
Joined
Aug 28, 2004
Messages
16
Hello to all,

Im working on Visual C# Express 2005 (whidbey) and i want to scan a plain-text file (especially .html and other source pages) to get all relative or not URI/URL.
Exemple :

<a href="/james/photos.htm">Jamess Pics</a>
<a href="http://www.jamespics.com">...</a>

And get all URI on the page! But, i have only the pattern for matching URI and this pattern doesnt work it throws me an exception :
Code:
            string pattern = @"^(http|https|ftp)\://([a-zA-Z0-9\.\-]+(\:[a-zA-Z0-9\.&%\$\-]
            +)*@)*((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1
            }|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9
            ]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1
            }[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1
            -9]{1}[0-9]{1}|[0-9])|localhost|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-
            9\-]+\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|a
            ero|coop|museum|[a-zA-Z]{2}))(\:[0-9]+)*(/($|[a-zA-Z0-9\.\,\
            ?\\\\+&%\$#\=~_\-]+))*$";

            Regex regexp = new Regex(pattern);
            MatchCollection mc = regexp.Matches("http://www.yahoo.co.uk/nostream.php?acting=lolz", 0);

            foreach(Match match in mc)
            {
                MessageBox.Show(match.Value);
            }
And, the exception :
Code:
parsing \"^(http|https|ftp)\\://([a-zA-Z0-9\\.\\-]+(\\:[a-zA-Z0-9\\.&%\\$\\-]\r\n            +)*@)*((25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1\r\n            }|[1-9])\\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9\r\n            ]{1}|[1-9]|0)\\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1\r\n            }[0-9]{1}|[1-9]|0)\\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1\r\n            -9]{1}[0-9]{1}|[0-9])|localhost|([a-zA-Z0-9\\-]+\\.)*[a-zA-Z0-\r\n            9\\-]+\\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|a\r\n            ero|coop|museum|[a-zA-Z]{2}))(\\:[0-9]+)*(/($|[a-zA-Z0-9\\.\\,\\\r\n            ?\\\\\\\\+&%\\$#\\=~_\\-]+))*$\" - [x-y] range in reverse order.

So, if someone has a solution get all URI (all styles : http(s)://(www.)yahoo.co.uk/dir/page.php?var=none&var2=LOL and all other style like www.hey-you.com/mister/james.php?page=pics, etc...) and the relative uri (ex.: ./dir/page.php?lol=no (so we must get the href="CONTENT")

Thanks, ive tried but never succedeed to make my own regexp :s

Thanks a lot!
 
Once you have your whole HTML file in a string variable... you could submit this string to the following RegEx expression :
C#:
[b]Regex reg = new Regex("<a.*?(href=(\"|).*?(\"|))+.*?>");[/b]

I used a specialized function in my program that was only looking for JPEG and GIF file... so here is the function... you only have to change the specialized part :

C#:
private string[] GetImgListFromHtml(string html)
      {
        Regex reg = new Regex("<a.*?(href=(\"|).*?(\"|))+.*?>");
        MatchCollection ms = reg.Matches(html.ToLower());
        string[] ret = new string[0];
        if( ms.Count > 0 )
        {
          ret = new string[ms.Count];
          for( int i = 0; i < ms.Count ; i++ )
          {
            bool apost = false;
            string elem = ms[i].Value;
            int ihrefB = elem.IndexOf("href=",1);
            if( ihrefB == -1 )
            {
              ihrefB = elem.IndexOf("href=\"",1);
              apost = true;
            }
            int ihrefE = apost?elem.IndexOf("\"",ihrefB + 6): elem.IndexOf("", ihrefB+5);
            elem = elem.Substring(ihrefB+6,ihrefE-ihrefB-6 );
            string ext = Path.GetExtension(elem);
[b]            if(ext == ".jpg" || ext == ".gif")
               ret[i] = elem;[/b]
          }
          
        }
        int nbNonNull = 0;
        ArrayList arr = new ArrayList(ret);
        for( int k = 0; k < arr.Count; k++ )
        {
          if( arr[k] != null)
            nbNonNull++;
        }
        string [] ret2 = new string[nbNonNull];
        int ind = 0;
        for( int j = 0; j < arr.Count; j++)
        {
          if( arr[j] != null)
          {
            ret2[ind] = arr[j].ToString();
            ind++;
          }
        }
        ret = ret2;
        return ret;
      }

N.B.: Sorry if the programming is not perfect... I was only looking to make it work... however... it work without any problem.

Give me news
 
Back
Top