Regex Question

darknuke

Well-known member
Joined
Oct 3, 2003
Messages
68
I am trying to get data between HTML tags, but I am doing it wrong, as it is returning nearly all the source. :(

I modified the MSDN example to try and do it, but no success...

Code:
        Dim r As System.Text.RegularExpressions.Regex
        Dim m As System.Text.RegularExpressions.Match

        r = New System.Text.RegularExpressions.Regex("<td.*>(.*)</td>", _
             System.Text.RegularExpressions.RegexOptions.IgnoreCase Or System.Text.RegularExpressions.RegexOptions.Compiled)

        m = r.Match(inputString)

        While m.Success
            MsgBox(m.Groups(1).Value.ToString)
            m = m.NextMatch()
        End While

I am trying to get what is in between <td (attributes here)> and </td>... what am I doing wrong?
 
Last edited by a moderator:
Code:
        Dim r As System.Text.RegularExpressions.Regex
        Dim m As System.Text.RegularExpressions.Match

        r = New System.Text.RegularExpressions.Regex("<td[^>]+>([^<]+)</td>", _
             System.Text.RegularExpressions.RegexOptions.IgnoreCase Or System.Text.RegularExpressions.RegexOptions.Compiled)

        m = r.Match(inputString)

        While m.Success
            MsgBox(m.Groups(1).Value.ToString)
            m = m.NextMatch()
        End While

^^ try that ^^

Hope this helps!

Andreas
 
One problem; theres HTML tags between the <td> tags :( ... How can I include anything that appears between the tags?
 
Last edited by a moderator:
Trying to get eBay listings:

(I have looked at the eBay developer SDK and for example source already)

Stuff like (full source is huuuge):

Code:
<tr bgcolor="#eeeeee">
 <td valign="center" align="middle" width="12%" rowspan="">
   <a href="http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&amp;item=3660198719&amp;category=51347">
   <img height="64" width="64" border="0" src="http://thumbs.ebaystatic.com/pict/36601987196464.jpg" alt="**NEW** Disney Toy Story 2 Activity Studio CD"></a></td>
 <td valign="top">
  <font size="3">
  <a href="http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&amp;item=3660198719&amp;category=51347"> **NEW** Disney Toy Story 2 Activity Studio CD </a>
  </font>?
  <img src="http://pics.ebaystatic.com/aw/pics/paypal/logo_paypalPPBuyerProtection_28x16.gif" alt="PayPal Buyer Protection Program" border="0" width="28" height="16">
  <br>
  <img height="1" width="200" border="0" alt="" src="http://pics.ebaystatic.com/aw/pics/s.gif"></td>

The regex should return:

Code:
 <a href="http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&amp;item=3660198719&amp;category=51347">
   <img height="64" width="64" border="0" src="http://thumbs.ebaystatic.com/pict/36601987196464.jpg" alt="**NEW** Disney Toy Story 2 Activity Studio CD"></a>

and

Code:
  <font size="3">
  <a href="http://cgi.ebay.com/ws/eBayISAPI.dll?ViewItem&amp;item=3660198719&amp;category=51347"> **NEW** Disney Toy Story 2 Activity Studio CD </a>
  </font>?
  <img src="http://pics.ebaystatic.com/aw/pics/paypal/logo_paypalPPBuyerProtection_28x16.gif" alt="PayPal Buyer Protection Program" border="0" width="28" height="16">
  <br>
  <img height="1" width="200" border="0" alt="" src="http://pics.ebaystatic.com/aw/pics/s.gif">
 
Last edited by a moderator:
do you know c#?

What I would do is use mshtml to parse the page, look for the table that has the listings, and then iterate through each row and do whatever u want to do with the data. I made an object that will download a webpage, remove the scripting and create a html document. From there you could do HtmlDocument.getElementsByTagName("TABLE") to retrieve all the html tables in the page, Find the table that has the rows you want. And then iterate through each row.

The thing is, I did it in c#, and youre using vb. For this to work, I could 1) give u the dll or 2) give you the c# code, and you use the c# compiler to create the dll and import the dll to your project. Or you can try to convert the c# code to vb.net code.
 
Last edited by a moderator:
I dont have MSHTML as far as I know. I dont have C#, I got VB.NET 2003 in a stand-alone package.
 
Originally posted by darknuke
I dont have MSHTML as far as I know. I dont have C#, I got VB.NET 2003 in a stand-alone package.

Sure you do, Project -> Add Reference -> COM -> Microsoft HTML Object Library.
 
Originally posted by darknuke
How can I put a string (HTML) into the the HTMLDocumentClass class...

You need to save the contents of the string 2 disk.
C# code
Code:
HTMLDocumentClass htmlDoc = new HTMLDocumentClass();
System.Runtime.InteropServices.UCOMIPersistFile pf = (System.Runtime.InteropServices.UCOMIPersistFile) htmlDoc;
pf.Load(filename, 0);
while(htmlDoc.body == null)
		System.Windows.Forms.Application.DoEvents();
while(htmlDoc.readyState != "complete")
	System.Windows.Forms.Application.DoEvents();
 
Originally posted by darknuke
I dont have C#, I got VB.NET 2003 in a stand-alone package.

:D

*screams wildly* I dont know what Im doing wrong :(

Code:
        Dim htmlDoc As New mshtml.HTMLDocument
        Dim pf As System.Runtime.InteropServices.UCOMIPersistFile
        pf.Load("c:\eBay.html", 0)
        htmlDoc = pf

I get an error on the pf.Load line... (unhandled exception; object not set to a instance of an object)
 
Last edited by a moderator:
Would someone please show me an example use of MSHTML that is in VB.NET (that does not require me to use a browser control, if possible)?
 
change this:
Code:
        Dim htmlDoc As New mshtml.HTMLDocument
        Dim pf As System.Runtime.InteropServices.UCOMIPersistFile
        pf.Load("c:\eBay.html", 0)
        htmlDoc = pf

..to this!:

Code:
        Dim htmlDoc As New mshtml.HTMLDocument
        Dim pf As System.Runtime.InteropServices.UCOMIPersistFile
        [b]pf = CType(htmlDoc,System.Runtime.InteropServices.UCOMIPersistFile)[/b]
        pf.Load("c:\eBay.html", 0)
        htmlDoc = pf

Hope this helps!

Andreas
 
Remove scripting code

Hi,

If you are still offering, I would be grateful to receive the c# code you mentioned below.

Many thanks.

HJB417 said:
do you know c#?

What I would do is use mshtml to parse the page, look for the table that has the listings, and then iterate through each row and do whatever u want to do with the data. I made an object that will download a webpage, remove the scripting and create a html document. From there you could do HtmlDocument.getElementsByTagName("TABLE") to retrieve all the html tables in the page, Find the table that has the rows you want. And then iterate through each row.

The thing is, I did it in c#, and youre using vb. For this to work, I could 1) give u the dll or 2) give you the c# code, and you use the c# compiler to create the dll and import the dll to your project. Or you can try to convert the c# code to vb.net code.
 
You will need to add references to mshtml and system.windows.forms

Code:
using System;
using System.Diagnostics;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;
using mshtml;

namespace HB.Net
{
	/// <summary>
	/// Creates a managed wrapper for a <see cref="mshtml.HTMLDocument"/> object.
	/// </summary>
	public class HtmlDocument : IDisposable
	{

		private bool _deleteWhenDone;

		/// <summary>
		/// The underlying <see cref="mshtml.HTMLDocument"/>.
		/// </summary>
		public readonly HTMLDocument MsHtmlDoc;
		
		/// <summary>
		/// The file path of the downloaded html document.
		/// </summary>
		public readonly string LocalPath;

		private bool _disposed;

		/// <summary>
		/// The content of the webpage.
		/// </summary>
		public readonly string AsciiData;
		
		private static readonly Regex ScriptParser;
		private static readonly Regex FileExtRemover;
		
		static HtmlDocument()
		{
			string[] tags = new string[] {"script", /*"style", */"object", "head", "map", "iframe", "javascript"};
			string scriptParserPattern = @"<(" + string.Join("|", tags) + @">).*?</\1>";
			ScriptParser = new Regex(scriptParserPattern, RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline);
			FileExtRemover = new Regex(@"\.\w+$", RegexOptions.Compiled);
		}

		/// <summary>
		/// Creates a <see cref="HtmlDocument"/> from the binary data of a webpage.
		/// </summary>
		/// <param name="data">The binary data of the webpage</param>
		/// <param name="removeScripting">true to remove javascript.</param>
		public HtmlDocument(byte[] data, bool removeScripting)
			: this(CreateFile(data), true, removeScripting)
		{
		}

		public HtmlDocument(string html, bool removeScripting)
			: this(html, removeScripting, Encoding.ASCII)
		{
		}

		public HtmlDocument(string html, bool removeScripting, Encoding encoding)
			: this(encoding.GetBytes(html), removeScripting)
		{
		}

		/// <summary>
		/// Creates a <see cref="HtmlDocument"/> from a webpage file.
		/// </summary>
		/// <param name="filename">The file path of the webpage.</param>
		/// <param name="deleteFile">true to delete the file on dispose.</param>
		/// <param name="removeScripting">set to true to remove script tags.</param>
		public HtmlDocument(string filename, bool deleteFile, bool removeScripting)
		{
			_disposed = false;
			_deleteWhenDone = true;
			LocalPath = filename;
			try
			{
				if(removeScripting)
					Preparse(filename);
				MsHtmlDoc = CreateHTMLDocument(out AsciiData);
			}
			catch
			{
				try
				{
					File.Delete(filename);
				}
				catch{}
				throw;
			}
		}

		/// <summary>
		/// Deletes the webpage and closes the underlying <see cref="mshtml.HTMLDocument"/> object.
		/// </summary>
		public void Dispose()
		{
			if(_disposed)
				return;
			MsHtmlDoc.close();
			if(_deleteWhenDone)
			{
				try
				{
					File.Delete(LocalPath);
				}
				catch{}
			}
			_disposed = true;
			GC.SuppressFinalize(this);
		}

		~HtmlDocument()
		{
			try
			{
				Dispose();
			}
			catch{}
		}

		/// <summary>
		/// Creates a HTMLDocument.
		/// </summary>
		private HTMLDocumentClass CreateHTMLDocument(out string asciiData)
		{
			byte[] _htmlData;
			FileStream file = File.OpenRead(LocalPath);
			try
			{
				_htmlData = new byte[file.Length];
				for(int read = 0; read < file.Length;)
					read+=file.Read(_htmlData, read, (int)(file.Length - read));
			}
			finally
			{
				file.Close();
			}
			HTMLDocumentClass htmlDoc = new HTMLDocumentClass();
			try
			{
				System.Runtime.InteropServices.UCOMIPersistFile pf = (System.Runtime.InteropServices.UCOMIPersistFile) htmlDoc;
				pf.Load(LocalPath, 0);
				while(htmlDoc.body == null)
					System.Windows.Forms.Application.DoEvents();
				while(htmlDoc.readyState != "complete")
					System.Windows.Forms.Application.DoEvents();
				asciiData = Encoding.ASCII.GetString(_htmlData);
			}
			catch(Exception e)
			{
				htmlDoc.close();
				throw new ApplicationException("An error occurred while creating a mshtml.HTMLDocumentClass object.", e);
			}
			return htmlDoc;
		}

		/// <summary>
		/// Removies scripting from a html file.
		/// </summary>
		/// <param name="filename">The path of the file.</param>
		public void Preparse(string filename)
		{
			//read in txt file
			TextReader file = File.OpenText(filename);
			string text = null;
			try
			{
				text = file.ReadToEnd();
				text = ScriptParser.Replace(text, "");
			}
			finally
			{
				file.Close();
			}
			TextWriter output = File.CreateText(filename);
			try
			{
				output.Write(text);
				output.Flush();
			}
			finally
			{
				output.Close();
			}
		}

		private static string CreateTempHtmlFile()
		{
			while(true)
			{
				string filename = Path.GetTempFileName();
				try
				{
					string htmlFileName = FileExtRemover.Replace(filename, ".html");
					File.Move(filename, htmlFileName);
					return htmlFileName;
				}
				catch
				{
					File.Delete(filename);
				}
			}
		}

		/// <summary>
		/// Creates a html file from an array of bytes.
		/// </summary>
		/// <param name="data">The array of bytes to create the data from.</param>
		private static string CreateFile(byte[] data)
		{
			string filename = CreateTempHtmlFile();
			FileStream file = File.OpenWrite(filename);
			try
			{
				file.Write(data, 0, data.Length);
				file.Flush();
				return filename;
			}
			catch
			{
				try
				{
					File.Delete(filename);
				}
				catch{}
				throw;
			}
			finally
			{
				file.Close();
			}
		}

		/// <summary>
		/// Returns the content of the html document.
		/// </summary>
		/// <returns>The content of the html document.</returns>
		[System.Diagnostics.DebuggerStepThrough]
		public override string ToString()
		{
			return AsciiData;
		}
	}
}

edit: cleaned up the code.
 
Last edited by a moderator:
Back
Top