Finding text in HTML document

amir100

Well-known member
Joined
Mar 14, 2004
Messages
190
Location
Indonesia
Hi all.

Can anyone suggest a way to find specific text in HTML document without confusing it with any HTML tags and their attributes? For instance if I want to find the word body then I can skip the <body> tag.

Any help would be appreciated.

Ive done some reading but I can only find methods to strip HTML tag which in my case stripping HTML tag is out of question. I have to preserve the original HTML document.
 
Okay. At this point Ive managed to skip the <head> part of the HTML document. I went straight to process the document after the <body> tag. Its a straightforward solution. But I cant figure out any other way. :D

Anyway the question persist. How do I differentiate the text that I found is not part of an HTML tag? How do I know, for instance, if I want to find "mytzixklyptomic" then the occurence of such word is not part of an HTML tag?

Anyone?

Even the slightest would help. So please help me. :D
 
Re: Text parsing

Thx for replying MrPaul. Ive thought of the same solution. Still youve manage to provide technical detail on how to accomplish that. :D

Ill give it a shot.

Thx again.
 
Re: Text parsing

Almost forgot.

This wouldnt work if Im dealing with documents containing mathematic equations using < and > right?

text here < text here ... mytext here ... text here > text here

I bet mytext would be considered part of an HTML tag. CMIIW.
 
Re: Text parsing

If the document has been properly created then various characters should have been encoded i.e. < and > would be &lt; and &gt; so in that case it should still be ok.

Unfortunately if the document does contain such symbols in an un-encoded form it will make parsing of the file very difficult indeed.
 
Re: Text parsing

The code from MrPaul works perfectly. It is a bit straightforward but right now Ill go with that. Thx again for MrPaul.

It is true, just like PlausiblyDamp said, that if my HTML document were properly created then having a document with equations wont be a problem.

The complete idea to finding a text in an HTML document would be:
- Go past over the <body> tag.
- From that point, start finding the text.
- Use the method proposed by MrPaul to determine whether the text is a part of an HTML Tag or not.

Well then. That wraps it up.

Thank you all.
 
Re: Text parsing

Sorry for not thinking of this sooner but what about using a DOM parser to extract the data?

Using a DOM parser would negate issues with regards to stray < and > scattered throughout the text.
 
Re: Text parsing

Ive never really use a DOM Parser before. I dont even know which one youre refering to, mskeel. :D In any case, I dont think DOM Parser is really the one I need. Care to explain how do I use a DOM Parser in my case?
 
This code still has a few problems in that > will work, but < will not. Those should be converted into &gt; and &lt; anyway. You can probably work around this, but Im just using the default behaviors. The code should look familiar to this.

This is pretty quick and dirty, but it demonstrates the basics of how you might use a Document Object Model parser to extract the information you want.

Let me know if you have any questions.
 

Attachments

This code still has a few problems in that > will work, but < will not. Those should be converted into &gt; and &lt; anyway. You can probably work around this, but Im just using the default behaviors. The code should look familiar to this.

This is pretty quick and dirty, but it demonstrates the basics of how you might use a Document Object Model parser to extract the information you want.

This feels nostalgic. The first time I used an XMLTextReader sure gave me a hard time. But after looking at the link you provide, it turns out that it is really that simple to use an XMLTextReader. I guess at that time Im really lacking in reference.

Anyway about your code. I have to say that your code works fine. I dont really have anything to ask. I get the big picture of your code.

[CS]
private void backgroundWorker1_ProgressChanged(object sender, ProgressChangedEventArgs e)
{
this.textBox1.Text += e.UserState.ToString() + " ---- ";
}
[/CS]

I alter a bit of your code. As you can see in the above code, Ive added a simple concatenation to distinguish every text element that you process. Using your sample.html, I got this result after running your code.

Tri-Corner Humor Web Shoppe ---- Home of the original HA! HA! Guy Whiteboard ----
In the year of our Lord two thousand and seven, we at Tri-Corner Humor bear somber witness to the future of door enhancements: the worlds first HA! HA! Guy Whiteboard!
---- See HA! HA! Whiteboards in action! ---- Win free whiteboards throughout the month of April by sending questions to the HA! HA! Guy!! ----
A ---- blockbuster Internet phenomenon ---- spanning more than two years, HA! HA! Guys iron grip on the groin of our collective imagination is as strong as ever. The secret to the Quakers staying power is his unique delivery, one which is certain to ---- delight even the most cynical jerkface. ---- Regular checkups help detect polyps! ----
Like having your very own incarnation of the Dalai Lama, a HA! HA! Guy Whiteboard is there when you need a little extra something in a delicate situation. ---- This offering is your ticket to a new world, one unshackled from the demands of flowery diplomacy and tact. ---- We envision a future where every major business, political, and medical transaction occurs through the ritual exchange of HA! HA!s.
---- Hell understand. ----
Before this product was available you would have had to hot glue a laptop to your door if you wanted to use the HA! HA! Guy to let your roommate know you ate the last yogurt and his cobra escaped while you were playing with it. Now you can spare the laptop and spoil yourself with this handsome offering for only ---- $9.95! ---- Yes! Look me in the eye and tell me you havent spent more and gotten a whole lot less. ---- We promise that this will be the best online whiteboard impulse buy you will ever make! ----

When you order now you will receive:
---- One HA! HA! Guy Whiteboard. ---- One stylish black whiteboard marker. ---- Two heavy duty epoxy strips for affixing your whiteboard to ---- consenting ---- surfaces. ---- Packing material suitable for preserving your whiteboard in "mint" condition for collectors purposes. ----

this is a test for stray greater thans: 8 > 9 8 > 9 8 > 9

---- Shopping Cart ---- | ---- Policies ---- | ---- About ---- | ---- 2007 Tri-corner Humor. All Rights Reserved. ----

As I said earlier. Your code works fine as an well-formed HTML Parser. But this is not what I really need.

I needed to have a library to find the right word or phrase in an HTML Document and replace those words or phrases with an appropriate replacement. I must do that without changing anything else from that HTML Document.

Heres an example. I want to change all occurence of "when" in your sample.html. Then my library has to produce sample.html with the "when" words already replaced.

Im thinking of possibilities to use your code to develop the library I need. Im kinda stuck here. Any idea on how to achieve my goal using your code?
 
Back
Top