EDN Admin
Well-known member
i have this function that is calling it self all the time like a loop. Its called
<br/>
recursive.
<pre class="prettyprint private List<string> webCrawler(string url, int levels,DoWorkEventArgs eve)
{
this.Invoke(new MethodInvoker(delegate { label3.Text = label3.Text = (Int32.Parse(label12.Text) + Int32.Parse(label10.Text)).ToString(); }));
// CancelAsync to abort the process to return without doinf the work return back without return anything.
// To check about timeout when loading url/s
// to check the site familymediation.co.il
// *** To save/keep all settings like url change and levels to crawl and all checkboxes and options in designer to keep/save while program is running *** \
//levels = levelsToCrawl;
HtmlWeb hw = new HtmlWeb();
List<string> webSites;
List<string> csFiles = new List<string>();
csFiles.Add("temp string to know that something is happening in level = " + levels.ToString());
csFiles.Add("current site name in this level is : " + url);
/* later should be replaced with real cs files .. cs files links..*/
try
{
this.Invoke(new MethodInvoker(delegate { Texts(richTextBox1, "Loading The Url: " , Color.Red); }));
this.Invoke(new MethodInvoker(delegate { Texts(richTextBox1, url + "...",Color.Blue); }));
HtmlAgilityPack.HtmlDocument doc = TimeOut.getHtmlDocumentWebClient(url, false, "", 0, "", "");
/* if (doc == null)
{
this.Invoke(new MethodInvoker(delegate { Texts(richTextBox1, " Check The Link" + Environment.NewLine, Color.Green); }));
return csFiles;
}*/
//string html = doc.DocumentNode.InnerHtml;
//get Text
//string pageText = doc.DocumentNode.InnerText;
//doc = hw.Load(url);
this.Invoke(new MethodInvoker(delegate { Texts(richTextBox1, " Done " + Environment.NewLine, Color.Red); }));
currentCrawlingSite.Add(url);
webSites = getLinks(doc);
removeDupes(webSites);
removeDuplicates(webSites, currentCrawlingSite);
removeDuplicates(webSites, sitesToCrawl);
if (removeExt == true)
{
removeExternals(webSites);
}
if (downLoadImages == true)
{
webContent.retrieveImages(url); // to check when its not // when im using and calling the function to retrieve images the program is not working good not crawling ! to check why.
}
// maybe something like this :
if (levels > 0)
sitesToCrawl.AddRange(webSites);// we want this to grow..(but not in the most deep level..cause we are not going to dive anyway in this level)
this.Invoke(new MethodInvoker(delegate { label7.Text = sitesToCrawl.Count.ToString(); }));
this.Invoke(new MethodInvoker(delegate { label12.Text = currentCrawlingSite.Count.ToString(); }));
/* to call here the duplicates function with current sites with the sites visited *
to call again the duplicates function with the same currentsites with the list number 2 in the form level the sites im going to visits them !
the list webSites are the links im going to visit im adding to sitestocrawl
/* to filter/clean same sites already when gewtting all links here **
/*
/*
2DO:
webSites = FilterJunkLinks(webSites); // keeps only things that start with http or https.. and maybe
* remove self site.. or other junk..
* */
if (levels == 0)
{
return csFiles;
}
else
{
for (int i = 0; i < webSites.Count(); i++)//&& i < 20; i++) // limiting ourseleves for 20 sites for each level for now..
//or it will take forever.
{
//int mx = Math.Min(webSites.Count(), 20);
string t = webSites;
if ((t.StartsWith("http://") == true) || (t.StartsWith("https://") == true)) // replace this with future FilterJunkLinks function
{
csFiles.AddRange(webCrawler(t, levels - 1, eve));
}
}
return csFiles;
}
}
catch
{
failedUrls++;
this.Invoke(new MethodInvoker(delegate { label10.Text = failedUrls.ToString(); }));
this.Invoke(new MethodInvoker(delegate { Texts(richTextBox1, " Failed " + Environment.NewLine, Color.Green); }));
return csFiles;
}
}[/code]
<br/>
Now i have backgroundworker DoWork event :
<pre class="lang-cs prettyprint
[/code]
And a button click event that start the backgroundworker:
<pre class="lang-cs prettyprint
[/code]
Now i have a new empty button click event wich there i want to make the pause <br/>
and resume. And i added a line in the Form1 top level before the <br/>
constructor:
<pre class="lang-cs prettyprint
[/code]
This line i thought to use for the pause and resume.
But im not sure how to do it. <hr class="sig danieli
View the full article
<br/>
recursive.
<pre class="prettyprint private List<string> webCrawler(string url, int levels,DoWorkEventArgs eve)
{
this.Invoke(new MethodInvoker(delegate { label3.Text = label3.Text = (Int32.Parse(label12.Text) + Int32.Parse(label10.Text)).ToString(); }));
// CancelAsync to abort the process to return without doinf the work return back without return anything.
// To check about timeout when loading url/s
// to check the site familymediation.co.il
// *** To save/keep all settings like url change and levels to crawl and all checkboxes and options in designer to keep/save while program is running *** \
//levels = levelsToCrawl;
HtmlWeb hw = new HtmlWeb();
List<string> webSites;
List<string> csFiles = new List<string>();
csFiles.Add("temp string to know that something is happening in level = " + levels.ToString());
csFiles.Add("current site name in this level is : " + url);
/* later should be replaced with real cs files .. cs files links..*/
try
{
this.Invoke(new MethodInvoker(delegate { Texts(richTextBox1, "Loading The Url: " , Color.Red); }));
this.Invoke(new MethodInvoker(delegate { Texts(richTextBox1, url + "...",Color.Blue); }));
HtmlAgilityPack.HtmlDocument doc = TimeOut.getHtmlDocumentWebClient(url, false, "", 0, "", "");
/* if (doc == null)
{
this.Invoke(new MethodInvoker(delegate { Texts(richTextBox1, " Check The Link" + Environment.NewLine, Color.Green); }));
return csFiles;
}*/
//string html = doc.DocumentNode.InnerHtml;
//get Text
//string pageText = doc.DocumentNode.InnerText;
//doc = hw.Load(url);
this.Invoke(new MethodInvoker(delegate { Texts(richTextBox1, " Done " + Environment.NewLine, Color.Red); }));
currentCrawlingSite.Add(url);
webSites = getLinks(doc);
removeDupes(webSites);
removeDuplicates(webSites, currentCrawlingSite);
removeDuplicates(webSites, sitesToCrawl);
if (removeExt == true)
{
removeExternals(webSites);
}
if (downLoadImages == true)
{
webContent.retrieveImages(url); // to check when its not // when im using and calling the function to retrieve images the program is not working good not crawling ! to check why.
}
// maybe something like this :
if (levels > 0)
sitesToCrawl.AddRange(webSites);// we want this to grow..(but not in the most deep level..cause we are not going to dive anyway in this level)
this.Invoke(new MethodInvoker(delegate { label7.Text = sitesToCrawl.Count.ToString(); }));
this.Invoke(new MethodInvoker(delegate { label12.Text = currentCrawlingSite.Count.ToString(); }));
/* to call here the duplicates function with current sites with the sites visited *
to call again the duplicates function with the same currentsites with the list number 2 in the form level the sites im going to visits them !
the list webSites are the links im going to visit im adding to sitestocrawl
/* to filter/clean same sites already when gewtting all links here **
/*
/*
2DO:
webSites = FilterJunkLinks(webSites); // keeps only things that start with http or https.. and maybe
* remove self site.. or other junk..
* */
if (levels == 0)
{
return csFiles;
}
else
{
for (int i = 0; i < webSites.Count(); i++)//&& i < 20; i++) // limiting ourseleves for 20 sites for each level for now..
//or it will take forever.
{
//int mx = Math.Min(webSites.Count(), 20);
string t = webSites;
if ((t.StartsWith("http://") == true) || (t.StartsWith("https://") == true)) // replace this with future FilterJunkLinks function
{
csFiles.AddRange(webCrawler(t, levels - 1, eve));
}
}
return csFiles;
}
}
catch
{
failedUrls++;
this.Invoke(new MethodInvoker(delegate { label10.Text = failedUrls.ToString(); }));
this.Invoke(new MethodInvoker(delegate { Texts(richTextBox1, " Failed " + Environment.NewLine, Color.Green); }));
return csFiles;
}
}[/code]
<br/>
Now i have backgroundworker DoWork event :
<pre class="lang-cs prettyprint
Code:
<span class="kwd private<span class="pln <span class="kwd void<span class="pln backgroundWorker1_DoWork<span class="pun (<span class="kwd object<span class="pln sender<span class="pun ,<span class="pln <span class="typ DoWorkEventArgs<span class="pln e<span class="pun )<span class="pln <br/> <span class="pun {<span class="pln <br/> <br/> test<span class="pun (<span class="pln mainUrl<span class="pun ,<span class="pln levelsToCrawl<span class="pun ,<span class="pln e<span class="pun );<span class="pln <br/> <br/> <br/> <span class="pun }<span class="pln <br/>
And a button click event that start the backgroundworker:
<pre class="lang-cs prettyprint
Code:
<span class="kwd private<span class="pln <span class="kwd void<span class="pln button1_Click<span class="pun (<span class="kwd object<span class="pln sender<span class="pun ,<span class="pln <span class="typ EventArgs<span class="pln e<span class="pun )<span class="pln <br/> <span class="pun {<span class="pln <br/> backgroundWorker1<span class="pun .<span class="typ RunWorkerAsync<span class="pun ();<span class="pln <br/> button1<span class="pun .<span class="typ Enabled<span class="pln <span class="pun =<span class="pln <span class="kwd false<span class="pun ;<span class="pln <br/> <span class="kwd this<span class="pun .<span class="typ Text<span class="pln <span class="pun =<span class="pln <span class="str "Processing..."<span class="pun ;<span class="pln <br/> label6<span class="pun .<span class="typ Text<span class="pln <span class="pun =<span class="pln <span class="str "Processing..."<span class="pun ;<span class="pln <br/> label6<span class="pun .<span class="typ Visible<span class="pln <span class="pun =<span class="pln <span class="kwd true<span class="pun ;<span class="pln <br/> button2<span class="pun .<span class="typ Enabled<span class="pln <span class="pun =<span class="pln <span class="kwd false<span class="pun ;<span class="pln <br/> checkBox1<span class="pun .<span class="typ Enabled<span class="pln <span class="pun =<span class="pln <span class="kwd false<span class="pun ;<span class="pln <br/> checkBox2<span class="pun .<span class="typ Enabled<span class="pln <span class="pun =<span class="pln <span class="kwd false<span class="pun ;<span class="pln <br/> numericUpDown1<span class="pun .<span class="typ Enabled<span class="pln <span class="pun =<span class="pln <span class="kwd false<span class="pun ;<span class="pln <br/> button3<span class="pun .<span class="typ Enabled<span class="pln <span class="pun =<span class="pln <span class="kwd true<span class="pun ;<span class="pln <br/> <span class="pun }<span class="pln <br/>
Now i have a new empty button click event wich there i want to make the pause <br/>
and resume. And i added a line in the Form1 top level before the <br/>
constructor:
<pre class="lang-cs prettyprint
Code:
<span class="typ System<span class="pun .<span class="typ Threading<span class="pun .<span class="typ ManualResetEvent<span class="pln _busy <span class="pun =<span class="pln <span class="kwd new<span class="pln <span class="typ System<span class="pun .<span class="typ Threading<span class="pun .<span class="typ ManualResetEvent<span class="pun (<span class="kwd false<span class="pun );<span class="pln <br/>
This line i thought to use for the pause and resume.
But im not sure how to do it. <hr class="sig danieli
View the full article