C# How to parallel download web page fast yet correctly

  • Thread starter Thread starter zydjohn
  • Start date Start date
Z

zydjohn

Guest
Hello:

I have to login to one web site, and browse many similar web pages, those web pages have URL like this: https://www.myweb.com/page1/ https://www.myweb.com/page100/

The number of pages vary from time to time, some times, it has only one page, but some times, it has up to 400+ pages.

I want to use httpclient to download all the pages, but to save time, I want to use Parallel loop. However, as the httpclient has to carry the cookies got from the login page, so I have to create each httpclient with necessary cookies.

I setup a counter, to count each page’s length to know their difference.

The following is my C# (.net core) code:



public const string _userAgent =
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3844.0 Safari/537.36";
public static ParallelOptions Para_Option =
new ParallelOptions() { MaxDegreeOfParallelism = Environment.ProcessorCount };
public static ConcurrentDictionary<string, int> Dpage_Counter =
new ConcurrentDictionary<string, int>();
public static string login_cookies = "csrftoken=H6EyIoa9njatwHUPH2PdPRIlULApxZDQ4mie6; _gid=GA1.2.657661824.1565778096";

public static async Task<HttpClient> Create_HttpClient()
{
ServicePointManager.UseNagleAlgorithm = true;
ServicePointManager.Expect100Continue = true;
ServicePointManager.DefaultConnectionLimit = int.MaxValue;
ServicePointManager.EnableDnsRoundRobin = true;
ServicePointManager.ReusePort = true;
HttpClientHandler clientHandler = new HttpClientHandler()
{
AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip
};
HttpClient client1 = new HttpClient(clientHandler);
client1.DefaultRequestHeaders.Accept.Clear();
client1.DefaultRequestHeaders.Accept.Add(new MediaTypeWithQualityHeaderValue("*/*"));
client1.DefaultRequestHeaders.Add("Accept-Encoding", "gzip, deflate");
client1.DefaultRequestHeaders.AcceptLanguage.Add(new StringWithQualityHeaderValue("en-US"));
client1.DefaultRequestHeaders.Add("User-Agent", _userAgent);
client1.DefaultRequestHeaders.TryAddWithoutValidation("Cookie", login_cookies);
await Task.Delay(1);
return (client1);
}

public static async Task Http_Page_Length(string url1)
{
using (HttpClient client1 = await Create_HttpClient())
{
using (HttpResponseMessage http_reply1 = await client1.GetAsync(url1))
{
string html_content1 = await http_reply1.Content.ReadAsStringAsync();
Dpage_Counter.AddOrUpdate(url1, html_content1.Length, (k, v) => v);
}
}
}

static async Task Main()
{
int max_page_num = 20;
HashSet<string> Page_URLs = new HashSet<string>();
for (int i = 1; i <= max_page_num; i++)
{
string page_url1 = "https://www.myweb.com/page" + i.ToString();
Page_URLs.Add(page_url1);
}
Parallel.ForEach(Page_URLs, Para_Option, async (page_url1) =>
{
await Http_Page_Length(page_url1).ConfigureAwait(false);
});
}
}



When I run my code, I found some issue: if the total number of web page is less than 10, then it always works. But if the total web page is more than 10, then I have issue, the counter has only 10 web page’s length data.

I searched around, someone said: the default connection limit for httpclient is 10 or 2; but when I create httpclient, I setup the default connection limit to a huge number:

ServicePointManager.DefaultConnectionLimit = int.MaxValue;

But I still got the error, the number of pages and the page’s length counter did not match.

Please advice, what went wrong with my code.

By the way, if I change my code from Parallel to ordinary in sequence, then my code works without the error.

Finally, I am use Visual Studio 2019 (Version 16.2.2) on Windows 10.

Continue reading...
 
Back
Top