Parsing URL with HttpClient and adding new URL to parse

Hiline1961 · Aug 21, 2020

Hi. I'm probably overthinking this so I'm confusing myself.

I have 10 Original URLs that I want to parse the content asynchronously with HttpClient. Once I parse the content, I want to find the "Next URL" tag that I will want to parse. This will continue until there is no "Next URL" tag.

I don't want to do HttpClient recursively so I was thinking about a Semaphore/Concurrent Collection.

My initial thought is this.

1) Create a Concurrent Bag with the initial 10 URLs.

2) Use a Semaphore to limit access to the Concurrent Bag.

3) Parse each URLs content with HttpClient and find the Next URL.

4) Add the Next URL to the Concurrent Bag before I release the Semaphore so that it will be processed subsequently

5) At the end of all this, I want a Collection of URLs that includes the 10 Original URLs and all Next URLs.

This is where I confuse myself.

a) If an item is removed from the Concurrent Bag (CB1) once it is processed, does this mean I need a second Concurrent Bag (CB2) to insert the items that have been processed in CB1? That way when I process all the items in CB1 they will be in CB2?

b) How do I know I'm done with CB1? Let's say I have 3 threads. Thread 1 and 2 reach the end of the Next URLs and release so there is nothing new to process. However, Thread 3 is still working and finds a Next URL. If I add it to the Concurrent Bag - how do I ensure it gets processed?

Thanks much!!!

Continue reading...

Parsing URL with HttpClient and adding new URL to parse

Hiline1961

Guest

Similar threads