Downloading a dynamically-generated file using C#

  • Thread starter Thread starter William Snell
  • Start date Start date
W

William Snell

Guest
I was working on a web scraping project, and was able to use ScrapySharp to log into a site and grab data from thousands of users. My boss was happy enough that he wanted me to use it to automate getting a CSV report from the same site. I just had to modify a few form values and roll. I was able to log in, to find and load the form as well as changing the pertinent values, but the site runs some kind of script to generate the CSV report I need, and I'm not able to access it via the scraper.

So I changed gears and tried it using HttpWebRequest/Reponse objects. I'm pretty sure I'm authenticating successfully, but I can't be sure. I get a 200 status code, but the HTML returned isn't the HTML of the form page I need to get. So I tried first sending a request to the login url, and then post the form data on the form page's url. When I get the response stream from the second request, I'm still getting the login page's HTML. I encountered something similar in ScrapySharp. ScrapySharp has a browser emulator, and a means to submit a page's form. The Submit method returns a WebPage object, and when I submit the login form via ScrapySharp, the returned WebPage yields the login page's HTML. However, if I then use ScrapySharp's browser to navigator to the required URL (which requires authentication), I get the correct page's HTML. I thought I could emulate this with HttpWebRequest by posting a second request to the required page with the form data it needed. That isn't working, or at least I can't seem to get the generated CSV file when I submit. I'm worried I might need to copy over headers, access tokens, whatever... but I'm not sure what I'm missing. Here's my code:

public void DownloadCSV()
{
var cookieContainer = new CookieContainer();

var request = WebRequest.Create(_loginUri) as HttpWebRequest;
request.Credentials = GenerateCredentials();
request.PreAuthenticate = true;
request.CookieContainer = cookieContainer;
request.KeepAlive = true;
request.Method = WebRequestMethods.Http.Post;
request.ContentType = "application/x-www-form-urlencoded";

var loginResponse = request.GetResponse() as HttpWebResponse;

using (var loginStream = loginResponse.GetResponseStream())
using (var output = File.Create(_loginResponseSavePath))
{
loginStream.CopyTo(output);
}

// Logged in, now submit form.
var postData = "huge-string-of-post-data";
var postBytes = Encoding.UTF8.GetBytes(postData);

request = WebRequest.Create(_csvFormUri) as HttpWebRequest;
request.Credentials = GenerateCredentials();
request.ContentLength = postBytes.Length;
request.CookieContainer = cookieContainer;
request.KeepAlive = true;
request.Method = WebRequestMethods.Http.Post;
request.ContentType = "application/x-www-form-urlencoded";

using (Stream postStream = request.GetRequestStream())
{
postStream.Write(postBytes, 0, postBytes.Length);
}

var formResponse = request.GetResponse() as HttpWebResponse;

using (var stream = formResponse.GetResponseStream())
using (var output = File.Create(_csvSavePath))
{
stream.CopyTo(output);
}
}


And here's the code that generates the credentials:

private CredentialCache GenerateCredentials()
{
var username = _configuration.GetValue<string>("LoginCreds:username");
var password = _configuration.GetValue<string>("LoginCreds:password");

var credentialCache = new CredentialCache();
credentialCache.Add(_loginUri, "Basic", new NetworkCredential(username, password));

return credentialCache;
}


I'm thinking that maybe the first request is authenticating as required, but I'm re-creating the request and possibly blowing that out. I re-generate the credentials, but nothing is working thus far. The actual site has a form, but some kind of report builder script runs and the browser automatically downloads the CSV file as an Excel spreadsheet. It's super-easy to grab the report manually, but the automation is driving me nuts.

Continue reading...
 
Back
Top