RE: [orlandophp] Health Insurance Web Services and/or Scraping

From: Tony T.
Sent on: Saturday, September 22, 2012 10:19 AM

Maybe check out the built-in downloader-middleware which includes CookiesMiddleware

It includes a debug function so if you enable that it will log the cookies its sending and receiving. Scrapy can also specify a proxy (using HttpProxyMiddleware) so if you are using Burp already you can also leverage it to debug the spider.

On Sep 22,[masked]:53 AM, "Joseph Persie" <[address removed]> wrote:
Thanks

the FormRequest.from_response takes care of handling csrf token and selenium does execute the javascript at an extra layer of complexity.
The next step is figuring which cookies to snatch, i was able to replicate some wierdness with chromium will definitely attempt the Burp analysis
to get a batter idea of my cookie selection.


Subject: Re: [orlandophp] Health Insurance Web Services and/or Scraping
From: [address removed]
To: [address removed]
Date: Sat, 22 Sep[masked]:28:11 -0400

Joseph,

(Token based) CSRF protection just ensures the request includes the valid token presented for the session to ensure the persons session being used to make the request is coming from the actual session owner. (as opposed to some nasty XSS embedded in a site hijacking a separate active session for the user.) So if the site is using such tokens you'd see that token value presented when you initiated the session with the site. It's your session, you aren't hijacking another, but your spider needs to be aware that it has to present that token. So you have to parse that token in the initial response and grab the auth cookie. I will sometimes use Zed Attack Proxy or Burp (interception proxies) to see the requests and responses in a real browser to get a better idea of what is happening behind the scenes for a site and then replicate that within the spider.

If you want to dynamically follow links you will need to use crawlspider (and create some rules) instead of basespider as basespider doesnt work that way. http://doc.scrapy.org/en/latest/topics/spiders.html

I'm not 100% on your javascript question, have had colleagues tell me they used Selenium for javascript stuff, but you can find code snippets for Scrapy at snipplr which may be helpful for your project. Here's one that may provide some insight. http://snipplr.com/view/66998/rendered-javascript-crawler-with-scrapy-and-selenium-rc/ 

Good luck! :)



On Sat, Sep 22, 2012 at 2:28 AM, Joseph Persie <[address removed]> wrote:
Success.. I was able to do what i needed with scrapy, its a great tool!


above is the script that does what in necessary in order to retrieve rates from response.body however 
it contains javascript which must be executed to redirect to the rates listings. 

Is there anyway to execute javascript in scrapy? i cant seem to find anything in the docs.
Otherwise i would simply parse the url within the noscript tag to scrape my rates. Any feedback would be great
thanks for the recommendation.


Subject: Re: [orlandophp] Health Insurance Web Services and/or Scraping
From: [address removed]
To: [address removed]
Date: Fri, 21 Sep[masked]:47:55 -0400

Jorge,

I may not have the most intelligent answer for you because most of the times I've used this I've been scraping sites with rate limiting protections. Last site I did had a 200 ms delay between each request but I had to do it that way or my IP would be shunned. I haven't had the pleasure of benchmarking against a site where I had a large enough pool of requests without such protections to give accurate numbers here. I do know that Scrapy uses an asynchronous method using Twisted as opposed to a threaded approach which is faster than using curl.

-Tony

On Fri, Sep 21, 2012 at 3:04 PM, Jorge Colon <[address removed]> wrote:
Tony,

Had a look at the docs. How's this in terms of speed? Anything that involves a lot of parsing, especially HTML and XML is slow. 

Regards,

Jorge Colon
Director of Web Development

Zend Certified Engineer

Sent from my mobile device

On Sep 21, 2012, at 12:07 AM, Tony Turner <[address removed]> wrote:

I've been using Scrapy for my Web scraping projects and it's very tunable. Can change user agents, set delays between requests, set or even disable cookies, and a multitude of other options . You can export to CSV, JSON and other formats as well as db connectors, though I've not gotten that fancy yet. It includes an interactive shell so you can pretty easily work out the Xpaths without having to recode so usually works the first time you run the spider. It's also Python so it's super easy to setup.
On Sep 20,[masked]:30 PM, "Joseph Persie" <[address removed]> wrote:
Off Topic
don't abuse the mailing list, its a great resource and should be used when absolutely needed.

Intention

im building a health insurance rate request application the requires the following parameters:
zip, gender, age

With these args i want to retrieve near accurate rates for various insurance companies

Web Services?
unfortunately we are still in the eve of the 21'st century and people are still using file cabinets!
Does anyone tap into health insurance web services?
I've tried contacting 

eHealth
it seems they are pretty preoccupied with their corporate customers

eligibleAPI
they require to many parameters and seem geared towards hospitals and insurance agents.

So if you know of service that I can send the following paremte

Scrape time!

Additionally lets say i wanted the rates from the following:

after having a gander at the source thier is csrf protection in place.
using curl with an htmldom parser should get me to the rates im looking for.
What is the best method of mimicking a legit user from a browser via curl to trick csrf into thinking im
real. I undestand curl has a cookiejar but i was curious if someone had a gist of implementation they could point me to,

Thanks!





--
Please Note: If you hit "REPLY", your message will be sent to everyone on this mailing list ([address removed])
This message was sent by Joseph Persie ([address removed]) from The Orlando PHP User Group.
To learn more about Joseph Persie, visit his/her member profile
Set my mailing list to email me As they are sent | In one daily email | Don't send me mailing list messages

Meetup, PO Box 4668 #37895 New York, New York[masked] | [address removed]




--
Please Note: If you hit "REPLY", your message will be sent to everyone on this mailing list ([address removed])
This message was sent by Tony Turner ([address removed]) from The Orlando PHP User Group.
To learn more about Tony Turner, visit his/her member profile
Set my mailing list to email me As they are sent | In one daily email | Don't send me mailing list messages

Meetup, PO Box 4668 #37895 New York, New York[masked] | [address removed]




--
Please Note: If you hit "REPLY", your message will be sent to everyone on this mailing list ([address removed])
This message was sent by Jorge Colon ([address removed]) from The Orlando PHP User Group.
To learn more about Jorge Colon, visit his/her member profile



--
Tony Turner
OWASP Orlando Chapter Founder/Co-Leader
[address removed]






--
Please Note: If you hit "REPLY", your message will be sent to everyone on this mailing list ([address removed])
This message was sent by Tony Turner ([address removed]) from The Orlando PHP User Group.
To learn more about Tony Turner, visit his/her member profile
Set my mailing list to email me As they are sent | In one daily email | Don't send me mailing list messages

Meetup, PO Box 4668 #37895 New York, New York[masked] | [address removed]




--
Please Note: If you hit "REPLY", your message will be sent to everyone on this mailing list ([address removed])
This message was sent by Joseph Persie ([address removed]) from The Orlando PHP User Group.
To learn more about Joseph Persie, visit his/her member profile
Set my mailing list to email me As they are sent | In one daily email | Don't send me mailing list messages

Meetup, PO Box 4668 #37895 New York, New York[masked] | [address removed]



--
Tony Turner
OWASP Orlando Chapter Founder/Co-Leader
[address removed]






--
Please Note: If you hit "REPLY", your message will be sent to everyone on this mailing list ([address removed])
This message was sent by Tony Turner ([address removed]) from The Orlando PHP User Group.
To learn more about Tony Turner, visit his/her member profile
Set my mailing list to email me As they are sent | In one daily email | Don't send me mailing list messages

Meetup, PO Box 4668 #37895 New York, New York[masked] | [address removed]




--
Please Note: If you hit "REPLY", your message will be sent to everyone on this mailing list ([address removed])
This message was sent by Joseph Persie ([address removed]) from The Orlando PHP User Group.
To learn more about Joseph Persie, visit his/her member profile
Set my mailing list to email me As they are sent | In one daily email | Don't send me mailing list messages

Meetup, PO Box 4668 #37895 New York, New York[masked] | [address removed]

Our Sponsors

  • Accelebrate Training

    Win a $25 Amazon.com gift card, usable as AWS credit, by attending!

  • Green Key Resources

    Thanks for helping provide food, drinks, and other great events!

  • Consultis

    Many thanks for sponsoring the food, drinks, and sponsors every month!

  • Veredus

    Thanks so much for helping us cover the cost of food and drinks!

People in this
Meetup are also in:

Sign up

Meetup members, Log in

By clicking "Sign up" or "Sign up using Facebook", you confirm that you accept our Terms of Service & Privacy Policy