In my current work, my boss got a project about getting data from websites and posting it into a system. Sadly, one of those websites lacks an API, so we needed to figure out a way to get information from there. The short answer is scraping. To make this story short, after evaluating multiple projects, I think the best option is to use CasperJS.
CasperJS is an open-source navigation scripting testing utility written in Javascript for the PhantomJS WebKit headless browser and SlimerJS (Gecko). It eases the process of defining a full navigation scenario and provides useful high-level functions and methods for common tasks such as:
- defining ordering browsing navigation steps,
- filling submitting forms,
- clicking the following links,
- capturing screenshots of a page (or part of it),
- testing remote DOM,
- logging events,
- downloading resources, including binary ones writing functional test suites, saving results as JUnit XML,
- scraping Web content.
To install CasperJS, I did the following.
Installation
- I set up OKay's RPM repository,
- I installed Phantom 1.9.x by typing yum install phantom19. Note the binary will be at /usr/bin/phanthom19
- I downloaded the Casper 1.1-beta3 (the latest available at this moment) from the GitHub page, and I unzipped it. CasperJS 1.0.x won't run under Phantom 1.9.x.
We are done! To call Jasper do a command like this:
PHANTOMJS_EXECUTABLE=/usr/bin/phantomjs19 ~/casperjs-1.1-beta3/bin/casperjs