This may not seem very Web 2.0 (O'Reilly wrote web services is 2.0 but screen scraping is 1.0), but I think there are a variety of reasons that screen scraping is still helpful, including:
Need to be closer to what the user sees
Don't have access directly to the database or a web service that will provide you the information you need (or you won't have access soon enough)
For example:
Testing whether your web pages are looking the way you expect. Sometimes testing this from the back end just isn't going to cut it, and you need to analyze the HTML to see if the page looks reasonable.
Writing a report that doesn't already exist on top of some reporting tool (for instance, on top of a defect-tracking system that you don't have access to the code for).
Creating archived versions of sites. Sometimes using HTTRACK, for example, isn't enough on its own (for example, when you need to pull in full-sized videos from the source system as oppossed to the streamed version on the web). Also, you can use Perl to wrap around HTTRACK so that you have a standard way of passing options to HTTRACK.
Seeing which of a large set of your sites are indexed in Google.
Testing your RSS feeds to determine if they have the right number of content items, etc (I guess this would be more "RSS scraping" than screen scraping).
Importing from a static site to a CMS (less and less commonly needed nowadays).
Often, if there's a direct DB connection or an RSS feed or some other XML interface that you can use, then it probably makes sense to use that. Even in that case, the archiving and web page testing cases would probably benefit from screen scraping.
Source: http://hobbsontech.com/content/screen-scraping
Need to be closer to what the user sees
Don't have access directly to the database or a web service that will provide you the information you need (or you won't have access soon enough)
For example:
Testing whether your web pages are looking the way you expect. Sometimes testing this from the back end just isn't going to cut it, and you need to analyze the HTML to see if the page looks reasonable.
Writing a report that doesn't already exist on top of some reporting tool (for instance, on top of a defect-tracking system that you don't have access to the code for).
Creating archived versions of sites. Sometimes using HTTRACK, for example, isn't enough on its own (for example, when you need to pull in full-sized videos from the source system as oppossed to the streamed version on the web). Also, you can use Perl to wrap around HTTRACK so that you have a standard way of passing options to HTTRACK.
Seeing which of a large set of your sites are indexed in Google.
Testing your RSS feeds to determine if they have the right number of content items, etc (I guess this would be more "RSS scraping" than screen scraping).
Importing from a static site to a CMS (less and less commonly needed nowadays).
Often, if there's a direct DB connection or an RSS feed or some other XML interface that you can use, then it probably makes sense to use that. Even in that case, the archiving and web page testing cases would probably benefit from screen scraping.
Source: http://hobbsontech.com/content/screen-scraping
No comments:
Post a Comment