Screen Scraping Services: May 2013

Friday, 31 May 2013

Screen Scraping

This may not seem very Web 2.0 (O'Reilly wrote web services is 2.0 but screen scraping is 1.0), but I think there are a variety of reasons that screen scraping is still helpful, including:

    Need to be closer to what the user sees
    Don't have access directly to the database or a web service that will provide you the information you need (or you won't have access soon enough)

For example:

    Testing whether your web pages are looking the way you expect. Sometimes testing this from the back end just isn't going to cut it, and you need to analyze the HTML to see if the page looks reasonable.
    Writing a report that doesn't already exist on top of some reporting tool (for instance, on top of a defect-tracking system that you don't have access to the code for).
    Creating archived versions of sites. Sometimes using HTTRACK, for example, isn't enough on its own (for example, when you need to pull in full-sized videos from the source system as oppossed to the streamed version on the web). Also, you can use Perl to wrap around HTTRACK so that you have a standard way of passing options to HTTRACK.
    Seeing which of a large set of your sites are indexed in Google.
    Testing your RSS feeds to determine if they have the right number of content items, etc (I guess this would be more "RSS scraping" than screen scraping).
    Importing from a static site to a CMS (less and less commonly needed nowadays).

Often, if there's a direct DB connection or an RSS feed or some other XML interface that you can use, then it probably makes sense to use that. Even in that case, the archiving and web page testing cases would probably benefit from screen scraping.

Source: http://hobbsontech.com/content/screen-scraping

Wednesday, 29 May 2013

Screen-scraping with WWW::Mechanize

Screen-scraping is the process of emulating an interaction with a Web site - not just downloading pages, but filling out forms, navigating around the site, and dealing with the HTML received as a result. As well as for traditional lookups of information - like the example we'll be exploring in this article - we can use screen-scraping to enhance a Web service into doing something the designers hadn't given us the power to do in the first place. Here's an example:

I do my banking online, but get quickly bored with having to go to my bank's site, log in, navigate around to my accounts and check the balance on each of them. One quick Perl module (Finance::Bank::HSBC) later, and now I can loop through each of my accounts and print their balances, all from a shell prompt. Some more code, and I can do something the bank's site doesn't ordinarily let me - I can treat my accounts as a whole instead of individual accounts, and find out how much money I have, could possibly spend, and owe, all in total. Another step forward would be to schedule a crontab every day to use the HSBC option to download a copy of my transactions in Quicken's QIF format, and use Simon Cozens' Finance::QIF module to interpret the file and run those transactions against a budget, letting me know whether I'm spending too much lately. This takes a simple Web-based system from being merely useful to being automated and bespoke; if you can think of how to write the code, then you can do it. (It's probably wise for me to add the caveat, though, that you should be extremely careful working with banking information programatically, and even more careful if you're storing your login details in a Perl script somewhere.)

Back to screen-scrapers, and introducing WWW::Mechanize, written by Andy Lester and based on Skud's WWW::Automate. Mechanize allows you to go to a URL and explore the site, following links by name, taking cookies, filling in forms and clicking "submit" buttons. We're also going to use HTML::TokeParser to process the HTML we're given back, which is a process I've written about previously.

The site I've chosen to demonstrate on is the BBC's Radio Times site, which allows users to create a "Diary" for their favorite TV programs, and will tell you whenever any of the programs is showing on any channel. Being a London Perl M[ou]nger, I have an obsession with Buffy the Vampire Slayer. If I tell this to the BBC's site, then it'll tell me when the next episode is, and what the episode name is - so I can check whether it's one I've seen before. I'd have to remember to log into their site every few days to check whether there was a new episode coming along, though. Perl to the rescue! Our script will check to see when the next episode is and let us know, along with the name of the episode being shown.

Here's the code:

#!/usr/bin/perl -w
use strict;

use WWW::Mechanize;
use HTML::TokeParser;

If you're going to run the script yourself, then you should register with the Radio Times site and create a diary, before giving the e-mail address you used to do so below.

my $email = ";
die "Must provide an e-mail address" unless $email ne ";

We create a WWW::Mechanize object, and tell it the address of the site we'll be working from. The Radio Times' front page has an image link with an ALT text of "My Diary", so we can use that to get to the right section of the site:

my $agent = WWW::Mechanize->new();
$agent->get("http://www.radiotimes.beeb.com/");
$agent->follow("My Diary");

The returned page contains two forms - one to allow you to choose from a list box of program types, and then a login form for the diary function. We tell WWW::Mechanize to use the second form for input. (Something to remember here is that WWW::Mechanize's list of forms, unlike an array in Perl, is indexed starting at 1 rather than 0. Our index is, therefore,'2.')

$agent->form(2);

Now we can fill in our e-mail address for the '<INPUT name="email" type="text">' field, and click the submit button. Nothing too complicated.

$agent->field("email", $email);
$agent->click();

WWW::Mechanize moves us to our diary page. This is the page we need to process to find the date details from. Upon looking at the HTML source for this page, we can see that the HTML we need to work through is something like:

<input>
<tr><td></td></tr>
<tr><td></td><td></td><td class="bluetext">Date of episode</td></tr>
<td></td><td></td>
<td class="bluetext">Time of episode</td></tr>
<a href="page_with_episode_info"></a>

This can be modeled with HTML::TokeParser as below. The important methods to note are get_tag - which will move the stream on to the next opening for the tag given - and get_trimmed_text, which returns the text between the current and given tags. For example, for the HTML code "Bold text here", my $tag = get_trimmed_text("/b") would return "Bold text here" to $tag.

Also note that we're initializing HTML::TokeParser on '\$agent->{content}' - this is an internal variable for WWW::Mechanize, exposing the HTML content of the current page.

my $stream = HTML::TokeParser->new(\$agent->{content});
my $date;

# <input>
$stream->get_tag("input");

# <tr><td></td></tr><tr>
$stream->get_tag("tr"); $stream->get_tag("tr");

# <td></td><td></td>
$stream->get_tag("td"); $stream->get_tag("td");

# <td class="bluetext">Date of episode</td></tr>
my $tag = $stream->get_tag("td");
if ($tag->[1]{class} and $tag->[1]{class} eq "bluetext") {
 $date = $stream->get_trimmed_text("/td");
 # The date contains ' ', which we'll translate to a space.
 $date =~ s/\xa0/ /g;
}

# <td></td><td></td>
$stream->get_tag("td");

# <td class="bluetext">Time of episode
$tag = $stream->get_tag("td");

if ($tag->[1]{class} eq "bluetext") {
 $stream->get_tag("b");
 # This concatenates the time of the showing to the date.
 $date .= ", from " . $stream->get_trimmed_text("/b");
}

# </td></tr><a href="page_with_episode_info"></a>
$tag = $stream->get_tag("a");

# Match the URL to find the page giving episode information.
$tag->[1]{href} =~ m!src=(http://.*?)'!;

We have a scalar, $date, containing a string that looks something like "Thursday 23 January, from 6:45pm to 7:30pm.", and we have an URL, in $1, that will tell us more about that episode. We tell WWW::Mechanize to go to the URL:

$agent->get($1);

The navigation we want to perform on this page is far less complex than on the last page, so we can avoid using a TokeParser for it - a regular expression should suffice. The HTML we want to parse looks something like this:

 Episode The Episode Title 

We use a regex delimited with '!' in order to avoid having to escape the slashes present in the HTML, and store any number of alphanumeric characters after some whitespace, all between tags after the Episode header:

$agent->{content} =~ m! Episode \s+?(\w+?) !;

$1 now contains our episode, and all that's left to do is print out what we've found:

my $episode = $1;
print "The next Buffy episode ($episode) is on $date.\n";

And we're all set. We can run our script from the shell:

$ perl radiotimes.pl
The next Buffy episode (Gone) is Thursday Jan. 23, from 6:45 to 7:30 p.m.

I hope this gives a light-hearted introduction to the usefulness of the modules involved. As a note for your own experiments, WWW::Mechanize supports cookies - in that the requestor is a normal LWP::UserAgent object - but they aren't enabled by default. If you need to support cookies, then your script should call "use HTTP::Cookies; $agent->cookie_jar(HTTP::Cookies->new);" on your agent object in order to enable session-volatile cookies for your own code.

Happy screen-scraping, and may you never miss a Buffy episode again.

Monday, 27 May 2013

Beneficial Data Collection Services

Internet is becoming the biggest source for information gathering. Varieties of search engines are available over the World Wide Web which helps in searching any kind of information easily and quickly. Every business needs relevant data for their decision making for which market research plays a crucial role. One of the services booming very fast is the data collection services. This data mining service helps in gathering relevant data which is hugely needed for your business or personal use.

Traditionally, data collection has been done manually which is not very feasible in case of bulk data requirement. Although people still use manual copying and pasting of data from Web pages or download a complete Web site which is shear wastage of time and effort. Instead, a more reliable and convenient method is automated data collection technique. There is a web scraping techniques that crawls through thousands of web pages for the specified topic and simultaneously incorporates this information into a database, XML file, CSV file, or other custom format for future reference. Few of the most commonly used web data extraction processes are websites which provide you information about the competitor's pricing and featured data; spider is a government portal that helps in extracting the names of citizens for an investigation; websites which have variety of downloadable images.

Aside, there is a more sophisticated method of automated data collection service. Here, you can easily scrape the web site information on daily basis automatically. This method greatly helps you in discovering the latest market trends, customer behavior and the future trends. Few of the major examples of automated data collection solutions are price monitoring information; collection of data of various financial institutions on a daily basis; verification of different reports on a constant basis and use them for taking better and progressive business decisions.

While using these service make sure you use the right procedure. Like when you are retrieving data download it in a spreadsheet so that the analysts can do the comparison and analysis properly. This will also help in getting accurate results in a faster and more refined manner.

Source: http://ezinearticles.com/?Beneficial-Data-Collection-Services&id=5879822

Saturday, 18 May 2013

Screen Scraping

There is a huge difference between screen scraping and data mining. Basically, screen scraping allows you to obtain information while data mining on the other hand allows you to analyze the information you obtain. Before the advent of the internet, screen scraping literally meant scraping off or extracting information from text so it could be analyzed. Today, screen scraping is basically used to scrape information off the web. With that, specially designed programs and applications crawl through websites to pull out data needed by individuals doing the scraping. This is usually done when a person wants to build websites for price and product comparison, archiving web pages, or acquiring texts so it can be easily evaluated and filtered.

When you perform a screen scraping, you are able to scrape off data more directly. This is also one of the fastest ways to obtain data since the process is fully automated. Different types of screen scraping services can offer different ways of obtaining information. This is usually the solution especially when the website that is subject for scraping has several barriers designed to block this type of automated activity. Some screen scraping services offer text grepping and common expression matching. Extracting information from the web can be done through a UNIX grep command or other related techniques for expression matching. Some services offer web scraping applications that can be used to customize and tailor fit web based scraping solutions.

These applications can try to automatically distinguish the data structure of a particular page or offer a recording interface that significantly reduces the need to create screen scraping codes manually or other scraping functions that can be utilized to take out and convert web content as well as database interfaces that could accumulate the scraped information using local databanks.On the other hand, data mining is basically the process of automatically searching large caches of information and data for patterns. This means that you already have the information and what you only need to do is to analyze the contents to find the useful things you need. This is very different with screen scraping wherein you still need to look for the data before you can analyze it.

Data mining also involves a lot of complicated algorithms often based on various statistical methods. This process has nothing to do with how you obtain the data. All it cares about is analyzing what is available for evaluation. Screen scraping is often mistaken for data mining where in fact these are two different things. Today, there are online services that offer screen scraping. Depending on what you need, you can have it custom tailored to meet your specific needs and perform precisely the tasks you want. Finding reliable screen scraping services is not difficult and you can simply search them online and find the right company that can have the right solution for your needs.

Source: http://www.fetch.com/screen-scraping-article/

Thursday, 16 May 2013

Comparable Sales — and Widgets, APIs and Screen Scraping

Recently someone contacted me and inquired about pulling comparable sales info from Zillow, Trulia, Eppraisal, Yahoo and Cyberhomes. He also wanted to sort the details before storing it in excel for further analysis. If it works well, he wanted to package it and sell to the realtor community.

I told him that storing comparable sales data (or any other data for that matter) is against their Terms and conditions. Most of these providers (except Zillow and Eppraisal) do not provide APIs to get comparable sales. However, regardless, he found someone to do screen scraping these sites and create an excel spread sheet for a cheap price.

Screen scraping and stealing information from web sites has serious ramifications. This will create legal issues once the web site owners trace the activity to his IP address. His site also could be blacklisted. Even worse, screen scraping will stop functioning once the site makes minor changes to their HTML which is not uncommon in today’s world dominated by screen scrapers.

Screen scraping is simple parsing of web pages using a programming language (like PHP, Cold Fusion, Java, ASP, Perl, Python) looking for specific patterns in the HTML code extracting certain key details. This only requires basic programming skills and most of the languages make it easy with powerful parsing capabilities. This amounts to piracy and can have legal ramifications and is best avoided. The temptation is high given there are many freelancers over the internet offering cheap solutions using screen scraping.

This leads to the basic question – How can one access comparable sales information to attract traffic to his site?

The answer depends on your needs and capability. If you don’t want to get your hands dirty with the programming and/or you have low budget, your best bet will be using readily available widgets provided by most of these sites. You will only need some basic HTML skills to make sure the widget is placed properly on your web site without distorting the layout. There may also be plug-ins available (like the WordPress local market explorer) if you want to add these to your blog.

For advanced users with programming skills, you can try the API (Application programming Interface) offered by these providers. You can also hire programmers to do this for you. APIs are mostly Web services based on REST (Vs SOAP). Amazon was the pioneer in this area later embraced by most major players. There are very good frameworks or libraries available for using these APIs. This gives you maximum flexibility and you can combine this with other APIs like Google maps, Facebook, Twitter, Walkscore and Yelp (to name a few) to create very interesting end results, known as mashups. Word of caution – make sure that you read and follow their Terms & Conditions when doing this.

API offerings may be very limited in many cases and you may end up using the widgets in these situations. One also has to be aware of the API changes which may break your code requiring fixes to keep it running; something Google, Amazon, Twitter and Facebook have done frequently. I’d recommend making sure you have an ongoing relationship with the programmer when you hire for this kind of job. Don’t go only by the cost since most of them may not be around when your code needs fix.

Questions about APIs, widgets, or screen scraping? Ask away in the comments!

Source: http://geekestateblog.com/comparable-sales-and-widgets-apis-and-screen-scraping/

Sunday, 5 May 2013

What are the main reasons to prevent screen scraping?

Operational and infrastructure impact – Overloads server and bandwidth resources

Screen scrapers often hit the websites at very high rates causing significant load on servers and infrastructure. The fact that they are using programs rather than the browsers your site is optimized for may also cause additional problems with for example caching.

Denial of Service from high volume scraping

As scrapers often use hundreds or even thousands of IP adresses to access the sites, even a small problem in the program they use often cause traffic very simialr to ddos attacks against the site.

Content is no longer unique to you – Dilutes the value of you data

Often the end goal for the scrapers is to take the unique data provided by a website and redisplay it on other sites in a slightly different way. This in turn can cause problems wioth search engine rankings and steal traffic away from the site.

Commercial impact and loss of revenue

The commercial impact of not preventing screen scraping is very much dependant on the specific type of business that is affected. See the executive report for detailed information about each sectors problems with screen scraping.

Without proper detection and blocking YOU do the work for the scrapers.

Source: http://www.scrapesentry.com/threat

Friday, 3 May 2013

Content access basics – Part I – screen scraping

In this multi-part series we will look at a number of different approaches that federated search engines (FSEs) take to access content from remote databases.

FSEs are always at the mercy of the content provider when it comes to searching and retrieving content. FSEs perform deep web searches since they access content that lives inside of databases. Read the earlier articles on crawling vs. deep web searching and introduction to the deep web for background information on deep web searching. Also, read the article about connectors to understand how the query processing and search engine submission process works for deep web searching.

When FSEs search deep web databases they often do so by filling out search forms much like humans do and they also process result lists (summaries of documents generated by the remote search engines) much like the way humans examine the search results in their browsers. Processing a list of search results by reading and dissecting the HTML that a search engine provides is called “screen scraping.” Wikipedia has an article about screen scraping.

Screen scraping is the most difficult way to obtain search results because the result data is not structured in a way that makes it easy to identify the fields in the result records. Unfortunately, however, screen scraping is also the most prevalent approach to extracting field data because a majority of the content that is published electronically is expected to be consumed by humans and not by search engines. Fortunately, the use of structured result data is increasing.

Let’s look at what happens when a human searches a deep web database to better under how screen scraping works. First he enters his search terms into one or more search fields and submits the form. Then he examines the results that are returned by the form. The results are returned as HTML, which the browser renders, or draws, to look nice on the screen, displaying result text in different fonts and styles. To the user the results are nicely structured. One record after another is displayed. It’s obvious to him where a record starts and where it ends.

Software that has to read the HTML doesn’t have as easy a time as does the human for a number of reasons:

    The screen scraping software has to determine which sections of the HTML document containing the search results it must ignore. Headers, footers, and advertisements are examples of parts of the HTML that need to be ignored. In other words, the screen scraper must determine where a record starts and where it ends.
    Some result fields may be missing in some of the result records. The screen scraper needs to be able to determine when fields are missing and keep processing the results while not getting confused about what fields it is extracting.
    It is not obvious to the screen scraper which field is the title, which is the author, which is the publisher, and so on.
    Some FSEs retrieve multiple pages of results from a source. When result data is returned in XML or other structured format it is a straightforward process to request subsequent sets of results. Screen scraping software however, must be configured for every possible next-page scenario. Sometimes there’s a “next” button to press, sometimes there’s an arrow button to go to the next page, sometimes there’s a page number with a link to click. Also, the page navigation elements may be anywhere on an HTML page.
    Data may be returned in an inconsistent format between records. A date may appear as “January 1, 2008″ in one record and as “1/1/2008″ in another. While this is not very common within results from one publisher it becomes a major issue when the FSE is aggregating results from multiple sources which return date, author, and other fields in different formats and must normalize the field data (convert it to one format) in order to sort by one of those fields.
    Authentication to access restricted content is often problematic for the screen scrapers. When result data is returned in a structured format the expectation is that a computer program will be processing that data and that a computer program that is not a browser will be performing the authentication steps. Thus, there are usually fewer hoops to jump through when authenticating to retrieve structured results data. Additionally, the authentication steps are likely to be documented. In the screen scraping approach the FSE typically has to deal with session information, cookies, and perhaps IP-based authentication. The FSE connector developer has to manually reverse engineer the authentication steps (since there’s usually no documentation) and implement them on a per-source basis.
    The HTML may not be consistent from one record to another. In particular, the first or last record in a result list often has HTML tags around it that are different from the tags around the other records in the results list. And, if a search returns only one result the HTML tags around that result are often different than in the multiple result case.
    The HTML may not be correct. Small errors in HTML may be corrected by browsers and may trip up screen scrapers.

There are two important points to make about the issues with screen scraping we just enumerated. First, humans can deal with ambiguous structure and missing fields much more easily than computers can. Humans have no trouble identifying titles, authors, and other fields without being explicitly told what the fields are. Humans can ignore ads, footers, and headers with little effort. Humans can find links to subsequent results. Computers have to work much harder to obtain the same results.

The second point is that the problems of screen scraping are exacerbated when federation occurs. It is no longer enough to identify the date field in a result record. The format of that date field has to be normalized across all results from all sources. Granted, this is not a problem unique to screen scraping but, in screen scraping we must always be on the lookout for data that is inconsistent within results from the same source. When the data is structured, as XML for example, the data is more likely, but never guaranteed, to be more consistent in its format.

It is important to note that the arguments about whether it is better to screen scrape or not are nonsensical ones. Federated search engine vendors do screen scraping when there is no better method to access content from a source. Vendors who brag about not screen scraping are also telling you indirectly that there’s a large pool of sources that they just don’t search.

In subsequent parts of this series we will look at XML gateways, SRU/SRW, OpenSearch, Z39.50, and other methods of accessing content.

Source: http://federatedsearchblog.com/2007/12/27/content-access-basics-part-i-screen-scraping/

Note:

Justin Stephens is experienced web scraping consultant and writes articles on screen scraping services, website scraper, Yellow Pages Scraper, amazon data scraping, yellowpages data scraping, product information scraping and yellowpages data scraping.

How to get rid of Screen Scrapers from your Website

While driving on a long trip this weekend, I had a bit of time to think. One topic that came to my mind was screen scraping, with a focus on APIs. It hit me: screen scraping is more of a problem with the content producer than it is with the “unauthorized scraping” application.

Screen scraping is the process of taking information that is rendered on the client, and then transforming the information in another process. Typically, the information that is obtained is later processed for filtering, saving, or making a calculation on the information. Everyone has performed some [legitimate form] of screen scraping. When you print a web page, the content is reformatted to be printed. Many of the unauthorized formats of screen scraping have been collecting information on current gambling games [poker, etc], redirecting capchas, and collecting airline fare/availability information.

The scrapee’s [the organization that the scraper is targeting] argument against the process is typically a claim that the tool puts an unusual demand on their service. Typically this demand does not provide them with their usual predictable probability of profit that they are used to. Another argument is that the scraper provides an unfair advantage to other users on the service. In most cases, the scrapee fights against this in legal or technical manners. A third argument is that the content is being misappropriated, or some value is being gained by the scraper and defrauded from the scrapee.

The problem I have with the fighting back against scrapers, is that it never solves the problem that the scrapers try to fix. Let’s take a few examples to go over my point: the KVS tool, TV schedules, and poker bots. The KVS tool uses [frequently updated] plugins to scrape airline sites to get accurate pricing and seat availability details. The tool is really good for people that want to get a fair bit of information on what fares are available and when. It does not provide any information that was not provided by anyone else. It just made many more queries than most people can do manually. Airlines fight against this because they make a lot of money on uninformed users. Their business model is to guarantee that their passengers are not buying up cheap seats. When an airline claims that they have a “lowest price guarantee” that typically means that they show the discount tickets for as long as possible, until they’re gone.

Another case where web scraping has caused another issue is with TV schedules. With the MythTV craze a few years ago, many open source users were using MythTV to record programs via their TV card. It’s a great technology, however the schedule is not provided in the cable TV feed, at least in an unencrypted manner. Users had to resort to scrapping television sites for publicly available “copyrighted” schedules.

The Poker-bots are a little bit of an ethical issue. This is something that differs from the real world rules of the game. When playing poker outside of the internet, players do not have access to real-time statistic tools. Online poker providers aggressively fight against the bots. It makes sense; bots can perform the calculations a lot faster than humans can.

Service providers try to block scrapers in a few different ways. The end of the Wikipedia article lists more; this is a shortened version. Web sites try to deny/misinform scrapers in a few manners: profile the web request traffic (clients that have difficulty with cookies, and do not load JavaScript/images are big warning signs), block the requesting provider, provide “invisible false data” (honeypot-like paths on the content), etc. Application-based services [Pokerbots] are more focused on trying to look for processes that may influence the running executable, securing the internal message handling, and sometimes record the session (also typically done on MMORPGs)

In the three cases, my point is not to argue why the service is justified in attempting to block them, my point is that the service providers are ignoring an untapped secondary market. Those service providers have refused to address the needs of this market – or maybe just haven’t seen the market as viable, and are merely ignoring it.

If people wish to make poker bots, create a service that allows just the bots to compete against each other. The developers of these bots are [generally] interested in the technology, not so much the part about ripping-off non-bot users.

For airlines, do not try to hide your data. Open up API keys for individual users. If an individual user is trying to abuse the data to resell it, to create a Hipmunk/Kayak clone, revoke the key. Even if the individual user’s service request don’t fit the profile; there are ways of catching this behavior. Mapmakers have solved this problem a long time ago by creating trap streets. Scrapers are typically used as a last resort, they’re used to do something that the current process is made very difficult to do.

Warning more ranting: with airline sites, it’s difficult to get a very good impression on the cost differences of flying to different markets [like flying from Greensboro rather than Charlotte] or even changing tickets, so purchasing from an airline is difficult without the aid of this kind of tool. Most customers want to book a single round trip ticket, but some may have a complex itinerary that will have them leaving Charlotte stopping over in Texas, then to San Francisco, and then returning to Texas and flying back to my original destination. That could be accomplished by purchasing separate round trip tickets, but the rules of the tickets allow such combinations to exist on a single literary. Why not allow your users to take advantage of these rules [without the aid of a costly customer service representative]?

People who use scrapers do not represent the majority of the service’s customers. In the case of the television schedules example, they do not profit off the information, and the content that they wished to retrieve wasn’t even motivated by profit. Luckily, an organization stepped in and provided this information at a reasonable [$25/yr] cost. The organization is SchedulesDirect.

The silver lining to the battle on scrapers can get interesting. The PokerClients have prompted scraper developers to come up with clever solutions. The “Coding the Wheel” blog has an interesting article about this and how they inject DLLs into running applications, use OCR, and abuse Windows Message Handles [again of another process]. Web scraping introduces interesting topics that deal with machine learning [to create profiles], and identifying usage patterns.

In conclusion, solve the issue that the screen scrapers attempt to solve, and if you have a situation like poker, prevent the behavior you wish to deny.

Source: http://theexceptioncatcher.com/blog/2012/07/how-to-get-rid-of-screen-scrapers-from-your-website/

Note:

Roze Tailer is experienced web scraping consultant and writes articles on screen scraping services, website scraper, Yellow Pages Scraper, amazon data scraping, yellowpages data scraping, product information scraping and yellowpages data scraping.

“Screen-scraped” bank feeds are unreliable and inaccurate

Many business owners use cloud accounting solutions and benefit from daily bank-feeds, a feature where bank transactions are automatically imported and matched to the correct accounts in their accounting software. Bank feeds remove both the tedious task of …1 May 2013

MYOB warns “screen-scraped” bank feeds are unreliable and inaccurate

Many business owners use cloud accounting solutions and benefit from daily bank-feeds, a feature where bank transactions are automatically imported and matched to the correct accounts in their accounting software. Bank feeds remove both the tedious task of data entry and the challenge of correctly allocating numerous transactions in the bank reconciliation process. However MYOB warns bank feeds services from some software providers may be unreliable and inaccurate.

MYOB General Manager, User Experience and Design, Ben Ross, says the company is committed to providing reliable, accurate data and maintaining rigorous standards of security when managing financial data.

“At MYOB, we understand that reliable access to accurate data is absolutely fundamental for our customers. Automatically importing transaction details into MYOB accounting solutions significantly reduces manual data entry, improves accuracy and saves both time and money,” he says.

Mr Ross explains that it is important for business owners to understand exactly how their accounting software accesses their sensitive banking information, and whether that access is authorised by their banks online terms and conditions.

“There are several ways that accounting service providers can access aggregated bank transaction data and unfortunately some software providers play fast and loose with data quality and customer security,” he says.

MYOB uses a bank-authorised data collection system provided by BankLink for its LiveAccounts and AccountRight Live products. In this process, BankLink supplies secure bank transaction data via direct feeds from financial institutions without needing to disclose logon details. The data is supplied in a secure, ‘read only’ format. The entire process complies with the stringent Payment Card Industry Data Security Standard for the safe handling of transaction data and meets the requirements of more than 100 financial institutions.

“MYOB chose to work with BankLink for its proven reliability, security and coverage of feeds from financial institutions across Australia. BankLink has a team of data accuracy specialists reviewing bank data feeds using processes they have refined over their 25 years of providing this service. For this reason, BankLink feeds are 99.9999% accurate and in some cases, more reliable than the bank’s own raw feeds,” says Mr Ross.

BankLink applies a series of proprietary, data validation routines to all bank transactions that identify and correct any anomalies in the data. This sophisticated error detection system results in a significant increase in data accuracy. Furthermore, BankLink’s direct contractual relationship with the banks means that they have protocols in place to fix any errors promptly without any interruption to service.

Some cloud accounting providers use a method commonly called “screen-scraping”. This process requires a business owner to disclose their internet banking username and password to a third party ‘screen-scraper’. This third party then automatically logs in to the business’s internet banking account at regular intervals, copies their transactions and supplies them to their accounting services provider.

The screen-scraping process may contravene internet banking terms and conditions.

“Most online banking terms and conditions forbid the disclosure of login and password details to any party, and exclude the bank from liability for any fraud which may then occur on the account – whether or not the fraud is related to the actions of the screen-scraper. We caution users of other software against passing on their online banking credentials through to third parties in return for bank feeds that are insecure and contain inaccuracies,” says Mr Ross.

Along with the potential security risks, screen-scraping can also be unreliable as the third party isn’t working directly with the banks. Not surprisingly, this lack of reliability can lead to frustration for accountants, bookkeepers and business owners.

“The concern for business owners is in the accuracy of their business financials. Even if only two in every hundred transactions are wrong, how do you know which two? That adds in a whole lot of extra work that undermines the original time saving benefit of the bank feed system.”

According to Debbie Vihi, owner of Mobile Bookkeeping Services, the bank feeds associated with some cloud accounting packages that use a third party or ‘screen scraping’ can be both unreliable and inaccurate. This makes it time consuming to reconcile a client’s banks accounts.

“I was reconciling a client’s accounts when I noticed that the software had duplicated transactions. The client often had a lot of similar amounts coming out of their bank account so the double ups were not picked up — they were mainly related to credit cards,” she says.

“What was usually a five minute job turned out to be quite time consuming. I had to take a million steps backwards and had to manually tick the statements off against the correct transactions,” says Ms Vihi.

Fortunately for accountants, bookkeepers and business owners who want to enjoy the time and cost benefits of bank feeds, MYOB’s provider BankLink offers both an accurate and more reliable alternative to screen-scraping.

“For anyone using cloud accounting, accurate bank feeds can be a real time-saver, inaccurate bank feeds can be a nightmare. To ensure you are getting accurate, reliable data in a way that doesn’t contravene your bank’s terms and conditions, it’s important to understand how your cloud accounting provider obtains its bank feeds. Users should check that they haven’t inadvertently supplied a third party with their banking login in details and ask their provider what industry standards their third party supplier complies with,” says Mr Ross.

Source: http://business.scoop.co.nz/2013/05/01/screen-scraped-bank-feeds-are-unreliable-and-inaccurate/

Note: