Screen Scraping Services: April 2013

Sunday, 28 April 2013

How To Do Screen Scraping, Web Scraping And URL Scraping?

Web Scraping

Extract information on websites is called scraping. In most cases, these programs site scraper software to reproduce the way in which man explores the Web by performing low-level HTTP or by setting specific web browsers as well-developed Mozilla, Google Chrome and Internet Explorer browsers.

Some scratches web provides the specific details of a number of different sites without having to collect manually. This means that data are collected using automatic scraper tools and software. So if you need to get as much information about a product list to collect, you just have to scrape or extract information using automated tools, web scraping.

By doing this, you will be able to adjust prices, pictures, descriptions and names of the products you want to search to get. After extraction, the data can then be exported to various formats, including TXT, XML, HTML and SQL script. This gives you the advantage going into the details of the size you normally use.

URL Scraping

Some software can even scrape URL. URL scraper is useful for the high quality URL to your blogs and other articles. For people who want to create back links to improve, a scraping tool URL is useful. You can even use it for research and study SEO. By typing a keyword, the software will scrape the URL to the top of the web that your keywords search.

Now you can easily increase your campaign targeted to specific market for you. You can spend on your back link, because you can be sure that the best sites on the web linked back.

This technique can be used to help you with your own website and improve performance on the Internet.

Now let’s see example.

Screen scraping

The process of collecting visual data online web pages is known as screen scraping. Also known as the showcase is the method of data acquisition screen by capturing the text manually or via software. To scraping automatically, software must be used to identify specific data. This software takes screen scrap data from HTML Web pages and to convert unstructured data into structured records or reports.

The software for screen scraping can be used for some applications. Example, a broker may use a screen scraper to gather information on competitor’s websites to form an average price for a house or in a particular region to offer. The operator may use a screen scraping software for collecting customer emails while researchers generally use it as a tool for a wide range of data and information gathering.

Thanks to webscrapingexpert.com easy access to the services of the screen scrapping for scrapping regular updates prices, scrapping for sales leads and decomposition of regular updates, product or service from competitors.

Extracting data from all types of information, cities and countries is a snap with the screen scrappers offered by the provider of the removal. These scrappers can write the data from Excel or CSV format and save it as XML.

They can be used dynamically to both the mining and extraction unlimited ad unique ads. The characteristics of the automatic update is the latest software versions that are compatible with the latest Microsoft operating system platforms, demolition, like XP, Vista, Windows 7, Linux and Mac.

Source: http://data.ezinemark.com/how-to-do-screen-scraping-web-scraping-and-url-scraping-7d3143b93016.html

Note:

Roze Tailer is experienced web scraping consultant and writes articles on screen scraping services, website scraper, Yellow Pages Scraper, amazon data scraping, yellowpages data scraping, product information scraping and yellowpages data scraping.

Screen Scraping HTML

We've all found useful information on the web. Occassionally, its even necessary to retrieve that information in an automated fashion. It could be just for your own amusement, possibly a new web service that hasn't yet published an API, or even a critical business partner who only exposes a web based interface to you.

Of course, screen scraping web pages is not the optimal solution to any problem, and I highly advise you to look into APIs or formal web services that will provide a more consistent and intentional programming interface. Potential problems could arise for a number of reasons.
Step 0 : Considerations

Most obvious and annoying problem is you are not guaranteed any form of consistency in the presentation of your data. Websites are under construction constantly. Even when they look the same, programmers and designers are behind the scenes tweaking little pieces to optimize, straighten, or update. This means that your data is likely to move or disappear entirely. As you can imagine, this can lead to erroneous data or your program failing to complete.

A problem that you might not think of immediately is the impact of your screen scraping on the target's web server. During the development phase especially, you should give serious thought the mirroring the website using any number of mirroing applications available on the web. This will protect against you accidentally Denial of Servicing the target's web site. Once you move to production, out of common courtesy, you should limit the running of your program to as few times as possible to provide you with the accuracy your required. Obviously, if this is a business-to-business transaction, you should keep the other guy in the loop. It won't be good for your business relationships should you trip the other companies Intrusion Detection System and then have to explain what you're to a defensive security administrator.

Along the same lines, consider the legality of the screen scraping. To a web server, your traffic could masquerade as 100% interactive, valid traffic, but upon closer inspection, a wise system administrator will likely put the pieces together. Search that companies website for "Acceptable Use Policies" and "Terms of Service." In some cases, they may not apply but it's likely that the privilege to access the data is granted only after agreeing to one of the two aforementioned documents.
Step 1 : Research

At this point, it's necessary to dive into the task at hand. Go through the motions manually in a web browser that supports thorough debugging. My experience with Firefox has always been a positive one. Through the use of tools like the DOM Inspector, the built in Javascript Debugger, and extensions like Web Developer, View Source With .., and Venkman its been one of the best platforms for web development I've encountered. Incidentally, the elements of web design are critical to the automated extraction of that data. There are two phases to debug to write a good screen scraper.

The Request

A web server is not a mind reader, it has to know what you're after. HTTP Requests tell the web server what document to serve and how to serve it. The request can be issued through the address bar, a form, or a link. As you navigate the site, take note of the parameters passed in the Query String of the URL. If you need to login, use the Web Developer Extension to "Display Form Details" and take note of the names of the login prompt and the form objects themselves. Also, its important to take note of the "METHOD" the form is going to use, either "GET" or "POST". As you go through, sketch out the process on a scrap piece of paper with details on the parameters along the way. If you're clicking on links to get where you need, use the right click option of "View Link Properties" to get details.

A key thing people often miss when doing web automation is the effect of client side scripting. You can use Venkman to step through the entire run of client side code. You want to pay attention to hidden form fields that are often set "onClick" of the submit button, or through other types of normal user interaction. Without knowing and setting these hidden fields to the correct value, the page will refuse to load or cause problems. Granted, this isn't good practice on the site designer's part as a growing number of security aware web surfers are limiting, or disabling client side scripting entirely.

The Response

After sketching out the path to your data, you've finally arrived at the page that contains the data itself. You now need to map out the page in a way that your data can be identified from the rest of the insignificant details, styling, and advertisements! I've always believed in syntax highlighting and have become accustomed to vim's flavor of highlighting. I've got the View Source With .. Extension configured to use gvim. So I right click and with any luck, the page source is displayed in the gvim buffer with syntax highlighting enabled. If the page has a weird extension, or no extension, I might have to "set syntax=html" if its not presenting the proper page headers. Search through the source file, correlating the visual representations in the browser with the source code that's generating them. You'll need to find landmarks in the HTML to use as a means to guide your parser through an obscure landscape of markup language. If you're having problems, another indispensible tool provided by Firefox is the "View Selection Source". To use it, simply highlight some content and then right click -> "View Selection Source". A Mozilla Source viewer opens with just the HTML that generated the selected content highlighted with some surrounding HTML to provide context.

You're going to have to start thinking like a machine. Think Simple, 1's and 0's, true and false! I usually start at my data and start working back, looking for a unique tag or pattern that I can use to locate the data moving forward. Look not only at the HTML Elements (<b>,<td>, etc) but at their attributes (color="#FF000",colspan="3") to profile the areas containing and surrounding your data.

The lay of the land is changing these days. It should be getting much easier to treat HTML as a data source thanks Web Standards and the alarming number of web designers pushing whole-heartedly for their adoption. The old table based layouts, styled by font tags and animated GIFs is giving way to "Document Object Model" aware design and styling fueled mostly by Cascading Style Sheets (CSS). CSS works most effectively when the document layout emulates an object. There are "classes", "ids", and tags establish relationships. CSS makes it trivial for Web Designers with passion and experience in Design Arts, to cooperate with Web Programmers whose passion is the Art of Programming and whose idea of "progressive design" is white text on a black background! The cues that Programmers and Designers specify to insure interoperability of Content and Presentation gives the Screen Scraper a legible road map by which to extract their data. If you see "div", "span", "tbody", "theader" elements bearing attributes like "class" and "id" favor using these elements as landmarks. Though nothing is guaranteed, it's much more likely that these elements will maintain their relationships as they're often the result of divisional cooperation than entropy.

One of the simplest ways to keep your bearing is to print out the section of HTML you're targetting, and sketch out some simple logic to be able to quickly identify it. I use a highlighter and a red pen to make notes on the print out that I can glance at as a sanity check.

Step 2 : Automated Retrieval of Your Content

Depending on how complicated the path to your data, there are a number of tools available. Basic "GET" method requests that don't require cookies, session management, or form tracking can take advantage of the simple interface provided by the LWP::Simple package.

hat's it. Simple.

More complex problems with cookies and login's will require a more sophisticated tool. WWW::Mechanize offers a simple a solution to a complex path to your data with the ability to store cookies and construct form objects that can intelligently initialize themselves. An example:

Step 3 : Data Processing

There are two main ways to parse markup languages like HTML, XHTML, and XML. I've always preferred dealing with the "Event Driven" methodology. Essentially, as the document is parsed, new tags trigger events in the code, calling functions you've defined with the attributes of the tag included as arguments. The content between a start and end tag is handled through another callback function that you've defined. This method requires that you build your own data structures. The second method parses the entire document, building a tree like object from it which it then returns to the programmer as an object. This second method is very useful when you have to process an entire document, modify its contents and then transform it back into markup language. Usually, a screen scraping program cares very little for the "entire document" and more for the interesting tidbits, everything else can be ignored.

HTML::Parser

HTML::Parser is an event driven HTML parser module available on CPAN. Using the above content retrieval code snippet, delete the "print $bot->content();" line, and insert this code, with "use" statements at the top for consistency.

Using this, its simple to extract the temperature from the variable $textStr. If you wanted to extract more information, you could use a more complex data structure to hold all the variables. The important thing to remember about the event based model is everything happens linearly. It's good practice to keep state, either through a simple scalar, like the $grabText var above, or in an array or hash. If you're dealing with data that's nested in several layers of tags, you might consider something like this:

This model works great for most screen scraping as we're usually interested in key pieces of data on a page byh page basis. However, this can quickly turn your program into a mess of handler subroutines and complex tracking variables that make managing your screen scraper closer to voodoo than programming. Thankfully, HTML::Parser is fully prepared to make our lives easier by supporting subclassing.
Step 4 : SubClassing for Sanity

I usually like to have 1 subclassed HTML::Parser class per page. In that class I'll include accessors to the relevant data on that page. That way, I can just "use" my class where I'm processing the data for that one page and I can keep the main program relatively clean from unnecessary clutter.

The following script, uses a simple interface to pull down the current temperature in Fahrenheit. The accessor method allows the user to specify the units they'd like the temperature back in.

The script uses a homemade module "MyParsers::Weather::Current" to handle all the parsing. The code for that module is provided below.

Wrapping Up

HTML can be an incredibly effective transport mechanism for data, even if the original author hadn't intended it to be that way. With the advent of Web Services and Standards Compliant designs utilizing Cascading Style Sheets, its becoming more and more interoperable and cooperative. Learning to use screen scraping techniques can provide a wealth of information for the programmer to analyze and format to their heart's content.

As an exercise, you might want to expand on the "MyParsers::Weather::Current" object to pull additional information from weather.com's page, and add a few more accessors! If you'd really like a challenge, it'd be kind of fun to write a parser for each of the major weather sites, pull the data for forecasting down, and use a weighted average based on the individual sites accuracy in the past to get an "educated guess" at the weather conditions!

Feel free to contact me with Questions/Comments on this article!

Source: http://edgeofsanity.net/article/2005/04/06/html-parsing.html

Note:

RYANAIR: screen scrapers, databases, free-riding and unfair competition in Spain

Here's an instructive piece from our man in Spain, Fidel Porcuna, on a situation in which -- even for a business that has an ample portfolio of rights -- it may be difficult or impossible to guard against free-riding. Fidel writes:

On 9 October 2012 the Spanish Supreme Court ruled on the dispute between Ryanair Ltd and Atrápalo, S.A., a Spanish online travel agency using a screen scraper software on Ryanair's website. The Court confirmed previous instances decisions that dismissed Ryanair's claim based on copyright infringement of a database, infringement of a sui generis or standalone database right, and unfair competition. Proved facts were as follows: Atrápalo regularly enters Ryanair's website as a mere user. By means of a screen scraper software that reads the search patterns of the Ryanair website, Atrápalo extracts the information on flights its own user is requesting through Atrápalo's website and provides it, omitting that such information is scraped from Ryanair's website. Atrápalo collects not only time details, but also prices as displayed in Ryanair's website. To such such prices Atrápalo adds a cut (its profit). Ryanair offers a whole range of complementary services to anyone who navigates through Ryanair's website searching for a flight. The terms and conditions regulating the use of Ryanair websites include a prohibition to use screen scrapers and use of the websites with a commercial purpose.

Based very much exhaustively on the CJEU's interpretation of Directive 96/9 on the legal protection of databases (cases C-604/10 Football Dataco Ltd, The Scottish Premier League Ltd, The Scottish Football League, PA Sport UK Ltd v Sportradar GmbH and Sportradar AG; C-545/07 Apis-Hristovich EOOD v Lakorda AD; C-444/02 Fixtures Marketing Ltd v Organismos Prognostikon Agonon Podosfairou; C-338/02 Fixtures Marketing Board Ltd v Svenska Spel AB, C-203/02 British Horseracing Board and Others, etc.), the findings of the Court are as follows:

    Ryanair does not have a database protected under Article 12.2 Spanish Copyright Act (as implemented by Articles 1(2) and Article 3(1) of Directive 96/9). The Court declares that there is not a proper database (no collection of independent data), but a software that generates the information requested under the parameters introduced by the user (that is, a software that provides the best price for the flight the user is looking for, considering a range of variable factors). In the hypothetical case that the Court accepted Ryanair's allegation on the existence of a database, in no case it could be accepted that such hypothetical database's structure meets the originality necessary to be protected. Indeed, the selection and arrangement comes from a software, says the Court. Ryanair defended that, in contrast, the Regional Court of Hamburg had declared in Ryanair v Cheaptickets of 26 February 2010 that Ryanair did have a database.
    There is not a sui generis right in a database, as Ryanair's substantial investment was not directed to collect data: it was directed to create a software that generates information under the parameters introduced by the user of Ryanair's website. That is, the investment refers ultimately to creation of information, but not to its collection, verification or presentation.

Importantly, the Court refers to violation of contractual law and unfair competition as follows:

    The Court concludes that there is no contractual relationship between Atrápalo and Ryanair and therefore no violation of a contract exists. The Court accepts that the supply of or the access to information on flights could be subject to a contract under the Spanish law, but it considers that the use of the Ryanair website –- free to anyone who types the URL address –- does not entail a consent to enter into such a contract. Therefore Ryanair failed to prove Atrápalo's consent to enter into its terms and conditions to navigate through its website – despite the latter using the website through a screen scraper, expressly forbidden by such terms and conditions. The situation then, as viewed by the Court, is that Atrápalo carried out something not allowed by Ryanair in a contract to which it did not consent and so no violation of the contract could exist. The Court noted that Ryanair acknowledged that it does not apply proper (technical) means to prevent travel agencies to use their websites.

    Based on procedural reasons, the Supreme Court did not decide on the merits regarding unfair competition. But such is nonetheless rejected by the lower court (Court of Appeals). Ryanair argues that Atrápalo is free-riding on its effort to create a potent and liable software that optimizes flight information according to users' requests. By the screen scraping, Atrápalo is also diverting users away from Ryanair's website where a range of different complementary services are offered to whom is looking for a cheap flight (car rentals, hostels, etc.) causing loss of profits to Ryanair. The Appeals Court believes there is no unfair advantage of Ryanair's repute (as argued by Ryanair's lawyers); and Atrápalo or other travel agencies do not need an authorisation from Ryanair for exercising their intermediary's role as there is no legal right that would support this. Nor is a bad faith conduct, as Atrápalo is not affecting the normal functioning of the market or altering the market's competitive structure. Indeed, it is beneficial for users and therefore helps in keeping and fostering the free competition of the current economic order.

Source: https://www.marques.org/class46/default.asp?D_A=20130301
Note:

Screen Scraping, Screen Scrapping Software, Screen Scraper, Screen Scrappers

Screen Scraping

The process of collecting visual data online from web pages is known as screen scraping. Also known as screen scrapping, it is the method of acquiring data displayed on screen by capturing the text manually or via software. To perform scraping automatically, software must be used that can recognize the specific data. This screen scrapping software takes the data from the HTML web pages and converts the unstructured data into structured records or reports. This software for screen scraping can be used for a number of applications. For instance a real estate agent may use a screen scraper to gather data on competing websites to form an average price or offer for a given house in an area. A marketer may use screen scraping software to collect emails of the customers whereas the researchers in general use it as a tool to gather a wide array of data or information.

Through scrappingexpert.com one can easily get access to screen scrapping services for scrapping of regular price updates, scrapping for sales leads and scrapping of regular products or service updates of the competitors.

Related Coverage

Touch Screen Repair

A touchscreen is a display that can sense the presence and place of a touch in the display location. The term normally refers to touch or contact on the display of the monitor by a hand or finger. Fire Places Screen Savers Might be Ideal to suit your needs and your family members

In the event you see fireplace screensavers before then you know which you may have considered them a pointless piece of software and not value spending for Best Touch-screen Laptops Worldwide

Touch-screen laptops are now widely used all over the world thank to their convenience and functions such as iPad and Android tablets. However, if you expect to own a two-in-one tool that provides the keyboard for writing emails and the touch-screen for a natural way to surf the web as well as browse your photos, touch-screen laptops can meet your demands. Let take a look at top ten best touch-screen laptops worldwide Green Screen Editing Technique

How about amazing your family members, neighbors and friends with some beautiful pictures from around the world? Wouldn’t it be a great surprise for them to see them and wonder that when you went on a world trip? Well! Of course for that you literally do not need to travel the globe but need a simple green screen editor to do the magic work for you.Extracting data from all the categories of the information, cities and countries is really easy with the screen scrappers offered by this scrapping solutions provider. These scrappers can write the extracted data to excel or CSV format and also save it as XML.
They can be used dynamically for both unlimited ad extraction and unique ads extraction. Having the features of automatic updating, these are the latest versions of scrapping softwares that are compatible with latest OS platforms like Microsoft XP, Vista, Windows 7, Linux and Mac.

Source: http://www.offroadwithcobba.com/screen_scraping_screen_scrappi/

Note: