The World Wide Web has made text data available like never before. Almost all sites accessible through the Internet can be addressed by either a browser or a program that retrieves the plain-text version of the page containing HTML tags. This availability of raw data has caused a renaissance of programs called screen scrapers. Screen scrapers access data normally targeted at a screen (or browser window) and scrape the desired data from the screen for storage, or repackaging and display.
Brief History of Screen Scrapers
Between the 1970s era of widespread deployment of text-based mainframe/terminal applications and the twenty-first century browser era came the age of the graphical user interface (GUI). Ushered in by the success of the Macintosh, computer applications began to feature windows, drop-down menus, checkboxes, and other user-interface elements that made using programs much more flexible than their text-based predecessors.
The revolution in GUI adoption caused a problem for organizations that had invested tremendous amounts of time and money in mainframe-based text applications. In less than a decade, these text-based applications went from being cutting edge to antiquated. For productivity reasons, organizations had to rewrite these applications to take advantage of the new GUI paradigm. Even more daunting than the mountain of reprogramming was the conversion of data stored in these mostly custom systems. Few standard data formats existed when they were initially designed, so retrieving and converting the data posed tremendous difficulties.
Enter the screen scraper. A screen scraper was a program that sat between the text-based application and either a user or a more advanced data retrieval program. For GUI applications, the screen scraper acted as a middle layer between the graphical interface in the foreground and the real text-based application in the background. Data would be scraped from the text-based screen and loaded into the frontend graphical interface. A user would work with a modern interface with all of the advantages provided by a GUI, including Undo, Cut, Copy, and Paste (among other functions). When the user clicked the Save button, the screen scraper would interface with the text-based application and act as if the user was punching in the keyboard codes and data by hand.
More commonly, a screen scraper would be used to access a system and “scrape” the screen for data held in the text-based system. The scraper might send a query into the text-based system and then scrape the results displayed by the mainframe and place it in a new data store such as a database management system (DBMS).
As time passed and text-based applications were retired, screen scrapers became less and less common. Organizations upgraded antiquated text-based systems and traded custom data storage for standardized database servers with flexible data stores. Scrapers were no longer needed when data retrieval through many different types of retrieval middleware was available.
The explosive growth of HTML changed all that. Almost overnight, oceans of data were being widely published in an unencrypted, public, and quickly accessible data format. Anything Web sites published for viewing by individual browsers could easily be harvested by a program, stripped of its formatting, and either reformatted for display or stored for later access.
Since PHP comes with built-in HTTP capabilities, it is a fairly easy task to write code to go to a Web site, grab the HTML code from the site, and strip the desired data from the page. Implementing such code as a Joomla component is only a short development step from there.
Pitfalls of Using a Screen Scraper
Whenever you are retrieving content from a remote site and publishing it for the use of your Web site visitors, a number of problems arise — technical, legal, and ethical. These problems should be taken into account before you undertake a screen scraping application.
On the technical side, using a scraper means that you have no access to a standard API for the data you need. That means any changes to the host Web page can instantly cause your scraper to stop functioning properly. Your program will rely on the data to be presented in a particular format. Unlike a human viewer who understands the information whether it is displayed in three columns or changed to four columns, a program must have consistency in order to process the page and retrieve the information. Therefore, a scraping application can go from working flawlessly one day to not working at all the next. Scrapers require constant monitoring.
Further technical problems arise when the site tries to discourage scraper applications. Some organizations hate scraping applications because the programs retrieve the information while expending site bandwidth and other resources. Further, the scrapers eliminate the possibility of obtaining visitor data from the user ’s browser, and avoid the display of Internet ads that generate revenue for the site.
On the legal side, even though the data may be publicly available, it may be illegal for you to redisplay it on your own Web site in the same way that you can’t plagiarize a written article and claim it as your own. For Web-based content, these issues are very cloudy. Some Web data is compiled from numerous sources before being processed and published to the world, making determining the information’s provenance difficult. Republishing it through scraping and reformatting further complicates the issue.
Additional legal problems arise when the terms of agreement for use of the site prohibits scraping processes. Google, for example, includes such a clause in their Terms and Conditions site document. That means that, although Google is very easy to scrape, it is illegal to do so.
On the ethical side, even though you may be able to retrieve the information and republish without any legal ramifications, it might not be ethical to do so. There is no question that an automated solution that queries a Web data store uses resources on the target Web server. It is also very likely that the site owner never intended the information published on the site to be accessed by a machine for unknown repurposing.
However, despite these significant pitfalls to screen scraping, there are numerous applications where scraping is not only possible, but extremely useful. Many government Web sites publish information to the Web that is free and in the public domain. The information is provided because the taxpayers have paid for it to be widely disseminated (for example, Center for Disease Control safety alerts). Often, this government data is located in difficult-to-find places, or in such cryptic form that a site republishing the information is performing a public service by making it accessible and usable.
However, even some public databases warn against using them for automated database retrieval. The California Department of Real Estate, for example, has the following on its search page:
The online status inquiry feature is a service for consumers. It is not intended for, nor capable of, automated database searches or sorts. If you desire such database files, contact the Department for information on availability and costs.
While the technology of the scraper presented here can be used to obtain data from a site whether permission has been granted or not, you should respect the wishes of a site and not apply your scraper there.
Did you know?
Most providers forget that web hosting is, and always was, a service, not a product. You must be very carefull when you are searching for a reliable and an affordable company in order to host your new website.