GeekArticles
Database
SQL Server
Database
SQL ServerParsing HTML Documents with the Html Agility Pack
<p><i>Screen scraping</i> is the process of programmatically accessing and processing information from an external website. For example, a price comparison website
might screen scrape a variety of online retailers to build a database of products and what various retailers are selling them for. Typically, screen scraping is
performed by mimicking the behavior of a browser - namely, by making an request from code and then parsing and analyzing the returned HTML.
</p><p>
The .NET Framework offers a variety of classes for accessing data from a remote website, namely the
<a class</a> and the
<a class</a>. These classes are useful for making an to a remote website and pulling down the markup from a particular URL, but they offer no assistance in parsing the returned HTML. Instead, developers commonly
rely on string parsing methods like <code>String.IndexOf</code>, <code>String.Substring</code>, and the like, or through the use of regular expressions.
</p><p>
Another option for parsing HTML documents is to use the <a Agility Pack</a>, a free, open-source library designed to
simplify reading from and writing to HTML documents. The Html Agility Pack constructs a Document Object Model (DOM) view of the HTML document being parsed. With a
few lines of code, developers can walk through the DOM, moving from a node to its children, or vice versa. Also, the Html Agility Pack can return specific nodes in the
DOM through the use of XPath expressions. (The Html Agility Pack also includes a class for downloading an HTML document from a remote website; this means you can both
download and parse an external web page using the Html Agility Pack.)
</p><p>
This article shows how to get started using the Html Agility Pack and includes a number of real-world examples that illustrate this utility. A complete, working
demo is available for download at the end of this article. Read on to learn more!
<br /><a More ></a></p>
Sponsored Links
Related Topics
Subscribe via RSS
SQL Server
- Naming Database Objects: Part II
- Trace Messages Part V: Trace Cleanup
- Naming Database Objects: Part I
- Multiple-Child Aggregation
- Creating SQL Tables for an Integrating Application Using Dexterity
- SQL Server 2005 Beta 2 Transact-SQL Enhancements
- .NET Rocks! - Brian Larson on SQL Server Reporting Services
- Computing the Trimmed Mean in SQL
- SQL Server 2000 Gains on Oracle
- Separator First Formatting (SFF)
