Five web sources need to be parsed and data entries (say search results) need to be extracted. What is the best approach?
One could use regular expressions to work on the data. However, I am more familiar with XPath selectors (similar to CSS selectors) due to my experience with jQuery hence I’ll be talking about an approach without using regular expressions.
Some information on how the two (XPath selectors/CSS selectors) are interrelated is mentioned in this post by John Resig (creator of jQuery.)
Here are the steps required to extract the data from the web sources:
- Start by accessing the website itself, so you’ll connect to the page via some HTTP library present in the language (all good languages have them anyway.)
- Once you’ve got the raw HTML as a string you need to ‘massage’ it into XML. Depending on the language there are different approaches, I have found that BeautifulSoup is good for Python and that JTidy might be good for Java.
- The above libraries will transform your HTML string into a well-formed XML tree structure. Upon analysis of this webpage you will manually identify where result entries repeat and exist. For example, you may find that your XML tree has a snippet like the following:
<tr>
<td colspan="3">
<a href="..." class="medium-text" target="_self">
Experiments on Design Pattern Discovery
</a>
<div class="authors">
Jing Dong, Yajing Zhao
</div>
</td>
</tr>
- In the above example we would create an XPath selector as follows:
//tr/td[contains(@colspan,'3')]
- Which would return a list of the contents of the elements that matched the selector:
<a href="..." class="medium-text" target="_self"> Experiments on Design Pattern Discovery </a> <div class="authors"> Jing Dong, Yajing Zhao </div>
- Once you have that list you can start pulling the little details out of the result entry. To do this you may write custom string parsing functions, perhaps you will use some to pull the authors out of the result entry and separate them from the title of the result entry.
- Alternatively, another approach would be to apply Natural Language Processing to the entries. NLP attempts to pick up the different kinds of words and text existing within a larger set of text. However, NLP is beyond the scope of this discussion. For Python I believe the NLTK is appropriate.

Pingback: Recent Links Tagged With "beautifulsoup" - JabberTags