Postby Jamesrp » Wed Feb 08, 2006 23:40

I want to develop a web crawler to extract information from HTML real estate listings.

Sure, I could just use some specific Regular Expressions and customize them on a site by site basis, but I want to create a more "Intelligent" program that can take any generic real estate site (within reason) and extract the property's price, description, address, etc.

I'm a lot more familiar with PHP than Python, but would Orange and Python be very well suited to this? If not can anyone direct me to some good resources on this?



Postby Blaz » Wed Feb 22, 2006 7:37

James, Orange could be used in your case only in postprocessing of the data (say, based on extracted features, if you would like to determine what type of real-estate the web page is referring to, and you would like to induce such a classification algorithm from manually curated examples). For HTML information extraction and web-crawling you would either need to resort to some other software, or write a corresponding agent in Python, Perl, or alike.

