Datasets
For the experimental evaluations, the datasets that used to test our system are the search result pages that were generated from many search engine web sites by entering some search queries. However, these datasets also contain some of list-pages that were generated without search queries. We divided the datasets into two groups:
- Dataset 1 is an up-to-date dataset that was recently collected from the 100 well-known websites in year 2013 such as Google, Yahoo, Bing, Youtube, Amazon, Ebay, Imdb, IeeeXplore, SpringerLink, ScienceDirect, and so on.
- Dataset 2 was taken from the ViNT’s testbed. This dataset consists of two groups of webpages. First group contains 50 pages from 50 websites that each webpage belongs only to a single web site. Second group contains 1100 pages from 100 websites that are categorized into 4 domains: education, general, government and medical. Note that, this dataset is outdate; it had been collected in year 2005. However, because it was used in many works; we use it as a useful indicator to evaluate the accuracy and reliability of our system.
- Group 1 + Group2 (150 websites, 1150 pages)
- Group 1 (50 websites, 50 pages)
- Group 2 (100 websites, 1100 pages)
- List of Websites in of all datasets