A large number of wrappers generate tables without column names for human consumption because the meaning of the columns are apparent from the context and easy for humans to understand, but in emerging applications, labels are needed for autonomous assignment and schema mapping where machine tries to understand the tables. Autonomous label assignment is critical in volume data processing where ad hoc mediation, extraction and querying is involved.
We propose an algorithm Lads for Labeling Anonymous Datasets, which can holistically label/annotate tabular Web document. The algorithm has been tested on anonymous datasets from a number of sites, yielding very promising results. We report here our experimental results on anonymous datasets from a number of sites e.g., music, movie, watch, political, automobile, synthetic obtained through different search engine such as Google, Yahoo and MSN. The comparative probabilities of attributes being candidate labels are presented which seem to be very promising, achieved as high as 98% probability of assigning good label to anonymous attribute. To the best of our knowledge, this is the first of its kind for label assignment based on multiple search engines’ recommendation. We have introduced a new paradigm, Web search engine based annotator which can holistically label tabular Web document. We categorize column into three types: disjoint set column (DSC), repeated prefix/suffix column (RPS) and numeric column (NUM). For labeling DSC column, our method rely on hit counts from Web search engine (e.g., Google, Yahoo and MSN). We formulate speculative queries to Web search engine and use the principle of disambiguation by maximal evidence to come up with our solution. Our algorithm L ads is guaranteed to work for the disjoint set column.
Experimental results from large number of sites in different domains and subjective evaluation of our approach show that the proposed algorithm Lads works fairly well. In this line we claim that our algorithm Lads is robust. In order to assign label for the Disjoint Set Column, we need a candidate set of labels (e.g., label library) which can be collected on-the-fly from user SQL query variable as well as from Web Form label tag. We classify a set of homogeneous anonymous datasets into meaningful label and at the same time cluster those labels into a label library by learning user expectation and materialization of her expectation from a site. Previous work in this field rely on extraction ontologies, we eliminate the need for domain specific ontologies as we could extract label from the Web form. Our system is novel in the sense that we accommodate label from the user query variable. We hypothesize that our proposed algorithm Lads will do a good job for autonomous label assignment. We bridge the gap between two orthogonal research directions: wrapper generation and ontology generation from Web site (i.e., label extraction). We are NOT aware of any such prior work that address to connect these two orthogonal research for value added services such as online comparison shopping.
|Advisor:||Jamil, Hasan M.|
|Commitee:||Fotouhi, Farshad, Lu, Shiyong, Nathan, Geoffrey|
|School:||Wayne State University|
|School Location:||United States -- Michigan|
|Source:||DAI-B 72/04, Dissertation Abstracts International|
|Keywords:||Anonymous datasets, Hidden web, Html table, Web data integration, Web form, Wrapper|
Copyright in each Dissertation and Thesis is retained by the author. All Rights Reserved
The supplemental file or files you are about to download were provided to ProQuest by the author as part of a
dissertation or thesis. The supplemental files are provided "AS IS" without warranty. ProQuest is not responsible for the
content, format or impact on the supplemental file(s) on our system. in some cases, the file type may be unknown or
may be a .exe file. We recommend caution as you open such files.
Copyright of the original materials contained in the supplemental file is retained by the author and your access to the
supplemental files is subject to the ProQuest Terms and Conditions of use.
Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be