VEGA 1-1060-04

-> Home page -> Research flag

Document Classification and Annotation for the Semantic Web

Project summary:

The project is focused on design and evaluation of methods for text document annotation using metadata, which define what the documents are about in a machine processable way. The focus is on exploitation of domain theories represented as ontologies whose components can be used to annotate documents. In connection with this the project aims at:
  • Document classification using machine learning methods
  • Using natural language processing methods for document annotation
  • Annotation based on employment of lexical databases
  • Abstract generation
  • Document re-annotation as a result of a domain theory change
The project copes also with creation of ontologies applicable for document annotation. In connection with this field the focus is on:
  • Automatic generation of ontological models based on text document collections
  • Ontology modification using text mining methods

Key words:

Document classification and annotation, domain knowledge modelling, ontology creation, natural language processing, machine learning, text mining

Project participants:

  • Marian Mach - project leader
  • Sabol Tomas
  • Paralic Jan - project viceleader
  • Kende Robert
  • Hreno Jan
  • Machova Kristina
  • Hudak Slavomir
  • Bednar Peter
  • Kostial Ivan
  • Sarnovsky Martin
  • Mraz Miroslav
  • Babic Frantisek
  • Smatana Peter
  • Rockai Viliam

Annotation of project resuls:

  • Design of various methods for increasing efficiency of text document classification (using Bayesian networks, reduction of number of documents) and text document clustering (controlled initialisation, attribute oriented induction).
  • Extraction of key terms from documents, relations among terms, phrases, and synonym identification using statistical methods and the theory of associative concept learning.
  • Creation of hierarchical concept models based on clustering and fuzzy formal conceptual analysis and their use for document content annotation.
  • Transformation of unstructured documents into structured ones using regular and linguistic analyses.
  • Java library for development of text mining applications. It provides facilities for text analysis as well as for building, evaluating and applying of various methods for supervised and unsupervised learning.
  • Implementation of a service for text document classification in a grid environment provided by GridMiner.
  • Method for creation of dedicated text collections from web sources, suggesting alternative documents based on user stereotypes.

Copyright © MM
Last updated 17.8.2009