Document Classification and Annotation for the Semantic Web
Project summary:
The project is focused on design and evaluation of methods for text
document annotation using metadata, which define what the documents
are about in a machine processable way. The focus is on exploitation
of domain theories represented as ontologies whose components can be
used to annotate documents. In connection with this the project aims at:
- Document classification using machine learning methods
- Using natural language processing methods for document annotation
- Annotation based on employment of lexical databases
- Abstract generation
- Document re-annotation as a result of a domain theory change
The project copes also with creation of ontologies applicable for
document annotation. In connection with this field the focus is on:
- Automatic generation of ontological models based on text
document collections
- Ontology modification using text mining methods
Key words:
Document classification and annotation, domain knowledge modelling,
ontology creation, natural language processing, machine learning,
text mining
Project participants:
- Marian Mach - project leader
- Sabol Tomas
- Paralic Jan - project viceleader
- Kende Robert
- Hreno Jan
- Machova Kristina
- Hudak Slavomir
- Bednar Peter
- Kostial Ivan
- Sarnovsky Martin
- Mraz Miroslav
- Babic Frantisek
- Smatana Peter
- Rockai Viliam
Annotation of project resuls:
- Design of various methods for increasing efficiency of text
document classification (using Bayesian networks, reduction
of number of documents) and text document clustering (controlled
initialisation, attribute oriented induction).
- Extraction of key terms from documents, relations among terms,
phrases, and synonym identification using statistical methods
and the theory of associative concept learning.
- Creation of hierarchical concept models based on clustering and
fuzzy formal conceptual analysis and their use for document
content annotation.
- Transformation of unstructured documents into structured ones
using regular and linguistic analyses.
- Java library for development of text mining applications. It
provides facilities for text analysis as well as for building,
evaluating and applying of various methods for supervised and
unsupervised learning.
- Implementation of a service for text document classification
in a grid environment provided by GridMiner.
- Method for creation of dedicated text collections from web
sources, suggesting alternative documents based on user
stereotypes.
|