Open Topic Model (OTM) is a text analysis tool written in C++ (Qt Development Frameworks) and ported to Java and Apache Lucene. We consider the problem of topic identification on Open Topic Models. That is, we are not heading towards a clustering of a document collection but labelling individual documents with the best fitting topic names obtained from a social ontology.
The Open Topic Model comes with two language models, German, and English. The English model utlizes the category taxonomy of en.wikipedia.org. The German model utlizes the category taxonomy of de.wikipedia.org.
The OTM tool can directly be applied by using one of these two language models within two different interfaces:
Each input file or string stream (UTF-8) is converted into a tokenized TEI-P5 representation, using the preprocessing tool Prepro2010:
<seg xml:id="xd1_segm_1"> <w xml:id="xd1_wo1" type="#NE" subtype="#companyName" lemma="FC" function="pro">FC</w> <w xml:id="xd1_wo2" type="#NE" subtype="#companyName" lemma="Bayern" function="dic">Bayern</w> <w xml:id="xd1_wo3" type="#NE" subtype="#companyName" lemma="München" function="dic">München</w> </seg>
The pre-analyzed text stream is than used for the topic prediction by means of our Open Topic Model algorithm. The Topic Model output representation highlights the best fitting articles found in the dataset. In addition, the tool generates the best direct topics and the best generalized topics with respect to the category taxonomy.
<article> <node id="1" value="0.872">Arjen Robben</node> <node id="2" value="0.326">Fußballer des Monats</node> <node id="3" value="0.289">FC Bayern München/Namen und Zahlen</node> <node id="4" value="0.285">FC Bayern München</node> </article> <directTopics> <node id="1" value="0.887">Deutscher Meister (Fußball)</node> <node id="2" value="0.887">Englischer Meister (Fußball)</node> <node id="3" value="0.887">Fußballspieler (Niederlande)</node> <node id="4" value="0.779">Fußballspieler (Deutschland)</node> </directTopics> <generalizedTopics> <node id="1" value="0.977">Sportler</node> <node id="2" value="0.843">Nationaler Meister</node> <node id="3" value="0.568">Fußballspieler</node> <node id="4" value="0.449">Sport (Deutschland)</node> </generalizedTopics>
In principle, the generalized topics are predicted by connecting a given text fragment to the most specific category, proverbially taking an uphill walk within the taxonomy and ’dye’ the trail we have visited. The task of walking up the taxonomy means walking along the hypernym edges of the category tree.
In order to evaluate the topic identification on OTM we compiled two datasets comprising 1000 articles of the German Wikipedia (dataset B) and of the Meyer- Lexikon collection (dataset A) each. Since both datasets are encyclopedia based and categorized by a taxonomy, we chose ten categories (e.g. fashion, politics, sports) as our open topics and selected for each category 100 articles. For each document, we computed the five and ten best generalized categories and compared if one of these matched the initial category of the taxonomy.
info spor poli medi liter cult econ pada reli psycho A0 .638 .745 .750 .710 .660 .495 .710 .710 .760 .462 A1 .670 .798 .940 .770 .750 .546 .710 .810 .850 .527 A2 .766 .957 1 .860 .830 .825 .940 .920 .960 .714 A3 .798 .979 1 .860 .970 .979 .980 .940 .960 .725 A4 .872 .979 1 .890 1 .989 1 .950 .970 .725 A5 .894 .979 1 .910 1 1 1 .950 .970 .824 info spor poli medi liter cult econ mili educ cloth B0 .677 .630 .740 .660 .780 .520 .460 .620 .560 .240 B1 .768 .690 .880 .700 .850 .650 .490 .620 .710 .240 B2 .849 .870 .960 .810 .900 .970 .920 .890 .890 .280 B3 .879 .880 .970 .820 .960 .990 .990 .910 .950 .360 B4 .889 .900 .980 .830 1 1 .990 .930 .970 .360 B5 .929 .900 .990 .850 1 1 .990 .930 .980 .360Accuracy results of the topic identification experiments by means of OTM using the Meyers Lexicon (A) and Wikipedia (B) corpus and ten topic labels.
Most approaches in topic identification focus either on topic
clustering techniques by clustering keywords using
different notions of a similarity measure, or by
an automatic text categorization scenario using a small
set of given categories. In this context, our approach utilizes
over 55,000 different categories as topic labels and combines
both keyword extraction as a type of text representation and
categorization by means of topic labelling.
Social ontologies are used as a source of terminological
knowledge providing a large-scale but most importantly a
flexible knowledge system in building OTM. OTM are topic-related
models in which content categories are not assigned
in advance but change over time – contributed by the open
community. Content categories themselves are predefined by
the constantly growing social ontology itself. Our approach
utilizes such a social ontology by the alignment of documents
within a social network comprising category information
trails. Therefore we treat the task of topic identification
as a problem of ontology mapping. Doing this, we identify
the documents of a collection that are most closely related
for a given text fragment.
You can use the open topic model tool within two different interfaces:
You can connect to the Socket-Interface through (PHP-Socket-Example):
fsockopen("129.70.40.30", 6665); $inputPortStream ="$myRawData&lang=german"; fwrite($fp, $inputPortStream); $topicXml = stream_get_contents($fp); fclose($fp);
You can check the functionality of the OpenTopicModel as a jar application (default model is german) at the Server Varda/Hydra:
TopicServer.getOpenTopicQuery(String textStream, int usedString, int usedQueryTerms, int usedTaxonomyDepth, double minimumArticleSim);
Ulli Waltinger
University of Bielefeld
Faculty of Technology
Text Technology / Applied Computational Linguistics
ulli_marc.waltinger@uni-bielefeld.de
www.ulliwaltinger.de
Ulli Waltinger, Alexander Mehler 2009. Social Semantics And Its Evaluation By Means Of Semantic Relatedness And Open Topic Models. Proceedings of the 2009 IEEE/WIC/ACM International Conference on Web Intelligence, Milan (Italy), 2009.
Alexander Mehler and Ulli Waltinger. Enhancing Document Modeling by Means of Open Topic Models: Crossing the Frontier of Classification Schemes in Digital Libraries by Example of the {DDC}. In Library Hi Tech, 27 (4), 2009.
Ulli Waltinger, Irene Cramer and Tonio Wandmacher 2009. From Social Networks To Distributional Properties: A Comparative Study On Computing Semantic Relatedness. Proceedings of the Annual Meeting of the Cognitive Science Society - CogSci 2009, Amsterdam (NL), 2009.
Ulli Waltinger, Alexander Mehler and Rüdiger Gleim 2009. Social Semantics And Its Evaluation By Means of Closed Topic Models: An SVM-Classification Approach Using Semantic Feature Replacement By Topic Generalization. Proceedings of the GSCL-Conference, Potsdam (DE), 2009.
Last changed: 12 August 2010, Ulli Waltinger