Wikipedia - OpenTopic Model

Overview Representation Evaluation API: TopicModel Architecture Java Usage Contact References

Open Topic Model (OTM) is a text analysis tool written in C++ (Qt Development Frameworks) and ported to Java and Apache Lucene. We consider the problem of topic identification on Open Topic Models. That is, we are not heading towards a clustering of a document collection but labelling individual documents with the best fitting topic names obtained from a social ontology.

The Open Topic Model comes with two language models, German, and English. The English model utlizes the category taxonomy of en.wikipedia.org. The German model utlizes the category taxonomy of de.wikipedia.org.

The OTM tool can directly be applied by using one of these two language models within two different interfaces:

Binary
XML-API

[top]

Representation

Each input file or string stream (UTF-8) is converted into a tokenized TEI-P5 representation, using the preprocessing tool Prepro2010:

	<seg xml:id="xd1_segm_1">
	   <w xml:id="xd1_wo1" type="#NE" subtype="#companyName" lemma="FC" function="pro">FC</w>
	   <w xml:id="xd1_wo2" type="#NE" subtype="#companyName" lemma="Bayern" function="dic">Bayern</w>
	   <w xml:id="xd1_wo3" type="#NE" subtype="#companyName" lemma="München" function="dic">München</w>
	</seg>

The pre-analyzed text stream is than used for the topic prediction by means of our Open Topic Model algorithm. The Topic Model output representation highlights the best fitting articles found in the dataset. In addition, the tool generates the best direct topics and the best generalized topics with respect to the category taxonomy.

	<article>
	  <node id="1" value="0.872">Arjen Robben</node>
	  <node id="2" value="0.326">Fußballer des Monats</node>
	  <node id="3" value="0.289">FC Bayern München/Namen und Zahlen</node>
	  <node id="4" value="0.285">FC Bayern München</node>
	</article>	
	
	<directTopics>
	  <node id="1" value="0.887">Deutscher Meister (Fußball)</node>
	  <node id="2" value="0.887">Englischer Meister (Fußball)</node>
	  <node id="3" value="0.887">Fußballspieler (Niederlande)</node>
	  <node id="4" value="0.779">Fußballspieler (Deutschland)</node>
	</directTopics>	
	
	
	<generalizedTopics>
	  <node id="1" value="0.977">Sportler</node>
	  <node id="2" value="0.843">Nationaler Meister</node>
	  <node id="3" value="0.568">Fußballspieler</node>
	  <node id="4" value="0.449">Sport (Deutschland)</node>
	</generalizedTopics>

In principle, the generalized topics are predicted by connecting a given text fragment to the most specific category, proverbially taking an uphill walk within the taxonomy and ’dye’ the trail we have visited. The task of walking up the taxonomy means walking along the hypernym edges of the category tree.

[top]

Evaluation

In order to evaluate the topic identification on OTM we compiled two datasets comprising 1000 articles of the German Wikipedia (dataset B) and of the Meyer- Lexikon collection (dataset A) each. Since both datasets are encyclopedia based and categorized by a taxonomy, we chose ten categories (e.g. fashion, politics, sports) as our open topics and selected for each category 100 articles. For each document, we computed the five and ten best generalized categories and compared if one of these matched the initial category of the taxonomy.


info     spor 	poli 	medi 	liter 	cult 	econ 	pada 	reli 	psycho
  A0 	.638 	.745 	.750 	.710 	.660 	.495 	.710 	.710 	.760 	.462
  A1 	.670 	.798 	.940 	.770 	.750 	.546 	.710 	.810 	.850 	.527
  A2 	.766 	.957 	1 	.860 	.830 	.825 	.940 	.920 	.960 	.714
  A3 	.798 	.979 	1 	.860 	.970 	.979 	.980 	.940 	.960 	.725
  A4 	.872 	.979 	1 	.890 	1 	.989 	1 	.950 	.970 	.725
  A5 	.894 	.979 	1 	.910 	1 	1 	1 	.950 	.970 	.824
 




info 	spor 	poli 	medi 	liter 	cult 	econ 	mili 	educ 	cloth
B0 	.677 	.630 	.740 	.660 	.780 	.520 	.460 	.620 	.560 	.240
B1 	.768 	.690 	.880 	.700 	.850 	.650 	.490 	.620 	.710 	.240
B2 	.849 	.870 	.960 	.810 	.900 	.970 	.920 	.890 	.890 	.280
B3 	.879 	.880 	.970 	.820 	.960 	.990 	.990 	.910 	.950 	.360
B4 	.889 	.900 	.980 	.830 	1 	1 	.990 	.930 	.970 	.360
B5 	.929 	.900 	.990 	.850 	1 	1 	.990 	.930 	.980 	.360

Accuracy results of the topic identification experiments by means of OTM using the Meyers Lexicon (A) and Wikipedia (B) corpus and ten topic labels.

[top]

OpenTopicModel-API

[top]

Open Topic Model Architecture

Most approaches in topic identification focus either on topic clustering techniques by clustering keywords using different notions of a similarity measure, or by an automatic text categorization scenario using a small set of given categories. In this context, our approach utilizes over 55,000 different categories as topic labels and combines both keyword extraction as a type of text representation and categorization by means of topic labelling.

Social ontologies are used as a source of terminological knowledge providing a large-scale but most importantly a flexible knowledge system in building OTM. OTM are topic-related models in which content categories are not assigned in advance but change over time – contributed by the open community. Content categories themselves are predefined by the constantly growing social ontology itself. Our approach utilizes such a social ontology by the alignment of documents within a social network comprising category information trails. Therefore we treat the task of topic identification as a problem of ontology mapping. Doing this, we identify the documents of a collection that are most closely related for a given text fragment.

[top]

Open Topic Model Usage

You can use the open topic model tool within two different interfaces:

PortSocket-API

You can connect to the Socket-Interface through (PHP-Socket-Example):

	fsockopen("129.70.40.30", 6665);
	$inputPortStream ="$myRawData&lang=german";
	fwrite($fp, $inputPortStream);
	$topicXml = stream_get_contents($fp);
	fclose($fp);

Java-Jar

You can check the functionality of the OpenTopicModel as a jar application (default model is german) at the Server Varda/Hydra:

	TopicServer.getOpenTopicQuery(String textStream, 
	                                int usedString, 
	                                int usedQueryTerms, 
	                                int usedTaxonomyDepth, 
	                                double minimumArticleSim);

[top]

Contact

Ulli Waltinger
University of Bielefeld
Faculty of Technology
Text Technology / Applied Computational Linguistics
ulli_marc.waltinger@uni-bielefeld.de
www.ulliwaltinger.de

Reference

Ulli Waltinger, Alexander Mehler 2009. Social Semantics And Its Evaluation By Means Of Semantic Relatedness And Open Topic Models. Proceedings of the 2009 IEEE/WIC/ACM International Conference on Web Intelligence, Milan (Italy), 2009.

Alexander Mehler and Ulli Waltinger. Enhancing Document Modeling by Means of Open Topic Models: Crossing the Frontier of Classification Schemes in Digital Libraries by Example of the {DDC}. In Library Hi Tech, 27 (4), 2009.

Ulli Waltinger, Irene Cramer and Tonio Wandmacher 2009. From Social Networks To Distributional Properties: A Comparative Study On Computing Semantic Relatedness. Proceedings of the Annual Meeting of the Cognitive Science Society - CogSci 2009, Amsterdam (NL), 2009.

Ulli Waltinger, Alexander Mehler and Rüdiger Gleim 2009. Social Semantics And Its Evaluation By Means of Closed Topic Models: An SVM-Classification Approach Using Semantic Feature Replacement By Topic Generalization. Proceedings of the GSCL-Conference, Potsdam (DE), 2009.

[top]

Last changed: 12 August 2010, Ulli Waltinger

Language: German
Input: Bilanz Seehofer bringt die CSU nicht voran Zwei Jahre ist Seehofer bald im Amt. Doch die CSU hat sich noch immer nicht gefangen. Seehofers Führungsstil und die Umfrage-Affäre könnten den Niedergang der Partei beschleunigen.	Output: