ChungwonSeo

Monday, August 27, 2007

Software Downloads

Co-occurrence computation

We use Liepzig tools, TinyCC2 for computing co-occurrence. This tool gives log-likelihood ratio for significant neighbour and sentence collocation.

Platform: Linux (x86)
Location: http://wortschatz.uni-leipzig.de/~cbiemann/software/TinyCC2.htm
Input: Text file, Mark-up text (xml, html)

Clustering tool

We use open source clustering tool for hierarchical clustering. This tool support hierchical, k-means and SOM bsed clustering.

Platform: Windows/Linux/MacOS
Location: http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/
Input: Feature matrix

Utils

We made some format convertor for co-occurrence tool and clustering tools

Platform: Linux (x86),
Required: php (http://www.php.net)
Location:

http://csace.kaist.ac.kr/~cwseo/gen_matrix.tar.gz

make director "matrix" and extract to there

http://csace.kaist.ac.kr/~cwseo/tinyCC2.tar.gz

extract and copy to tinyCC2 directory

generating collocation vector

extCoc_s.sh

generating matrix from co-occurrence result

gen_matrix.sh

Cluster Extration

cnvFormatx.php

Computing Semantic Relatednes

Platform: Independent
Required: WordNet (>=2.1)
Location: http://csace.kaist.ac.kr/~cwseo/WNSearch.zip
Input: word vector

football basketball convolution cable_television coaxial_cable convolution cable_television ruby_programming_language php cricket football basketball xhtml xml tiff gif system operating_system cybertron galvatron ruby_programming_language php tcl perl java_#programming_languag

Output

Ball_games SYNSET{SID-2752393-n#:#Words[W-2752393-n-1-ball]} 2.772588722239781 3.4011973816621555 football basketball Communication SYNSET{SID-1930-n#:#Words[W-1930-n-1-physical_entity]} 0.6931471805599453 0.8362480242006186 convolution cable_television Communication SYNSET{SID-1930-n#:#Words[W-1930-n-1-physical_entity]} 0.7801585575495751 0.8109302162163287 coaxial_cable convolution cable_television Culture NULL 0.8266785731844679 -1.0 ruby_programming_language php Ball_games SYNSET{SID-462746-n#:#Words[W-462746-n-1-field_game]} 3.1780538303479458 0.3409265869705933 cricket football basketball Human_communication NULL 0.8266785731844679 -1.0 xhtml xml

Thursday, August 2, 2007

Computing Semantic Relatedness

Overview

We can extract clusters from nodes of hierachical clustering results. For conceptualization, we need to find clusters that consits of similar words that can be a class of ontology.
We extract it by computing semantic relatedness. The semantic relatedness of cluster is obtained by measuring distance between terms and lowest common subsume (lcs).
We can use any kind of taxonomy for computing semantic relatedness.
In this case, we use WordNet hierarchy and Wikipedia category hierarchy for computing Semantic Relatedness.

Computing Semantic Relatednes

Platform: Independent
Required: WordNet (>=2.1)
Location: http://csace.kaist.ac.kr/~cwseo/WNSearch.zip
Input: word vector

Output

Usage

Command: Computing.bat "input_file" "output_file"

ex)
Computing.bat wiki5000_cluster.txt wiki5000_res.txt

Term Clustering for Domain Ontology Building

Overview

For building ontology from text, we need to find terms and conceptualize them as classes of ontology. The first step of conceptualization is finding synonyms and clustering of terms into clusters that have similar meaning and can be defined by same properties.
For example, the set of terms {“hard_disk, floppy_disk, cd-rom, linux, unix, bsd, unix-like operating_systems”} can be partitioned into two concepts. {“hard_disk, floppy_disk, cd-rom”} is classified as a disc device and {“linux, unix, bsd, unix-like operating_systems”} is classified as an operating system.
We use paradigmatic relations to get synonym set. The result of hierarchical clustering with synonym sets gives candidates of concepts. We use 1st order and 2nd order collocation to extract pragmatic relation. Cluster that consists of similar words can be a class of ontology. We extract it by computing semantic relatedness. The semantic relatedness of cluster is obtained by measuring distance between terms and lowest common subsume (lcs) .

Demo

http://cseight.kaist.ac.kr:8080/TermCluster

Co-occurrence computation

We use Liepzig tools, TinyCC2 for computing co-occurrence. This tool gives log-likelihood ratio for significant neighbour and sentence collocation.

Platform: Linux (x86)
Location: http://wortschatz.uni-leipzig.de/~cbiemann/software/TinyCC2.htm
Input: Text file, Mark-up text (xml, html)

Clustering tool

We use open source clustering tool for hierarchical clustering. This tool support hierchical, k-means and SOM bsed clustering.

Platform: Windows/Linux/MacOS
Location: http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/
Input: Feature matrix

Utils

We made some format convertor for co-occurrence tool and clustering tools

Platform: Linux (x86),
Required: php (http://www.php.net)
Location:

http://csace.kaist.ac.kr/~cwseo/gen_matrix.tar.gz

make director "matrix" and extract to there

http://csace.kaist.ac.kr/~cwseo/tinyCC2.tar.gz

extract and copy to tinyCC2 directory

generating collocation vector

extCoc_s.sh

generating matrix from co-occurrence result

gen_matrix.sh

Cluster Extration

cnvFormatx.php

Usage

Computing co-occurrence

/tinyCC2/tinyCC.sh
Command: sh tinyCC.sh "prefix" "datadir" none

Ex) Input files in ~cwseo/tinyCC2/wikiCS2/*.txt
cd ~cwseo/tinyCC2
tincyCC.sh wikiCS wikiCS2/ none
extCoc_s.sh wikiCS 50

After execution, we can find coc_"prefix"_"threshold" directory and context vector files in there . In tinyCC2 directory, "prefix"_cos.src is generated (result of extCoc_s.sh).

2nd order collocation

Excute tinyCC.sh again for coc_"prefix"_"threshold" directory.

ex)
tinyCC.sh cocWikiCS coc_wikiCS2_50 none
extCoc_s.sh cocWikiCS 20

Result: cocWikiCS_cos.src

Hierarchical Clustering

/matrix

Make new directory under "matrix" and copy "wikiCS_cos.src" to "freq_src.txt", and "cocWikiCS_cos.src" to "list.txt".
ex)
cs ~cwseo/tinyCC2/matrix
mkdir 07WikiCS
cp ../wikiCS_cos.src ./07WikiCS/freq_src.txt
cp ../cocWikiCS_cos.src ./07WikiCS/list.txt
sh gen_matrix.sh 07WikiCS

Result) result/07WikiCS.newick , result/07WikiCS.sif, result/07WikiCS.graphml
*.newick (for TreeQVista)
*.sif (for cytoscape)
*.graphml (for yEd)

TreeQVista: http://genome.lbl.gov/vista/TreeQVista/
Cytoscape: http://www.cytoscape.org/