- Overview
For example, the set of terms {“hard_disk, floppy_disk, cd-rom, linux, unix, bsd, unix-like operating_systems”} can be partitioned into two concepts. {“hard_disk, floppy_disk, cd-rom”} is classified as a disc device and {“linux, unix, bsd, unix-like operating_systems”} is classified as an operating system.
We use paradigmatic relations to get synonym set. The result of hierarchical clustering with synonym sets gives candidates of concepts. We use 1st order and 2nd order collocation to extract pragmatic relation. Cluster that consists of similar words can be a class of ontology. We extract it by computing semantic relatedness. The semantic relatedness of cluster is obtained by measuring distance between terms and lowest common subsume (lcs) .
- Demo
- Co-occurrence computation
- Platform: Linux (x86)
- Location: http://wortschatz.uni-leipzig
.de/~cbiemann/software/TinyCC2 .htm - Input: Text file, Mark-up text (xml, html)
- Clustering tool
- Platform: Windows/Linux/MacOS
- Location: http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/
- Input: Feature matrix
- Utils
- Platform: Linux (x86),
- Required: php (http://www.php.net)
- Location:
- http://csace.kaist.ac.kr/~cwseo/gen_matrix.tar.gz
- make director "matrix" and extract to there
- http://csace.kaist.ac.kr/~cwseo/tinyCC2.tar.gz
- extract and copy to tinyCC2 directory
- generating collocation vector
- extCoc_s.sh
- generating matrix from co-occurrence result
- gen_matrix.sh
- Cluster Extration
- cnvFormatx.php
- Usage
- Computing co-occurrence
Command: sh tinyCC.sh "prefix" "datadir" none
Ex) Input files in ~cwseo/tinyCC2/wikiCS2/*.txt
cd ~cwseo/tinyCC2
tincyCC.sh wikiCS wikiCS2/ none
extCoc_s.sh wikiCS 50
After execution, we can find coc_"prefix"_"threshold" directory and context vector files in there . In tinyCC2 directory, "prefix"_cos.src is generated (result of extCoc_s.sh).
- 2nd order collocation
ex)
tinyCC.sh cocWikiCS coc_wikiCS2_50 none
extCoc_s.sh cocWikiCS 20
Result: cocWikiCS_cos.src
- Hierarchical Clustering
Make new directory under "matrix" and copy "wikiCS_cos.src" to "freq_src.txt", and "cocWikiCS_cos.src" to "list.txt".
ex)
cs ~cwseo/tinyCC2/matrix
mkdir 07WikiCS
cp ../wikiCS_cos.src ./07WikiCS/freq_src.txt
cp ../cocWikiCS_cos.src ./07WikiCS/list.txt
sh gen_matrix.sh 07WikiCS
Result) result/07WikiCS.newick , result/07WikiCS.sif, result/07WikiCS.graphml
*.newick (for TreeQVista)
*.sif (for cytoscape)
*.graphml (for yEd)
TreeQVista: http://genome.lbl.gov/vista
Cytoscape: http://www.cytoscape.org/