Architecture

Getting started

Citation

When using TextImager please cite following:

Running a process

Here is an example of how a minimal class using the web service could look like:

package example;

import java.io.IOException;
import java.net.URISyntaxException;

import org.hucompute.services.webservices.TextImagerInterface;
import org.hucompute.services.webservices.TextImagerService;

import net.java.dev.jaxb.array.StringArray;

public class Example {

	public static void main(String[] args) throws IOException, URISyntaxException {
		TextImagerService service = new TextImagerService();
		TextImagerInterface serviceInterface = service.getTextImagerPort();


		StringArray options = new StringArray();

		options.getItem().add("de");								//first add the language
		options.getItem().add("MarMoTTagger, MarMoTLemma");					//next add the nlp tools
		options.getItem().add("outputFormat");								
		options.getItem().add("tcf");										
		options.getItem().add("inputFormat");								
		options.getItem().add("plain");										


		System.out.println(serviceInterface.process("Das ist ein simpler Test.", options));

	}

}
	
The output format can be changed, the default format is xmi. tcf output should look like this:
<?xml version="1.0" encoding="UTF-8" standalone="no"?> <?xml-model href="http://de.clarin.eu/images/weblicht-tutorials/resources/tcf-04/schemas/latest/d-spin_0_4.rnc" type="application/relax-ng-compact-syntax"?> <D-Spin xmlns="http://www.dspin.de/data" version="0.4"> <ns2:MetaData xmlns="http://www.clarin.eu/cmd/" xmlns:ns2="http://www.dspin.de/data/metadata" .../> <TextCorpus xmlns="http://www.dspin.de/data/textcorpus" lang="de"> <text>Das ist ein simpler Test.</text> <tokens> <token ID="t_0">Das</token> <token ID="t_1">ist</token> <token ID="t_2">ein</token> <token ID="t_3">simpler</token> <token ID="t_4">Test</token> <token ID="t_5">.</token> </tokens> <sentences> <sentence tokenIDs="t_0 t_1 t_2 t_3 t_4 t_5"/> </sentences> <lemmas> <lemma ID="l_0" tokenIDs="t_0">der</lemma> <lemma ID="l_1" tokenIDs="t_1">sein</lemma> <lemma ID="l_2" tokenIDs="t_2">ein</lemma> <lemma ID="l_3" tokenIDs="t_3">simpel</lemma> <lemma ID="l_4" tokenIDs="t_4">Test</lemma> <lemma ID="l_5" tokenIDs="t_5">--</lemma> </lemmas> <POStags tagset="STTS"> <tag tokenIDs="t_0">PDS</tag> <tag tokenIDs="t_1">VAFIN</tag> <tag tokenIDs="t_2">ART</tag> <tag tokenIDs="t_3">ADJA</tag> <tag tokenIDs="t_4">NN</tag> <tag tokenIDs="t_5">$.</tag> </POStags> </TextCorpus> </D-Spin>

Options

Languages See available services
Tools See available services
Input Format (inputFormat) tei, json, xmi, plain
Output Format (outputFormat) tei, tcf, xmi, webanno
Options have to be used in a key-value manner, the key for the tools is the language you add to the options beforehand.
You can use multiple language-tools pairs in your options, but at most one for the input and the output format.

Available Services

Group Name Language Model
parser StanfordParser ar factored
de sr
factored
pcfg
en rnn
sr
sr-beam
wsj-factored
wsj-pcfg
wsj-rnn
factored
pcfg
es pcfg
sr
sr-beam
fr factored
sr-beam
zh xinhua-factored
xinhua-pcfg
factored
pcfg
sr
ClearNlpDependencyParser en mayo
ontonotes
ner StanfordNamedEntityRecognizer de dewac_175m_600.crf
hgc_175m_600.crf
en all.3class.caseless.distsim.crf
all.3class.distsim.crf
conll.4class.caseless.distsim.crf
conll.4class.distsim.crf
muc.7class.caseless.distsim.crf
muc.7class.distsim.crf
nowiki.3class.caseless.distsim.crf
es ancora.distsim.s512.crf
NERAnnotator de ?
OpenNlpNameFinder en ?
time HeidelTime en ?
de ?
sentiment Sentiws en date
location
money
organization
percentage
person
time
es locatin
misc
organization
person
nl location
misc
organization
lemmatizer LanguageToolLemmatizer en ?
de ?
StanfordLemmatizer en ?
MarMoTLemma la ?
de ?
pos StanfordPosTagger ar accurate
de fast-caseless
dewac
fast
hgc
en bidirectional-distsim
caseless-left3words-distsim
fast.41
left3words-distsim
twitter
twitter-fast
wsj-0-18-bidirectional-nodistsim
wsj-0-18-caseless-left3words-distsim
wsj-0-18-left3words-distsim
wsj-0-18-left3words-nodistsim
es default
distsim
MarMoTTagger la ?
de ?
tokenizer OpenNlpSegmenter da maxent
de maxent
en maxent
it maxent
nb maxent
nl maxent
pt maxent
sv maxent
LanguageToolSegmenter en ?
de ?
BreakIteratorSegmenter en ?
de ?
la ?
StanfordTokenizer ar ?
en ?
es ?
fr ?
OpenNLPTokenizer en ?
de ?
ClearNlpSegmenter en ?
paragraphSplitter ParagraphSplitter en ?
de ?
la ?
similarity CosineSimilarity de ?
en ?
la ?
Biemann de ?
en ?
la ?
WordNGramJaccardMeasure de ?
en ?
la ?
grammaticalcategory TreeTaggerGrammaticalCategory de ?
disambiguation HUComputeWSD en ?

Legal Notice (German) - Legal Notice (English)