Skip to content
Permalink
8bed31719c
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
74 lines (61 sloc) 2.65 KB
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
"../../../docbook-xml-4.5/docbookx.dtd">
<appendix id="appendix.Tokenizers.inOmegaT">
<title>Tokenizers<indexterm class="singular">
<primary>Tokenizers</primary>
</indexterm></title>
<section>
<title>Introduction<indexterm class="singular">
<primary>Lucene</primary>
<see>Tokenizer</see>
</indexterm><indexterm class="singular">
<primary>Stemmer</primary>
<see>Tokenizer</see>
</indexterm></title>
<para>Tokenizers (or stemmers) improve the quality of matches by
recognizing inflected words in source and translation memory data. They
also improve glossary matching.</para>
<para>A stemmer for English, for example, should identify the string
"cats" (and possibly "catlike", "catty" etc.) as based on the root "cat",
and "stemmer", "stemming", "stemmed" as based on "stem". A stemming
algorithm reduces the words "fishing", "fished", "fish", and "fisher" to
the root word, "fish". This is especially useful in case of languages that
use pre- and postfix forms for the stem words. Borrowing an example from
Slovenian, here "good" in all possible grammatically correct forms:</para>
<itemizedlist>
<listitem>
<para>lep, lepa, lepo - singular, masculine, feminine, neutral</para>
</listitem>
</itemizedlist>
<itemizedlist>
<listitem>
<para>lepši, lepša, lepše . - comparative, nominative, masculine,
feminine, neutral, resp. Plural form of the adjective</para>
</listitem>
</itemizedlist>
<itemizedlist>
<listitem>
<para>najlepših - superlative, plural, genitive for M,F,N</para>
</listitem>
</itemizedlist>
</section>
<section>
<title>Languages selection</title>
<para>Tokenizers are included in OmegaT and active by default. OmegaT
automatically selects a tokenizer for the source and the target language
according to the language settings of the project. It is possible to
select another tokenizer (Language Tokenizer) or a different version of
the tokenizer (Tokenizer Behavior) from the Project Properties
window.</para>
<para>In case no tokenizer exists for the current languages, OmegaT
uses Hunspell instead (in that case, make sure that relevant Hunspell
dictionaries are installed).</para>
<warning>
<title>Incompatibilities</title>
<para>OmegaT will not launch if tokenizers are found in the /plugin
folder. Remove all the tokenizers from the /plugin folder before
starting OmegaT.</para>
</warning>
</section>
</appendix>