Permalink
Cannot retrieve contributors at this time
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
2015OmegaT/OmegaT/doc_src/en/Segmentation.xml
Go to fileThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
285 lines (221 sloc)
10.8 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<?xml version="1.0" encoding="UTF-8"?> | |
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" | |
"../../../docbook-xml-4.5/docbookx.dtd"> | |
<chapter id="chapter.segmentation"> | |
<title>Source segmentation</title> | |
<para>Translation memory tools work with textual units called segments. | |
<application>OmegaT</application> has two ways to segment a text: by | |
paragraph or by sentence segmentation (also referred to as “rule-based | |
segmentation”). In order to select the type of segmentation, select | |
<menuchoice> | |
<guimenu><indexterm class="singular"> | |
<primary>Project</primary> | |
<secondary>Properties</secondary> | |
</indexterm>Project</guimenu> | |
<guimenuitem>Properties...</guimenuitem> | |
</menuchoice> from the main menu and tick or untick the check box | |
provided. Paragraph segmentation is advantageous in certain cases, such as | |
highly creative or stylistic translations in which the translator may wish | |
to change the order of entire sentences; for the majority of projects, | |
however, sentence segmentation is a choice to be preferred, since it | |
delivers better matches from previous translations. If sentence segmentation | |
has been selected, you can setup the rules by selecting <menuchoice> | |
<guimenu><indexterm class="singular"> | |
<primary>Project</primary> | |
<secondary>Options</secondary> | |
</indexterm>Options</guimenu> | |
<guimenuitem>Segmentation...</guimenuitem> | |
</menuchoice>from the main menu.</para> | |
<para>Dependable segmentation rules are already available for many | |
languages, so it is likely that you will not need to get involved with | |
writing your own segmentation rules. On the other hand this functionality | |
can be very useful in special cases, where you can increase your | |
productivity by tuning the segmentation rules to the text to be | |
translated.</para> | |
<para><emphasis role="bold">Warning: </emphasis>because the text will | |
segment differently after filter options have been changed, so you may have | |
to start translating from scratch. At the same time the previous valid | |
segments in the project translation memory will turn into orphan segments. | |
If you change segmentation options when a project is open, you must reload | |
the project in order for the changes to take effect.</para> | |
<para><application>OmegaT</application> uses the following sequence of | |
steps:</para> | |
<variablelist> | |
<varlistentry> | |
<term><indexterm class="singular"> | |
<primary>Segmentation</primary> | |
<secondary>Source level segmentation</secondary> | |
</indexterm>Structure level segmentation</term> | |
<listitem> | |
<para><application>OmegaT</application> first parses the text for | |
structure-level segmentation. During this process it is only the | |
structure of the source file that is used to produce segments.</para> | |
<para>For example, text files may be segmented on line breaks, empty | |
lines, or not be segmented at all. Files containing formatting | |
(ODF documents, HTML documents, etc.) are segmented on the | |
block-level (paragraph) tags. Translatable object attributes in XHTML | |
or HTML files can be extracted as separate segments.</para> | |
</listitem> | |
</varlistentry> | |
<varlistentry> | |
<term><indexterm class="singular"> | |
<primary>Segmentation</primary> | |
<secondary>Sentence level segmentation</secondary> | |
</indexterm>Sentence level segmentation</term> | |
<listitem> | |
<para>After segmenting the source file into structural units, | |
<application>OmegaT</application> will segment these blocks further | |
into sentences.</para> | |
</listitem> | |
</varlistentry> | |
</variablelist> | |
<section id="segmentation.rules"> | |
<title>Segmentation rules<indexterm class="singular"> | |
<primary>Segmentation</primary> | |
<secondary>Rules</secondary> | |
</indexterm></title> | |
<para>The process of segmenting can be pictured as follows: the cursor | |
moves along the text, one character at a time. At each cursor position | |
rules, consisting of a <emphasis role="bold">Before </emphasis>and | |
<emphasis role="bold">After </emphasis>pattern, are applied in their given | |
order to see if any of the<emphasis role="bold"> Before</emphasis> | |
patterns are valid for the text on the left and the corresponding | |
<emphasis role="bold">After</emphasis> pattern for the text on the right | |
of the cursor. If the rule matches, either the cursor moves on without | |
inserting a segment break (for an exception rule) or a new segment break | |
is created at the current cursor position (for the break rule).</para> | |
<para>The two types of rules behave as follows:</para> | |
<variablelist> | |
<varlistentry> | |
<term><indexterm class="singular"> | |
<primary>Segmentation</primary> | |
<secondary>Rules</secondary> | |
<tertiary>Break rule</tertiary> | |
</indexterm>Break rule</term> | |
<listitem> | |
<para>Separates the source text into segments. For example, | |
"<emphasis>Did it make sense? I was not sure</emphasis>." should be | |
split into two segments. For this to happen, there should be a break | |
rule for "?", when followed by spaces and a capitalized word. To | |
define a rule as a break rule, tick the Break/Exception check | |
box.</para> | |
</listitem> | |
</varlistentry> | |
</variablelist> | |
<variablelist> | |
<varlistentry> | |
<term><indexterm class="singular"> | |
<primary>Segmentation</primary> | |
<secondary>Rules</secondary> | |
<tertiary>Exception rule</tertiary> | |
</indexterm>Exception rule</term> | |
<listitem> | |
<para>specify what parts of text should NOT be separated. In spite | |
of the period, <emphasis>"Mrs. Dalloway "</emphasis> should not be | |
split in two segments, so an exception rule should be established | |
for Mrs (and for Mr, for Dr, for prof etc), followed by a period. To | |
define a rule as an exception rule, leave the Break/Exception check | |
box unticked.</para> | |
</listitem> | |
</varlistentry> | |
</variablelist> | |
<para>The predefined break rules should be sufficient for most European | |
languages and Japanese. In view of the flexibility, you may consider | |
defining more exception rules for your source language in order to provide | |
more meaningful and coherent segments.</para> | |
</section> | |
<section id="rules.priority"> | |
<title>Rule priority<indexterm class="singular"> | |
<primary>Segmentation</primary> | |
<secondary>Rules priority</secondary> | |
</indexterm></title> | |
<para>All segmentation rule sets for a matching language pattern are | |
active and are applied in the given order of priority, so rules for | |
specific language should be higher than default ones. For example, rules | |
for Canadian French (FR-CA) should be set higher than rules for French | |
(FR.*), and higher than Default (.*) ones. Thus, when translating from | |
Canadian French the rules for Canadian French - if any - will be applied | |
first, followed by the rules for French and lastly, by the Default | |
rules.</para> | |
</section> | |
<section id="creating.new.rule"> | |
<title>Creating a new rule<indexterm class="singular"> | |
<primary>Segmentation</primary> | |
<secondary>Creating a new rule</secondary> | |
<seealso>Regular expressions</seealso> | |
</indexterm></title> | |
<para>Major changes to the segmentation rules should be generally avoided, | |
especially after completion of the first draft, but minor changes, such as | |
the addition of a recognized abbreviation, can be advantageous.</para> | |
<para>In order to edit or expand an existing set of rules, simply click on | |
it in the top table. The rules for that set will appear in the bottom half | |
of the window.</para> | |
<para>In order to create an empty set of rules for a new language pattern | |
click <emphasis role="bold">Add </emphasis>in the upper half of the | |
dialog. An empty line will appear at the bottom of the upper table (you | |
may have to scroll down to see it). Change the name of the rule set and | |
the language pattern to the language concerned and its code (see <xref | |
linkend="appendix.languages"/> for a list of language codes). The syntax | |
of the language pattern conforms to regular expression syntax. If your set | |
of rules handles a language-country pair, we advise you to move it to the | |
top using the <emphasis role="bold">Move Up</emphasis> button.</para> | |
<para>Add the <emphasis role="bold">Before</emphasis> and<emphasis | |
role="bold"> After</emphasis> patterns. To check their syntax and their | |
applicability, it is advisable to use tools which allow you to see their | |
effect directly. See the chapter on <link linkend="chapter.regexp">Regular | |
expressions</link>. A good starting point will always be the existing | |
rules.</para> | |
</section> | |
<section id="few.simple.examples"> | |
<title><indexterm class="singular"> | |
<primary>Segmentation</primary> | |
<secondary>Examples</secondary> | |
</indexterm>A few simple examples</title> | |
<informaltable> | |
<tgroup cols="4"> | |
<colspec align="left" colnum="1"/> | |
<colspec align="center" colnum="2"/> | |
<colspec align="center" colnum="3"/> | |
<colspec align="left" colnum="4"/> | |
<thead> | |
<row> | |
<entry>Intention</entry> | |
<entry align="center">Before</entry> | |
<entry align="center">After</entry> | |
<entry>Note</entry> | |
</row> | |
</thead> | |
<tbody> | |
<row> | |
<entry>Set the segment start after a period ('.') followed by a | |
space, tab ...</entry> | |
<entry align="center">\.</entry> | |
<entry align="center">\s</entry> | |
<entry>"\." stands for the period character. "\s" means any white | |
space character (space, tab, new page etc.)</entry> | |
</row> | |
<row> | |
<entry>Do not segment after Mr.</entry> | |
<entry align="center">Mr\.</entry> | |
<entry align="center">\s</entry> | |
<entry>This an exception rule, so the rule check box must not be | |
ticked</entry> | |
</row> | |
<row> | |
<entry>Set a segment after "。" (Japanese period)</entry> | |
<entry align="center">。</entry> | |
<entry align="center"/> | |
<entry>Note that <literal>after</literal> is empty</entry> | |
</row> | |
<row> | |
<entry>Do not segment after M. Mr. Mrs. and Ms.</entry> | |
<entry align="center">Mr??s??\.</entry> | |
<entry align="center">\s</entry> | |
<entry>Exception rule - see the use of ? in regular | |
expressions</entry> | |
</row> | |
</tbody> | |
</tgroup> | |
</informaltable> | |
</section> | |
</chapter> |