TranslationMemories.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
"../../../docbook-xml-4.5/docbookx.dtd">
<chapter id="chapter.translation.memories">
  <title>Translation memories<indexterm class="singular">
      <primary>Translation memories</primary>
    </indexterm><indexterm class="singular">
      <primary>TMX</primary>

      <see>Translation memories</see>
    </indexterm></title>

  <section id="OmegaT.and.tmx.files">
    <title>Translation memories in OmegaT</title>

    <section id="tmx.files.location.and.purpose">
      <title>tmx folders - location and purpose</title>

      <para><application>OmegaT</application> projects can have translation
      memory files - i.e. files with the extension tmx - in five different
      places:</para>

      <variablelist>
        <varlistentry>
          <term><indexterm class="singular">
              <primary>Translation memories</primary>

              <secondary>Subfolder omegat</secondary>

              <seealso>Project files</seealso>
            </indexterm>omegat folder</term>

          <listitem>
            <para>The omegat folder contains the
            <filename>project_save.tmx</filename> and possibly a number of
            backup TMX files. The <filename>project_save.tmx</filename> file
            contains all the segments that have been recorded in memory since
            you started the project. This file always exists in the project.
            Its contents will always be sorted alphabetically by the source
            segment.</para>
          </listitem>
        </varlistentry>

        <varlistentry>
          <term><indexterm class="singular">
              <primary>Translation memories</primary>

              <secondary>Project main folder</secondary>
            </indexterm>main project folder</term>

          <listitem>
            <para>The main project folder contains 3 tmx files,
            <filename>project_name-omegat.tmx</filename>,
            <filename>project_name-level1.tmx</filename> and
            <filename>project_name-level2.tmx</filename> (project_name being
            the name of your project).</para>

            <itemizedlist>
              <listitem>
                <para>The level1 file contains only textual
                information.</para>
              </listitem>

              <listitem>
                <para>The level2 file encapsulates
                <application>OmegaT</application> specific tags in correct tmx
                tags so that the file can be used with its formatting
                information in a translation tool that supports tmx level 2
                memories, or <application>OmegaT</application> itself.</para>
              </listitem>

              <listitem>
                <para>The <application>OmegaT</application> file includes
                <application>OmegaT</application> specific formatting tags so
                that the file can be used in other
                <application>OmegaT</application> projects</para>
              </listitem>
            </itemizedlist>

            <para>These files are copies of the file
            <filename>project_save.tmx</filename>, i.e. of the project's main
            translation memory, excluding the so-called orphan segments. They
            carry appropriately changed names, so that its contents still
            remain identifiable, when used elsewhere, for instance in the
            <filename>tm</filename> subfolder of some other project (see
            below).</para>
          </listitem>
        </varlistentry>

        <varlistentry>
          <term><filename><indexterm class="singular">
              <primary>Translation memories</primary>

              <secondary>Subfolder tm</secondary>

              <seealso>Project files</seealso>
            </indexterm>tm</filename> folder</term>

          <listitem>
            <para>The /tm/ folder can contain any number of ancillary
            translation memories - i.e. tmx files. Such files can be created
            in any of the three varieties indicated above. Note that other CAT
            tools can export (and import as well) tmx files, usually in all
            three forms. The best thing of course is to use OmegaT-specific
            TMX files (see above), so that the in-line formatting within the
            segment is retained.</para>

            <para>The contents of translation memories in the tm subfolder
            serve to generate suggestions for the text(s) to be translated.
            Any text, already translated and stored in those files, will
            appear among the fuzzy matches, if it is sufficiently similar to
            the text currently being translated.</para>

            <para>If the source segment in one of the ancillary TMs is
            identical to the text being translated, OmegaT acts as defined in
            the <menuchoice>
                <guimenu>Options</guimenu>

                <guimenuitem>Editing Behavior...</guimenuitem>
              </menuchoice> dialog window. For instance (if the default is
            accepted), the translation from the ancillary TM is accepted and
            prefixed with<emphasis> [fuzzy]</emphasis>, so that the translator
            can review the translations at a later stage and check whether the
            segments tagged this way, have been translated correctly (see the
            <link linkend="chapter.translation.editing">Editing
            behavior</link> chapter) <menuchoice>
                <guimenu>.</guimenu>
              </menuchoice></para>

            <para>It may happen, that translation memories, available in the
            <filename>tm</filename> subfolder, contain segments with identical
            source text, but differing targets. TMX files are read sorted by
            their names and segments within a given TMX file line by line. The
            last segment with the identical source text will thus prevail
            (Note: of course it makes more sense to avoid this to happen in
            the first place).</para>

            <para>Note that the TMX files in the tm folder can be compressed
            with gzip.<indexterm class="singular">
                <primary>Translation memories</primary>

                <secondary>compressed</secondary>
              </indexterm></para>
          </listitem>
        </varlistentry>

        <varlistentry>
          <term><indexterm class="singular">
              <primary>Translation memories</primary>

              <secondary>Subfolder tm/auto</secondary>

              <seealso>Project files</seealso>
            </indexterm>tm/auto folder<indexterm class="singular">
              <primary>Project</primary>

              <secondary>Pretranslation</secondary>
            </indexterm></term>

          <listitem>
            <para>If it is clear from the very start, that translations in a
            given TM (or TMs) are all correct, one can put them into
            the<emphasis role="bold"> tm/auto</emphasis> folder and avoid
            confirming a lot of<emphasis> [fuzzy]</emphasis> cases. This will
            effectively <emphasis role="bold">pre-translate </emphasis>the
            source text: all the segments in the source text, for which
            translations can be found in those "auto" TMs, will land in the
            main TM of the project without any user intervention.</para>
          </listitem>
        </varlistentry>

        <varlistentry>
          <term>tm/enforce folder</term>

          <listitem>
            <para>If you have no doubt that a TMX is more accurate than the
            <filename>project_save.tmx</filename> of OmegaT, put this TMX in
            /tm/enforce to overwrite existing default translations
            unconditionally.</para>
          </listitem>
        </varlistentry>

        <varlistentry>
          <term>tm/mt folder</term>

          <listitem>
            <para>In the editor pane, when a match is inserted from a TMX
            contained in a folder named <emphasis role="bold">mt</emphasis>,
            the background of the active segment is changed to red. The
            background is restored to normal when the segment is left.</para>
          </listitem>
        </varlistentry>

        <varlistentry>
          <term><indexterm class="singular">
              <primary>Translation memories</primary>

              <secondary>Subfolders tm/penalty-xxx</secondary>

              <seealso>Project files</seealso>
            </indexterm>tm/penalty-xxx folders</term>

          <listitem>
            <para>Sometimes, it is useful to distinguish between high-quality
            translation memories and those that are, because of the subject
            matter, client, revision status, etc., less reliable. For
            translation memories in folders with a name "penalty-xxx" (with
            xxx between 0 and 100), matches will be degraded according to the
            name of the folder: a 100% match in any of TMs, residing in a
            folder called Penalty-30 for instance, will be lowered to a 70%
            match. The penalty applies to all three match percentages: matches
            75, 80, 90 will in this case be lowered to 45, 50, 60.</para>
          </listitem>
        </varlistentry>
      </variablelist>

      <para>Optionally, you can let <application>OmegaT</application> have an
      additional tmx file (<application>OmegaT</application>-style) anywhere
      you specify, containing all translatable segments of the project. See
      pseudo-translated memory below.</para>

      <para>Note that all the translation memories are loaded into memory when
      the project is opened. Back-ups of the project translation memory are
      produced regularly (see next chapter), and
      <filename>project_save.tmx</filename> is also saved/updated when the
      project is closed or loaded again. This means for instance that you do
      not need to exit a project you are currently working on if you decide to
      add another ancillary TM to it: you simply reload the project, and the
      changes you have made will be included.</para>

      <para>The locations of the various different translation memories for a
      given project are user-defined (see Project dialog window in <link
      linkend="chapter.project.properties">Project properties)</link></para>

      <para>Depending on the situation, different strategies are thus
      possible, for instance:</para>

      <para><emphasis role="bold">several projects on the same subject:
      </emphasis>keep the project structure, and change source and target
      folders (Source = source/order1, target = target/order1 etc). Note that
      you segments from order1, that are not present in order2 and other
      subsequent jobs, will be tagged as orphan segments; however, they will
      still be useful for getting fuzzy matches.</para>

      <para><emphasis role="bold">several translators working on the same
      project:</emphasis> split the source files into source/Alice,
      source/Bob... and allocate them to team members (Alice, Bob ...). They
      can then create their own projects and, deliver their own
      <filename>project_save.tmx</filename>, when finished or when a given
      milestone has been reached. The <filename>project_save.tmx</filename>
      files are then collected and possible conflicts as regards terminology
      for instance get resolved. A new version of the master TM is then
      created, either to be put in team members'
      <emphasis>tm/auto</emphasis>subfolders or to replace their
      <filename>project_save.tmx</filename> files. The team can also use the
      same subfolder structure for the target files. This allows them for
      instance to check at any moment, whether the target version for the
      complete project is still OK</para>
    </section>

    <section id="tmx.backup">
      <title>tmx backup<indexterm class="singular">
          <primary>Translation memories</primary>

          <secondary>Backup</secondary>
        </indexterm></title>

      <para>As you translate your files, <application>OmegaT</application>
      stores your work continually in <filename>project_save.tmx</filename> in
      the project's /<filename>omegat</filename> subfolder.</para>

      <para><application>OmegaT</application> also backups translation memory
      to <filename>project_save.tmx.YEARMMDDHHNN.bak</filename> in the same
      subfolder whenever a project is opened or reloaded. YEAR is 4-digit
      year, MM is a month, DD day of the month, HH and NN are hours and
      minutes when the previous translation memory was saved.</para>

      <para>If you believe you have lost translation data, follow the
      following procedure:</para>

      <orderedlist>
        <listitem>
          <para>Close the project</para>
        </listitem>

        <listitem>
          <para>Rename the current <filename>project_save.tmx</filename> file
          ( e.g. to <filename>project_save.tmx.temporary</filename>)</para>
        </listitem>

        <listitem>
          <para>Select the backup translation memory that is most likely -
          e.g. the most recent one, or the last version from the day before)
          to contain the data you are looking for</para>
        </listitem>

        <listitem>
          <para>Copy it to <filename>project_save.tmx</filename></para>
        </listitem>

        <listitem>
          <para>Open the project</para>
        </listitem>
      </orderedlist>
    </section>

    <section id="tmx.files.and.language">
      <title>tmx files and language<indexterm class="singular">
          <primary>Translation memories</primary>

          <secondary>Language</secondary>
        </indexterm></title>

      <para>Tmx files contain translation units, made of a number of
      equivalent segments in several languages. A translation unit comprises
      at least two translation unit variants (TUV). Either can be used as the
      source or target.</para>

      <para>The settings in your project indicate which is the source and
      which the target language. OmegaT thus takes the TUV segments
      corresponding to the project's source and target language codes and uses
      them as the source and target segments respectively. OmegaT recognizes
      the language codes using the following two standard conventions :</para>

      <itemizedlist>
        <listitem>
          <para>2 letters (e.g. JA for Japanese), or</para>
        </listitem>

        <listitem>
          <para>2- or 3-letter language code followed by the 2-letter country
          code (e.g. EN-US - See <xref linkend="appendix.languages"/> for a
          partial list of language and country codes).</para>
        </listitem>
      </itemizedlist>

      <para>If the project language codes and the tmx language codes fully
      match, the segments are loaded in memory. If languages match but not the
      country, the segments still get loaded. If neither the language code not
      the country code match, the segments will be ignored.</para>

      <para><indexterm class="singular">
          <primary>Translation memories</primary>

          <secondary>multilingual, handling of</secondary>
        </indexterm>TMX files can generally contain translation units with
      several candidate languages. If for a given source segment there is no
      entry for the selected target language, all other target segments are
      loaded, regardless of the language. For instance, if the language pair
      of the project is DE-FR, it can be still be of some help to see hits in
      the DE-EN translation, if there's none in the DE-FR pair.</para>
    </section>

    <section>
      <title>Orphan segments<indexterm class="singular">
          <primary>Translation memories</primary>

          <secondary>Orphan segments</secondary>
        </indexterm></title>

      <para>The file <filename>project_save.tmx</filename> contains all the
      segments that have been translated since you started the project. If you
      modify the project segmentation or delete files from the source, some
      matches may appear as <emphasis role="bold">orphan strings</emphasis> in
      the Match Viewer: such matches refer to segments that do not exist any
      more in the source documents, as they correspond to segments translated
      and recorded before the modifications took place.</para>
    </section>
  </section>

  <section id="using.translation.memories.from.previous.projects">
    <title>Reusing translation memories<indexterm class="singular">
        <primary>Translation memories</primary>

        <secondary>Reusing translation memories</secondary>
      </indexterm></title>

    <para>Initially, that is when the project is created, the main TM of the
    project, <filename>project_save.tmx</filename> is empty. This TM gradually
    becomes filled during the translation. To speed up this process, existing
    translations can be reused. If a given sentence has already been
    translated once, and translated correctly, there is no need for it to be
    retranslated. Translation memories may also contain reference
    translations: multinational legislation, such as that of the European
    Community, is a typical example.</para>

    <para>When you create the target documents in an
    <application>OmegaT</application> project, the translation memory of the
    project is output in the form of three files in the root folder of your
    <application>OmegaT</application> project (see the above description). You
    can regard these three tmx files (<filename>-omegat.tmx</filename>,
    <filename>-level1.tmx</filename> and <filename>-level2.tmx</filename>) as
    an "export translation memory", i.e. as an export of your current
    project's content in bilingual form.</para>

    <para>Should you wish to reuse a translation memory from a previous
    project (for example because the new project is similar to the previous
    project, or uses terminology which might have been used before), you can
    use these translation memories as "input translation memories", i.e. for
    import into your new project. In this case, place the translation memories
    you wish to use in the <emphasis>/tm</emphasis> or
    <emphasis>/tm</emphasis>/auto folder of your new project: in the former
    case you will get hits from these translation memories in the fuzzy
    matches viewer, and in the latter case these TMs will be used to
    pre-translate your source text.</para>

    <para>By default, the /tm folder is below the project's root folder (e.g.
    ...<emphasis>/MyProject/tm</emphasis>), but you can choose a different
    folder in the project properties dialog if you wish. This is useful if you
    frequently use translation memories produced in the past, for example
    because they are on the same subject or for the same customer. In this
    case, a useful procedure would be:</para>

    <itemizedlist>
      <listitem>
        <para>Create a folder (a "repository folder") in a convenient location
        on your hard drive for the translation memories for a particular
        customer or subject.</para>
      </listitem>
    </itemizedlist>

    <itemizedlist>
      <listitem>
        <para>Whenever you finish a project, copy one of the three "export"
        translation memory files from the root folder of the project to the
        repository folder.</para>
      </listitem>
    </itemizedlist>

    <itemizedlist>
      <listitem>
        <para>When you begin a new project on the same subject or for the same
        customer, navigate to the repository folder in the
        <guimenuitem>Project &gt; Properties &gt; Edit Project
        dialog</guimenuitem> and select it as the translation memory
        folder.</para>
      </listitem>
    </itemizedlist>

    <para>Note that all the tmx files in the /tm repository are parsed when
    the project is opened, so putting all different TMs you may have on hand
    into this folder may unnecessarily slow OmegaT down. You may even consider
    removing those that are not required any more, once you have used their
    contents to fill up the <filename>project-save.tmx</filename> file.</para>

    <section id="importing.and.exporting.translation.memories">
      <title>Importing and exporting translation memories<indexterm
          class="singular">
          <primary>Translation memories</primary>

          <secondary>Importing and exporting</secondary>
        </indexterm></title>

      <para>OmegaT supports imported tmx versions 1.1-1.4b (both level 1 and
      level 2). This enables the translation memories produced by other tools
      to be read by OmegaT. However, OmegaT does not fully support imported
      level 2 tmx files (these store not only the translation, but also the
      formatting). Level 2 tmx files will still be imported and their textual
      content can be seen in OmegaT, but the quality of fuzzy matches will be
      somewhat lower.</para>

      <para>OmegaT follows very strict procedures when loading translation
      memory (tmx) files. If an error is found in such a file, OmegaT will
      indicate the position within the defective file at which the error is
      located.</para>

      <para>Some tools are known to produce invalid tmx files under certain
      conditions. If you wish to use such files as reference translations in
      OmegaT, they must be repaired, or OmegaT will report an error and fail
      to load them. Fixes are trivial operations and OmegaT assists
      troubleshooting with the related error message. You can ask the user
      group for advice if you have problems.</para>

      <para>OmegaT exports version 1.4 TMX files (both level 1 and level 2).
      The level 2 export is not fully compliant with the level 2 standard, but
      is sufficiently close and will generate correct matches in other
      translation memory tools supporting TMX Level 2. If you only need
      textual information (and not formatting information), use the level 1
      file that OmegaT has created.</para>
    </section>

    <section id="Creating.a.translation.memory.for.selected.documents">
      <title>Creating a translation memory for selected documents</title>

      <para>In case translators need like to share their TMX bases while
      excluding some of their parts or including just translations of certain
      files, sharing the complete <filename>ProjectName-omegat.tmx</filename>
      is out of question. The following recipee is just one of the
      possibilities, but simple enough to follow and without any dangers for
      the assets.</para>

      <itemizedlist>
        <listitem>
          <para>Create a project, separate for other projects, in the desired
          language pair, with an appropriate name - note that the TMXs created
          will include this name.</para>
        </listitem>
      </itemizedlist>

      <itemizedlist>
        <listitem>
          <para>Copy the documents, you need the translation memory for, into
          the source folder of the project.</para>
        </listitem>
      </itemizedlist>

      <itemizedlist>
        <listitem>
          <para>Copy the translation memories, containing the translations of
          the documents above, into <filename>tm/auto</filename> subfolder of
          the new project.</para>
        </listitem>
      </itemizedlist>

      <itemizedlist>
        <listitem>
          <para>Start the project. Check for possible Tag errors with
          <keycap>Ctrl+T </keycap>and untranslated segments with
          <keycap>Ctrl+U</keycap>. To check everything is as expected, you may
          press <keycap>Ctrl+D</keycap> to create the target documents and
          check their contents.</para>
        </listitem>
      </itemizedlist>

      <itemizedlist>
        <listitem>
          <para>When you exit the project. the TMX files in the main project
          folder (see above) now contain the transltions in the selected
          language pair, for the files, you have copied into the source
          folder. Copy them to a safe place for future referrals.</para>
        </listitem>
      </itemizedlist>

      <itemizedlist>
        <listitem>
          <para>To avoid reusing the project and thus possibly polluting
          future cases, delete the project folder or archive it away from your
          workplace.</para>
        </listitem>
      </itemizedlist>
    </section>

    <section id="sharing.translation.memories">
      <title>Sharing translation memories<indexterm class="singular">
          <primary>Translation memories</primary>

          <secondary>Sharing</secondary>

          <seealso>Project,Download Team Project...</seealso>
        </indexterm></title>

      <para>In cases where a team of translators is involved, translators will
      prefer to share common translation memories rather than distribute their
      local versions.</para>

      <para>OmegaT interfaces to SVN and Git, two common team software
      versioning and revision control systems (RCS), available under an open
      source license. In case of OmegaT complete project folders - in other
      words the translation memories involved as well as source folders,
      project settings etc - are managed by the selected RCS. see more in
      Chapter</para>
    </section>

    <section>
      <title>Using TMX with alternative language pairs<indexterm
          class="singular">
          <primary>Translation memories</primary>

          <secondary>Alternative language pairs</secondary>
        </indexterm></title>

      <para>There may be cases where you have done a project with e.g. Dutch
      sources, and a translation in say English. Then you need a translation
      in e.g. Chinese, but your translator does not understand Dutch; she,
      however, understands perfectly English. In this case NL-EN translation
      memory can serve as a go-between to help generate NL to ZH
      translation.</para>

      <para>The solution in our example is to copy the existing translation
      memory into the tm/tmx2source/ subfolder and rename it ZH_CN.tmx to indicate the
      target language of the tmx. The translator will be shown English
      translations for source segments in Dutch and use them to create the
      Chinese translation.</para>

      <para><emphasis role="bold">Important: </emphasis>the supporting TMX
      must be renamed XX_YY.tmx, where XX_YY is the target language of the
      tmx, for instance to ZH_CN.tmx in the example above. The project and TMX
      source languages should of course be identical - NL in our example. Note
      that only one TMX for a given language pair is possible, so if several
      translation memories should be involved, you will need to merge them all
      into the XX_YY.tmx.</para>
    </section>
  </section>

  <section>
    <title>Sources with existing translations<indexterm class="singular">
        <primary>Translation memories</primary>

        <secondary>PO and OKAPI TTX files</secondary>

        <seealso>Translation memories Subfolder tm/auto</seealso>
      </indexterm></title>

    <para>Some types of source files (for instance PO, TTX, etc.) are
    bilingual, i.e. they serve both as a source and as a translation memory.
    In such cases, an existing translation, found in the file, is included in
    the <filename>project_save.tmx</filename>. It is treated as a default
    translation, if no match has been found, or as an alternative translation,
    in case the same source segment exists, but with a target text. The result
    will thus depend on the order in which the source segments have been
    loaded.</para>

    <para>All translations from source documents are also displayed in the
    Comment pane, in addition to the Match pane. In case of PO files, a 20%
    penalty applied to the alternative translation (i.e., a 100% match becomes
    an 80% match). The word [Fuzzy] is displayed on the source segment.</para>

    <para>When you load a segmented TTX file, segments with source = target
    will be included, if "Allow translation to be equal to source" in Options
    → Editing Behavior... has been checked. This may be confusing, so you may
    consider unchecking this option in this case.</para>
  </section>

  <section id="pseudo.translated.memory">
    <title>Pseudo-translated memory<indexterm class="singular">
        <primary>Translation memories</primary>

        <secondary>Pseudotranslation</secondary>
      </indexterm></title>

    <note>
      <para>Of interest for advanced users only!</para>
    </note>

    <para>Before segments get translated, you may wish to pre-process them or
    address them in some other way than is possible with OmegaT. For example,
    if you wish to create a pseudo-translation for testing purposes, OmegaT
    enables you to create an additional tmx file that contains all segments of
    the project. The translation in this tmx can be either</para>

    <itemizedlist>
      <listitem>
        <para>translation equals source (default)</para>
      </listitem>
    </itemizedlist>

    <itemizedlist>
      <listitem>
        <para>translation segment is empty</para>
      </listitem>
    </itemizedlist>

    <para>The tmx file can be given any name you specify. A pseudo-translated
    memory can be generated with the following command line parameters:</para>

    <para><literal>java -jar omegat.jar --pseudotranslatetmx=&lt;filename&gt;
    [pseudotranslatetype=[equal|empty]]</literal></para>

    <para>Replace <literal>&lt;filename&gt;</literal> with the name of the
    file you wish to create, either absolute or relative to the working folder
    (the folder you start <application>OmegaT</application> from). The second
    argument <literal>--pseudotranslatetype</literal> is optional. Its value
    is either <literal>equal</literal> (default value, for source=target) or
    <literal>empty</literal> (target segment is empty). You can process the
    generated tmx with any tool you want. To reuse it in
    <application>OmegaT</application> rename it to <emphasis>project_save.tmx
    </emphasis> and place it in the <literal>omegat</literal>-folder of your
    project.</para>
  </section>

  <section id="upgrading.translation.memories">
    <title>Upgrading translation memories<indexterm class="singular">
        <primary>Translation memories</primary>

        <secondary>Upgrading to sentence segmentation</secondary>
      </indexterm></title>

    <para>Very early versions of <application>OmegaT</application> were
    capable of segmenting source files into paragraphs only and were
    inconsistent when numbering formatting tags in HTML and Open Document
    files. <application>OmegaT</application> can detect and upgrade such tmx
    files on the fly to increase fuzzy matching quality and leverage your
    existing translation better, saving you the work of doing this
    manually.</para>

    <para>A project's tmx will be upgraded only once, and will be written in
    upgraded form into the <filename>project-save.tmx</filename>; legacy tmx
    files will be upgraded on the fly each time the project is loaded. Note
    that in some cases changes in file filters in
    <application>OmegaT</application> may lead to totally different
    segmentation; as a result, you will have to upgrade your translation
    manually in such rare cases.</para>
  </section>
</chapter>