OmegaT/doc_src/en/FileFilters.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
"../../../docbook-xml-4.5/docbookx.dtd">
<chapter id="chapter.file.filters">
  <title>File Filters</title>

  <para>OmegaT features highly customizable filters, enabling you to configure
  numerous aspects. File filters are pieces of code capable of:</para>

  <itemizedlist>
    <listitem>
      <para>Reading the document in some specific file format. For instance,
      plain text files.</para>
    </listitem>
  </itemizedlist>

  <itemizedlist>
    <listitem>
      <para>Extracting the translatable content out of the file.</para>
    </listitem>
  </itemizedlist>

  <itemizedlist>
    <listitem>
      <para>Automating modifications of the translated document file names by
      replacing translatable contents with its translation.</para>
    </listitem>
  </itemizedlist>

  <para>To see which file formats can be handled by OmegaT, see the menu
  <guimenuitem>Options &gt; File Filters ... </guimenuitem></para>

  <para>Most users will find the default file filter options sufficient. If
  this is not the case, open the main dialog by selecting<emphasis
  role="bold"> Options → File Filters...</emphasis> from the main menu. You
  can also enable project-specific file filters, which will only be used on
  the current project, by selecting the <emphasis role="bold">File
  Filters...</emphasis> option in Project Properties.</para>

  <para>You can enable project specific filters via the <emphasis
  role="bold">Project → Properties...</emphasis>. Click on <guibutton>File
  Filters</guibutton> button and activate the check box <guimenuitem>Enable
  project specific filters</guimenuitem><indexterm class="singular">
      <primary>File filters</primary>

      <secondary>Project specific file filters</secondary>
    </indexterm>. A copy of the filters configuration will be stored with the
  project in this case. If you later change filters, only the project filters
  will be updated, while the user filters stay unchanged.</para>

  <para><emphasis role="bold">Warning!</emphasis> Should you change filter
  options whilst a project is open, you must reload the project in order for
  the changes to take effect.</para>

  <section id="file.filters.dialog">
    <title>File filters dialog<indexterm class="singular">
        <primary>File filters</primary>

        <secondary>Dialog</secondary>
      </indexterm></title>

    <para>This dialog lists available file filters. Should you wish not to use
    OmegaT to translate files of a certain type, you can turn off the
    corresponding filter by deactivating the check box beside its name. OmegaT
    will then omit the appropriate files while loading projects, and will copy
    them unmodified when creating target documents. When you wish to use the
    filter again, just tick the check box. Click <emphasis
    role="bold">Defaults</emphasis> to reset the file filters to the default
    settings. To edit which files in which encodings the filter is to process,
    select the filter from the list and click <emphasis
    role="bold">Edit.</emphasis></para>

    <para>The dialog allows to enable or disable the following options:</para>

    <itemizedlist>
      <listitem>
        <para>Remove leading and trailing tags: uncheck this option to display
        all the tags including the leading and trailing ones. Warning: in
        Microsoft Open XML formats (docx, xlsx, etc.), if all tags are
        displayed, DO NOT write text before the first tag (it is a technical
        tag that must always begin the segment).</para>
      </listitem>

      <listitem>
        <para>Remove leading and trailing whitespace in non-segmented
        projects: by default, OmegaT removes leading and trailing whitespace. 
        In non-segmented projects, it is possible to keep it by unchecking this option.</para>
      </listitem>

      <listitem>
        <para>Preserve spaces for all tags: check this option if the source
        documents contain significant spaces (for layout matters) that must
        not be ignored.</para>
      </listitem>

      <listitem>
        <para>Ignore file context when identifying segments with alternate
        translations: by default, OmegaT uses the source file name as part of the 
        identification of an alternate translation.
        if the option is checked, the source file name will not be used, and
        alternative translations will take effect in any file as long 
        as the other context (previous/next
        segments, or some sort of ID depending on the file format) matches.</para>
      </listitem>
    </itemizedlist>
  </section>

  <section id="filters.options">
    <title>Filter options<indexterm class="singular">
        <primary>File filters</primary>

        <secondary>Options</secondary>
      </indexterm></title>

    <para>Several filters (Text files, XHTML files, HTML and XHTML files,
    OpenDocument files and Microsoft Open XML files) have one or more specific
    options. To modify the options select the filter from the list and click
    on <emphasis role="bold">Options</emphasis>. The available options
    are:</para>

    <para><emphasis role="bold">Text files</emphasis></para>

    <itemizedlist>
      <listitem>
        <para><emphasis>Paragraph segmentation on line breaks, empty lines or
        never:</emphasis></para>

        <para>if sentence segmentation rules are active, the text will further
        be segmented according to the option selected here.</para>
      </listitem>
    </itemizedlist>

    <para><emphasis role="bold">PO files</emphasis></para>

    <itemizedlist>
      <listitem>
        <para><emphasis>Allow blank translations in the target
        file</emphasis>:</para>

        <para>If on, when a PO segment (which may be a whole paragraph) is not
        translated, the translation will be empty in the target file.
        Technically speaking, the <code>msgstr</code> segment in the PO target
        file, if created, will be left empty. As this is the standard behavior
        for PO files, it is on by default. If the option is off, the source
        text will be copied to the target segment.</para>
      </listitem>
    </itemizedlist>

    <itemizedlist>
      <listitem>
        <para><emphasis>Skip PO header</emphasis></para>

        <para>PO header will be skipped and left unchanged, if this option is
        checked.</para>
      </listitem>
    </itemizedlist>

    <itemizedlist>
      <listitem>
        <para><emphasis>Auto replace 'nplurals=INTEGER; plural=EXPRESSION;' in
        header</emphasis></para>

        <para><emphasis>The option allows OmegaT to override the
        specification in the PO file header and use the default for the
        selected target language. </emphasis></para>
      </listitem>
    </itemizedlist>

    <para><emphasis role="bold">XHTML Files</emphasis></para>

    <itemizedlist>
      <listitem>
        <para><emphasis>Add or rewrite encoding declaration in HTML and XHTML
        files</emphasis>: frequently the target files must have the encoding
        character set different from the one in the source file (wether it is
        explicitly defined or implied). Using this option the translator can
        specify, whether the target files are to have the encoding declaration
        included. For instance, if the file filter specifies UTF8 as the
        encoding scheme for the target files, selecting Always will assure
        that this information is included in the translated files.</para>
      </listitem>
    </itemizedlist>

    <itemizedlist>
      <listitem>
        <para><emphasis>Translate the following attributes</emphasis>: the
        selected attributes will appear as segments in the Editor
        window.</para>
      </listitem>
    </itemizedlist>

    <itemizedlist>
      <listitem>
        <para><emphasis>Start a new paragraph on: the &lt;br&gt;</emphasis>
        HTML tag will constitute a paragraph for segmentation purposes.</para>
      </listitem>
    </itemizedlist>

    <itemizedlist>
      <listitem>
        <para><emphasis>Skip text matching regular expression</emphasis>: the
        text matching the regular expression gets skipped. It is shown
        rendered red in the tag validator. Text in source segment that matches
        is shown in italic.</para>
      </listitem>
    </itemizedlist>

    <itemizedlist>
      <listitem>
        <para><emphasis>Do not translate the content attribute of meta-tags
        ... :</emphasis> The following meta-tags will not be
        translated.</para>
      </listitem>
    </itemizedlist>

    <itemizedlist>
      <listitem>
        <para><emphasis>Do not translate the content of tags with the
        following attribute key-value pairs (separate with commas)</emphasis>:
        a match in the list of key-value pairs will cause the content of tags
        to be ignored</para>

        <para>It is sometimes useful to be able make some tags untranslatable
        based on the value of attributes. For example, <literal>&lt;div
        class="hide"&gt; &lt;span translate="no"&gt;</literal> You can define
        key-value pairs for tags to be left untranslated. For the example
        above, the field would contain: <literal>class=hide, translate=no
        </literal></para>
      </listitem>
    </itemizedlist>

    <para><emphasis role="bold">Microsoft Office Open XML
    files</emphasis></para>

    <para>You can select which elements are to be translated. They will appear
    as separate segments in the translation.</para>

    <itemizedlist>
      <listitem>
        <para><emphasis role="bold">Word:</emphasis> non-visible instruction
        text, comments, footnotes, endnotes, footers</para>
      </listitem>
    </itemizedlist>

    <itemizedlist>
      <listitem>
        <para><emphasis role="bold">Excel: </emphasis>comments, sheet
        names</para>
      </listitem>
    </itemizedlist>

    <itemizedlist>
      <listitem>
        <para><emphasis role="bold">Power Point</emphasis>: slide comments,
        slide masters, slide layouts</para>
      </listitem>
    </itemizedlist>

    <itemizedlist>
      <listitem>
        <para><emphasis role="bold">Global:</emphasis> charts, diagrams,
        drawings, WordArt</para>
      </listitem>
    </itemizedlist>

    <itemizedlist>
      <listitem>
        <para><emphasis role="bold">Other Options: </emphasis></para>

        <itemizedlist>
          <listitem>
            <para><emphasis>Aggregate tags</emphasis>: if checked, tags
            without translatable text between them will be aggregated into
            single tags.</para>
          </listitem>

          <listitem>
            <para><emphasis>Preserve spaces for all tags</emphasis>: if
            checked, "white space" (i.e., spaces and newlines) will be
            preserved, even if not set technically in the document</para>
          </listitem>
        </itemizedlist>
      </listitem>
    </itemizedlist>

    <para><emphasis role="bold">HTML and XHTML files</emphasis></para>

    <itemizedlist>
      <listitem>
        <para><emphasis>Add or rewrite encoding declaration in HTML and XHTML
        files</emphasis>: Always (default), Only if (X)HTML file has a header,
        Only if (X)HTML file has an encoding declaration, Never</para>
      </listitem>
    </itemizedlist>

    <itemizedlist>
      <listitem>
        <para><emphasis>Translate the following attributes</emphasis>: the
        selected attributes will appear as segments in the Editor
        window.</para>
      </listitem>
    </itemizedlist>

    <itemizedlist>
      <listitem>
        <para><emphasis>Start a new paragraph on</emphasis>: the &lt;br&gt;
        HTML tag will constitute a paragraph for segmentation purposes.</para>
      </listitem>
    </itemizedlist>

    <itemizedlist>
      <listitem>
        <para><emphasis>Skip text matching regular expression</emphasis>: The
        text, matching the regular expression, will be skipped.</para>
      </listitem>
    </itemizedlist>

    <itemizedlist>
      <listitem>
        <para><emphasis>Do not translate the content attribute of meta-tags
        ... :</emphasis> The following meta-tags will not be
        translated.</para>
      </listitem>
    </itemizedlist>

    <itemizedlist>
      <listitem>
        <para><emphasis>Do not translate the content of tags with the
        following attribute key-value pairs (separate with commas)</emphasis>:
        a match in the list of key-value pairs will cause the content of tags
        to be ignored</para>
      </listitem>
    </itemizedlist>

    <para><emphasis role="bold">Text files</emphasis></para>

    <itemizedlist>
      <listitem>
        <para><emphasis>Paragraph segmentation on line breaks, empty lines or
        never</emphasis>:</para>

        <para>if sentence segmentation rules are active, the text will further
        be segmented according to the option selected here.</para>
      </listitem>
    </itemizedlist>

    <para><emphasis role="bold">Open Document Format (ODF)
    files</emphasis></para>

    <itemizedlist>
      <listitem>
        <para>You can select which of the following items are to be
        translated:</para>

        <para>index entries, bookmarks, bookmark references, notes, comments,
        presentation notes, links (URL), sheet names</para>
      </listitem>
    </itemizedlist>
  </section>

  <section id="edit.filter.dialog">
    <title>Edit filter dialog<indexterm class="singular">
        <primary>File filters</primary>

        <secondary>Editing</secondary>
      </indexterm></title>

    <para>This dialog enables you to set up the source filename patterns of
    files to be processed by the filter, customize the filenames of translated
    files, and select which encodings should be used for loading the file and
    saving its translated counterpart. To modify a file filter pattern, either
    modify the fields directly or click <emphasis role="bold">Edit.</emphasis>
    To add a new file filter pattern, click <emphasis
    role="bold">Add</emphasis>. The same dialog is used to add a pattern or to
    edit a particular pattern. The dialog is useful because it includes a
    special target filename pattern editor with which you can customize the
    names of output files.</para>

    <section id="source.filetype.and.filename.pattern">
      <title><indexterm class="singular">
          <primary>Source files</primary>

          <secondary>File type and name pattern</secondary>
        </indexterm>Source file type, filename pattern<indexterm
          class="singular">
          <primary>File filters</primary>

          <secondary>File type and name pattern</secondary>
        </indexterm></title>

      <para>When OmegaT encounters a file in its source folder, it attempts to
      select the filter based upon the file's extension. More precisely,
      OmegaT attempts to match each filter's source filename patterns against
      the filename. For example, the pattern <literal>*.xhtml
      </literal>matches any file with the <literal>.xhtml</literal> extension.
      If the appropriate filter is found, the file is assigned to it for
      processing. For example, by default, XHTML filters are used for
      processing files with the .xhtml extension. You can change or add
      filename patterns for files to be handled by each file. Source filename
      patterns use wild card characters similar to those used in <emphasis
      role="bold">Searches. </emphasis>The '*' character matches zero or more
      characters. The '?' character matches exactly one character. All other
      characters represent themselves. For example, if you wish the text
      filter to handle readme files (<literal>readme, read.me</literal>, and
      <literal>readme.txt</literal>) you should use the pattern
      <literal>read*</literal>.</para>
    </section>

    <section id="source.and.target.files.encoding">
      <title>Source and Target file encoding</title>

      <indexterm class="singular">
        <primary>Source files</primary>

        <secondary>Encoding</secondary>
      </indexterm>

      <indexterm class="singular">
        <primary>Target files</primary>

        <secondary>Encoding</secondary>
      </indexterm>

      <indexterm class="singular">
        <primary>File filters</primary>

        <secondary>Source, target - encoding</secondary>
      </indexterm>

      <para>Only a limited number of file formats specify a mandatory
      encoding. File formats that do not specify their encoding will use the
      encoding you set up for the extension that matches their name. For
      example, by default <literal>.txt</literal> files will be loaded using
      the default encoding of your operating system. You may change the source
      encoding for each different source filename pattern. Such files may also
      be written out in any encoding. By default, the translated file encoding
      is the same as the source file encoding. Source and target encoding
      fields use combo boxes with all supported encodings included.
      &lt;auto&gt; leaves the encoding choice to
      <application>OmegaT</application>. This is how it works:</para>

      <itemizedlist>
        <listitem>
          <para>OmegaT identifies the source file encoding by using its
          encoding declaration, if present (HTML files, XML based
          files)</para>
        </listitem>
      </itemizedlist>

      <itemizedlist>
        <listitem>
          <para>OmegaT is instructed to use a mandatory encoding for certain
          file formats (Java properties etc)</para>
        </listitem>
      </itemizedlist>

      <itemizedlist>
        <listitem>
          <para>OmegaT uses the default encoding of the operating system for
          text files.</para>
        </listitem>
      </itemizedlist>
    </section>

    <section id="target.name">
      <title>Target filename<indexterm class="singular">
          <primary>Target files</primary>

          <secondary>Filenames</secondary>
        </indexterm></title>

      <para>Sometimes you may wish to rename the files you translate
      automatically, for example adding a language code after the file name.
      The target filename pattern uses a special syntax, so if you wish to
      edit this field, you must click <emphasis
      role="bold">Edit...</emphasis>and use the <indexterm class="singular">
          <primary>File filters</primary>

          <secondary>Dialog</secondary>
        </indexterm>Edit Pattern Dialog. If you wish to revert to default
      configuration of the filter, click <emphasis
      role="bold">Defaults.</emphasis> You may also modify the name directly
      in the target filename pattern field of the file filters dialog. The
      Edit Pattern Dialog offers among others the following options:</para>

      <itemizedlist>
        <listitem>
          <para>Default is <literal>${filename}</literal>– full filename of
          the source file with extension: in this case the name of the
          translated file is the same as that of the source file.</para>
        </listitem>
      </itemizedlist>

      <itemizedlist>
        <listitem>
          <para><literal>${nameOnly}</literal>– allows you to insert only the
          name of the source file without the extension.</para>
        </listitem>

        <listitem>
          <para><literal>${extension} </literal>- the original file
          extension</para>
        </listitem>
      </itemizedlist>

      <itemizedlist>
        <listitem>
          <para><literal>${targetLocale}</literal>– target locale code (of a
          form "xx_YY").</para>
        </listitem>
      </itemizedlist>

      <itemizedlist>
        <listitem>
          <para><literal>${targetLanguage}</literal>– the target language and
          country code together (of a form "XX-YY").</para>
        </listitem>
      </itemizedlist>

      <itemizedlist>
        <listitem>
          <para><literal>${targetLanguageCode}</literal> – the target language
          - only "XX"</para>
        </listitem>
      </itemizedlist>

      <itemizedlist>
        <listitem>
          <para><literal>${targetCountryCode}</literal>– the target country -
          only "YY"</para>
        </listitem>

        <listitem>
          <para><literal>${timestamp-????}</literal> – system date time at
          generation time in various patterns</para>

          <para>See <ulink
          url="http://docs.oracle.com/javase/1.4.2/docs/api/java/text/SimpleDateFormat.html">Oracle
          documentation</ulink> for examples of the "SimpleDateFormat"
          patterns</para>
        </listitem>

        <listitem>
          <para><literal>${system-os-name}</literal> - operating system of the
          computer used</para>
        </listitem>

        <listitem>
          <para><literal>${system-user-name}</literal> - system user
          name</para>
        </listitem>

        <listitem>
          <para><literal>${system-host-name}</literal> - system host
          name</para>
        </listitem>

        <listitem>
          <para><literal>${file-source-encoding}</literal> - source file
          encoding</para>
        </listitem>

        <listitem>
          <para><literal>${file-target-encoding}</literal> - target file
          encoding</para>
        </listitem>

        <listitem>
          <para><literal>${targetLocaleLCID}</literal> - Microsoft target
          locale</para>
        </listitem>
      </itemizedlist>

      <para>Additional variants are available for variables ${nameOnly} and
      ${Extension}. In case the file name has ambivalent name, one can apply
      variables of the form <literal>${name only</literal><emphasis>-extension
      number</emphasis>} and
      <literal>${extension</literal>-<emphasis>extension number} </emphasis>.
      If for example the original file is named Document.xx.docx, the
      following variables will give the following results:</para>

      <itemizedlist>
        <listitem>
          <para><literal>${nameOnly-0}</literal> Document</para>
        </listitem>

        <listitem>
          <para><literal>${nameOnly-1}</literal> Document.xx</para>
        </listitem>

        <listitem>
          <para><literal>${nameOnly-2}</literal> Document.xx.docx</para>
        </listitem>

        <listitem>
          <para><literal>${extension-0}</literal> docx</para>
        </listitem>

        <listitem>
          <para><literal>${extension-1}</literal> xx.docx</para>
        </listitem>

        <listitem>
          <para><literal>${extension-2}</literal> Document.xx.docx</para>
        </listitem>
      </itemizedlist>
    </section>
  </section>
</chapter>