OmegaT/doc_src/en/PlainText.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
"../../../docbook-xml-4.5/docbookx.dtd">
<chapter id="chapter.plain.text">
  <title>Working with plain text<indexterm class="singular">
      <primary>Source files</primary>

      <secondary>Plain text files</secondary>
    </indexterm></title>

  <section id="default.encoding">
    <title>Default encoding<indexterm class="singular">
        <primary>Encoding</primary>

        <secondary>Plain text files</secondary>
      </indexterm><indexterm class="singular">
        <primary>Source files</primary>

        <secondary>Encoding</secondary>
      </indexterm></title>

    <para>Plain text files - in most cases files with a txt extension -
    contain just textual information and offer no clearly defined way to
    inform the computer which language they contain. The most that OmegaT can
    do in such a case, is to assume that the text is written in the same
    language the computer itself uses. This is no problem for files encoded in
    Unicode using a 16 bit character encoding set. If the text is encoded in 8
    bits, however, one can be faced with the following awkward situation:
    instead of displaying the text, for Japanese characters...</para>

    <mediaobject>
      <imageobject role="html">
        <imagedata fileref="images/OmT_Japanese.png"/>
      </imageobject>

      <imageobject role="fo">
        <imagedata fileref="images/OmT_Japanese.png" width="60%"/>
      </imageobject>
    </mediaobject>

    <para>...the system will display it like this for instance:</para>

    <mediaobject>
      <imageobject role="html">
        <imagedata fileref="images/OmT_Cyrillic.png"/>
      </imageobject>

      <imageobject role="fo">
        <imagedata fileref="images/OmT_Cyrillic.png" width="60%"/>
      </imageobject>
    </mediaobject>

    <para>The computer, running OmegaT, has Russian as the default language,
    and thus shows the characters in the Cyrillic alphabet and not in
    Kanji.</para>
  </section>

  <section id="OmegaT.solution">
    <title>The <application>OmegaT</application> solution</title>

    <para>There are basically three ways to address this problem in
    <application>OmegaT</application>. They all involve the application of
    file filters in the<emphasis role="bold"> Options </emphasis>menu.</para>

    <variablelist>
      <varlistentry>
        <term>Change the encoding of your files to Unicode</term>

        <listitem>
          <para>open your source file in a text editor that correctly
          interprets its encoding and save the file in <emphasis
          role="bold">"UTF-8"</emphasis> encoding. Change the file extension
          from <literal>.txt</literal> to <literal>.utf8.</literal>
          <application>OmegaT</application> will automatically interpret the
          file as a UTF-8 file. This is the most common-sense alternative,
          sparing you problems in the long run.</para>
        </listitem>
      </varlistentry>
    </variablelist>

    <variablelist>
      <varlistentry>
        <term>Specify the encoding for your plain text files</term>

        <listitem>
          <para>- i.e. files with a <filename>.txt </filename>extension - : in
          the <emphasis role="bold">Text files </emphasis>section of the file
          filters dialog, change the <emphasis role="bold">Source File
          Encoding</emphasis> from &lt;auto&gt; to the encoding that
          corresponds to your source <filename>.txt</filename> file, for
          instance to .jp for the above example.</para>
        </listitem>
      </varlistentry>
    </variablelist>

    <variablelist>
      <varlistentry>
        <term>Change the extensions of your plain text source files</term>

        <listitem>
          <para>for instance from <filename>.txt</filename> to
          <filename>.jp</filename> for Japanese plain texts: in the <emphasis
          role="bold">Text files</emphasis> section of the file filters
          dialog, add new <emphasis role="bold">Source Filename
          Pattern</emphasis> (<filename>*.jp</filename> for this example) and
          select the appropriate parameters for the source and target
          encoding</para>
        </listitem>
      </varlistentry>
    </variablelist>

    <para><application>OmegaT</application> has by default the following short
    list available to make it easier for you to deal with some plain text
    files:</para>

    <itemizedlist>
      <listitem>
        <para><literal>.txt</literal> files are automatically (&lt;auto&gt;)
        interpreted by <application>OmegaT</application> as being encoded in
        the computer's default encoding.</para>
      </listitem>
    </itemizedlist>

    <itemizedlist>
      <listitem>
        <para><literal>.txt1</literal> files are files in ISO-8859-1, covering
        most <emphasis role="bold">Western Europe</emphasis>
        languages.<indexterm class="singular">
            <primary>Encoding</primary>

            <secondary>Western</secondary>
          </indexterm></para>
      </listitem>
    </itemizedlist>

    <itemizedlist>
      <listitem>
        <para><literal>.txt2</literal> files are files in ISO-8859-2, that
        covers most <emphasis role="bold">Central and Eastern
        Europe</emphasis> languages<indexterm class="singular">
            <primary>Encoding</primary>

            <secondary>Central and Eastern European</secondary>
          </indexterm></para>
      </listitem>
    </itemizedlist>

    <itemizedlist>
      <listitem>
        <para><literal>.utf8</literal> files are interpreted by
        <application>OmegaT</application> as being encoded in UTF-8 (an
        encoding that covers almost all languages in the world).<indexterm
            class="singular">
            <primary>Encoding</primary>

            <secondary>Unicode</secondary>
          </indexterm></para>
      </listitem>
    </itemizedlist>

    <para>You can check that yourself by selecting the item <emphasis
    role="bold">File Filters</emphasis> in the menu <emphasis
    role="bold">Options</emphasis>. For example, when you have a Czech text
    file (very probably written in the <emphasis
    role="bold">ISO-8859-2</emphasis> code) you just need to change the
    extension<literal> .txt</literal> to <literal>.txt2 </literal>and
    <application>OmegaT</application> will interpret its contents correctly.
    And of course, if you wish to be on the safe side, consider converting
    this kind of file to Unicode, i.e. to the <literal>.utf8 </literal>file
    format.</para>
  </section>
</chapter>