Character encoding ================== Subtitle files in the wild are a character-encoding minefield. A Russian ``.srt`` is very likely Windows-1251; a Chinese one might be GBK or Big5; older files show up as Latin-1, Shift-JIS, or UTF-16 with a byte-order mark. Get the encoding wrong and you don't get an error — you get mojibake, or a subtitle that fails to parse. Robust handling of these legacy encodings is one of the things ffsubsync does notably well compared to other subtitle sync tools, and it happens automatically by default. Automatic detection (``infer``) ------------------------------- The input encoding option, ``--encoding``, defaults to the sentinel value ``infer``. In this mode ffsubsync reads the subtitle file as **raw bytes** and asks a character-encoding detection library to guess the encoding, then decodes with the winning guess. To be resilient, ffsubsync consults up to three detectors in a fixed preference order and takes the **first** one that returns a result: 1. **cchardet** — a fast C-based detector (see the availability note below), 2. **charset_normalizer** — a pure-Python detector, always installed, 3. **chardet** — the classic pure-Python detector. Whichever library is installed and answers first wins; if a detector is missing or raises, ffsubsync simply moves on to the next. The detected encoding is logged so you can see what it chose. Detection then degrades gracefully. When ffsubsync decodes the bytes it uses Python's ``errors="replace"`` mode, so an imperfect guess produces a few replacement characters rather than crashing the whole sync. Byte-order marks (BOMs) ----------------------- There is no special-case BOM-stripping code, and none is needed. Because the file is handed to the detector as raw bytes, a UTF-8/UTF-16/UTF-32 BOM is part of what the detector inspects, so it reports the appropriate codec (``UTF-8-SIG``, ``UTF-16``, and so on). Python's decoder for those codecs then consumes the BOM during decoding. UTF-16 in particular is explicitly handled this way. Forcing an encoding ------------------- If you already know the encoding — or the detector guesses wrong on an ambiguous file — pass it explicitly to skip detection entirely: .. code-block:: console $ ffs video.mp4 -i input.srt -o output.srt --encoding windows-1251 Any codec name Python understands is accepted (``latin-1``, ``cp1251``, ``shift_jis``, ``big5``, ``utf-16``, ...). Output encoding --------------- Output is controlled separately by ``--output-encoding``, which defaults to ``utf-8``. Modern players prefer UTF-8, so converting legacy-encoded input to UTF-8 on the way out is usually what you want and happens by default. To instead preserve the input's encoding, pass the special value ``same``: .. code-block:: console $ ffs video.mp4 -i input.srt -o output.srt --output-encoding same .. note:: ffsubsync defaults to UTF-8 output regardless of the input encoding. This is a deliberate change from very early versions, which reused the input encoding by default; ``--output-encoding same`` restores that older behavior when you need it. Reference encoding ------------------ When the reference is itself a subtitle file (see :doc:`reference_types`), its encoding is auto-detected the same way as the input. Override it with ``--reference-encoding`` if needed. This option only applies to subtitle references — passing it alongside a video reference is an error. .. _cchardet-availability: The cchardet availability caveat -------------------------------- The fastest and often most accurate detector in the chain, **cchardet**, deserves a closer look, because whether it is present depends on your Python version. The original ``cchardet`` package is unmaintained. ffsubsync switched to the maintained fork, `faust-cchardet `_, in v0.4.25. The important quirk is that faust-cchardet still installs under the **module name** ``cchardet`` — which is why the code simply does ``import cchardet`` even though the declared dependency is ``faust-cchardet``. faust-cchardet is declared as a dependency only for **Python < 3.13**: .. code-block:: text chardet;python_version>='3.7' charset_normalizer faust-cchardet;python_version<'3.13' The practical consequences: - **On Python < 3.13**, the full chain is available. cchardet is tried first, so you get the fast C detector. - **On Python 3.13+**, faust-cchardet is not installed. The ``import cchardet`` fails quietly, and detection falls through to ``charset_normalizer`` (always present) and then ``chardet``. Everything still works — you just lose the C detector and rely on the pure-Python ones. For the vast majority of files this makes no observable difference; the pure-Python detectors handle common encodings well. The edge cases are ambiguous legacy encodings where the detectors can disagree. If you are on Python 3.13+ and hit a file that detects wrong, you have two clean options: 1. Pass the correct encoding explicitly with ``--encoding`` (see above), or 2. Run ffsubsync under an older Python (3.12 or earlier) where faust-cchardet is available. .. tip:: You can check which detectors are active in your environment with: .. code-block:: console $ python -c "import cchardet" && echo "cchardet available" || echo "cchardet NOT available" $ python -c "import charset_normalizer, chardet; print('pure-python detectors present')"