Character encoding

Subtitle files in the wild are a character-encoding minefield. A Russian .srt is very likely Windows-1251; a Chinese one might be GBK or Big5; older files show up as Latin-1, Shift-JIS, or UTF-16 with a byte-order mark. Get the encoding wrong and you don’t get an error — you get mojibake, or a subtitle that fails to parse. Robust handling of these legacy encodings is one of the things ffsubsync does notably well compared to other subtitle sync tools, and it happens automatically by default.

Automatic detection (infer)

The input encoding option, --encoding, defaults to the sentinel value infer. In this mode ffsubsync reads the subtitle file as raw bytes and asks a character-encoding detection library to guess the encoding, then decodes with the winning guess.

To be resilient, ffsubsync consults up to three detectors in a fixed preference order and takes the first one that returns a result:

  1. cchardet — a fast C-based detector (see the availability note below),

  2. charset_normalizer — a pure-Python detector, always installed,

  3. chardet — the classic pure-Python detector.

Whichever library is installed and answers first wins; if a detector is missing or raises, ffsubsync simply moves on to the next. The detected encoding is logged so you can see what it chose.

Detection then degrades gracefully. When ffsubsync decodes the bytes it uses Python’s errors="replace" mode, so an imperfect guess produces a few replacement characters rather than crashing the whole sync.

Byte-order marks (BOMs)

There is no special-case BOM-stripping code, and none is needed. Because the file is handed to the detector as raw bytes, a UTF-8/UTF-16/UTF-32 BOM is part of what the detector inspects, so it reports the appropriate codec (UTF-8-SIG, UTF-16, and so on). Python’s decoder for those codecs then consumes the BOM during decoding. UTF-16 in particular is explicitly handled this way.

Forcing an encoding

If you already know the encoding — or the detector guesses wrong on an ambiguous file — pass it explicitly to skip detection entirely:

$ ffs video.mp4 -i input.srt -o output.srt --encoding windows-1251

Any codec name Python understands is accepted (latin-1, cp1251, shift_jis, big5, utf-16, …).

Output encoding

Output is controlled separately by --output-encoding, which defaults to utf-8. Modern players prefer UTF-8, so converting legacy-encoded input to UTF-8 on the way out is usually what you want and happens by default. To instead preserve the input’s encoding, pass the special value same:

$ ffs video.mp4 -i input.srt -o output.srt --output-encoding same

Note

ffsubsync defaults to UTF-8 output regardless of the input encoding. This is a deliberate change from very early versions, which reused the input encoding by default; --output-encoding same restores that older behavior when you need it.

Reference encoding

When the reference is itself a subtitle file (see Reference types), its encoding is auto-detected the same way as the input. Override it with --reference-encoding if needed. This option only applies to subtitle references — passing it alongside a video reference is an error.

The cchardet availability caveat

The fastest and often most accurate detector in the chain, cchardet, deserves a closer look, because whether it is present depends on your Python version.

The original cchardet package is unmaintained. ffsubsync switched to the maintained fork, faust-cchardet, in v0.4.25. The important quirk is that faust-cchardet still installs under the module name cchardet — which is why the code simply does import cchardet even though the declared dependency is faust-cchardet.

faust-cchardet is declared as a dependency only for Python < 3.13:

chardet;python_version>='3.7'
charset_normalizer
faust-cchardet;python_version<'3.13'

The practical consequences:

  • On Python < 3.13, the full chain is available. cchardet is tried first, so you get the fast C detector.

  • On Python 3.13+, faust-cchardet is not installed. The import cchardet fails quietly, and detection falls through to charset_normalizer (always present) and then chardet. Everything still works — you just lose the C detector and rely on the pure-Python ones.

For the vast majority of files this makes no observable difference; the pure-Python detectors handle common encodings well. The edge cases are ambiguous legacy encodings where the detectors can disagree. If you are on Python 3.13+ and hit a file that detects wrong, you have two clean options:

  1. Pass the correct encoding explicitly with --encoding (see above), or

  2. Run ffsubsync under an older Python (3.12 or earlier) where faust-cchardet is available.

Tip

You can check which detectors are active in your environment with:

$ python -c "import cchardet" && echo "cchardet available" || echo "cchardet NOT available"
$ python -c "import charset_normalizer, chardet; print('pure-python detectors present')"