Reference types

The reference is whatever ffsubsync treats as the ground truth for timing. ffsubsync inspects the reference — mostly its file extension — and picks one of several strategies for turning it into a speech signal to align against. Understanding these paths helps you choose the fastest and most accurate option for what you have on hand.

Video or audio (voice-activity detection)

When the reference is a media file, ffsubsync uses ffmpeg to extract the audio and then runs a voice-activity detector (VAD) to label each 10 ms window as speech or silence. This is the most general path — it works for any video with a dialogue track — but also the most expensive, since audio extraction dominates the runtime.

Which detector runs, and how to tune it for difficult audio, is covered under Voice-activity detectors (--vad). The audio is extracted at a sample rate controlled by --frame-rate (default 48000; this is the audio sample rate used for VAD, not the video’s frames per second).

Embedded subtitles first (subs_then_*)

Many video containers (especially MKV) carry one or more embedded text subtitle streams. Those are already a perfect speech signal — far cheaper and often more accurate than running a VAD over the audio.

The default detector, subs_then_webrtc, exploits this: it first tries to use an embedded text-subtitle stream from the reference, and only falls back to the WebRTC audio VAD if no usable embedded subtitles are found. The subs_then_* family (subs_then_webrtc, subs_then_auditok, subs_then_silero) all behave this way, differing only in which audio VAD they fall back to. Use a bare detector name (e.g. --vad webrtc) to skip the embedded-subtitle shortcut and force audio detection.

Subtitle file

If the reference itself is a subtitle file — extension .srt, .ass, .ssa, or .sub — ffsubsync derives the speech signal straight from the reference’s on/off subtitle timings. No audio is extracted, so this is the fastest path (typically under a second). This is the “sync against an already-correct subtitle” workflow from Usage.

When the reference is a subtitle file you can also control its text encoding with --reference-encoding (it defaults to auto-detection, just like input subtitles — see Character encoding), and merge the reference into the output with --merge-with-reference.

PGS image subtitles

Blu-ray rips often ship subtitles as PGS (Presentation Graphic Stream) image-based tracks rather than text. ffsubsync can use a PGS track as the sync reference without any OCR, deriving speech timing from when each subtitle image is displayed:

$ ffs ref.mkv -i in.srt -o out.srt --pgs-ref-stream

Passing --pgs-ref-stream with no value auto-detects the first hdmv_pgs_subtitle track. To pick a specific track, give it a stream specifier (the leading 0: is optional):

$ ffs ref.mkv -i in.srt -o out.srt --pgs-ref-stream s:2

Serialized speech (.npy / .npz)

If you pass a .npy or .npz file as the reference, ffsubsync loads a previously-serialized speech signal instead of computing one. You produce such a file with --serialize-speech (see Advanced options). This is handy when you want to sync several subtitle files against the same video: extract the speech signal once, then reuse it repeatedly without re-decoding the audio.

Selecting a stream from the reference

A video file can contain several audio or subtitle tracks. Use --reference-stream to choose which one to use, formatted according to ffmpeg conventions:

$ ffs ref.mkv -i in.srt -o out.srt --reference-stream s:2

For example, 0:s:0 uses the first subtitle track and 0:a:3 uses the fourth audio track; you may drop the leading 0: and write s:0 or a:3.

Offset-only mode (no reference)

Finally, ffsubsync doesn’t strictly need a reference at all. If you already know the correction you want, --apply-offset-seconds shifts every subtitle by a fixed amount with no alignment step:

$ ffs -i in.srt -o out.srt --apply-offset-seconds 3.5

This is covered further in Advanced options.