Reference types
The reference is whatever ffsubsync treats as the ground truth for timing. ffsubsync inspects the reference — mostly its file extension — and picks one of several strategies for turning it into a speech signal to align against. Understanding these paths helps you choose the fastest and most accurate option for what you have on hand.
Video or audio (voice-activity detection)
When the reference is a media file, ffsubsync uses ffmpeg to extract the audio and then runs a voice-activity detector (VAD) to label each 10 ms window as speech or silence. This is the most general path — it works for any video with a dialogue track — but also the most expensive, since audio extraction dominates the runtime.
Which detector runs, and how to tune it for difficult audio, is covered under
Voice-activity detectors (--vad). The audio is extracted at a sample rate controlled by
--frame-rate (default 48000; this is the audio sample rate used for VAD,
not the video’s frames per second).
Embedded subtitles first (subs_then_*)
Many video containers (especially MKV) carry one or more embedded text subtitle streams. Those are already a perfect speech signal — far cheaper and often more accurate than running a VAD over the audio.
The default detector, subs_then_webrtc, exploits this: it first tries to use
an embedded text-subtitle stream from the reference, and only falls back to the
WebRTC audio VAD if no usable embedded subtitles are found. The subs_then_*
family (subs_then_webrtc, subs_then_auditok, subs_then_silero) all
behave this way, differing only in which audio VAD they fall back to. Use a bare
detector name (e.g. --vad webrtc) to skip the embedded-subtitle shortcut and
force audio detection.
Subtitle file
If the reference itself is a subtitle file — extension .srt, .ass,
.ssa, or .sub — ffsubsync derives the speech signal straight from the
reference’s on/off subtitle timings. No audio is extracted, so this is the
fastest path (typically under a second). This is the “sync against an
already-correct subtitle” workflow from Usage.
When the reference is a subtitle file you can also control its text encoding with
--reference-encoding (it defaults to auto-detection, just like input
subtitles — see Character encoding), and merge the reference into the output with
--merge-with-reference.
PGS image subtitles
Blu-ray rips often ship subtitles as PGS (Presentation Graphic Stream) image-based tracks rather than text. ffsubsync can use a PGS track as the sync reference without any OCR, deriving speech timing from when each subtitle image is displayed:
$ ffs ref.mkv -i in.srt -o out.srt --pgs-ref-stream
Passing --pgs-ref-stream with no value auto-detects the first
hdmv_pgs_subtitle track. To pick a specific track, give it a stream
specifier (the leading 0: is optional):
$ ffs ref.mkv -i in.srt -o out.srt --pgs-ref-stream s:2
Serialized speech (.npy / .npz)
If you pass a .npy or .npz file as the reference, ffsubsync loads a
previously-serialized speech signal instead of computing one. You produce such a
file with --serialize-speech (see Advanced options). This is handy when you
want to sync several subtitle files against the same video: extract the speech
signal once, then reuse it repeatedly without re-decoding the audio.
Selecting a stream from the reference
A video file can contain several audio or subtitle tracks. Use
--reference-stream to choose which one to use, formatted according to ffmpeg
conventions:
$ ffs ref.mkv -i in.srt -o out.srt --reference-stream s:2
For example, 0:s:0 uses the first subtitle track and 0:a:3 uses the
fourth audio track; you may drop the leading 0: and write s:0 or a:3.
Offset-only mode (no reference)
Finally, ffsubsync doesn’t strictly need a reference at all. If you already know
the correction you want, --apply-offset-seconds shifts every subtitle by a
fixed amount with no alignment step:
$ ffs -i in.srt -o out.srt --apply-offset-seconds 3.5
This is covered further in Advanced options.