From 1b4d0401e47e9cd9340a0dd071285a26ac8674a1 Mon Sep 17 00:00:00 2001
From: TheQWERTYCodr <93845040+TheQWERTYCodr@users.noreply.github.com>
Date: Fri, 1 Aug 2025 02:55:53 -0400
Subject: [PATCH 1/6] fix youtube music metadata extraction
fixed the metadata extraction regex's catastrophic backtracking, made it faster on all inputs, and added proper support for artists using the middle dot character
and now, a rant about properly checking your work and learning how to do shit before you publish changes:
simulated atomic groups did not make the regex faster - you added a newline.
simulated atomic groups are always (guaranteed!) slower than normal groups and removing them from the old regex makes that regex faster: https://regex101.com/r/8Ssf2h/3
this is fairly obvious to anyone who has actually learned how regexes are matched.
the fix is to add a delimiter to the start of the expression: https://regex101.com/r/XqqucW/1
without (?:\n|^), the regex attempts to find a match starting at every possible title character (which is virtually every location)
it will then attempt to extend this until it can't do so.
for the string "hello", it would have to check "hello", "ello", "llo", "lo", and "o".
this is what backtracking is, and it causes quadratic performance in the number of input characters.
again, this is fairly obvious to anyone who has actually learned how regexes are matched.
i really hope the next person to "improve" this actually takes the time to review their changes before pushing them.
---
yt_dlp/extractor/youtube/_video.py | 19 +++++++------------
1 file changed, 7 insertions(+), 12 deletions(-)
diff --git a/yt_dlp/extractor/youtube/_video.py b/yt_dlp/extractor/youtube/_video.py
index 171aa9b5c4..ada2f495ae 100644
--- a/yt_dlp/extractor/youtube/_video.py
+++ b/yt_dlp/extractor/youtube/_video.py
@@ -4177,20 +4177,15 @@ def process_language(container, base_url, lang_code, sub_name, client_name, quer
# Youtube Music Auto-generated description
if (video_description or '').strip().endswith('\nAuto-generated by YouTube.'):
- # XXX: Causes catastrophic backtracking if description has "·"
- # E.g. https://www.youtube.com/watch?v=DoPaAxMQoiI
- # Simulating atomic groups: (?P[^xy]+)x => (?=(?P[^xy]+))(?P=a)x
- # reduces it, but does not fully fix it. https://regex101.com/r/8Ssf2h/2
+ # Before you change this, learn how regexes work. The last guy didn't.
mobj = re.search(
r'''(?xs)
- (?=(?P