1
0
mirror of https://github.com/yt-dlp/yt-dlp.git synced 2025-08-15 08:58:28 +00:00

fix youtube music metadata extraction

fixed the metadata extraction regex's catastrophic backtracking, made it faster on all inputs, and added proper support for artists using the middle dot character

and now, a rant about properly checking your work and learning how to do shit before you publish changes:
simulated atomic groups did not make the regex faster - you added a newline.
simulated atomic groups are always (guaranteed!) slower than normal groups and removing them from the old regex makes that regex faster: https://regex101.com/r/8Ssf2h/3
this is fairly obvious to anyone who has actually learned how regexes are matched.
the fix is to add a delimiter to the start of the expression: https://regex101.com/r/XqqucW/1
without (?:\n|^), the regex attempts to find a match starting at every possible title character (which is virtually every location)
it will then attempt to extend this until it can't do so.
for the string "hello", it would have to check "hello", "ello", "llo", "lo", and "o".
this is what backtracking is, and it causes quadratic performance in the number of input characters.
again, this is fairly obvious to anyone who has actually learned how regexes are matched.
i really hope the next person to "improve" this actually takes the time to review their changes before pushing them.
This commit is contained in:
TheQWERTYCodr 2025-08-01 02:55:53 -04:00 committed by GitHub
parent 71f30921a2
commit 1b4d0401e4
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -4177,20 +4177,15 @@ def process_language(container, base_url, lang_code, sub_name, client_name, quer
# Youtube Music Auto-generated description
if (video_description or '').strip().endswith('\nAuto-generated by YouTube.'):
# XXX: Causes catastrophic backtracking if description has "·"
# E.g. https://www.youtube.com/watch?v=DoPaAxMQoiI
# Simulating atomic groups: (?P<a>[^xy]+)x => (?=(?P<a>[^xy]+))(?P=a)x
# reduces it, but does not fully fix it. https://regex101.com/r/8Ssf2h/2
# Before you change this, learn how regexes work. The last guy didn't.
mobj = re.search(
r'''(?xs)
(?=(?P<track>[^\n·]+))(?P=track)·
(?=(?P<artist>[^\n]+))(?P=artist)\n+
(?=(?P<album>[^\n]+))(?P=album)\n
(?:.+?\s*(?P<release_year>\d{4})(?!\d))?
(?:.+?Released\ on\s*:\s*(?P<release_date>\d{4}-\d{2}-\d{2}))?
(.+?\nArtist\s*:\s*
(?=(?P<clean_artist>[^\n]+))(?P=clean_artist)\n
)?.+\nAuto-generated\ by\ YouTube\.\s*$
(?:\n|^)(?P<track>[^\n·]+)\ ·\ (?P<artist>[^\n]+)\n+
(?P<album>[^\n]+)\n+
(?:\s*(?P<release_year>\d{4})[^\n]+\n+)?
(?:Released\ on\s*:\s*(?P<release_date>\d{4}-\d{2}-\d{2}))?.+?
(\nArtist\s*:\s*(?P<clean_artist>[^\n]+)\n)?
.+Auto-generated\ by\ YouTube\.\s*$
''', video_description)
if mobj:
release_year = mobj.group('release_year')