mirror of
https://github.com/yt-dlp/yt-dlp.git
synced 2025-08-15 00:48:28 +00:00
fix youtube music metadata extraction
fixed the metadata extraction regex's catastrophic backtracking, made it faster on all inputs, and added proper support for artists using the middle dot character and now, a rant about properly checking your work and learning how to do shit before you publish changes: simulated atomic groups did not make the regex faster - you added a newline. simulated atomic groups are always (guaranteed!) slower than normal groups and removing them from the old regex makes that regex faster: https://regex101.com/r/8Ssf2h/3 this is fairly obvious to anyone who has actually learned how regexes are matched. the fix is to add a delimiter to the start of the expression: https://regex101.com/r/XqqucW/1 without (?:\n|^), the regex attempts to find a match starting at every possible title character (which is virtually every location) it will then attempt to extend this until it can't do so. for the string "hello", it would have to check "hello", "ello", "llo", "lo", and "o". this is what backtracking is, and it causes quadratic performance in the number of input characters. again, this is fairly obvious to anyone who has actually learned how regexes are matched. i really hope the next person to "improve" this actually takes the time to review their changes before pushing them.
This commit is contained in:
parent
71f30921a2
commit
1b4d0401e4
@ -4177,20 +4177,15 @@ def process_language(container, base_url, lang_code, sub_name, client_name, quer
|
||||
|
||||
# Youtube Music Auto-generated description
|
||||
if (video_description or '').strip().endswith('\nAuto-generated by YouTube.'):
|
||||
# XXX: Causes catastrophic backtracking if description has "·"
|
||||
# E.g. https://www.youtube.com/watch?v=DoPaAxMQoiI
|
||||
# Simulating atomic groups: (?P<a>[^xy]+)x => (?=(?P<a>[^xy]+))(?P=a)x
|
||||
# reduces it, but does not fully fix it. https://regex101.com/r/8Ssf2h/2
|
||||
# Before you change this, learn how regexes work. The last guy didn't.
|
||||
mobj = re.search(
|
||||
r'''(?xs)
|
||||
(?=(?P<track>[^\n·]+))(?P=track)·
|
||||
(?=(?P<artist>[^\n]+))(?P=artist)\n+
|
||||
(?=(?P<album>[^\n]+))(?P=album)\n
|
||||
(?:.+?℗\s*(?P<release_year>\d{4})(?!\d))?
|
||||
(?:.+?Released\ on\s*:\s*(?P<release_date>\d{4}-\d{2}-\d{2}))?
|
||||
(.+?\nArtist\s*:\s*
|
||||
(?=(?P<clean_artist>[^\n]+))(?P=clean_artist)\n
|
||||
)?.+\nAuto-generated\ by\ YouTube\.\s*$
|
||||
(?:\n|^)(?P<track>[^\n·]+)\ ·\ (?P<artist>[^\n]+)\n+
|
||||
(?P<album>[^\n]+)\n+
|
||||
(?:℗\s*(?P<release_year>\d{4})[^\n]+\n+)?
|
||||
(?:Released\ on\s*:\s*(?P<release_date>\d{4}-\d{2}-\d{2}))?.+?
|
||||
(\nArtist\s*:\s*(?P<clean_artist>[^\n]+)\n)?
|
||||
.+Auto-generated\ by\ YouTube\.\s*$
|
||||
''', video_description)
|
||||
if mobj:
|
||||
release_year = mobj.group('release_year')
|
||||
|
Loading…
Reference in New Issue
Block a user