gh-80667: Fix Tangut ideographs names in unicodedata#101585
gh-80667: Fix Tangut ideographs names in unicodedata#101585wismill wants to merge 3 commits intopython:mainfrom
Conversation
|
@malemburg, @ezio-melotti as Unicode experts. |
|
@malemburg @ezio-melotti kind reminder |
|
@malemburg @ezio-melotti @arhadthedev kind reminder |
|
Another attempt: @corona10 as a developer who worked with the unicode part of Python via Asian scripts. |
SnoopJ
left a comment
There was a problem hiding this comment.
I do not have review authority on CPython, but since this PR has been sitting for a few months, I thought I'd chime in to say it looks good to me in terms of functionality. I just discovered this PR because I just wrote some downstream code to work around the lack of support for this range.
The additional test looks a little complicated to me, it may be duplicating functionality from makeunicodedata.py, but the code added to the runtime is pretty much following the lead of the CJK stuff that already exists.
| while (namelen--) { | ||
| v *= 16; | ||
| if (*name >= '0' && *name <= '9') | ||
| v += *name - '0'; | ||
| else if (*name >= 'A' && *name <= 'F') | ||
| v += *name - 'A' + 10; | ||
| else | ||
| return 0; | ||
| name++; | ||
| } |
There was a problem hiding this comment.
I'm a little unsettled that this loop is duplicated from above, but I don't see a better way to do it aside from maybe some preprocessor abuse.
There was a problem hiding this comment.
Yeah I agree. But since this MR has received very little attention, I am not going to dedicate time to this if there is no opportunity to merge it.
3657122 to
b2c4e92
Compare
|
@SnoopJ thank you for your review, I resolved 2 items. I am quite demotivated because no CPython reviewer has been able to review this in close to half a year. @malemburg @ezio-melotti @arhadthedev @corona10 kind reminder 🙏 |
|
If it wasn't already known, Example (tried in Python 3.10.8): import unicodedata
try:
tangut_ideograph_172fd = b"\xf0\x97\x81\x83".decode("utf8")
unicodedata.name(tangut_ideograph_172fd)
except ValueError as e:
raise ValueError(f"Couldn't handle Tangut ideograph {tangut_ideograph_172fd}") from eOutput: |
Yep, this PR includes changes to I think |
@rcgale I find that
@SnoopJ I confirm that, good catch! |
Thanks, I appreciate the confirmation! Nice to know the bug has been identified, and that there's already a fix implemented. I hope it can be released soon!
Yes, it was a typo. I was encountering the same problem with several Tangut characters, and I probably copy/pasted the output from a different example. |
|
Really disappointed by the lack of interest of the Python devs after such a long time. |
|
It wasn't clear to me if this PR was merged or superseded. It has not been merged and Tangut codepoints still give errors. They were added in Unicode 9.0 in 2016. |
|
@rogerbinns feel free to revive this PR. |
|
Revived in #144789. |
|
I apologize for not getting a review of this PR in time, @wismill. There are not many core developers, especially experts in Unicode. I am now determined to see the matter through to the end. Your PR was generally good, it only needed some updates because the surrounding code changed. |
Fix Tangut ideographs names in unicodedata.
Partially fixes #80667.