Template talk:StripAccents

From ChoralWiki
Jump to: navigation, search

Nice template, Carlos! One thing, vis-a-vis use for NameSorter and for sortkeys in general. It would be nice if it converted something like "DeFord, Sally" to "Deford, Sally", "de Koven, Reginald" to "De Koven, Reginald", "D'Indy, Vincent" to "Dindy, Vincent", etc. (although in some worlds, the latter would be sorted as "Indy, Vincent D"). I.e., upper case for the first letter in any word of the name, then lowercase thereafter - and remove apostophes. Sorry for adding a "wishlist" here. Obviously, I've been trying to synchronize the lists provided by Composers and Composer works categories. -- Chucktalk Giffen 16:00, 13 May 2009 (UTC)

Hi Carlos, there is a bug somewhere, because your 2nd example in the documentation returns something way off base! -- Chucktalk Giffen 20:22, 13 May 2009 (UTC)
Hi Chuck, thanks for alerting me, it seems that RegExp doesn't work well with Unicode characters if they are not hex encoded, so I had to revert. With respect to your suggestions above, they would bring a nice improvement to NameSorter too. I'll think of a way to gather them in still another template (also including the particles "von", "van" etc.) —Carlos Email.gif 21:14, 13 May 2009 (UTC)
No, thank YOU!! -- Chucktalk Giffen 22:10, 13 May 2009 (UTC)
Hi again, Carlos. I've been thinking about the sortkey situation some more. Here are a few items for thought:
  1. A few more ligatures/double letters should be added, for example ae and oe, "eth" and "thorn", and "å=aa" is probably better than "å=a" (the Danish/Norwegian usage is more predominant, I think). Other changes might be to set "č=cz" and "š=sz", so that "čech" would then become "czech" and "Miloš" would become "Milosz" (as in Miloš/Milosz Forman), the usual way of converting these Čech (Czech) words (transliteration of Ukrainain and other slavic languages often uses "č" and "š" which would be converted the same way, eg. in the surname "Ščarba" which is usually rendered "Szczarba").
  2. Capital letters with diacritics should be included, at least Å, Æ, Ä, Č, Š, Ö, Ü, and Ø, since these do appear as first letters in names.
  3. It might be best if all letters were converted to lower case (or upper case), to avoid problems like "DeFord" (a real case here), "DeForest", "VanBuren", "VandeGraaf", "VanderCook", "VanderWerp", "VonBraun", "VonKleist", "LaRue", "LaPlace", "LeGrand", etc. Additionally, if this were done, there would be no need to duplicate the inclusion of diacritic conversion for capital letters, and the additional coding is easy enough - just do a lc (or UC) conversion before invoking StripAccents.
  4. In names like d'Indy d'Ambleville, d'Astorgia, it is probably best to remove the apostrophe (that seems to be what Wikipedia usually does, and I did the same manually), but whereas d'Indy is sorted under I, the other two are sorted under D (as Dambleville and Dastorgia, respectively).
There are always problematic issues, because of the varying usage of some of these letters in other languages. All this is very tricky stuff, with many pitfalls, and there are bound to be disagreements - some would have us keep (some of) the diacritics and/or use one of several possible different sort orders. Generally, I think simply removing diacritics is best except in the few instances I've raised (and even these may be subject to some debate). I'm fiddling with some changes on my User page, but I'm not finished yet. -- Chucktalk Giffen 16:00, 11 June 2009 (UTC)
Chuck, I agree with some of the suggestions (as the Latin 'ae' and 'oe'), but am not sure about the remaining ones. We should not think of this in terms of which is the best transliteration, but in terms of how are these names sorted under the English spelling rules. As an example, I have never seen Miloš be converted in Milosz for sorting purposes (though I know these forms are considered equivalents in some Slavic countries), but if you think this is the correct way to do it in English, then go ahead!
I like the suggestion to convert the text lo lower case before the substitutions, it will simplify the treatment of uppercase letters as you have pointed out.
As there is not a clear rule about apostrophes, and the cases are very few, better keep dealing with it case by case.
When you finish the tests and find the results are OK, go ahead and apply the changes here, but remember that the composer categories will probably suffer a few changes in their sorting order once again. —Carlos Email.gif 05:18, 12 June 2009 (UTC)