Umlauts and other diacritics are broken

Information from and to the site administrators.

Moderator: Alastair

Message
Author
User avatar
Garry
Posts: 513
Joined: Sun Oct 28, 2012 11:43 am
Location: Sydney, Australia
Contact:

Umlauts and other diacritics are broken

#1 Post by Garry »

Try Games > Browse by language and select a language such as (say) Danish or German that uses diacritical marks. Now browse through the list of games and you'll see all sorts of strange characters where umlauts or other non-ASCII characters should appear in game titles and author's names.

Can someone please fix this?
Alastair
Posts: 1169
Joined: Fri Nov 11, 2005 12:21 am

Re: Umlauts and other diacritics are broken

#2 Post by Alastair »

Thanks Garry for bringing this to our attention. I noticed this for one entry back on the 16th, but I hadn't realised it was site wide. I hope there is an easy solution, the thought of correcting each error individually does not appeal.
User avatar
Gunness
Site Admin
Posts: 1951
Joined: Tue Dec 07, 2004 7:04 pm
Location: Copenhagen, Denmark
Contact:

Re: Umlauts and other diacritics are broken

#3 Post by Gunness »

I presume that it's a display error, as I don't know why these specific characters should have been corrupted.
As far as I can tell, all pages are displayed with UTF-8, which *is* able to display special characters such as these. I don't have any ideas straight away but I'll investigate.
Mr Creosote
Posts: 1146
Joined: Tue Sep 22, 2009 9:23 am
Contact:

Re: Umlauts and other diacritics are broken

#4 Post by Mr Creosote »

My guess is on a database upgrade gone wrong. I can check this afternoon.
User avatar
Gunness
Site Admin
Posts: 1951
Joined: Tue Dec 07, 2004 7:04 pm
Location: Copenhagen, Denmark
Contact:

Re: Umlauts and other diacritics are broken

#5 Post by Gunness »

Much appreciated!
Mr Creosote
Posts: 1146
Joined: Tue Sep 22, 2009 9:23 am
Contact:

Re: Umlauts and other diacritics are broken

#6 Post by Mr Creosote »

Can you help assembling a list of characters which will need replacing - i.e. how they look now and what they are supposed to be? I have so far:

Code: Select all

    'ü' => 'ü',
    'ä' => 'ä',
    'ö' => 'ö',
    'Ü' => 'Ü',
    'é' => 'é',
    'á' => 'á',
    'ê' => 'ê'
Also, are other fields than the title also affected?
Alastair
Posts: 1169
Joined: Fri Nov 11, 2005 12:21 am

Re: Umlauts and other diacritics are broken

#7 Post by Alastair »

Add

Code: Select all

½ => ½
â => â
è => è
Ç => Ç
Å« => ū
also I am seeing î (see links below), I think it should be Î or î
https://solutionarchive.com/game/id%2C9 ... %2C+L.html
https://solutionarchive.com/game/id%2C1 ... 2C+Le.html

I'm guessing that the ¿ in ¿La sentencia? - https://solutionarchive.com/game/id%2C6 ... 2C+La.html - should be ¿ and the í in the author's name "José Luis Díaz" should be í but then we have an instance where just one rather than two faulty characters represents the real one.

(Note that since ½ => ½ and ¿ may => ¿ it could be that all instances of Â{character} resolve to {character}.)

I also see https://solutionarchive.com/game/id%2C9 ... A9lye.html what the title and the author's name are supposed to be I cannot guess.

Mr Creosote wrote: Sat May 30, 2026 12:41 pm Also, are other fields than the title also affected?
Yes, see https://solutionarchive.com/game/id%2C5 ... glub!.html for an example where "Related" (no surprise since it contains the game's title) and "Notes" are affected.
Alastair
Posts: 1169
Joined: Fri Nov 11, 2005 12:21 am

Re: Umlauts and other diacritics are broken

#8 Post by Alastair »

This is a rough equation in hex I've devised that may explain what is going on, where:

x is the value of the actual character
y is the value of the first rogue character
z is the value of the second rogue character

I'm getting the values from https://en.wikipedia.org/wiki/List_of_U ... characters

x = 40(y - C2) + z

some examples which all correlate with what is known (for the third example see the note):

Code: Select all

à = 00C3
¤ = 00A4

x = 40(C3 - C2) + A4 = 40 + A4 = E4

00E4 = ä

Code: Select all

Å = 00C5
« = 00AB

x = 40(C5 - C2) + AB = C0 + AB = 16B

016B = ū

Code: Select all

à = 00C3
{Soft hyphen} = 00AD

x = 40(C3 - C2) + AD = ED

00ED = í

N.B. The soft hyphen would explain the "Díaz" issue.
User avatar
Gunness
Site Admin
Posts: 1951
Joined: Tue Dec 07, 2004 7:04 pm
Location: Copenhagen, Denmark
Contact:

Re: Umlauts and other diacritics are broken

#9 Post by Gunness »

Mr Creosote wrote: Sat May 30, 2026 12:41 pm Can you help assembling a list of characters which will need replacing - i.e. how they look now and what they are supposed to be? I have so far:

Code: Select all

    'ü' => 'ü',
    'ä' => 'ä',
    'ö' => 'ö',
    'Ü' => 'Ü',
    'é' => 'é',
    'á' => 'á',
    'ê' => 'ê'
Also, are other fields than the title also affected?
I have these:
'Ã¥' => 'å'
'Ø' => 'ø'
'ø' => 'Ø'
'æ' => 'æ'
'Ä' => 'Ä'

This ought to cover the various Scandinavian languages.

A few French characters:
'Ç' => 'Ç'
'â' => 'â'
'è' => 'è'
User avatar
Gunness
Site Admin
Posts: 1951
Joined: Tue Dec 07, 2004 7:04 pm
Location: Copenhagen, Denmark
Contact:

Re: Umlauts and other diacritics are broken

#10 Post by Gunness »

Mr Creosote wrote: Sat May 30, 2026 12:41 pmAlso, are other fields than the title also affected?
Yes, the user comments - see: https://solutionarchive.com/game/id%2C3 ... Karma.html:
"I’m a big fan of Avalon Hill Microcomputer Games and have nearly all of these titles in my game collection. Lords of Karma can’t be 1978 given one primary fact. Avalon Hill in context to computer games first presented offerings to the public for sale at the Origins Gaming Convention on June 27–29 1980"

What's worrying here is that the bug also seems to affect apostrophes and dashes?
Mr Creosote
Posts: 1146
Joined: Tue Sep 22, 2009 9:23 am
Contact:

Re: Umlauts and other diacritics are broken

#11 Post by Mr Creosote »

Well, I can run a script over a number of database fields to re-encode characters. Though honestly, it will never be fully complete. You don't happen to have a backup from before this happened, do you?
User avatar
Gunness
Site Admin
Posts: 1951
Joined: Tue Dec 07, 2004 7:04 pm
Location: Copenhagen, Denmark
Contact:

Re: Umlauts and other diacritics are broken

#12 Post by Gunness »

No, it would be pretty outdated, I'm afraid.

To avoid several passes, maybe we should ensure that the list of characters to be replaced is as complete as possible. I can take another look tomorrow, but don't have the time tonight.

Equally important, what can be done to avoid this in the future, other than restoring backups or running char replacement scripts?
Alastair
Posts: 1169
Joined: Fri Nov 11, 2005 12:21 am

Re: Umlauts and other diacritics are broken

#13 Post by Alastair »

Gunness wrote: Mon Jun 01, 2026 1:51 pm Yes, the user comments - see: https://solutionarchive.com/game/id%2C3 ... Karma.html:
"I’m a big fan of Avalon Hill Microcomputer Games and have nearly all of these titles in my game collection. Lords of Karma can’t be 1978 given one primary fact. Avalon Hill in context to computer games first presented offerings to the public for sale at the Origins Gaming Convention on June 27–29 1980"

What's worrying here is that the bug also seems to affect apostrophes and dashes?
Looking at that page in the Wayback Machine - https://web.archive.org/web/20251015052 ... Karma.html - shows that the apostrophes and dash in 27–29 are not standard ASCII ' and - they probably came from cutting and pasting from a word processor.

The Wayback Machine also shows that the problem occurred between 3rd April - https://web.archive.org/web/20260403032 ... chive.com/ - and 8th May - https://web.archive.org/web/20260508140 ... chive.com/ (search for "sur ma Cour" or "ologie en Aveugle" for a couple of examples).
Mr Creosote
Posts: 1146
Joined: Tue Sep 22, 2009 9:23 am
Contact:

Re: Umlauts and other diacritics are broken

#14 Post by Mr Creosote »

I think I have an algorithmic solution, not relying on a whitelist of specific characters (thanks, Alastair, your math approach nudged me in the right direction).

Here is an overview of what it would do if I put it live: https://solutionarchive.com/__umlauts/ (temporary page, will be deleted again after committing the fix). The ones marked in red do contain question marks, which could happen if my algorithm fails. Though of course, a question mark may be correctly part of a game title or a sentence. I.e. it just means we should manually check whether this proposed result is correct. A bit of sanity checking would be appreciated.

I'm still looking into what other database fields could be affected. Is there any impact on the forums as well or only on the website?
Alastair
Posts: 1169
Joined: Fri Nov 11, 2005 12:21 am

Re: Umlauts and other diacritics are broken

#15 Post by Alastair »

Mr Creosote wrote: Thu Jun 04, 2026 4:51 pm I'm still looking into what other database fields could be affected. Is there any impact on the forums as well or only on the website?
The umlauts in the thread Jörg Walkowiak's "Gold Fever" - https://solutionarchive.com/phpBB3/viewtopic.php?t=871 - are present and correct. So the forum is probably unaffected.
Post Reply