Discussion:
character set encoding of scheme source files
Jonathan Rees
2012-10-26 21:01:04 UTC
Permalink
Mike: I noticed the presence of an apparent u-umlaut in your recent Scheme 48 sources. This raises an interesting question for me: How are various parts of the ecosystem supposed to know what the character set encoding of scheme files is? The question hasn't come up much previously because Scheme is so retro (or provincial) that most source files so far are encoded in 7-bit ASCII, which is a subset shared among UTF-8, Latin-1, etc., so until now it just hasn't mattered.

This is a practical question, I don't mean to be pedantic. My emacs displayed the source in what I think was the intended way, but it feels like an accident that it did so. What if it had treated the source file as UTF-8 or Latin-5? What if it had paid attention to the Unix locale, and the locale specified an encoding different from what was intended?

Similarly, there's the question of how the encoding is (or should be) determined by Scheme. I haven't looked at the Unicode support in Scheme 48 yet, but I'm guessing that the problem of deciding on the encoding is pushed off to the user/programmer. Maybe that's OK, but the problem remains, how is the user/programmer supposed to decide what the encoding is, in a 'best practice' sense?

XML and N3 have answers to these questions: XML says (IIUIC) the encoding can be determined by reading the first few bytes of the file assuming 7-bit ASCII; having decoded that you can decode the rest (because the initial text tells you the encoding). N3 just says that the source file is in UTF-8, period. HTTP also has an answer: If content comes in an HTTP message the answer is given by the HTTP headers and if necessary the media type documentation (the story differs between text/ and application/, etc.). So what does Scheme (or R6RS Scheme or R7RS Scheme or Scheme-48 or folklore) say about this?

Just wondering, too lazy to dig around to find the answers, and hoping you'll brief me so I don't have to dig. Pointers to documentation would be a fine answer to my question.

Thanks
Jonathan
Michael Sperber
2012-10-27 13:50:35 UTC
Permalink
Post by Jonathan Rees
Mike: I noticed the presence of an apparent u-umlaut in your recent
Scheme 48 sources. This raises an interesting question for me: How are
various parts of the ecosystem supposed to know what the character set
encoding of scheme files is? The question hasn't come up much
previously because Scheme is so retro (or provincial) that most source
files so far are encoded in 7-bit ASCII, which is a subset shared
among UTF-8, Latin-1, etc., so until now it just hasn't mattered.
It's come up, and I don't have a good answer yet: The default encoding
of source files is Latin-1. While the system is capable of reading
other encodings, there's no easy way to influence that setting. But
it's on my list. My personal preference is for moving to UTF-8
eventually.

As to the Latin-1 u-umlaut, I was quite consciously following your lead
of putting a © (copyright) character in, for example, doc/deriving.txt
from 0.58.
--
Regards,
Mike
Jonathan Rees
2012-10-27 22:02:34 UTC
Permalink
Post by Michael Sperber
Post by Jonathan Rees
Mike: I noticed the presence of an apparent u-umlaut in your recent
Scheme 48 sources. This raises an interesting question for me: How are
various parts of the ecosystem supposed to know what the character set
encoding of scheme files is? The question hasn't come up much
previously because Scheme is so retro (or provincial) that most source
files so far are encoded in 7-bit ASCII, which is a subset shared
among UTF-8, Latin-1, etc., so until now it just hasn't mattered.
It's come up, and I don't have a good answer yet: The default encoding
of source files is Latin-1. While the system is capable of reading
other encodings, there's no easy way to influence that setting. But
it's on my list. My personal preference is for moving to UTF-8
eventually.
Mine too.
Post by Michael Sperber
As to the Latin-1 u-umlaut, I was quite consciously following your lead
of putting a © (copyright) character in, for example, doc/deriving.txt
from 0.58.
That was before I (we?) saw the Unicode light.

Is there a way to tell emacs to switch to a UTF-8 default for everything? And a way to hack at the Scheme 48 sources (off label) to make it do the same?

Thanks
Jonathan
Post by Michael Sperber
--
Regards,
Mike
Alex Shinn
2012-10-28 07:56:23 UTC
Permalink
Post by Jonathan Rees
Post by Michael Sperber
Post by Jonathan Rees
Mike: I noticed the presence of an apparent u-umlaut in your recent
Scheme 48 sources. This raises an interesting question for me: How are
various parts of the ecosystem supposed to know what the character set
encoding of scheme files is? The question hasn't come up much
previously because Scheme is so retro (or provincial) that most source
files so far are encoded in 7-bit ASCII, which is a subset shared
among UTF-8, Latin-1, etc., so until now it just hasn't mattered.
It's come up, and I don't have a good answer yet: The default encoding
of source files is Latin-1. While the system is capable of reading
other encodings, there's no easy way to influence that setting. But
it's on my list. My personal preference is for moving to UTF-8
eventually.
Mine too.
Post by Michael Sperber
As to the Latin-1 u-umlaut, I was quite consciously following your lead
of putting a © (copyright) character in, for example, doc/deriving.txt
from 0.58.
That was before I (we?) saw the Unicode light.
Is there a way to tell emacs to switch to a UTF-8 default for everything? And a way to hack at the Scheme 48 sources (off label) to make it do the same?
(prefer-coding-system 'utf-8)

Although for the sake of other users who don't
necessarily assume utf-8 you may want to keep
a habit of adding "coding: utf-8" to files that have
non-ASCII characters.
--
Alex
Michael Sperber
2012-10-28 15:14:48 UTC
Permalink
Post by Jonathan Rees
Is there a way to tell emacs to switch to a UTF-8 default for
everything? And a way to hack at the Scheme 48 sources (off label) to
make it do the same?
Easiest would be to call `(set-port-text-codec! port utf-8-codec)' on
the input port created in scheme/bcomp/read-form.scm.
--
Regards,
Mike
Loading...