Analysis of string representations for a modern programming language.
Naugler, David
If you are designing a language that will provide only one
primitive string type that subsumes character and supports Unicode, what
is the best internal representation? Should it be mutable or immutable?
Which encoding should it use: UTF-8, UTF-16, UTF-32, some hybrid, or
multiple encodings? Should the length be encoded as part of the string,
and if so, how? Should the string support list-like head/tail recursive algorithms? Should strings be interned (stored in a global hash table)
to save space and provide constant-time equality checks? If so, how
should the hashing work? In general, should strings be viewed as
suitable data structures for most common text operations, or are they
opaque containers that must be converted to some other type (list,
vector, deque, etc.) for processing? No perfect solution exists, but I
analyze the alternatives and justify the string representation I use for
my programming language Rune.
* Shade, E. Computer Science Department, Southwest Missouri State
University.