Strings are sequences of characters. The length of a string is
the number of characters that it contains, as an exact non-negative integer.
This number is usually fixed when the string is created,
however, you can extend a mutable string with
the (Kawa-specific) string-append!
function.
The valid indices of a string are the
exact non-negative integers less than the length of the string.
The first character of a
string has index 0, the second has index 1, and so on.
Strings are implemented as a sequence of 16-bit char
values,
even though they’re semantically a sequence of 32-bit Unicode code points.
A character whose value is greater than #xffff
is represented using two surrogate characters.
The implementation allows for natural interoperability with Java APIs.
However it does make certain operations (indexing or counting based on
character counts) difficult to implement efficiently. Luckily one
rarely needs to index or count based on character counts;
alternatives are discussed below.
Some of the procedures that operate on strings ignore the
difference between upper and lower case. The names of
the versions that ignore case end with “-ci
” (for “case
insensitive”).
The type of string objects. The underlying type is the interface
java.lang.CharSequence
. Immultable strings arejava.lang.String
, while mutable strings aregnu.lists.FString
.
Return
#t
ifobj
is a string,#f
otherwise.
Return a newly allocated string composed of the arguments. This is analogous to
list
.
Return a newly allocated string of length
k
. Ifchar
is given, then all elements of the string are initialized tochar
, otherwise the contents of thestring
are unspecified.
Procedure: string-length
string
Return the number of characters in the given
string
as an exact integer object.Performance note: Calling
string-length
may take time propertial to the length of thestring
, because of the need to scan for surrogate pairs.
Procedure: string-ref
string
k
k
must be a valid index ofstring
. Thestring-ref
procedure returns characterk
ofstring
using zero–origin indexing.Performance note: Calling
string-ref
may take time propertial tok
because of the need to check for surrogate pairs. An alternative is to usestring-cursor-ref
. If iterating through a string, usestring-for-each
.
Procedure: string-set!
string
k
char
This procedure stores
char
in elementk
ofstring
.(define s1 (make-string 3 #\*)) (define s2 "***") (string-set! s1 0 #\?) ⇒ void s1 ⇒ "?**" (string-set! s2 0 #\?) ⇒ error (string-set! (symbol->string 'immutable) 0 #\?) ⇒ errorPerformance note: Calling
string-set!
may take time propertial to the length of the string: First it must scan for the right position, likestring-ref
does. Then if the new character requires using a surrogate pair (and the old one doesn’t) then we have to make rom in the string, possible re-allocating a newchar
array. Alternatively, if the old character requires using a surrogate pair (and the new one doesn’t) then following characters need to be moved.The function
string-set!
is deprecated: It is inefficient, and it very seldom does the correct thing. Instead, you can construct a string withstring-append!
.
Procedure: substring
string
start
end
string
must be a string, andstart
andend
must be exact integer objects satisfying:0 <=start
<=end
<= (string-lengthstring
)The
substring
procedure returns a newly allocated string formed from the characters ofstring
beginning with indexstart
(inclusive) and ending with indexend
(exclusive).
Procedure: string-append
string
…
Return a newly allocated string whose characters form the concatenation of the given strings.
Procedure: string-append!
string
value
…
The
string
must be a mutable string, such as one returned bymake-string
orstring-copy
. Thestring-append!
procedure extendsstring
by appending eachvalue
(in order) to the end ofstring
. Eachvalue
should be a character or a string.Performance note: The compiler converts a call with multiple
value
s to a multiplestring-append!
calls. If avalue
is known to be acharacter
, then no boxing (object-allocation) is needed.The following example show to to efficiently process a string using
string-for-each
and incrementally “building” a result string usingstring-append!
.(define (translate-space-to-newline str::string)::string (let ((result (make-string 0))) (string-for-each (lambda (ch) (string-append! result (if (char=? ch #\Space) #\Newline ch))) str) result))
Procedure: string->list
[string
[start
]]end
It is an error if any element of
list
is not a character.The
string->list
procedure returns a newly allocated list of the characters ofstring
betweenstart
andend
. Thelist->string
procedure returns a newly allocated string formed from the characters inlist
. In both procedures, order is preserved. Thestring->list
andlist->string
procedures are inverses so far asequal?
is concerned.
Procedure: string-for-each
proc
string
1
string
2
…
Procedure: string-for-each
proc
string
1
[start
[end
]]
The
string
s must all have the same length.proc
should accept as many arguments as there arestring
s.The
start
-end
variant is provided for compatibility with the SRFI-13 version. (In that casestart
andend
count code Unicode scalar values (character
values), not Java 16-bitchar
values.)The
string-for-each
procedure appliesproc
element–wise to the characters of thestring
s for its side effects, in order from the first characters to the last.proc
is always called in the same dynamic environment asstring-for-each
itself.Analogous to
for-each
.(let ((v '())) (string-for-each (lambda (c) (set! v (cons (char->integer c) v))) "abcde") v) ⇒ (101 100 99 98 97)Performance note: The compiler generates efficient code for
string-for-each
. Ifproc
is a lambda expression, it is inlined,
Procedure: string-map
proc
string
1
string
2
…
The
string-map
procedure appliesproc
element-wise to the elements of the strings and returns a string of the results, in order. It is an error ifproc
does not accept as many arguments as there are strings, or return other than a single character. If more than one string is given and not all strings have the same length,string-map
terminates when the shortest string runs out. The dynamic order in whichproc
is applied to the elements of the strings is unspecified.(string-map char-foldcase "AbdEgH") ⇒ "abdegh"(string-map (lambda (c) (integer->char (+ 1 (char->integer c)))) "HAL") ⇒ "IBM"(string-map (lambda (c k) ((if (eqv? k #\u) char-upcase char-downcase) c)) "studlycaps xxx" "ululululul") ⇒ "StUdLyCaPs"Performance note: The
string-map
procedure has not been optimized (mainly because it is not very useful): The characters are boxed, and theproc
is not inlined even if a lambda expression.
Procedure: string-copy
[string
[start
]]end
Returns a newly allocated copy of the the part of the given
string
betweenstart
andend
.
Procedure: string-replace!
dst
dst-start
dst-end
[src
[src-start
]]src-end
Replaces the characters of string
dst
(betweendst-start
anddst-end
) with the characters ofsrc
(betweensrc-start
andsrc-end
). The number of characters fromsrc
may be different than the number replaced indst
, so the string may grow or contract. The special case wheredst-start
is equal todst-end
corresponds to insertion; the case wheresrc-start
is equal tosrc-end
corresponds to deletion. The order in which characters are copied is unspecified, except that if the source and destination overlap, copying takes places as if the source is first copied into a temporary string and then into the destination. (This is achieved without allocating storage by making sure to copy in the correct direction in such circumstances.)
Procedure: string-copy!
to
at
[from
[start
]]end
Copies the characters of the string
from
that are betweenstart
endend
into the stringto
, starting at indexat
. The order in which characters are copied is unspecified, except that if the source and destination overlap, copying takes places as if the source is first copied into a temporary string and then into the destination. (This is achieved without allocating storage by making sure to copy in the correct direction in such circumstances.)This is equivalent to (and implemented as):
(string-replace! to at (+ at (- end start)) from start end))(define a "12345") (define b (string-copy "abcde")) (string-copy! b 1 a 0 2) b ⇒ "a12de"
Procedure: string-fill!
string
[fill
[start
]]end
The
string-fill!
procedure storesfill
in the elements ofstring
betweenstart
andend
. It is an error iffill
is not a character or is forbidden in strings.
Procedure: string=?
string
1
string
2
string
3
…
Return
#t
if the strings are the same length and contain the same characters in the same positions. Otherwise, thestring=?
procedure returns#f
.(string=? "Straße" "Strasse") ⇒ #f
Procedure: string<?
string
1
string
2
string
3
…
Procedure: string>?
string
1
string
2
string
3
…
Procedure: string<=?
string
1
string
2
string
3
…
Procedure: string>=?
string
1
string
2
string
3
…
These procedures return
#t
if their arguments are (respectively): monotonically increasing, monotonically decreasing, monotonically non-decreasing, or monotonically nonincreasing. These predicates are required to be transitive.These procedures are the lexicographic extensions to strings of the corresponding orderings on characters. For example,
string<?
is the lexicographic ordering on strings induced by the orderingchar<?
on characters. If two strings differ in length but are the same up to the length of the shorter string, the shorter string is considered to be lexicographically less than the longer string.(string<? "z" "ß") ⇒ #t (string<? "z" "zz") ⇒ #t (string<? "z" "Z") ⇒ #f
Procedure: string-ci=?
string
1
string
2
string
3
…
Procedure: string-ci<?
string
1
string
2
string
3
…
Procedure: string-ci>?
string
1
string
2
string
3
…
Procedure: string-ci<=?
string
1
string
2
string
3
…
Procedure: string-ci>=?
string
1
string
2
string
3
…
These procedures are similar to
string=?
, etc., but behave as if they appliedstring-foldcase
to their arguments before invokng the corresponding procedures without-ci
.(string-ci<? "z" "Z") ⇒ #f (string-ci=? "z" "Z") ⇒ #t (string-ci=? "Straße" "Strasse") ⇒ #t (string-ci=? "Straße" "STRASSE") ⇒ #t (string-ci=? "ΧΑΟΣ" "χαοσ") ⇒ #t
Procedure: string-upcase
string
Procedure: string-downcase
string
Procedure: string-titlecase
string
Procedure: string-foldcase
string
These procedures take a string argument and return a string result. They are defined in terms of Unicode’s locale–independent case mappings from Unicode scalar–value sequences to scalar–value sequences. In particular, the length of the result string can be different from the length of the input string. When the specified result is equal in the sense of
string=?
to the argument, these procedures may return the argument instead of a newly allocated string.The
string-upcase
procedure converts a string to upper case;string-downcase
converts a string to lower case. Thestring-foldcase
procedure converts the string to its case–folded counterpart, using the full case–folding mapping, but without the special mappings for Turkic languages. Thestring-titlecase
procedure converts the first cased character of each word, and downcases all other cased characters.(string-upcase "Hi") ⇒ "HI" (string-downcase "Hi") ⇒ "hi" (string-foldcase "Hi") ⇒ "hi" (string-upcase "Straße") ⇒ "STRASSE" (string-downcase "Straße") ⇒ "straße" (string-foldcase "Straße") ⇒ "strasse" (string-downcase "STRASSE") ⇒ "strasse" (string-downcase "Σ") ⇒ "σ" ; Chi Alpha Omicron Sigma: (string-upcase "ΧΑΟΣ") ⇒ "ΧΑΟΣ" (string-downcase "ΧΑΟΣ") ⇒ "χαος" (string-downcase "ΧΑΟΣΣ") ⇒ "χαοσς" (string-downcase "ΧΑΟΣ Σ") ⇒ "χαος σ" (string-foldcase "ΧΑΟΣΣ") ⇒ "χαοσσ" (string-upcase "χαος") ⇒ "ΧΑΟΣ" (string-upcase "χαοσ") ⇒ "ΧΑΟΣ" (string-titlecase "kNock KNoCK") ⇒ "Knock Knock" (string-titlecase "who's there?") ⇒ "Who's There?" (string-titlecase "r6rs") ⇒ "R6rs" (string-titlecase "R6RS") ⇒ "R6rs"Note: The case mappings needed for implementing these procedures can be extracted from
UnicodeData.txt
,SpecialCasing.txt
,WordBreakProperty.txt
(the “MidLetter” property partly defines case–ignorable characters), andCaseFolding.txt
from the Unicode Consortium.Since these procedures are locale–independent, they may not be appropriate for some locales.
Note: Word breaking, as needed for the correct casing of the upper case greek sigma and for
string-titlecase
, is specified in Unicode Standard Annex #29.Kawa Note: The implementation of
string-titlecase
does not correctly handle the case where an initial character needs to be converted to multiple characters, such as “LATIN SMALL LIGATURE FL” which should be converted to the two letters"Fl"
.
Procedure: string-normalize-nfd
string
Procedure: string-normalize-nfkd
string
Procedure: string-normalize-nfc
string
Procedure: string-normalize-nfkc
string
These procedures take a string argument and return a string result, which is the input string normalized to Unicode normalization form D, KD, C, or KC, respectively. When the specified result is equal in the sense of
string=?
to the argument, these procedures may return the argument instead of a newly allocated string.(string-normalize-nfd "\xE9;") ⇒ "\x65;\x301;" (string-normalize-nfc "\xE9;") ⇒ "\xE9;" (string-normalize-nfd "\x65;\x301;") ⇒ "\x65;\x301;" (string-normalize-nfc "\x65;\x301;") ⇒ "\xE9;"
Using function-call syntax with strings is convenient and efficient. However, it has some “gotchas”.
We will use the following example string:
(! str1 "Smile \x1f603;!")
or if you’re brave:
(! str1 "Smile 😃!")
This is "Smile "
followed by an emoticon (“smiling face with
open mouth”) followed by "!"
.
The emoticon has scalar value \x1f603
- it is not
in the 16-bit Basic Multi-language Plane,
and so it must be encoded by a surrogate pair
(#\xd83d
followed by #\xde03
).
The number of scalar values (character
s) is 8,
while the number of 16-bits code units (char
s) is 9.
The java.lang.CharSequence#length
method
counts char
s; the length
function calls that method;
the string-length
procedure counts character
s. Thus:
(length str1) ⇒ 9 (str1:length) ⇒ 9 (string-length str1) ⇒ 8
Counting char
s is a constant-time operation (since it
is stored in the data structure), while counting character
s
takes time propertional to the length of the string,
since it has subtract one for each surrogate pair.
Similarly we can can index the string in 3 ways:
(str1 1) ⇒ #\m :: character (str1:charAt 1) ⇒ #\m :: char (string-ref str1 1) ⇒ #\m :: character
Note using the function-call syntax returns a character
.
Things become interesting when we reach the emoticon:
(str1 6) ⇒ #\😃 :: character (str1:charAt 6) ⇒ #\d83d :: char (string-ref str1 6) ⇒ #\😃 :: character
Both string-ref
and the function-call syntax return the
real character, while the charAt
methods returns a partial character.
However, string-ref
needs to linearly count from the
start of the string, while the function-call syntax can do a constant-time
lookup. It does this by calling (str1:charAt 6)
first.
If that returns a leading-surrogate character, it checks that the
next character (i.e. (str1:charAt 7)
) is a trailing-surrogate character,
and if so combines the two. (If there is no next character or it is not
a trailing-surrogate, then indexing just returns the leading-surrogate
partial character.)
In other words (string-ref s i)
returns the i
’th
character
, while (s i)
return the character
at the i
’th index.
If the character at the i
’th index is a surrogate pair,
then (s (+ i 1))
returns a special pseudo-character
named #\ignorable-char
. This pseudo-character should
generally be ignored. (It is automatically ignored by
Kawa functions including write-char
, string-append!
,
and the string
constructor.)
(str1 7) ⇒ #\ignorable-char :: character
(str1 8) ⇒ #\! :: character
(str1:charAt 7) ⇒ #\de03 :: char
(str1:charAt 8) ⇒ #\! :: char
(string-ref str1 7) ⇒ #\! :: character
(string-ref str1 8) ⇒ throws StringIndexOutOfBoundsException
Following are two possible implementations
of a single-string version of string-for-each
.
The string-for-each-1
version is simple,
obvious, and traditional - but the execution time is
quadratic in the string length.
The string-for-each-2
version requires filtering
out any #\ignorable-char
, but the execution time
is only linear.
(define (string-for-each-1 proc s::string) (! slen (string-length s)) (do ((i ::int 0 (+ i 1))) ((= i slen)) (proc (string-ref s i)))) (define (string-for-each-2 proc s::string) (! slen (length s)) (do ((i ::int 0 (+ i 1))) ((= i slen)) (let ((ch (s i))) (if (not (char=? ch #\ignorable-char)) (proc ch)))))
You can index a string with a list of integer indexes, most commonly a range:
(str
[i
...])
is basically the same as:
(string (str
:charAti
) ...)
This is usually the same as the following:
(string (str
i
) ...)
(The exception is if you select only part of a surrogate pair.)
Generally when working with strings it is best to work with substrings rather than individual characters:
(str
[start
<:end
])
This is equivalent to invoking the CharSequence:subSequence
method:
(str
:subSequencestart
end
)
This is much more efficient than the substring
procedure,
since the latter has to convert character
indexes
to char
offsets.
Indexing into a string (using for example string-ref
)
is inefficient because of the possible presence of surrogate pairs.
Hence given an index i
access normally requires linearly
scanning the string until we have seen i
characters.
The string-cursor API is defined in terms of abstract “cursor values”, which point to a position in the string. This avoids the linear scan.
The API is non-standard, but is based on that in Chibi Scheme.
An abstract posistion (index) in a string. Implemented as a primitive
int
which counts the number of preceding code units (16-bitchar
values).
Procedure: string-cursor-start
str
Returns a cursor for the start of the string. The result is always 0, cast to a
string-cursor
.
Procedure: string-cursor-end
str
Returns a cursor for the end of the string - one past the last valid character. Implemented as
(as string-cursor (invoke
.str
'length))
Procedure: string-cursor-ref
str
cursor
Return the
character
at thecursor
.
Procedure: string-cursor-next
string
cursor
[count
]
Return the cursor position
count
(default 1) character positions forwards beyondcursor
. For eachcount
this may add either 1 or 2 (if pointing at a surrogate pair) to thecursor
.
Procedure: string-cursor-prev
string
cursor
[count
]
Return the cursor position
count
(default 1) character positions backwards beforecursor
.
Procedure: substring-cursor
string
[start
[end
]]
Create a substring of the section of
string
between the cursorsstart
andend
.
Procedure: string-cursor<?
cursor1
cursor2
Procedure: string-cursor<=?
cursor1
cursor2
Procedure: string-cursor=?
cursor1
cursor2
Procedure: string-cursor>=?
cursor1
cursor2
Procedure: string-cursor>?
cursor1
cursor2
Is the position of
cursor1
respectively before, before or same, same, after, or after or same, ascursor2
.Performance note: Implemented as the corresponding
int
comparison.
Procedure: string-cursor-for-each
proc
string
[start
[end
]]
Apply the procedure
proc
to each character position instring
between the cursorsstart
andend
.