Kawa: Strings

Strings

Strings are sequences of characters. The length of a string is the number of characters that it contains, as an exact non-negative integer. This number is usually fixed when the string is created, however, you can extend a mutable string with the (Kawa-specific) string-append! function. The valid indices of a string are the exact non-negative integers less than the length of the string. The first character of a string has index 0, the second has index 1, and so on.

Strings are implemented as a sequence of 16-bit char values, even though they’re semantically a sequence of 32-bit Unicode code points. A character whose value is greater than #xffff is represented using two surrogate characters. The implementation allows for natural interoperability with Java APIs. However it does make certain operations (indexing or counting based on character counts) difficult to implement efficiently. Luckily one rarely needs to index or count based on character counts; alternatives are discussed below.

Some of the procedures that operate on strings ignore the difference between upper and lower case. The names of the versions that ignore case end with “-ci” (for “case insensitive”).

Type: string

The type of string objects. The underlying type is the interface java.lang.CharSequence. Immultable strings are java.lang.String, while mutable strings are gnu.lists.FString.

Basic string procedures

Procedure: string? obj

Return #t if obj is a string, #f otherwise.

Constructor: string char …

Return a newly allocated string composed of the arguments. This is analogous to list.

Procedure: make-string k

Procedure: make-string k char

Return a newly allocated string of length k. If char is given, then all elements of the string are initialized to char, otherwise the contents of the string are unspecified.

Procedure: string-length string

Return the number of characters in the given string as an exact integer object.

Performance note: Calling string-length may take time propertial to the length of the string, because of the need to scan for surrogate pairs.

Procedure: string-ref string k

k must be a valid index of string. The string-ref procedure returns character k of string using zero–origin indexing.

Performance note: Calling string-ref may take time propertial to k because of the need to check for surrogate pairs. An alternative is to use string-cursor-ref. If iterating through a string, use string-for-each.

Procedure: string-set! string k char

This procedure stores char in element k of string.
(define s1 (make-string 3 #\*))
(define s2 "***")
(string-set! s1 0 #\?) ⇒ void
s1 ⇒ "?**"
(string-set! s2 0 #\?) ⇒ error
(string-set! (symbol->string 'immutable) 0 #\?) ⇒ error
Performance note: Calling string-set! may take time propertial to the length of the string: First it must scan for the right position, like string-ref does. Then if the new character requires using a surrogate pair (and the old one doesn’t) then we have to make rom in the string, possible re-allocating a new char array. Alternatively, if the old character requires using a surrogate pair (and the new one doesn’t) then following characters need to be moved.

The function string-set! is deprecated: It is inefficient, and it very seldom does the correct thing. Instead, you can construct a string with string-append!.

Procedure: substring string start end

string must be a string, and start and end must be exact integer objects satisfying:
0 <= start <= end <= (string-length string)
The substring procedure returns a newly allocated string formed from the characters of string beginning with index start (inclusive) and ending with index end (exclusive).

Procedure: string-append string …

Return a newly allocated string whose characters form the concatenation of the given strings.

Procedure: string-append! string value …

The string must be a mutable string, such as one returned by make-string or string-copy. The string-append! procedure extends string by appending each value (in order) to the end of string. Each value should be a character or a string.

Performance note: The compiler converts a call with multiple values to a multiple string-append! calls. If a value is known to be a character, then no boxing (object-allocation) is needed.

The following example show to to efficiently process a string using string-for-each and incrementally “building” a result string using string-append!.
(define (translate-space-to-newline str::string)::string
  (let ((result (make-string 0)))
    (string-for-each
     (lambda (ch)
       (string-append! result
                       (if (char=? ch #\Space) #\Newline ch)))
     str)
    result))

Procedure: string->list string [start [end]]

Procedure: list->string list

It is an error if any element of list is not a character.

The string->list procedure returns a newly allocated list of the characters of string between start and end. The list->string procedure returns a newly allocated string formed from the characters in list. In both procedures, order is preserved. The string->list and list->string procedures are inverses so far as equal? is concerned.

Procedure: string-for-each proc string₁ string₂ …

Procedure: string-for-each proc string₁ [start [end]]

The strings must all have the same length. proc should accept as many arguments as there are strings.

The start-end variant is provided for compatibility with the SRFI-13 version. (In that case start and end count code Unicode scalar values (character values), not Java 16-bit char values.)

The string-for-each procedure applies proc element–wise to the characters of the strings for its side effects, in order from the first characters to the last. proc is always called in the same dynamic environment as string-for-each itself.

Analogous to for-each.
(let ((v '()))
  (string-for-each
    (lambda (c) (set! v (cons (char->integer c) v)))
    "abcde")
   v)
  ⇒ (101 100 99 98 97)
Performance note: The compiler generates efficient code for string-for-each. If proc is a lambda expression, it is inlined,

Procedure: string-map proc string₁ string₂ …

The string-map procedure applies proc element-wise to the elements of the strings and returns a string of the results, in order. It is an error if proc does not accept as many arguments as there are strings, or return other than a single character. If more than one string is given and not all strings have the same length, string-map terminates when the shortest string runs out. The dynamic order in which proc is applied to the elements of the strings is unspecified.
(string-map char-foldcase "AbdEgH")  ⇒ "abdegh"
(string-map
  (lambda (c) (integer->char (+ 1 (char->integer c))))
  "HAL")
        ⇒ "IBM"
(string-map
  (lambda (c k)
    ((if (eqv? k #\u) char-upcase char-downcase) c))
  "studlycaps xxx"
  "ululululul")
        ⇒ "StUdLyCaPs"
Performance note: The string-map procedure has not been optimized (mainly because it is not very useful): The characters are boxed, and the proc is not inlined even if a lambda expression.

Procedure: string-copy string [start [end]]

Returns a newly allocated copy of the the part of the given string between start and end.

Procedure: string-replace! dst dst-start dst-end src [src-start [src-end]]

Replaces the characters of string dst (between dst-start and dst-end) with the characters of src (between src-start and src-end). The number of characters from src may be different than the number replaced in dst, so the string may grow or contract. The special case where dst-start is equal to dst-end corresponds to insertion; the case where src-start is equal to src-end corresponds to deletion. The order in which characters are copied is unspecified, except that if the source and destination overlap, copying takes places as if the source is first copied into a temporary string and then into the destination. (This is achieved without allocating storage by making sure to copy in the correct direction in such circumstances.)

Procedure: string-copy! to at from [start [end]]

Copies the characters of the string from that are between start end end into the string to, starting at index at. The order in which characters are copied is unspecified, except that if the source and destination overlap, copying takes places as if the source is first copied into a temporary string and then into the destination. (This is achieved without allocating storage by making sure to copy in the correct direction in such circumstances.)

This is equivalent to (and implemented as):
(string-replace! to at (+ at (- end start)) from start end))
(define a "12345")
(define b (string-copy "abcde"))
(string-copy! b 1 a 0 2)
b  ⇒  "a12de"

Procedure: string-fill! string fill [start [end]]

The string-fill! procedure stores fill in the elements of string between start and end. It is an error if fill is not a character or is forbidden in strings.

String Comparisons

Procedure: string=? string₁ string₂ string₃ …

Return #t if the strings are the same length and contain the same characters in the same positions. Otherwise, the string=? procedure returns #f.
(string=? "Straße" "Strasse")    ⇒ #f

Procedure: string<? string₁ string₂ string₃ …

Procedure: string>? string₁ string₂ string₃ …

Procedure: string<=? string₁ string₂ string₃ …

Procedure: string>=? string₁ string₂ string₃ …

These procedures return #t if their arguments are (respectively): monotonically increasing, monotonically decreasing, monotonically non-decreasing, or monotonically nonincreasing. These predicates are required to be transitive.

These procedures are the lexicographic extensions to strings of the corresponding orderings on characters. For example, string<? is the lexicographic ordering on strings induced by the ordering char<? on characters. If two strings differ in length but are the same up to the length of the shorter string, the shorter string is considered to be lexicographically less than the longer string.
(string<? "z" "ß")      ⇒ #t
(string<? "z" "zz")     ⇒ #t
(string<? "z" "Z")      ⇒ #f

Procedure: string-ci=? string₁ string₂ string₃ …

Procedure: string-ci<? string₁ string₂ string₃ …

Procedure: string-ci>? string₁ string₂ string₃ …

Procedure: string-ci<=? string₁ string₂ string₃ …

Procedure: string-ci>=? string₁ string₂ string₃ …

These procedures are similar to string=?, etc., but behave as if they applied string-foldcase to their arguments before invokng the corresponding procedures without -ci.
(string-ci<? "z" "Z")                   ⇒ #f
(string-ci=? "z" "Z")                   ⇒ #t
(string-ci=? "Straße" "Strasse")        ⇒ #t
(string-ci=? "Straße" "STRASSE")        ⇒ #t
(string-ci=? "ΧΑΟΣ" "χαοσ")             ⇒ #t

String Conversions

Procedure: string-upcase string

Procedure: string-downcase string

Procedure: string-titlecase string

Procedure: string-foldcase string

These procedures take a string argument and return a string result. They are defined in terms of Unicode’s locale–independent case mappings from Unicode scalar–value sequences to scalar–value sequences. In particular, the length of the result string can be different from the length of the input string. When the specified result is equal in the sense of string=? to the argument, these procedures may return the argument instead of a newly allocated string.

The string-upcase procedure converts a string to upper case; string-downcase converts a string to lower case. The string-foldcase procedure converts the string to its case–folded counterpart, using the full case–folding mapping, but without the special mappings for Turkic languages. The string-titlecase procedure converts the first cased character of each word, and downcases all other cased characters.
(string-upcase "Hi")              ⇒ "HI"
(string-downcase "Hi")            ⇒ "hi"
(string-foldcase "Hi")            ⇒ "hi"

(string-upcase "Straße")          ⇒ "STRASSE"
(string-downcase "Straße")        ⇒ "straße"
(string-foldcase "Straße")        ⇒ "strasse"
(string-downcase "STRASSE")       ⇒ "strasse"

(string-downcase "Σ")             ⇒ "σ"
; Chi Alpha Omicron Sigma:
(string-upcase "ΧΑΟΣ")            ⇒ "ΧΑΟΣ"
(string-downcase "ΧΑΟΣ")          ⇒ "χαος"
(string-downcase "ΧΑΟΣΣ")         ⇒ "χαοσς"
(string-downcase "ΧΑΟΣ Σ")        ⇒ "χαος σ"
(string-foldcase "ΧΑΟΣΣ")         ⇒ "χαοσσ"
(string-upcase "χαος")            ⇒ "ΧΑΟΣ"
(string-upcase "χαοσ")            ⇒ "ΧΑΟΣ"

(string-titlecase "kNock KNoCK")  ⇒ "Knock Knock"
(string-titlecase "who's there?") ⇒ "Who's There?"
(string-titlecase "r6rs")         ⇒ "R6rs"
(string-titlecase "R6RS")         ⇒ "R6rs"
Note: The case mappings needed for implementing these procedures can be extracted from UnicodeData.txt, SpecialCasing.txt, WordBreakProperty.txt (the “MidLetter” property partly defines case–ignorable characters), and CaseFolding.txt from the Unicode Consortium.

Since these procedures are locale–independent, they may not be appropriate for some locales.

Note: Word breaking, as needed for the correct casing of the upper case greek sigma and for string-titlecase, is specified in Unicode Standard Annex #29.

Kawa Note: The implementation of string-titlecase does not correctly handle the case where an initial character needs to be converted to multiple characters, such as “LATIN SMALL LIGATURE FL” which should be converted to the two letters "Fl".

Procedure: string-normalize-nfd string

Procedure: string-normalize-nfkd string

Procedure: string-normalize-nfc string

Procedure: string-normalize-nfkc string

These procedures take a string argument and return a string result, which is the input string normalized to Unicode normalization form D, KD, C, or KC, respectively. When the specified result is equal in the sense of string=? to the argument, these procedures may return the argument instead of a newly allocated string.
(string-normalize-nfd "\xE9;")          ⇒ "\x65;\x301;"
(string-normalize-nfc "\xE9;")          ⇒ "\xE9;"
(string-normalize-nfd "\x65;\x301;")    ⇒ "\x65;\x301;"
(string-normalize-nfc "\x65;\x301;")    ⇒ "\xE9;"

Strings as sequences

Indexing a string

Using function-call syntax with strings is convenient and efficient. However, it has some “gotchas”.

We will use the following example string:

(! str1 "Smile \x1f603;!")

or if you’re brave:

(! str1 "Smile 😃!")

This is "Smile " followed by an emoticon (“smiling face with open mouth”) followed by "!". The emoticon has scalar value \x1f603 - it is not in the 16-bit Basic Multi-language Plane, and so it must be encoded by a surrogate pair (#\xd83d followed by #\xde03).

The number of scalar values (characters) is 8, while the number of 16-bits code units (chars) is 9. The java.lang.CharSequence#length method counts chars; the length function calls that method; the string-length procedure counts characters. Thus:

(length str1)          ⇒ 9
(str1:length)          ⇒ 9
(string-length str1)   ⇒ 8

Counting chars is a constant-time operation (since it is stored in the data structure), while counting characters takes time propertional to the length of the string, since it has subtract one for each surrogate pair.

Similarly we can can index the string in 3 ways:

(str1 1)              ⇒ #\m :: character
(str1:charAt 1)       ⇒ #\m :: char
(string-ref str1 1)   ⇒ #\m :: character

Note using the function-call syntax returns a character.

Things become interesting when we reach the emoticon:

(str1 6)              ⇒ #\😃 :: character
(str1:charAt 6)       ⇒ #\d83d :: char
(string-ref str1 6)   ⇒ #\😃 :: character

Both string-ref and the function-call syntax return the real character, while the charAt methods returns a partial character. However, string-ref needs to linearly count from the start of the string, while the function-call syntax can do a constant-time lookup. It does this by calling (str1:charAt 6) first. If that returns a leading-surrogate character, it checks that the next character (i.e. (str1:charAt 7)) is a trailing-surrogate character, and if so combines the two. (If there is no next character or it is not a trailing-surrogate, then indexing just returns the leading-surrogate partial character.)

In other words (string-ref s i) returns the i’th character, while (s i) return the character at the i’th index.

If the character at the i’th index is a surrogate pair, then (s (+ i 1)) returns a special pseudo-character named #\ignorable-char. This pseudo-character should generally be ignored. (It is automatically ignored by Kawa functions including write-char, string-append!, and the string constructor.)

(str1 7)              ⇒ #\ignorable-char :: character
(str1 8)              ⇒ #\! :: character
(str1:charAt 7)       ⇒ #\de03 :: char
(str1:charAt 8)       ⇒ #\! :: char
(string-ref str1 7)   ⇒ #\! :: character
(string-ref str1 8)   ⇒ throws StringIndexOutOfBoundsException

Following are two possible implementations of a single-string version of string-for-each. The string-for-each-1 version is simple, obvious, and traditional - but the execution time is quadratic in the string length. The string-for-each-2 version requires filtering out any #\ignorable-char, but the execution time is only linear.

(define (string-for-each-1 proc s::string)
  (! slen (string-length s))
  (do ((i ::int 0 (+ i 1))) ((= i slen))
    (proc (string-ref s i))))

(define (string-for-each-2 proc s::string)
  (! slen (length s))
  (do ((i ::int 0 (+ i 1))) ((= i slen))
    (let ((ch (s i)))
      (if (not (char=? ch #\ignorable-char))
          (proc ch)))))

Indexing with a sequence

You can index a string with a list of integer indexes, most commonly a range:

(str [i ...])

is basically the same as:

(string (str:charAt i) ...)

This is usually the same as the following:

(string (str i) ...)

(The exception is if you select only part of a surrogate pair.)

Generally when working with strings it is best to work with substrings rather than individual characters:

(str [start <: end])

This is equivalent to invoking the CharSequence:subSequence method:

(str:subSequence start end)

This is much more efficient than the substring procedure, since the latter has to convert character indexes to char offsets.

String Cursor API

Indexing into a string (using for example string-ref) is inefficient because of the possible presence of surrogate pairs. Hence given an index i access normally requires linearly scanning the string until we have seen i characters.

The string-cursor API is defined in terms of abstract “cursor values”, which point to a position in the string. This avoids the linear scan.

The API is non-standard, but is based on that in Chibi Scheme.

Type: string-cursor

An abstract posistion (index) in a string. Implemented as a primitive int which counts the number of preceding code units (16-bit char values).

Procedure: string-cursor-start str

Returns a cursor for the start of the string. The result is always 0, cast to a string-cursor.

Procedure: string-cursor-end str

Returns a cursor for the end of the string - one past the last valid character. Implemented as (as string-cursor (invoke str 'length)).

Procedure: string-cursor-ref str cursor

Return the character at the cursor.

Procedure: string-cursor-next string cursor [count]

Return the cursor position count (default 1) character positions forwards beyond cursor. For each count this may add either 1 or 2 (if pointing at a surrogate pair) to the cursor.

Procedure: string-cursor-prev string cursor [count]

Return the cursor position count (default 1) character positions backwards before cursor.

Procedure: substring-cursor string [start [end]]

Create a substring of the section of string between the cursors start and end.

Procedure: string-cursor<? cursor1 cursor2

Procedure: string-cursor<=? cursor1 cursor2

Procedure: string-cursor=? cursor1 cursor2

Procedure: string-cursor>=? cursor1 cursor2

Procedure: string-cursor>? cursor1 cursor2

Is the position of cursor1 respectively before, before or same, same, after, or after or same, as cursor2.

Performance note: Implemented as the corresponding int comparison.

Procedure: string-cursor-for-each proc string [start [end]]

Apply the procedure proc to each character position in string between the cursors start and end.