mem2sz, mem2zsz, str2sz, str2zsz, szencode, str_decode, szfree, sztrunc, sztail, szchr, szschr, szcmp, szcspn, szgetp, szicmp, szindex, szkill, szlen, szncmp, sznicmp, szrindex, szspn, szcat, szccat, szcpy, szdup, szncat, szncpy, szpbrk, szrcchr, szrchr, szsbrk, szsep, szsz, sztok, sztr, mmspn, mmcspn, szdata, szstats, szunzen, szzen - handle non-null-terminated strings
#include <sz.h>
sz *mem2sz(char *buf, size_t len);
sz *mem2zsz(char *buf, size_t len);
sz *str2sz(char *str);
sz *str2zsz(char *str);
char *szencode(void *s);
sz *str_decode(char *str);
void szfree(sz *s);
void sztrunc(sz *s, size_t len);
sz *sztail(void *s, size_t len);
sz *szchr(void *s, int c);
char *szschr(void *s, int c);
int szcmp(void *s1, void *s2);
size_t szcspn(void *s, void *charset);
sz *szgetp(void *v);
int szicmp(void *s1, void *s2);
int szindex(void *s, int c);
void szkill(sz *s);
size_t szlen(void *s);
int szncmp(void *s1, void *s2 , size_t len);
int sznicmp(void *s1, void *s2 , size_t len);
int szrindex(void *s, int c);
size_t szspn(void *s, void *charset);
sz *szcat(void *dest, void *s);
sz *szccat(void *dest, int c);
sz *szcpy(void *dest, void *src);
sz *szdup(void *s);
sz *szncat(void *dest, void *s , size_t len);
sz *szncpy(void *dest, void *src , size_t len);
sz *szpbrk(void *s, void *set);
char *szsrchr(void *s, int c);
sz *szrchr(void *s, int c);
sz *szsep(sz **stringp, void *delim);
char *szsbrk(void *s, void *set);
sz *szsz(void *s1, void *s2);
sz *sztok(void *string, void *delim);
sz *sztr(void *s, void *from , void *to);
size_t mmspn(char *str, void *charset , size_t len);
size_t mmcspn(char *str, void *charset , size_t len);
char *szdata(void *s);
void szstats(void);
sz *szunzen(sz *s);
sz *szzen(sz *s);
These functions implement a string-like type. The intent is that, in most ways, you can use an object of type sz * as though it were an object of type char *, except, of course, for dereferencing it. The functions perform functions similar to their analogues from the standard library str*() functions, with similar semantics. By design, the implementation of the sz type is opaque; client programs are not able to refer to the internals, in case the mechanisms are altered later.
When a function in this library takes a "void *" argument, it typically can accept either a normal C string, or an sz *. The limitation is that, if you pass a normal C string beginning with a magic character sequence (currently 0xFF, 0x01), the library may behave in unexpected manners. Otherwise, the string will be silently converted for internal use. This allocates a temporary sz. The temporary object will be deleted automatically, unless the function returns a reference to it. (For instance, szchr may return a reference into an object that the caller has no handle for.) For this reason, it is best to avoid using this feature to provide strings for arguments which have references returned to them.
The major differences between the unterminated strings and standard C strings are simple. The unterminated variety does not need to have a NUL character at its end, and automatically adapts in size when concatenated to. The terminated variety has much lower overhead. The semantics are largely identical; for instance, szchr() produces an object which, if truncated, truncates the original string, just as you would expect with the result of strchr().
However, many of these functions silently allocate space. The space thus allocated is tracked; all substrings of a given string are deleted when the object itself is deleted, with szfree(), but they will consume memory until then, or until they are explicitly deleted.
Most of these functions have names starting with sz, which is intended, in naming, to correspond roughly to the str prefix used in <string.h>. In general, functions which return a C-style string have an infix 's' immediately after the 'sz' prefix. A 'c' indicates a character (passed as an int).
sz *szpbrk(void *s, void *set); /* analogue to strpbrk */ char *szsbrk(void *s, void *set); /* returns (char *) */
Some functions refer to a case-insensitive matching operation; this is indicated by an ,i, infix in the name, and implemented by treating all alphabetical characters as if they were lower case, which may have surprising results.
Strings may have parents or children. When a substring is created, it has a parent, which is the string it is a substring of; it is added to the list of children of that parent.
Modifications of substrings propogate to the parent, to its parent, and so on, and then on back down to all children. Not all modifications of a given string have any real effect on substrings of it.
Some strings (notably, substrings of other strings) have a magic bit set called the zen bit. This bit indicates that the given string does not actually own its storage; the space it points to was not allocated for it, and, when it is deleted, it will not attempt to free that memory. There are entry points to create zen strings pointing at user-provided space; these would be used to avoid copying string literals, for instance.
Attempts to modify a zen string with no parent will cause it to become a normal string, with allocated storage, which contains a copy of the original data.
All children are considered zen strings, and have no data storage of their own.
The functions mem2sz() and str2sz() create sz's from existing memory. mem2sz() copies len bytes from buf into newly allocated space; str2sz() copies bytes from buf into newly allocated space, until it hits a terminating NUL byte, which is not copied.
The functions mem2zsz() and str2zsz() are equivalent to mem2sz() and str2sz(), except that they do not copy the space, but simply maintain a pointer to it. If the space is deallocated before the string is deleted, referencing the string will invoke undefined behavior.
The functions szgetp() and szkill() handle semi-automatic translations from strings. The argument to szgetp() is either a plain C string, or a sz object. If it is a string, a Zen wrapper is put on it; otherwise, the original object is returned. szgetp() returns a null pointer on error, or if the object appears to be another sort of magically wrapped object. The function szkill() will destroy any sz object which has not been passed through szgetp at least once. These functions allow the automatic deletion of temporary wrapper strings.
The functions szencode() and str_decode() convert between unterminated strings, and regular strings of a normalized format. Unprintable characters, including NUL bytes, are translated to C-style escape sequences by szencode(), and C-style escape sequences are translated to NUL bytes by str_decode(). (The underscore prevents the library from stepping on the compiler's namespace.)
Deleting strings is accomplished by means of szfree(), which deletes the string given to it, and any children that string may have. If the string was a zen string, this is all that happens; otherwise, the allocated memory is freed.
The sztrunc() function truncates the string referred to by s to len bytes. This truncation affects children as follows; if it specifies a location inside the child string, it truncates the child string to the same point. Otherwise, it has no effect. It is semantically equivalent to the assignment
s[len] = '0';
for a normal C-style string.
The sztail() function returns a substring of s offset by len bytes. It is analogous to the C expression
(s + len)
for a normal C-style string. Modifications of the tail affect the original string.
The functions szchr(), szcmp(), szcspn(), szlen(), szncmp(), szspn(), szcat(), szcpy(), szncat(), szncpy(), szpbrk(), szrchr(), szsz(), and sztok() perform functions analogous to their str*(*) counterparts; the only distinction is that functions which, for C-style strings, return a pointer into the string, actually create a substring in this library. (As noted above, this substring is deleted when the parent is deleted.)
The functions szindex(), szrindex() and szipbrk() are equivalent to szchr(), szrchr() and szpbrk() respectively, except that they return an offset into the original string, or -1 if c is not found in s. They do not allocate memory.
The functions szschr(), szrschr() and szsbrk() are equivalent to szchr(), szrchr() and szpbrk() respectively, except that they return a pointer to the character which matched, rather than a substring. They do not allocate memory.
The functions szicmp() and sznicmp() are equivalent to szcmp() and szncmp() respectively, except that they attempt a case-insensitive comparison; the implications of this are ill-defined for many character sets, but the intent is that the strings be compared as though case-mashed with tolower(). Likewise, the functions szsicmp() and szsnicmp() are equivalent to szscmp() and szsncmp() respectively, with the same difference.
The function szccat() concatenates the single character c onto the string s and returns s. It treats c as though it were an unsigned char converted to int.
The function sztr() performs a function similar to that of the UNIX utility tr. For each character in the string s, if that character occurs in from, it is replaced with the character in the same position in the string to. In both from and to, ranges (of the form x-y) are interpreted to mean all characters from x to y inclusive. (Using the local character set; ASCII collation is not guaranteed.)
The function szdup(), much like the strdup() function provided by some libraries, produces a duplicate of the provided string; it returns NULL if no memory is available. The duplicate string will have no parent, and will have its own duplicate of the storage of the original. It will not have the zen bit set.
The szsep() function, much like the strsep() provided by some libraries, is an alternative to sztok(). It runs through the string pointed to by stringp, looking for any instance of a character in delim. When it finds one, it stores a string starting one character after the delimiter into stringp, truncates the original string at the delimiter, and returns the original string. This allows detection of empty fields, such as those in a traditional UNIX password file. If *stringp is initially NULL, szsep() returns NULL.
The functions mmspn() and mmcspn() are analogous to strspn() and strcspn() from the standard library, and operate on bytes of memory. They return at most len, if no characters from charset occur within str. Their names are not memspn() and memcspn() because that would be implementation namespace.
The function szdata() returns a pointer to the data associated with a given string; this may be treated as a standard C-style string, as the library maintains a null character past the end of the string.
The functions szunzen() and szzen() clear and set the zen bit, respectively. They return s. An idiomatic usage would be
return szunzen(str2zsz(string));
to generate a string using a pre-allocated buffer, but which will free the buffer when it is deleted.
The szstats() function prints, to standard error, statistics about the number of strings created and deleted. It is a debugging tool only.
The example has not been written, as follows:
#include "sz.h" #include <stdio.h>
string (3)
The man page may be incomplete.
It is undesirable that szchr() allocates memory, even though it does go away; this is why szindex() and szschr() were created.
The string library is not optimally fast, although it's not as bad as it could be.
It would be nice if it were easier to mix these with plain old strings.
Because it is now easier to mix these with plain old strings, there is a serious memory leak in any function which returns a reference into an argument, if that argument is a plain old string.