[Swan-dev] can shunk_t (string chunk) be merged into chunk_t (byte chunk)

Mon Jan 21 20:53:12 UTC 2019

| From: Andrew Cagney <andrew.cagney at gmail.com>

|  While chunk_t is intended for
| bytes and shunk_t is intended for characters, they do both provide
| pointer+length abstractions.  This begs the question: should they be
| merged?

C's type system is useful but not very powerful.

If shunk turns out to be useful enough, it should be kept.  And it
should be kept distinct from chunk.  I have not checked how often
shunk has been useful.

The main reason is char vs. unsigned char.

A secondary reason is const.  It is a little funny that "const"ness is
different between chunk and shunk, but you go where the use cases take
you.

const vs not-const and char vs unsigned char ought to be orthogonal
but who wants four abstractions if it can be helped.

The main reason for chunk was for things that would go on the wire.
Most on-the-wire structures have lengths.

Strings inside our system are standard NUL terminated C things.
shunks should never go on the wire.  So the need for
strings-with-lengths isn't obvious.

When I wrote pluto there was really only one place I wanted strings
with lengths: when parsing configuration files, I'd like a way to
reference a substring of the input without having to make a copy.  (I
religiously avoid writing on my input.)  strtok is an abomination.

A number of string-eating operators in our library take character
counts and could use shunks.  But I think most calls use 0 for the
length which means "eat up to a '\0'".  That would be more awkward to
say with a shunk.  It would be better to have two distinct versions of
each functions (with one calling the other).

Yet another dimension: does the chunk "own" the memory, with the
obligation that the memory be freed when the code is done with the
chunk.  Right now, it is up to the programmer to remember this aspect,
with no help.

|  It turns out that they have a critical difference:
| 
| - chunk_t points at writable data but shunk_t points at read-only data

It is surprising that they are different.  Why does that turn out to
be the best choice?

| so I'd argue no.  For instance, a construct like:
| 
|     const char string[4] = { 'H', 'e', 'l', 'p', };
|     shunk_t s = { .ptr = string, .len = sizeof(string), };

Aside: I'd probably write that as
    const char string[4] = "Help";
    shunk_t s = { .ptr = string, .len = sizeof(string), };
or better:
    const char string[] = "Help";
    shunk_t s = { .ptr = string, .len = sizeof(string) - 1, };
(I knew that having programmers count characters was for the birds
when I had to use FORTRAN string literals in 1967: 4HHELP (no lower
case then).)

| is valid (unlike strspn() et.al., shunks don't assume NUL
| termination),

I'm not sure what you are saying.  Are you thinking sizeof is (un)like
strspn?  I don't see an analogy.  The difference between sizeof and
strlen is more interesting.

| where as:
| 
|     chunk_t c = { .ptr = (uint8_t*) string, .len = sizeof(string), };
| 
| is not.

It is idiomatic to put in the "-1".  Not a problem.

But casting is to be avoided since it can cover up type errors.

| What might useful are:
| 
| - think of a better name - string_t would be terrible, slice_t might be better

There was a time in the Windows world where those were considered
"Pascal strings".  But I don't wish to memorialize that.

Tentative proposal: call it a substring (substr_t).  That's what it
seems good for.

The elephant in the room: should we really be dealing with UTF-8 /
Unicode?  The difference between unsigned char and signed char is
quite minor, but handling the real world of UTF-8 takes a bigger step.

As far as I'm concerned, UTF-32 is a non-starter.  Except when
handling a single character.

If we wish to get to Unicode eventually, perhaps the shunk should be
more opaque now.  We should discourage subscripting into it because a
unicode character cannot be accessed that way.  But most of our
conversion work would be with regular string code.  This would
probably fatten up the abstraction considerably (sadly).

| - change .ptr to 'const uint8_t*', but that would break things like
| .ptr = "a string" :-(

The tension of reducing four cases to two -- which two?

This particular version seems more like a chunk variant, not a string
variant.