[Swan-dev] can shunk_t (string chunk) be merged into chunk_t (byte chunk)

Tue Jan 22 16:24:04 UTC 2019

On Mon, 21 Jan 2019 at 15:53, D. Hugh Redelmeier <hugh at mimosa.com> wrote:
>
> | From: Andrew Cagney <andrew.cagney at gmail.com>
>
> |  While chunk_t is intended for
> | bytes and shunk_t is intended for characters, they do both provide
> | pointer+length abstractions.  This begs the question: should they be
> | merged?
>
> C's type system is useful but not very powerful.
>
> If shunk turns out to be useful enough, it should be kept.  And it
> should be kept distinct from chunk.  I have not checked how often
> shunk has been useful.
>
> The main reason is char vs. unsigned char.

>From the POV of merging shunk_t into chunk_t it is in the noise.  A
properly implemented abstraction would hide that choice - they could
both use 'void *'.

OTOH, it does make it really clear that the original purpose of
shunk_t was [constant] ascii string manipulation.  Just like, by using
uint8_t, chunk_t makes it clear it is intended for manipulating raw
bytes.

> A secondary reason is const.  It is a little funny that "const"ness is
> different between chunk and shunk, but you go where the use cases take
> you.

No accident.  It was a deliberate design choice:
- modern languages pretty much all define strings as immutable
- the code should have a fighting chance of compiling and working when
fed -Wwrite-strings "strings" (a PITA that we've yet to try
navigating)

> const vs not-const and char vs unsigned char ought to be orthogonal
> but who wants four abstractions if it can be helped.
>
> The main reason for chunk was for things that would go on the wire.
> Most on-the-wire structures have lengths.
>
> Strings inside our system are standard NUL terminated C things.
> shunks should never go on the wire.  So the need for
> strings-with-lengths isn't obvious.

But strings on the wire do have lengths.  For instance those encoded using DER.

> When I wrote pluto there was really only one place I wanted strings
> with lengths: when parsing configuration files, I'd like a way to
> reference a substring of the input without having to make a copy.  (I
> religiously avoid writing on my input.)  strtok is an abomination.

And the algorithm parser is where it is used, and for just that reason.

> A number of string-eating operators in our library take character
> counts and could use shunks.  But I think most calls use 0 for the
> length which means "eat up to a '\0'".  That would be more awkward to
> say with a shunk.  It would be better to have two distinct versions of
> each functions (with one calling the other).

And code assuming NUL terminated strings should stick with standard c functions.

> |  It turns out that they have a critical difference:
> |
> | - chunk_t points at writable data but shunk_t points at read-only data
>
> It is surprising that they are different.  Why does that turn out to
> be the best choice?

I'm surprised your surprised.  See above.

What I do keep encountering is cases where chunk_t should be 'const'.

> | so I'd argue no.  For instance, a construct like:
> |
> |     const char string[4] = { 'H', 'e', 'l', 'p', };
> |     shunk_t s = { .ptr = string, .len = sizeof(string), };
>
> Aside: I'd probably write that as
>     const char string[4] = "Help";
>     shunk_t s = { .ptr = string, .len = sizeof(string), };
> or better:
>     const char string[] = "Help";
>     shunk_t s = { .ptr = string, .len = sizeof(string) - 1, };
> (I knew that having programmers count characters was for the birds
> when I had to use FORTRAN string literals in 1967: 4HHELP (no lower
> case then).)

It is to illustrate a point, not to be pretty.

> | is valid (unlike strspn() et.al., shunks don't assume NUL
> | termination),
>
> I'm not sure what you are saying.  Are you thinking sizeof is (un)like
> strspn?  I don't see an analogy.  The difference between sizeof and
> strlen is more interesting.

To paraphrase:

Shunks don't assume NUL termination (so using sizeof() and strlen()
was deliberate).

Code using strspn(), and for for that matter, every other variation, does.

> | where as:
> |
> |     chunk_t c = { .ptr = (uint8_t*) string, .len = sizeof(string), };
> |
> | is not.
>
> It is idiomatic to put in the "-1".  Not a problem.

?

> But casting is to be avoided since it can cover up type errors.
>
> | What might useful are:
> |
> | - think of a better name - string_t would be terrible, slice_t might be better
>
> There was a time in the Windows world where those were considered
> "Pascal strings".  But I don't wish to memorialize that.

pstring_t wouldn't be that bad.  Except, according to wikipedia, what
is known as a Pascal String (p-string), and used by many Pascal
dialects, had a one byte prefix.

> Tentative proposal: call it a substring (substr_t).  That's what it
> seems good for.

Something feels wrong about including 'str' in an abstract type not
actually a C string, and the existing code tends to use 'str' when a C
string is involved.  However, to your point, I suggested slice_t for
similar reasons. [sub]text_t :-)

Andrew