[Swan-dev] can shunk_t (string chunk) be merged into chunk_t (byte chunk)
Andrew Cagney
andrew.cagney at gmail.com
Tue Jan 22 16:24:04 UTC 2019
On Mon, 21 Jan 2019 at 15:53, D. Hugh Redelmeier <hugh at mimosa.com> wrote:
>
> | From: Andrew Cagney <andrew.cagney at gmail.com>
>
> | While chunk_t is intended for
> | bytes and shunk_t is intended for characters, they do both provide
> | pointer+length abstractions. This begs the question: should they be
> | merged?
>
> C's type system is useful but not very powerful.
>
> If shunk turns out to be useful enough, it should be kept. And it
> should be kept distinct from chunk. I have not checked how often
> shunk has been useful.
>
> The main reason is char vs. unsigned char.
>From the POV of merging shunk_t into chunk_t it is in the noise. A
properly implemented abstraction would hide that choice - they could
both use 'void *'.
OTOH, it does make it really clear that the original purpose of
shunk_t was [constant] ascii string manipulation. Just like, by using
uint8_t, chunk_t makes it clear it is intended for manipulating raw
bytes.
> A secondary reason is const. It is a little funny that "const"ness is
> different between chunk and shunk, but you go where the use cases take
> you.
No accident. It was a deliberate design choice:
- modern languages pretty much all define strings as immutable
- the code should have a fighting chance of compiling and working when
fed -Wwrite-strings "strings" (a PITA that we've yet to try
navigating)
> const vs not-const and char vs unsigned char ought to be orthogonal
> but who wants four abstractions if it can be helped.
>
> The main reason for chunk was for things that would go on the wire.
> Most on-the-wire structures have lengths.
>
> Strings inside our system are standard NUL terminated C things.
> shunks should never go on the wire. So the need for
> strings-with-lengths isn't obvious.
But strings on the wire do have lengths. For instance those encoded using DER.
> When I wrote pluto there was really only one place I wanted strings
> with lengths: when parsing configuration files, I'd like a way to
> reference a substring of the input without having to make a copy. (I
> religiously avoid writing on my input.) strtok is an abomination.
And the algorithm parser is where it is used, and for just that reason.
> A number of string-eating operators in our library take character
> counts and could use shunks. But I think most calls use 0 for the
> length which means "eat up to a '\0'". That would be more awkward to
> say with a shunk. It would be better to have two distinct versions of
> each functions (with one calling the other).
And code assuming NUL terminated strings should stick with standard c functions.
> | It turns out that they have a critical difference:
> |
> | - chunk_t points at writable data but shunk_t points at read-only data
>
> It is surprising that they are different. Why does that turn out to
> be the best choice?
I'm surprised your surprised. See above.
What I do keep encountering is cases where chunk_t should be 'const'.
> | so I'd argue no. For instance, a construct like:
> |
> | const char string[4] = { 'H', 'e', 'l', 'p', };
> | shunk_t s = { .ptr = string, .len = sizeof(string), };
>
> Aside: I'd probably write that as
> const char string[4] = "Help";
> shunk_t s = { .ptr = string, .len = sizeof(string), };
> or better:
> const char string[] = "Help";
> shunk_t s = { .ptr = string, .len = sizeof(string) - 1, };
> (I knew that having programmers count characters was for the birds
> when I had to use FORTRAN string literals in 1967: 4HHELP (no lower
> case then).)
It is to illustrate a point, not to be pretty.
> | is valid (unlike strspn() et.al., shunks don't assume NUL
> | termination),
>
> I'm not sure what you are saying. Are you thinking sizeof is (un)like
> strspn? I don't see an analogy. The difference between sizeof and
> strlen is more interesting.
To paraphrase:
Shunks don't assume NUL termination (so using sizeof() and strlen()
was deliberate).
Code using strspn(), and for for that matter, every other variation, does.
> | where as:
> |
> | chunk_t c = { .ptr = (uint8_t*) string, .len = sizeof(string), };
> |
> | is not.
>
> It is idiomatic to put in the "-1". Not a problem.
?
> But casting is to be avoided since it can cover up type errors.
>
> | What might useful are:
> |
> | - think of a better name - string_t would be terrible, slice_t might be better
>
> There was a time in the Windows world where those were considered
> "Pascal strings". But I don't wish to memorialize that.
pstring_t wouldn't be that bad. Except, according to wikipedia, what
is known as a Pascal String (p-string), and used by many Pascal
dialects, had a one byte prefix.
> Tentative proposal: call it a substring (substr_t). That's what it
> seems good for.
Something feels wrong about including 'str' in an abstract type not
actually a C string, and the existing code tends to use 'str' when a C
string is involved. However, to your point, I suggested slice_t for
similar reasons. [sub]text_t :-)
Andrew
More information about the Swan-dev
mailing list