Datatypes [Scratchbook] (Article)

Description :: Domains, classes, inheritance, value sets, type constraints

[This article has slowly evolved into random notes on the subject. I'm starting a new, separate (set of) article(s) which should be cleaner, and maybe I can iteratively arrive at clean, concise, precise notes on this subject matter. My apologies in advance.]

A domain can be understood as a set of (unique) values. As such, a value may be a member of several sets (domains). For example, "2" is a member of the domain of "all real numbers" as well as the domain of "positive integer numbers" as well as the domain of "letters and digits used in english text".

Domains can be defined in terms of what they are or what they are not. You could say that a given domain is "anything that looks like a number" or you could say that a given domain is "not a number". You could enumerate all possible values of the set, too: the domain of "all values that are either 'Cindy', 'Larry', or 'Jerry'" would be such a case.

Domains can be finite or infinite. Finite sets can be enumerated, that is, we can know every possible value (and, classically, we can assign a number/index to each one: thus "enumerating" or "giving each item a number".) Infinite sets are not finitely enumerable: you would never be able to list every possible value in the domain. Domains can be empty: if so, they are finite sets with zero elements. In any case, it's generally easier to find out whether or not a value is a member of a set than to get a list of some or all other members of that set.

Set logic and boolean logic are closely related. "Must be part of A and must be part of B" is equivalent to "Must be part of the intersection of A and B". "Must be part of A or a part of B" is equivalent to "Must be a part of the union of A and B". If domains are expressed with set or boolean logic, it should be possible to guess whether values known to be members of one domain can be automatically treated as members of another domain. This, however, is likely rare.

Having read plenty on the topic of structural vs. nominative typing, generalization and specializiation by constraint, inheritance, multiple inheritance, and interfaces, it's clear nobody seems to know what they want. And that's okay. Based on my experience elsewhere, the best solution is always to just provide for a callback (or anonymous function, lambda) -- basically, the ability to plug a function in (anywhere) as a definition of something, such as "what to do when the timer goes off", or even "what to do" + "when to do it" as separate functions.

A domain in math refers to the input possibilities for a function: f(x) = 1/x is defined for any x that is not zero. The domain of the function is therefore "any number that is not zero". Its output domain, called range, is defined also as "any number that is not zero" (1/x is never equal to zero.) For sin(x), the output range is "any number between -1 and +1, included", but the input domain is "any number". The phrasing of these domains should indicate why it's obvious we can say that a domain is just a boolean function: given a value, it tells you whether or not that value is suitable, or likely to happen.

Because it could be expensive to verify these domains at each point where they appear, a shortcut system could be helpful. It would basically say "any value which has qualified as part of domain A, is also part of domain B" (without running the function B against the value to make sure this is true.) This method may involve users explicitly stating "all A are also B", or it may involve code looking at other code directly to make sure this is true. For example, such a method could "see" that domain D is equivalent to "E and F", thus stating that all D values are automatically E or F values, if they need to be. From my readings on the topic of lambda calculus, this is a problem which does not have a general solution in finite time -- it could take forever to programmatically verify that function B returns true for any input on which function A would return true. (Even for finite sets, testing every input isn't a guaranteed method: the function could very well decide randomly, or based on time, or any other factor you're not aware of. Black-box testing isn't enough, and even if it were, would only apply to finite, and probably small, sets.)

Identifying the output range of a function has its uses. Even if not checked on a routine basis, this information could provide some optimizations. f(g(x)) (that is, f() of g() of x) wouldn't necessarily need to verify that the value returned by g(x) is actually valid in the input domain of f(x). Instead, it could be assumed by looking at g(x)'s output range definition. If it states that its output conforms to function A (domain A), and f(x)'s input domain demands that values be constrained to domain A or some superset thereof, then the above method might tell us that all values coming from g(x) are automatically valid in f(x), thus reducing the number of required verifications. Then again, just because a function says its output will obey a particular constraint doesn't mean it will unless we actually verify that it does -- therefore we would need to at least check that the value is allowed by the output domain, even if we can skip the input domain verification by deduction. Yes, there's been a lot written about verifying programmatically that functions will do what they say they will do; that would require source-code analysis though, which I'm not assuming to be always available, nor a good idea. Regardless, naming domains and using named domains should cover simple cases, a human-editable list of rules can cover more complicated cases, and if someone comes up with a cheap way to do it all automatically then we can do that later too.

A representation is the specific in-memory/on-disk expression of the value -- it's the format of the bits that make up a standard 4-byte integer, it's the format of the bits that make up a string, and in theory, it's the shape of the letter your fourth-grader just painted on the side of the fridge with permanent markers. A value will always need some representation, though two representations of the same value can be considered equal; a variable has a constraint (domain) that identifies what possible values can be stored in its space, but it doesn't necessarily care whether a number is stored as a 4-byte integer, or as a string representation of the number.

It seems reasonable to assume that domain definitions may sometimes look at the representation being used (for example, some functions may know how to work with integers, but refuse to do so if the integers are in ascii decimal format, preferring to deal only with 4-byte binary representations.) Variables should carry a domain definition with them, which can be used to deduce whether or not the value being represented (and symbolized) by the variable is suitable for use in a given function, but direct equality isn't required. Domain and representation are therefore distinct; the output value of a function has a representation, but the function itself has a domain indicating what it guarantees it will return. The only values likely not to be encapsulated by a symbol carrying a domain would be constants or values coming from outside the system, such as user-provided binary encodings, which would be verified on their way into the system. (This could be done at the client -or- server end, but is likely not relevant to this discussion. By the time these outside values are seen, they should have been validated against the domains of the symbols to which they were attached during transfer, yes?)

While it's not difficult to see variables as being storage space with an optional constraint on possible values to be stored, things get messier with functions. (Don't constrain the idea of "named things" to variables though; constants have values, but don't need domains because they never change, and they have names.) Consider the case of subtraction applies to integers, and consider a datatype for dates which will store a date as an integer ("2005" is a year and an integer.) Now consider the result of "2005 - 1999", "6". "6" is also an integer, but "2005 - 1999" is clearly not a year, but a difference of years. You can add such a difference to a year and get a year ("1997 + 6" = "2003",) but it is not itself a year; it is, however, an integer. Since any integer is a valid year, a year variable should likely accept a cast from an integer to a year. But such a cast is obviously dangerous, as illustrated above. Multiplying integers makes sense, but multiplying years generally does not; multiplying year differences does make sense though. How, therefore, shall functions properly indicate the constraints on their parameters and the type of their return values? A function can require that its parameters be integers and promise that its return value will be an integer, but that is not sufficient semantically. We can define a new representation, 'year', which happens to be identical to integers. Functions written for integers will reject values tagged with this representation, and we can now write functions designed solely for the purpose of operating on years. While this seems to solve the problem of functions being available by default where they should not be, it does make new domains harder to express. We would want an easy way to re-use implementations for other types."The difference between two years is a year-difference; this implementation assumes that years and year-differences are represented as integers internally." Implementations of new functions may therefore rely on other functions and internal casts in order to circumvent the type system; code re-use is generally a good thing. If such casts are available outside of function implementations, any user may simply state "my_year = (year)((integer)2005 * (integer)1999)" to avoid type restrictions. Regardless, absolute safety is not possible; a user could always go to the trouble of writing a multiplication function for years that manipulates bits "manually" to achieve the desired result; such abilities are required for a user-extensible system to work at all.

Systems that define domains as constraints also tend to name such domains, and use the domain names themselves as identifiers as in the above case. Rather than having functions look at the representation of the value, functions look at the domain of the variable, by name, to decide if the variable's value my be used as a parameter to the function. The confusion of constraint-based types and name-based types is obvious. Besides, not all values are in variables: you could pass a constant to a function, for example; as the constant cannot change, it has no need for a type constraint.

Revised thoughts on named domains, domains on variables, etc.
(This should replace some of the above, but the editing will have to wait.) From reading up on how various languages handles classes and inheritance, something became clear to me: a class is just an assertion. When you say "this is an integer", you're not actually saying anything. You might as well say "this is a pumpkinshoe". What matters is that functions, when verifying their preconditions, need a starting place. If you were to hand them a pure binary block of data and ask them to do something with it, they likely would have trouble even making heads and tails of the block of data in order to decide what could or could not be done with it. An integer looks like a boolean at the binary level. An image file could be (painfully) interpreted as a raw sound file. You need a starting point, a basic assumption, from which you can ask other questions. If you say it's an integer, then you can then ask questions like "is it greater than 0?". These assertions, these basic assumptions about a value (not a variable -- though if a value is stored in a variable, we can assume all the requirements of the variable are true of the value) are useful tools. They do not need to be tied directly to an "interface" or "protocol" as some languages do. They do not need to be tied directly to any functions at all. You could start with an assumption that is "this is a binary block of data" or with one that is "this is an 800x600 image that includes a naked female". Functions will take what you give them, try to find out if what they care about is true or not, and proceed from there. A value must therefore have at least one basic assumption tied to it (though we might assume that on a computer, all values implicitly are going to be passed around as arbitrary blocks of bits, and that this is an assumption that could be useful under some circumstances.) So maybe values don't actually need assumptions attached to them. But it's quasi-necessary. You don't have to specify everything that's true -- there's no way you can plan that. A function, somewhere, might take it into its head to ask "is this integer either 3 or 394392324?" and you shouldn't be required to name this condition and assign the condition name as an assumption of a particular value. Individual functions can perform their tests as necessary, though any assumptions you provide could be beneficial, efficiency-wise. It might be wise, however, for functions to verify whatever assumptions you provide, just in case there's some inconsistency. Just because a value is tagged as being both "positive non-zero integer" and "negative non-zero integer" doesn't mean it's possible to be both. Are these basic assumptions the same thing as representations? I don't know -- what are representation tags for? Do functions look at them directly to determine what can and/or should be done with a particular value? If so, then there's at least some cross-over. Maybe this is a superset that happens to include the representation tag on a value, as imagined by Date and Darwen.

So we have the following:
- a value may have zero or more named assumptions (its type) associated with it,
- a variable may have a constraint (its type) on the values it can carry,
- a function should look at the value (not the variable) to decide whether or not its domain constraint (its type) is satisfied.

Variables and functions similarly check values sent to them for validity against their constraints, which may happen to be of the form "is_Z(value) must be true" (is_Z could check the assumptions of the value, and if Z is not one of them, try to find some way to prove that Z is also true of the value.)

From here, polymorphism, overloading, overriding, and all that jazz.

But first, and again, before I redo all the text above ... I was just re-reading parts (in the appendix) of "Foundation for Ojbect/Relational Databases" (Date, Darwen), in particular their stuff on "specialization by constraint". Aside from the fact that they repeatedly talk about violating type constraints on values rather than variables ... it occurs to me that in my world, "types are dead". We don't need to know the type of a value at any time other than when checking to make sure it'll fit in a variable or through a function call without causing any harm. They insist on knowing the "most specific type" of a value -- but really, I only care about knowing whether or not a value can conform to a given type if need-be.

[Scratch that; in my private notes, I even wrote "the concept of a most specific type is crock", but there's a good reason for it: when you go looking through the namespace to find an appropriate function, given a value of type ColoredCircle being passed to a function named Draw() and you have two functions named Draw(Circle) and Draw(ColoredItem), you may wish to have some way to prioritize which one gets picked. That's what the most specific type is for -- though it works better in cases where you have, say, Draw(Circle) and Draw(ColoredCircle) ... with multiple-inheritance, things get nastier, and it solves even less ... but there is a valid problem here to be solved, in the domain of namespaces -- which should be a separate article.]

 // SetWidth(x : CircleOrEllipse) : CircleOrEllipse
 var E : CircleOrEllipse;
 E = Circle(2);
 PrintCircle(E);
 E = SetWidth(E, 3);
 PrintCircle(E); // fails

SetWidth was okay with receiving a circle, it knew to convert it to an ellipse as necessary (assuming ellipses are internally different from circles) and as E was okay with containing an ellipse value, the assignment didn't fail. See? All done. Yay for dynamic typing. (Note that the examples provided by Date and Darwen assume that functions may directly modify their parameters by reference, and that somehow the type of the parameter must be preserved through those changes ... because they don't do everything by assignment, I think they miss out on some insights. Admittedly, not passing by reference is a pain in the butt, but ... so?)

Continued at top