Monday, March 21, 2011

Human Language Compilation

I believe that computers should make translation of documents from one language to another simple, quick and, easy.

Computers do not actually work with our letters and words. All that they understand is the alignment of magnetic bits to represent either a 1 or a 0. But if we group such bits into groups of eight, known as a byte, we have 256 possible combinations because each os the eight bits can be magnetically set to represent either a 1 or a 0, two possible combinations, and two multiplied by itself eight times is 256.

In a code system known by it's acronym of ASCII, each letter of the alphabet, lower case and caps, as well as numbers, puncuation and, control characters, are each represented by one of the 256 possible combinations that make up a byte.

In the programming of computers, special languages are developed as a link between the tasks that humans want to computer to do and the opcodes (operational codes) that are wired into the processor of the computer. The processor in a typical computer might have several hundred opcodes, representing the tasks that are wired into it.

Opcodes are used in combination to create a vast number of possible language commands. They are distinguished by hexidecimal code. This is a numbering system based on sixteen, rather than ten. It uses the digits 0 through 9, followed by the letters a through f. This base is used because a unit of four bits (a nibble) can be made into sixteen possible combinations. This numbering system is used for such things as memory addresses in the computer, as well as opcodes.

Computer programming languages fall into two broad categories, those that are interpreted and those that are compiled. Simple web scripts such as Javascript and ActiveX controls are interpreted by the browser line by line and are not compiled. BASIC (Beginners All-purpose Symbolic Instruction Code) was originally designed as an interpreted language so that programming students could write a program and watch it being run line by line.

In high-level languages that are compiled, such as C++, a special program must be written to link the language with each processor that enters the market. This is because each and every processor has it's own set of opcodes. This special program is called a compiler and it goes over the program and, in several steps, breaks it down into the opcodes of the particular processor on the computer on which it is being run.

In assembly language, which is a low-level computer language only a step above the computer's machine code, or opcodes, another type of compiler, an assembler, is used to translate the commands into the machine code that the processor can work with. Such a low-level language as assembly language is arduous to write, but it used when a very short program that can run very quickly is required.

There are computer languages which are neither compiled or interpreted. The popular Java uses a "virtual machine" on the computer to enable it to operate across all computer platforms. Java makes use of "Java Byte Code", which is a "p-code" for pre-compiled code.

What I want to ask is why can't we write a compiler for human languages? If compilers are special programs that break the commands of high level computer languages into the opcodes that are wired into the processor of the computer, then the next step should be a compiler for any written language. The compiler could scan each sentence and break it down into numeric code.

A compiler could be written to link each human language to this code so that a document in one human language could be easily translated into another, as long as a compiler had been written for both languages.

The roadblock to an accomplishment such as this is, as I pointed out in the posting "Numbered Sentences" on my progress blog, that the word is not the primary unit of human communication when we are concerned with translating one language into another. Working with letters and words is fine as long as we will remain within one language. But it is the sentence, not the word, which must be translated from one language to another. A word-for word translation usually produces little more than gibberish simply because grammar and syntax differ from one language to another.

Such a coding system does not yet exist. It would be similar in concept to ASCII, but would require us to break down every possible sentence and, after eliminating redundancies, assign each a numeric code that would be the same regardless of what human language it was in. It would be fairly simply to assign nouns and verbs a place in a language tree structure. Spell check and grammar checking software is already widely-used and this is the next logical step.

My vision for the next breakthrough in the progress of computers lies not in technology, but in how we approach language. I have written about this topic already on the progress blog. The great limitation is that we are still using the ASCII system of coding that has been in use since 1968, when available computer memory was maybe one-thousandth of what it is now.

Basically, computer storage revolves around magnetic bits. Each bit can be either a 1 or a 0, on or off, so that there are only two possible states for each bit. This means that eight such bits have 256 possible combinations, which is ideal to encode all of the alphabet, lower case and capitals, numbers, punctuation, as well as unprinted controls such as carriage return and space. This is the system that we use, reading computer memory in the groupings of eight bits that is referred to as a "byte".

Computers only deal with numbers, while we communicate mostly with words. This means that we have to create artificial languages to communicate with computers, and to instruct them what to do. There are several hundred opcodes, or basic instructions, wired into each computer processor. Machine code tells the computer what we want it to do by combining the instructions in these opcodes.

This machine code, which is expressed in a so-called hexidecimal number system consisting of the numbers 0-9 and the letters A-F, is actually the most fundamental level computer language. One step up from this is assembly language, this is expressed in simple letter instructions and works by combining machine code instructions to the processor.

We can build higher level computer languages from this, all of which work by combining the instructions of lower-level languages. Some languages, such as those for web scripting, are interpreted in that they are simply read by the browser line-by-line. Most must have a compiler written to link each computer language to each new processor that comes on the market. The great advantage of higher-level languages is that the programmer does not have to understand exactly how the processor works in order to write instructions.

I find this system to be inefficient in the extreme for modern computing, as described on the progress blog. This is another example of how we have a way of becoming technically forward, but system backward.

For one thing, with the spell-check technology available nowadays, there is no need to encode capital letters. We can shorten the bytes that will speed computers by encoding all letters in lower case and letting spellcheckers capitalize the appropriate letters at the receiving end.

For another thing, I had the idea that we could just consider all letters, numbers, punctuation and, controls as one big number. This would mean considering it as a base-256 number, instead of the base-ten system that we are used to. But this relatively simple change would greatly multiply both the storage space and the speed available by compressing any document, as described in "The Floating Base System" on this blog.

Today I would like to write more about what should definitely be the next frontier in computing, reforming the basic system of encoding.

There are three possible ways to encode written information in a computer, by letters, by words or, by sentences. The way it is done now is still by letters, which is by far the most primitive and inefficient of the three and is a reflection of the strict memory limitations of 1968. The real unit of communication is actually the sentence, as we have seen on the progress blog, in "Human Language Compilation" and "Numbered Sentences". Notice that languages must be translated from one to another by the sentence, not by the word. This is because grammar and syntax differ from language to language, and word for word translations usually produce little more than gibberish.

To encode by sentences, we could scan the dictionary for sensible combinations of words that make sentences and then eliminate redundancies, or sentences that mean the same thing. This would give us a few million sentences that are used in communication. There would also be special pointers to names, place names, and culturally specific words. This would not only make storage and transmission of information many times more efficient, but would also facilitate easy translation from one language to another because all sentences could already be pre-translated.

The user would type a sentence, and then pick the one that came up from the database that was most like the one that was typed. Each one of these would have a pre-assigned bit code, similar in concept to the present ASCII.

There is yet another approach to better integrating the ordinary language that we communicate with and computers that I have not written about yet. This approach involves words, rather than sentences, and is will be more complex and difficult than numbering sentences, but will be the ultimate in language-computer integration and is what I want to add today.

Words are actually codes, which is why we have dictionaries for words but not for numbers. A word serves to differentiate something that exists from everything else, this fits with that all-pervasive pattern that I termed "The One And The Many", as described in the posting by that name on the patterns blog.

Since we are more complex than our inanimate matter surroundings, there is not enough complexity for everything that we could conceive of to actually exist. So, words also define for us that which does exist from that which doesn't. This is why we require words, as well as numbers, only a fraction of what could exist, from our complexity perspective, actually does exist.

Words, as codes, are far more complex than numbers. Although it may not seem like it, there is a vast amount of complexity packed into each and every word. All of the complexity of the pre-agreed upon meaning is contained in a word. Words can be thought of a a kind of "higher-level" of numbers in a way similar to that of computer languages.

Numbers differ from words in that everything is basically numbers being manifested. They exist in the universe of inanimate space and matter, while words don't. Numbers are less complex than words, but are not required to differentiate that which exists from that which doesn't as words are.

We must completely understand something in order to describe it with numbers, although that is not the case with less-precise words. (In "The Progression Of Knowledge", on the progress blog, I explained how this can give us an idea of where we stand as far as how the volume of knowledge that we have now compares with all that we can possibly know).

We cannot determine the complexity of the words that we must fall back on because if we could, we could continue our description of reality with numbers and would not need the words. We know what words mean, or else they would not be useful, but we do not know how much actual complexity the word contains in it's meaning because if we did, we could express it's meaning with numbers and would no longer really need the word.

In "Outer Mathematics" on either this or the patterns and complexity blog, we saw how numbers are all that there really is. Everything is actually numbers being manifested. This means that there must be a formula for everything that exists. But because of our complexity level, we are unable to discern formulae about ourselves or things more complex than us.

We can only arrive at a formula for something that is less complex than our brains, which have to figure it out and completely understand it. To derive a formula about ourselves or things more complex than us, we would have to be "smarter than ourselves", which is impossible. We could take the communications systems of animals and break it down into numbers, but cannot do that with our own. So, we can only rely on words for such descriptions.

But if there must be a formula for everything, even if it is hidden from us by our complexity perspective, that must also include words. Out there somewhere, there must be a way to substitute a number or a formula for every word in our language. If only we could arrive at this, it would be possible to construct a very complex system of numbers and formulae that would parallel the words that we use to communicate.

If we could only accomplish this, we would have the numbers that computers can deal with. Computers could deal directly with ordinary words, at least the ones that we had incorporated into this matching structure of numbers and formulae, and these artificial computer languages would no longer be necessary. We cannot see this, at our complexity level, because we are up against our own complexity and we cannot be "smarter than ourselves".

In the universe of inanimate matter, there is only quantity. In other words, everything is really numbers but with inanimate matter these numbers and formulae that describe everything are only one-dimensional. When we deal with living things, particularly ourselves, we have to deal with quality as well as quantity.

We can differentiate between the two by describing quantity as one-dimensional and quality as multi-dimensional. Quality forms a peak, which is the intersection of at least two slopes, while quality forms a simple slope. Quality is not simply "the more, the better", but is a peak factor. This is why we are so much more complex than the surrounding inanimate reality.

I explained a simple version of structuring words like numbers in "The Root Word System" on this blog.

So, just as we did the Human Genome Project and the Sloan Sky Survey, let's get powerful supercomputers to work developing the structure that must exist to incorporate every word that we use so that each word that we use can be expressed as a number or formula in the overall structure. Computers will then be capable of dealing with ordinary human language. All that we would have to do is to tell the computer what we wanted it to do, and they would be unimaginably more useful and easy to use than they are now.

No comments:

Post a Comment