Thursday, June 18, 2009

Numbered Sentences

It is really amazing how much rapid progress has been made in improving computer processor speeds as well as hard drive capacity. But while this progress is being made, our basic system of digital coding remains so inefficient. I find this to be yet another example of how we can be technically forward but system backward at the same time.

The best-known system of digital coding is ASCII. It uses eight digital bits, known as a byte, to encode information. In digital form, a byte would look like this: 01101001 since each bit can be either on or off, represented by a 1 or 0.

Each magnetic particle on a hard drive stores a bit. Eight bits of two possibilities each mean that a byte has 256 different possibilities. ASCII uses each bit to store a text character such a letter, number or, punctuation mark. Some of the 256 ASCII cominations in a byte are unprintable control codes. Besides ASCII, there are other systems such as EBCDIC, used in mainframe computers, and Unicode.

The problem with a coding system such as ASCII is that it's coding represents characters, such as numbers and the letters of the alphabet. No matter how we improve processor speeds and hard drive capacity, this simple coding system remains inefficient in the extreme and thus limits the potential of computers.

I find that our digital communications and data storage would be multiplied many times in efficiency if our primary unit of communication was not the character, not the word, but the sentence. If we can categorize DNA in the Genome Project, then why can't we categorize all things that all people say and write to each other while communicating?

People all across the world say pretty much the same things to each other. The sentences could be arranged in a logical order and each one assigned a number. This would greatly simplify and increase in efficiency all digital storage and communications. This should have been done long ago, actually in the early days of computing.

Microsoft Office has standardized and categorized office documents into word processing, spreadsheets, databases and, presentations. Why not categorize every sentence that is used in communications and assign it a number? Then we would only need to communicate and store that number instead of the sentence written out in characters. This would be immeasurably more efficient. Dictionary writers take great care to categorize words, we can expend the idea to entire sentences. The sentences could be arranged into a hundred or so logical categories and then selected from there.

It is all right, in most cases, if the sentences are somewhat generic. In most human communications, flowery prose is unnecessary. Several words may have the same meaning many sentences can be phrased in different ways, but for our purposes of efficient data storage and communications, only one choice of sentence would be necessary.

This concept of using sentences, rather than characters or words, as our main unit of digital communication and storage has another tremendous advantage besides the great increase in efficiency. Grammar and alphabet or character is not the same from one language to another so literal translation word by word from one language to another usually produces little more than gibberish. It is the sentence, not just the words, that must be translated.

This system of sentence numbering would also make quick and easy translation of data from one language to another possible. The entire world uses the same numbering system. Data could be stored and transmitted as numbers and each number would represent a sentence. The data numbers could be easily displayed in any language. If, for example, all communication was broken down into a million sentences, sentence 130461 might be "I went to the store today." All we would need to do would be to transfer and store the number.

In data transfers, computer systems make extensive use of codecs, compression and decompression so why not take the same approach to the basic coding of data? Numbering sentences would be far more efficient than today's coding of characters, which was developed long before the easy and widespread global communication of the internet. This sentence transmission and display could readily be included in future operating systems.

Next, let's move on to make this concept even more efficient with what I will call the "sentence package". Each sentence will be assigned a number, my guess is that we can expect to have a million or so sentences which will then be arranged into a logical sequence before being assigned numbers. The way to make this process more efficient is to use direct binary to encode each sentence, instead of the ASCII characters for the numbers assigned to the sentences.

A string of 20 of the digital bits that computers use to store and transmit data will give us 1,048,576 possible combinations. I believe that this will be enough to assign a number to all necessary sentence combinations, along with any control characters that will be needed. We will call this 20-bit string a "sentence package".

It will operate in the same way as the 8-bit bytes used to encode each character and number in ASCII. It might be more effective to enclose this 20-bit sentence package within a string of 24 bits because that would comprise 3 of the bytes that the computer world is accustomed to dealing with. This would also provide plenty of room for any specialized sub-sets of sentences that may be necessary such as one each for doctors, physicists, astronomers, etc. to include the sentences only used by these particular groups in communication. Names could still be spelled out in ASCII when necessary.

The thing that makes this great increase in efficiency possible is vast gaps in our written communication of potential words that are not used as words. We could call them non-words. For example, "Ncbda" could be a word but it isn't. The existence of such non-words means that our alphabetic system has much built-in spatial inefficiency. Quite a bit of this inefficiency is because the words we use revolve around positioning of vowels and consonants.

Our wording system, if printed as a graph would look something like a map of the South Pacific or the West Indies. The words we use would be islands but there would be vast gaps, represented by sea, of potential but non-words. The way to cut out this inefficiency is to use this idea of sentence packaging, it is the ultimate codec.

In ASCII coding, a simple sentence like "I went to the store today", requires 25 bytes, one for each character, including spaces. Since a byte consists of 8 bits, that means a total of 200 bits of data. In this new system of sentence packaging, it will require only the 20 bits of one sentence package. This is an increase in efficiency by a factor of 10. Even if we use 24 bits, 3 bytes, per sentence, it brings an increase in efficiency by more than a factor of 8. Plus the fact that text encoded in this way can be easily displayed in any language.

No comments:

Post a Comment