Convert a Block of Digits from Base X to Base Y

Posted on May 1, 2011 by Jonny Dee

This article describes a simple algorithm which can be used to transform a block of numbers from base X to a base Y. It can be used to encode a string or binary data not only into another represenation like base16, base32 or base64, but to any base you like.

Recently, I had to encode a block of bytes containing binary data into another representation using less than 256 values per character. To be more concrete, the new representation should contain only a specific set of printable characters.

This problem is not new and there are many more or less standardized encodings for this purpose. A well-known one is Base64, for example. It defines a charset consisting of 64 characters which can be used to represent data. Many of such encodings use a power of two as the number of available characters, because it makes encoding/decoding easier or, in fact, more efficient. However I had to use a non-standardized set of characters and the number of characters was not a power of two. To make it sound a bit harder (it isn’t though), the number of characters was a prime number — let’s say 97. So my first step when I need to solve a problem is to look around if others already have solved my problem. And while I could find a lot of converters for converting integer values between decimal, binary, hexadecimal, octal representation I could not find a general solution which works with numbers of arbitrary length. Note that encoders/decoders like the ones for Base64, for instance, can work with numbers of arbitrary length (just see a byte stream as a long number with base 256 digits), but they are not general enough to use them for arbitrary number base conversion. In fact, they are able to transform a number with base 256 (bytes) to a corresponding number with base 64, and vice versa.

However, I needed a flexible solution which can work with arbitrary long numbers where the source and destination number base can be configured. Furthermore, the solution should have as less dependencies to external libraries as possible, because it should run on an embedded system. So I came up with a simple solution written in C which does not have any dependencies to external libraries at all. It might not be the most efficient implementation, but it works for me and seems to perform quite well.

While my original solution is written in C, I will describe the algorithm using Python syntax, because it is very compact and readable and it looks like a pseudo-programming language used to express algorithms in general. Also, I did the C implementation at work and I am not sure if I am allowed to give you the code. But given my explanation, I think you should be able to implement the solution with the programming language of your choice yourself.

The algorithm is based on the following idea:

Suppose you have a series of digits [1,2,3,4] that should denote the number 1234 in base 10 number system. In order to get this value you can calculate it like this:

number = 0
for digit in [1,2,3,4]:
   number = number * 10 + digit
print number

If you execute this short Python program it will correctly print 1234 to the console. Let’s look at another example using a digits of a binary number. Suppose we have a series of binary digits [1,0,0,1,0,1,1,0,1,1,0,0,0,0,1,1,1,0,1,1,0,1,0,1,1] and we want to calculate the corresponding decimal value. Then we can calculate it analogously. I have just put the code into a function:

def convertNumberToBase10(digits, base):
   number = 0
   for digit in digits:
      number = number * base + digit
   return number

Invoking the function will yield:

>>> convertNumberToBase10([1,0,0,1,0,1,1,0,1,1,0,0,0,0,1,1,1,0,1,1,0,1,0,1,1], 2)
19761003

Note, that this also works for hexadecimal number digits (let’s try 0x12D876B):

>>> convertNumberToBase10([0x1, 0x2, 0xD, 0x8, 0x7, 0x6, 0xB], 16)
19761003

This principle works with any base. In particular, we are already able to convert a few bytes into a representation which only uses base 10 characters [0-9]. How? Just consider a byte stream as a series of digits of a number with base 256. In other words, an ASCII string is actually a number whose set of possible digit values is the ASCII character set. So if we want to encode the ASCII string “Easy!” into a base 10 representation we can use exactly the same algorithm (in Python you can get the corresponding list of ASCII codes using the expression [ord(c) for c in “Easy!”], but to make the code understandable for non-Python programmers I executed this expression myself and used the calculated ASCII values directly, instead):

>>> convertNumberToBase10([69, 97, 115, 121, 33], 256)
297987701025

I think after following the previous examples you got the idea. But does it only work with base 10 as the destination number base? The answer is no, it works with any number base as destination. There is nothing special with the base 10 number system compared to others, except that we are used to think and calculate in the base 10 number system. Of course, calculations are easier for humans with base 10, but for a computer there is no difference. So how can we adapt the algorithm to get a destination number base other than 10? One could use programming language support to get a base 2, base 8, or base 16 representation like this:

>>> bin(convertNumberToBase10([69, 97, 115, 121, 33], 256))
'0b100010101100001011100110111100100100001'
>>> oct(convertNumberToBase10([69, 97, 115, 121, 33], 256))
'04254134674441'
>>> hex(convertNumberToBase10([69, 97, 115, 121, 33], 256))
'0x4561737921'

But what about unsupported number bases? Furthermore, this algorithm only works for small numbers, doesn’t it? Well for unsupported bases it is up to you to provide the conversion from base 10 to base Y. And regarding the second question, you might be happy if you can work with Python, because it has built-in support for doing integer calculations with arbitrary big integers. So a call like this works out of the box:

>>> convertNumberToBase10([ord(c) for c in "The quick brown fox jumps over the lazy dog."], 256)
3024830571690175283291907639196436031967763819210983988162282536502237781693262640684650930677706176554798L
>>> hex(convertNumberToBase10([ord(c) for c in "The quick brown fox jumps over the lazy dog."], 256))
'0x54686520717569636b2062726f776e20666f78206a756d7073206f76657220746865206c617a7920646f672eL'

However, if there is no native language support for arbitrary precision integer calculations, like it is the case with C, and if there is little chance you can use an external library like, e.g. GMP, then you might not be as happy. And even if you can, there is still the problem of getting numbers expressed in other bases than the common ones. How do we cope with the latter problem? The idea is to actually do the calculation of the algorithm right within the destination number base. As you know, the computer internally does calculations with base 2, and the output of integer values is provided by an algorithm which converts the binary representation to a common one, like the decimal number system. We will, however, emulate calculation in an arbitrary number base.

So in order to come to a solution, we need to solve the following problems. If we want to use the algorithm as discussed before we need support for integer calculation with integers of arbitrary length. And we need to do calculation within the destination number base. The proposed solution will face both problems at once. We will implement arbitrary integer precision calculation within an arbitrary number base.

A look at the function ‘convertNumberToBase10’ shown above reveals that there is a multiplication as well as an addition operation involved. So we need to implement both operations for arbitrary long integers. Let’s have a look at the following Python code which implements the addition operation for an arbitrary long integer within a given base:

def incNumberByValue(digits, base, value):
   # The initial overflow is the 'value' to add to the number.
   overflow = value
   # Traverse list of digits in reverse order.
   for i in reversed(xrange(len(digits))):
      # If there is no overflow we can stop overflow propagation to next higher digit(s).
      if not overflow:
         return
      sum = digits[i] + overflow
      digits[i] = sum % base
      overflow = sum / base

This function assumes the number’s digits are right-aligned within the ‘digits’ list and it assumes the list is big enough to hold the result of an addition. The algorithm itself is straight forward. It adds the provided ‘value’ to the right-most digit. Then it calculates the ‘sum’ modulo the given ‘base’ and the result becomes the new digit. It also does an integer division by ‘base’ and propagates the result to the next higher digit by storing the result within ‘overflow’ which is fed back to the next iteration. The loop stops as soon as there is no overflow to propagate to the next higher digit(s). The algorithm is more or less the same as you do it by hand when you add two numbers on a paper. However, you calculate everything modulo 10 while this algorithm does it modulo the given base.

Now let’s have a look at the corresponding multiplication implementation:

def multNumberByValue(digits, base, value):
   overflow = 0
   # Traverse list of digits in reverse order.
   for i in reversed(xrange(len(digits))):
      tmp = (digits[i] * value) + overflow
      digits[i] = tmp % base
      overflow = tmp / base

As you can see this turns out to be at least as easy as the addition. We traverse the digits from right to left and multiply each one by the provided ‘value’. We calculate the new digit and the overflow as we did in the addition case, and the overflow is propagated to the next higher digit(s) as before. Note that in the multiplication case we cannot stop the loop if there is no overflow, because we must visit each digit and multiply it by ‘value’.

So what do we need to do now? Nothing, we are done! Ok, nearly — we need to rewrite the algorithm discussed at the beginning such that it uses our new functions, but that’s it. Here is how it looks like:

def convertNumber(srcDigits, srcBase, destDigits, destBase):
   for srcDigit in srcDigits:
      multNumberByValue(destDigits, destBase, srcBase)
      incNumberByValue(destDigits, destBase, srcDigit)

You can now convert arbitrary long numbers from base X to base Y using this function. To facilitate experimentation, I’ll however introduce a slightly improved version, which allocates enough digits for the destination number and returns the converted list of number digits:

import math

def convertNumberExt(srcDigits, srcBase, destBase):
   # Generate a list of zero's which is long enough to hold the destination number.
   destDigits = [0] * int(math.ceil(len(srcDigits)*math.log(srcBase)/math.log(destBase)))
   # Do conversion.
   convertNumber(srcDigits, srcBase, destDigits, destBase)
   # Return result.
   return destDigits

For example, to encode a string to base 64 you can do:

>>> base256 = [ord(c) for c in "The quick brown fox jumps over the lazy dog."]
>>> base256
[84, 104, 101, 32, 113, 117, 105, 99, 107, 32, 98, 114, 111, 119, 110, 32, 102, 111, 120, 32, 106, 117, 109, 112, 115, 32, 111, 118, 101, 114, 32, 116, 104, 101, 32, 108, 97, 122, 121, 32, 100, 111, 103, 46]
>>> base64 = convertNumberExt(base256, 256, 64)
>>> base64
[5, 17, 40, 25, 18, 1, 49, 29, 22, 37, 35, 26, 50, 1, 34, 28, 38, 61, 55, 27, 34, 1, 38, 27, 55, 32, 32, 26, 39, 21, 45, 28, 7, 12, 32, 27, 55, 25, 37, 28, 34, 1, 52, 26, 6, 20, 32, 27, 6, 5, 58, 30, 18, 1, 36, 27, 54, 28, 46]

Or if you want a base 64 to base 97 conversion you can do:

>>> base97 = convertNumberExt(base64, 64, 97)
>>> base97
[1, 50, 41, 20, 95, 6, 37, 16, 82, 57, 92, 96, 67, 35, 52, 44, 84, 85, 62, 5, 14, 39, 23, 21, 74, 30, 25, 18, 64, 79, 78, 51, 89, 73, 87, 24, 27, 24, 43, 1, 72, 18, 11, 74, 57, 65, 41, 58, 84, 24, 76, 84, 49, 78]

And if you want to convert back to binary/byte representation you can do:

>>> convertNumberExt(base97, 97, 256)
[0, 84, 104, 101, 32, 113, 117, 105, 99, 107, 32, 98, 114, 111, 119, 110, 32, 102, 111, 120, 32, 106, 117, 109, 112, 115, 32, 111, 118, 101, 114, 32, 116, 104, 101, 32, 108, 97, 122, 121, 32, 100, 111, 103, 46]

As you can see the last output is equal to the one for ‘base256’ except the preceding 0, which can be truncated, because leading zero’s don’t carry information. Note that we did a round trip (base256 -> base64 -> base97 -> base256) which should give you confidence that the solution works.

If we don’t want leading zeros in the converted result we can change our implementation accordingly:

def withoutLeadingZeros(digits):
   for i in xrange(len(digits)):
      if digits[i] != 0:
         break
   return digits[i:]

def convertNumberExt(srcDigits, srcBase, destBase):
   # Generate a list of zero's which is long enough to hold the destination number.
   destDigits = [0] * int(math.ceil(len(srcDigits)*math.log(srcBase)/math.log(destBase)))
   # Do conversion.
   convertNumber(srcDigits, srcBase, destDigits, destBase)
   # Return result (without leading zeros).
   return withoutLeadingZeros(destDigits)

You can download the complete source code for number conversion here.

That’s it — have fun with your experimentations.

Filed under: software engineering | Tagged: algorithm, base16, base32, base64, binary, c#, converter, decimal, hexadecimal, octal, programming, python |

« Emulating F#’s ‘with’ Construct in C# using Parameter Object Design Pattern Number Conversion with Arbitrary Digit Bases »

Stephen McKamey, on July 24, 2012 at 17:12 said:

Jonny Dee, you rock! This solved a big problem I was having. The key insight that made the solution click for me was “Just consider a byte stream as a series of digits of a number with base 256.” That’s a great way of thinking of the problem.

I implemented your algorithm and it is working great. BTW, I noticed when implementing mine, that the multiply and add functions are nearly identical. You can combine them into a single “pushDigit” function by simply initializing the multiply overflow to the add value.

Jonny Dee, on July 25, 2012 at 06:43 said:

Hi Stephen, thanks for your feedback 🙂 I’m glad to hear this post helped you. Using a single generic “pushDigit” function is a great idea, thanks again! 🙂

C, on September 26, 2013 at 23:30 said:

I followed your link on stack overflow to here – http://stackoverflow.com/questions/849598/print-large-base-256-array-in-base-10-in-c/962806#962806

You stipulate that you don’t use any “python magic”. You do, in fact. Python supports arbitrary width integers, C does not.

If you pass a number larger than the highest integer value supported by your C implementation to convertNumberToBase10(), then the returned number represents an overflow. Therefore, your simple algorithm doesn’t work with C. NDA aside, your algorithm doesn’t work with C.

Jonny Dee, on September 27, 2013 at 07:54 said:

I guess you stopped reading this article after you saw the mentioned function. Please read it completely as the issue with large integers in C is mentioned afterwards. And a solution which works for arbitrary large integers in C is presented, too.
Regards, Jonny

- James, on November 25, 2014 at 09:53 said:
  
  I’m seeing the overflow problem too and have read the entire article so I must be missing something. I can see that digits appears to be an array and you iterate through that array during your addition/multiplication operation. As stated, this array needs to be big enough to store the result (a large integer), however the overflow does not appear to be an array so will be limited by your C implementation as mentioned previously. Is this correct?
  
  As I see it, this algorithm works fine to perform additions/multiplications on arbitrary large numbers but only providing the number you want to add/multiply by is not overly large. You can’t add a large number to another large one?
  
  What am I missing?
- Jonny Dee, on November 25, 2014 at 19:33 said:
  
  Hi, thank you for your feedback!
  
  interloper255@gmail.com mentioned the function convertNumberToBase10():
  
  “If you pass a number larger than the highest integer value supported by your C implementation to convertNumberToBase10(), then the returned number represents an overflow.”
  
  Of course, he is right — this function really can cause an overflow in C if the resulting base 10 integer is too big. The reason why I guessed he stopped reading is that, later on, I present an algorithm which does not suffer from this problem.
  
  Maybe you misunderstood something. I guess you think that the conversion from, let’s say base 10, to base 16 is done by providing the base 10 number as a single integer, an “unsinged long long” for instance. But this is not the case. You have to provide the base 10 integer as an array of integral numbers. So in order to convert the base 10 number 12345678901234567890 to a base 16 number you would create an integer array [1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0] and pass that array to the function convertNumber() or convertNumberExt(). As a result you would get a new integer array with digit values from 0 up to 15.
  
  Let’s look at the code in function multNumberByValue(). Maybe this is the line which causes headache:
  tmp = (digits[i] * value) + overflow
  So for ‘tmp’ we need an integral type which is able to hold the resulting value. But this should nearly always be the case as the “(digits[i] * value)” represents the multiplicaton of the i-th digit value with a number base value. And the overflow is much smaller as its value is “tmp / base”. So in fact, an overflow would only occur if you tried to convert base 2^64 numbers where a single digit may be as big as 2^64-1 (!).
  
  I did implement this algorithm in C, and it worked well. But maybe it’s indeed me who is missing something. In this case I really would appreciate an example which demonstartes the problem. Just use the Python implementation and present me an input for the convert function which would be a problem in the C implementation case. That’d be nice 🙂
  
  Cheers, Jonny

Little John, on January 17, 2015 at 20:02 said:

Hi, thank you very much for that great code!
I implemented it in PureBasic, and it works fine.
Much appreciated!
see http://www.purebasic.fr/english/viewtopic.php?p=459536#p459536

Jonny Dee's Weblog

Categories

Archives