Archive for January, 2010

Filed Under (Python) by Marcin Kuźmiński on January-2-2010

Reading a huge file (ie. 70gb) in python just using open() could get you in troubles :).
There is a clean and nice solution for reading a complex huge files or list.
Using the new with open() as statement and python generator function you could write an easy functions
which will read such files without taking whole computer resources. Each iteration on such a function will perform a read of
given size, and the with statement will make sure that file is closed when iteration is finished…

An example of such a function:

def read_large_file(filename, mode = 'rs', size = '1024'):
    '''
    A lazy generator functions that reads a file with a given chunk of data
    USAGE:
    for data_chunk in read_large_file('/tmp/huge.file','rb',10240):
        print data_chunk
    @param filename: a filename
    @param mode: read mode
    @param size: size to read at one iteration
    '''
    with open(filename, mode) as f:
        while 1:
            data = f.read(size)
            if not data:
                break
            yield data

Remember that with statement is supported out of the box from python 2.6 in python
2.5 you have to do

from __future__ import with_statement

before using it. If you have troubles using with statement i recommend reading
this link



Filed Under (Python) by Marcin Kuźmiński on January-1-2010

I always loved how the print_r() function worked in php. In python print does print everything but when large collections (ie huge dicts) are printed using print you could get lost in the output. I just discovered a little smarter print function called pprint from module pprint

from pprint import pprint

This is great for debugging something in collections. Pprint prints in a nicer human readable form types of data like : list,dict,tuples.

l = [x for x in range(100)]
l.insert(0, l[:])
pprint(l)
l = dict([(x, x) for x in range(100)])

pprint(l)
#this will print this list in more convinient way
{0: 0,
 1: 1,
 2: 2,
 3: 3,
 4: 4,
 5: 5,
 6: 6,
 7: 7,
 8: 8,
 9: 9,
 ...
 98: 98,
 99: 99}

There are two interesting arguments in pprint function the:
width
Attempted maximum number of columns in the output.
width=100 will wrap everything that outputs longer than 100 chars.
and
depth
The maximum depth to print out nested structures.
ie:

pprint.pprint(stuff, depth=3)
['aaaaaaaaaa',
 ('spam', ('eggs', (...))),
 ['aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb'],
 ['cccccccccccccccccccc', 'dddddddddddddddddddd']]
#now as you can see the tuples at index 1 have a depth of more than 3 and #they are being reduced by the pprint. Imagine how this helps with debuggin a
#really complicated structures.