Rezha Julio

Hi!
My name is Rezha Julio
I am a chemist graduate from Bandung Institute of Technology. Currently working as Data Engineer at Traveloka.
You can reach me by email:

contact@rezhajulio.id

, @ Q Q

Powered by Hugo

filter by tags

New interesting data structures in Python 3

time to read 4 min | 845 words

Python 3’s uptake is dramatically on the rise rise these days, and I think therefore that it is a good time to take a look at some data structures that Python 3 offers, but that are not available in Python 2.

We will take a look at typing.NamedTuple, types.MappingProxyType and types.SimpleNamespace, all of which are new to Python 3.

typing.NamedTuple

typing.NamedTuple is a supercharged version of the venerable collections.namedtuple and while it was added in Python 3.5, it really came into its own in Python 3.6.

In comparions to collections.namedtuple, typing.NamedTuple gives you (Python >= 3.6):

  • nicer syntax

  • inheritance

  • type annotations

  • default values (python >= 3.6.1)

  • equally fast

See an illustrative typing.NamedTuple example below:

>>> from typing import NamedTuple

>>> class Student(NamedTuple):
>>>     name: str
>>>     address: str
>>>     age: int
>>>     sex: str
    
>>> tommy = Student(name='Tommy Johnson', address='Main street',     age=22, sex='M')
>>> tommy
    Student(name='Tommy Johnson', address='Main street', age=22, sex='M')

I like the class-based syntax compared to the old function-based syntax, and find this much more readable.

The Student class is a subclass of tuple, so it can be handled like any normal tuple:

>>> isinstance(tommy, tuple)
    True
>>> tommy[0]
    'Tommy Johnson'

A more advanced example, subclassing Student and using default values (note: default values require Python >= 3.6.1):

>>> class MaleStudent(Student):
>>>     sex: str = 'M'  # default value, requires Python >= 3.6.1 

>>>  MaleStudent(name='Tommy Johnson', address='Main street', age=22)
     MaleStudent(name='Tommy Johnson', address='Main street', age=22, sex='M')  # note that sex defaults to 'M'

In short, this modern version of namedtuples is just super-nice, and will no doubt become the standard namedtuple variation in the future.

See the docs for further details.

types.MappingProxyType

types.MappingProxyType is used as a read-only dict and was added in Python 3.3.

That types.MappingProxyType is read-only means that it can’t be directly manipulated and if users want to make changes, they have to deliberately make a copy, and make changes to that copy. This is perfect if you’re handing a dict -like structure over to a data consumer, and you want to ensure that the data consumer is not unintentionally changing the original data. This is often extremely useful, as cases of data consumers changing passed-in data structures leads to very obscure bugs in your code that are difficult to track down.

A types.MappingProxyType example:

>>>  from  types import MappingProxyType
>>>  data = {'a': 1, 'b':2}
>>>  read_only = MappingProxyType(data)
>>>  del read_only['a']
TypeError: 'mappingproxy' object does not support item deletion
>>>  read_only['a'] = 3
TypeError: 'mappingproxy' object does not support item assignment

Note that the example shows that the read_only object cannot be directly changed.

So, if you want to deliver data dicts to different functions or threads and want to ensure that a function is not changing data that is also used by another function, you can just deliver a MappingProxyType object to all functions, rather than the original dict, and the data dict now cannot be changed unintentionally. An example illustrates this usage of MappingProxyType:

>>>  def my_func(in_dict):
>>>     ...  # lots of code
>>>     in_dict['a'] *= 10  # oops, a bug, this will change the sent-in dict

...
# in some function/thread:
>>>  my_func(data)
>>>  data
data = {'a': 10, 'b':2}  # oops, note that data['a'] now has changed as an side-effect of calling my_func

If you send in a mappingproxy to my_func instead, however, attempts to change the dict will result in an error:

>>>  my_func(MappingProxyType(data))
TypeError: 'mappingproxy' object does not support item deletion

We now see that we have to correct the code in my_func to first copy in_dict and then alter the copied dict to avoid this error. This feature of mappingproxy is great, as it helps us avoid a whole class of difficult-to-find bugs.

Note though that while read_only is read-only, it is not immutable, so if you change data, read_only will change too:

>>>  data['a'] = 3
>>>  data['c'] = 4
>>>  read_only  # changed!
mappingproxy({'a': 3, 'b': 2, 'c': 4})

We see that read_only is actually a view of the underlying dict, and is not an independent object. This is something to be aware of. See the docs for further details.

types.SimpleNamespace

types.SimpleNamespace is a simple class that provides attribute access to its namespace, as well as a meaningful repr. It was added in Python 3.3.

>>>  from types import SimpleNamespace

>>>  data = SimpleNamespace(a=1, b=2)
>>>  data
namespace(a=1, b=2)
>>>  data.c = 3
>>>  data
namespace(a=1, b=2, c=3)

In short, types.SimpleNamespace is just a ultra-simple class, allowing you to set, change and delete attributes while it also provides a nice repr output string.

I sometimes use this as an easier-to-read-and-write alternative to dict. More and more though, I subclass it to get the flexible instantiation and repr output for free:

>>>  import random

>>>  class DataBag(SimpleNamespace):
>>>     def choice(self):
>>>         items = self.__dict__.items()
>>>         return random.choice(tuple(items))

>>>  data_bag = DataBag(a=1, b=2)
>>>  data_bag
DataBag(a=1, b=2)  
>>>  data_bag.choice()
(b, 2)

This subclassing of types.SimpleNamespace is not revolutionary really, but it can save on a few lines of text in some very common cases, which is nice. See the docs for details.

Keyword argument demystify

time to read 3 min | 637 words

There’s a lot of baffling among Python programmers on what exactly “keyword arguments” are. Let’s go through some of them.

If you somehow are writing for a Python 3 only codebase, I highly recommend making all your keyword arguments keyword only, especially keyword arguments that represent “options”.

There are many problems with this sentence. The first is that this is mixing up “arguments” (i.e. things at the call site) and “parameters” (i.e. things you declare when defining a function). So:

def foo(a, b):  # <- a and b are "parameters" or "formal arguments"
    pass

foo(1, 2)  # <- 1 and 2 are arguments to foo, that match a and b

This confusion is common among programmers. I also use the word “argument” when I mean “parameter”, because normally in conversation we can tell the difference in context. Even the documentation in the Python standard library uses these as synonyms.

The code above is the basic case with positional arguments. But we were talking about keyword arguments so let’s talk about those too:

def bar(a,    # <- this parameter is a normal python parameter
        b=1,  # <- this is a parameter with a default value
        *,    # <- all parameters after this are keyword only
        c=2,  # <- keyword only argument with default value
        d):   # <- keyword only argument without default value
    pass

So far so good. Now, let’s think about the statement we started with:

I highly recommend making all your keyword arguments keyword only

That implies there are keyword arguments that are not keyword only arguments. That’s sort of correct, but also very wrong. Let’s have some examples of usages of bar :

bar(1)         # one positional argument
bar(1, 2)      # two positional arguments
bar(a=1)       # one keyword argument
bar(a=1, b=2)  # two keyword arguments
bar(1, d=2)    # one positional and one keyword argument

The trick here is to realize that a “keyword argument” is a concept of the call site, not the declaration. But a “keyword only argument” is a concept of the declaration, not the call site. Super confusing!

There are also parameters that are positional only. The function sum in the standard library is like this: according to the documentation it looks like this:sum(iterable[, start]) But there’s a catch!

>>> sum(iterable=[1, 2])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: sum() takes no keyword arguments

And the start parameter can’t be used as a keyword argument either, even though it’s optional! Recap

(I’m using “argument” here even though “parameter” or “formal argument” would be more correct, but the Python standard library uses these all as synonyms so I will too, so my wording matches the documentation.)

Python functions can have :

  • Arguments that can be used both as positional and keyword arguments (this is the most common case)
  • Arguments that can be used both as positional and keyword arguments with default values (or just “arguments with default values”)
  • Positional only arguments (like the first argument tosum, this is uncommon and can only be achieved by functions implemented in C)
  • Positional only arguments with default values (like above, only for C)
  • Optional positional only arguments (2nd argument to sum, like above, only for C)
  • Keyword only arguments
  • Keyword only arguments with default values
  • Arbitrary positional arguments ( *args )
  • Arbitrary keyword arguments ( **kwargs )

When calling Python functions you can have:

  • Positional arguments
  • Keyword arguments

It’s very simple at the call site, but a lot more complex at the function definition, and how call site arguments are mapped to the declaration is quite complex.

Summary

Python appears simple because most of these rules and distinctions are so well thought out that many programmers can go years in a professional career and believe defaults arguments and keyword arguments are the same, and never get bitten by this incorrect belief.

Looping techniques in Python

time to read 1 min | 164 words

Python has multiple techniques for looping over data structures.

Dictionary looping with both key and value can be done using the items() method:

my_dict = {'first': 'a', 'second': 'b'}
for k, v in my_dict.items():
    print(k, v)
# first a
# second b

The enumerate() function allows looping with both index and value through any sequence:

my_list = ['a', 'b']
for i, v in enumerate(my_list):
    print(i, v)
# 0 a
# 1 b

zip() function can be used to pair two or more sequences in order to loop over both of them in parallel:

first_list = ['a', 'b']
second_list = ['one', 'two']
for f, s in zip(first_list, second_list):
    print(f, s)
# a one
# b two

To loop in a sorted order, use the sorted() function:

my_list = ['b', 'c', 'a']
for f in sorted(my_list):
    print(f)
# a
# b
# c

To loop in reverse, pass the sorted list to the reversed() function:

for f in reversed(sorted(set(my_list))):
  print(f)
# c
# b
# a

Python Data StructureEnhance your tuples

time to read 1 min | 110 words

Standard Python tuples are lightweight sequences of immutable objects, yet their implementation may prove inconvenient in some scenarios.

Instead, the collections module provides an enhanced version of a tuple, namedtuple, that makes member access more natural (rather than using integer indexes).

Import namedtuple:

from collections import namedtuple

Create a namedtuple object:

point = namedtuple('3DPoint', 'x y z')
A = point(x=3, y=5, z=6)
print(A)
# 3DPoint(x=3, y=5, z=6)

Access a specific member:

print(A.x)
# 3

Because namedtuples are backwards compatible with normal tuples, member access can be also done with indexes:

print(A[0])
# 3

To convert a namedtuple to a dict (Actually an OrderedDict):

print(A._asdict())
#OrderedDict([('x', 3), ('y', 5), ('z', 6)])

Python Data StructureGet more with collections!

time to read 1 min | 97 words

In addition to Python’s built-in data structures (such as tuples, dicts, and lists), a library module called collections provides data structures with additional features, some of which are specializations of the built-in ones.

Import the module:

import collections

Specialized container datatypes are usually dict subclasses or wrappers around other classes like lists, tuples, etc.

Notable implementations are :

  • the Counter class used for counting hashable objects.
  • defaultdict class used as a faster implementation of a specialised dictionary.
  • namedtuple class used for defining a meaning for every position in a tuple, often useful with databases or CSV files.

There is more to copying

time to read 1 min | 181 words

An assignment only creates a “binding” (an association) between a name and a “target” (object of some type). A copy is sometimes necessary so you can change the value of one object without changing the other (when two names are pointing to the same object).

# Assignment: bind the name y to 
# the list [1, 2].
y = [1, 2 ]
# Create another binding - 
# bind the name x to the same 
# object that y is currently bound to.
x = y 
# x[0 ] is changed too, when y[0 ] is. 
y[0] = 99
print(x[0])
# 99

The copy module has methods to support both shallow and deep copying of objects.

To create a shallow copy (construct a new object, but references to the objects found in the original are inserted):

from copy import copy

y = [1, 2 ]
x = copy(y)
# note that x = y.copy() works
y[0] = 99
print(x[0])
# 1

To create a deep copy (instead of references, multiple copies are used):

from copy import deepcopy
#...
x = deepcopy(y)

RECENT SERIES

  1. java 101 (13):
    Apr 29, 2017 - Translating Scanner tokens into primitive types
  2. python data structure (5):
    May 03, 2017 - Enhance your tuples
  3. python function (2):
    Apr 16, 2017 - Lambda Functions in Python
  4. python generator (4):
    Apr 26, 2017 - Next, Function or Method ?

Friends of Rezha