Today Greece is again all over the news because of the Eurogroup and the new deal the Greek governemnt tries to achieve through its representatives. Why not check out which website has a title about Greece on it’s front page as fast as possible.
Python 3.2 introduced concurrent futures which is actually a simple interface for asynchronous parallel tasks. Lets check it out on a very simple example.
We will crawl 5 web sites, some of them will not contain English characters and I selected them just to check the difference. I will use lxml and requests which are easily installed with pip (pip3 in my case :)).
The simple iteration through the urls
{% highlight python startinline=true %} from lxml import html import requests import sys
urls = [ ‘http://www.theguardian.com/uk’, ‘http://www.bbc.co.uk’, ‘http://www.in.gr’, ‘http://www.ethnos.gr’, ‘https://uk.yahoo.com’ ]
def parse(urls): for url in urls: page = requests.get(url) print(page.text.find(sys.argv[1]))
parse(urls) {% endhighlight %}
![simple loop iteration]({{ site.url }}/assets/media/single.png)
The asynchronous concurrent approach
{% highlight php startinline=true %} import concurrent.futures import urllib.request from lxml import html import requests import sys
urls = [ ‘http://www.theguardian.com/uk’, ‘http://www.bbc.co.uk’, ‘http://www.in.gr’, ‘http://www.ethnos.gr’, ‘https://uk.yahoo.com’ ]
def parse(url): page = requests.get(url) print(page.text.find(sys.argv[1]))
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_url = {executor.submit(parse, url): url for url in urls}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
{% endhighlight %}
![parallel iteration]({{ site.url }}/assets/media/multi.png)
The execution of all urls was in average 2 times faster on the parallel approach. As you might noticed just because of the asynchronous nature the result of the second url comes as third on the parallel script.
You can very easily replace the content of parse function to process your own set of data :)
Cheers