Thursday, June 28, 2007

Creating simple yet powerful anonymous downloader in Python

Let's suppose for example that we have a task to download big number of pages from some web server. It may be something all personal pages of server users, list of postal addresses or firms.
Python is a very convenient language for this task because it has Interactive Mode (or REPL - Read-Eval-Print-Loop). So we can start Python and write something like this:

>>>import urllib,re # some modules which we will need later

If target web site has some kind of index page, we should start from getting it and parsing it into items:

>>>index_page=urllib.urlopen('http://example.com/index').read() # we have read contents of the page
>>>items_to_get=re.findall('some regexp',index_page) # now we have list of all items to download, regexp may be something like '<a href=\'([^']*)\'>'
>>>out=open('index','w') # let's save all the items for future use
>>>for item in items_to_get:
... out.write(item+'\n')

>>>out.close()

It is really simple!
Now we can get all the pages. Let's suppose, that items are simple pages, something like 001.html

>>>for item in items_to_get:
... open(item,'w').write(urllib.urlopen('http://example.com/%s'%item).read())
...

This two lines of code will download and save all the pages into files like 001.html

But there are two big problems. First, we download only one page a time, so it will take a lot of time to get all the pages even in case of broad connection because of delays on sending requests, writing pages to disk and so on. Second and more important: site admin can notice that someone (you) tries to download the whole site and ban your IP.

So we need to write real piece of code.
I will write it in separate file getter.py:

1 #Let's import something
2 import os,urllib,urllib2,socket,sys,re,threading,time,random
3
4 max_thread_num=10 # we will download in ten threads!
5
6
7 # Let's create class for a thread of download
8 class ProxyThread(threading.Thread):
9 def set_params(self,proxy): # proxy address
10 self.proxy=proxy
11
12 def run(self): # main method of thread
13 # first we create proxy_handler to handle connection with anonymous http proxy
14 proxy_handler = urllib2.ProxyHandler(self.proxy)
15 auth = urllib2.HTTPBasicAuthHandler()
16
17 # then we build opener, wich works through the proxy
18 opener = urllib2.build_opener(proxy_handler, auth, urllib2.HTTPHandler)
19 #some fake User-Agent (may be useful for some sites)
20 opener.addheaders = [('User-agent', 'Mozilla/5.0')]
21 #Let's now check connection with proxy by getting some data from google
22 try:
23 urllib2.urlopen('http://google.com').read(5)
24 except:
25 # this proxy doesn't work, we will try another.
26 #We need to increment number of available threads.
27 #Simply use global variable
28 global max_thread_num
29 max_thread_num+=1
30 return
31
32 # this proxy is good, so we get items, wich are needed to be downloaded
33 items_to_get=open('index').readlines()
34 # we shuffle list so that different threads try to download different items
35 random.shuffle(items_to_get)
36
37 for itemstr in items_to_get:
38 item=itemstr.rstrip() #strip newline symbol
39 if os.path.exists(item): #Already downloaded ?
40 continue
41
42 #Let's try to download something :)
43 try:
44 data=opener.open('http://example.com/%s'%item).read()
45 #CoDeeN proxies give greetings page on first request, so we need to skip it
46 if not re.search('CoDeeN',data):
47 open(item,'w').write(data)
48 except:
49 #If there was some error during download,
50 #we will remove incompletely downloaded item,
51 #and skip this proxy, because majority of errors are proxy errors:
52 try:
53 os.remove(item)
54 except:
55 pass
56
57 global max_thread_num
58 max_thread_num+=1
59 return
60
61 global max_thread_num
62 max_thread_num+=1
63 return
64
65 # Well, it was quite a long method mostly because of error handling.
66
67 #Now small piece of code wich gets list of proxies and starts threads:
68 socket.setdefaulttimeout(120) # increase connection timeout because some proxies respond after rather long time
69 #We will go trough all 50 pages of free proxy list at http://www.samair.ru
70 for num in range(50):
71 # We get page of proxies
72 proxy_page=urllib.urlopen('http://www.samair.ru/proxy/time-%02d.htm'%num).read()
73 # and search for all proxy addresses on it
74 proxies=re.findall('<span class="proxy\d*">(\d+)</span>.<span class="proxy\d*">(?P<n2>\d+)</span>.(\d+).(\d+):\s*(\d+)',proxy_page)
75 # rather complex regexp :)
76 for proxy in proxies: # try to start thread for each proxy
77 while max_thread_num<=0: # all 10 thread have started, so we need to wait
78 time.sleep(1)
79 # Then we can start a thread:
80 prox={'http':'%s.%s.%s.%s:%s'%proxy}
81 pt=ProxyThread()
82 pt.set_params(prox)
83 max_thread_num-=1
84 pt.start()


Then we can start the program python getter.py.
Well, it works!

Almost :)
I've written no ending condition, so this code will end not when all the items will be downloaded, but when all the proxies will be checked. Anyway you need to decide which items you should download and what to do in case of errors.

Well, it looks like the next post can be about spam bots or other kinds of malware :)
Nonetheless, I believe that you won't use the code for illegal actions :)

No comments:

Post a Comment