antonnaz: Creating simple yet powerful anonymous downloader in Python

Let's suppose for example that we have a task to download big number of pages from some web server. It may be something all personal pages of server users, list of postal addresses or firms.
Python is a very convenient language for this task because it has Interactive Mode (or REPL - Read-Eval-Print-Loop). So we can start Python and write something like this:

>>>import urllib,re # some modules which we will need later

If target web site has some kind of index page, we should start from getting it and parsing it into items:

>>>index_page=urllib.urlopen('http://example.com/index').read() # we have read contents of the page
>>>items_to_get=re.findall('some regexp',index_page) # now we have list of all items to download, regexp may be something like '<a href=\'([^']*)\'>'
>>>out=open('index','w') # let's save all the items for future use
>>>for item in items_to_get:
... out.write(item+'\n')

>>>out.close()

It is really simple!
Now we can get all the pages. Let's suppose, that items are simple pages, something like 001.html

>>>for item in items_to_get:
... open(item,'w').write(urllib.urlopen('http://example.com/%s'%item).read())
...

This two lines of code will download and save all the pages into files like 001.html

But there are two big problems. First, we download only one page a time, so it will take a lot of time to get all the pages even in case of broad connection because of delays on sending requests, writing pages to disk and so on. Second and more important: site admin can notice that someone (you) tries to download the whole site and ban your IP.

So we need to write real piece of code.
I will write it in separate file getter.py:


 1 #Let's import something
 2 import os,urllib,urllib2,socket,sys,re,threading,time,random
 3 
 4 max_thread_num=10 # we will download in ten threads!
 5 
 6 
 7 # Let's create class for a thread of download
 8 class ProxyThread(threading.Thread):
 9         def set_params(self,proxy): # proxy address
10                 self.proxy=proxy
11 
12         def run(self): # main method of thread
13                         # first we create proxy_handler to handle connection with anonymous http proxy
14                         proxy_handler = urllib2.ProxyHandler(self.proxy)
15                         auth = urllib2.HTTPBasicAuthHandler()
16 
17                         # then we build opener, wich works through the proxy
18                         opener = urllib2.build_opener(proxy_handler, auth, urllib2.HTTPHandler)
19                         #some fake User-Agent (may be useful for some sites)
20                         opener.addheaders = [('User-agent', 'Mozilla/5.0')]
21                         #Let's now check connection with proxy by getting some data from google
22                         try:
23                                 urllib2.urlopen('http://google.com').read(5)
24                         except:
25                                 # this proxy doesn't work, we will try another. 
26                                 #We need to increment number of available threads. 
27                                 #Simply use global variable
28                                 global max_thread_num
29                                 max_thread_num+=1
30                                 return
31 
32                         # this proxy is good, so we get items, wich are needed to be downloaded
33                         items_to_get=open('index').readlines()
34                         # we shuffle list so that different threads try to download different items
35                         random.shuffle(items_to_get)
36 
37                         for itemstr in items_to_get:
38                                 item=itemstr.rstrip() #strip newline symbol
39                                 if os.path.exists(item): #Already downloaded ?
40                                         continue
41 
42                                 #Let's try to download something :)
43                                 try:
44                                         data=opener.open('http://example.com/%s'%item).read()
45                                         #CoDeeN proxies give greetings page on first request, so we need to skip it
46                                         if not re.search('CoDeeN',data):
47                                                 open(item,'w').write(data)
48                                 except:
49                                         #If there was some error during download, 
50                                         #we will remove incompletely downloaded item, 
51                                         #and skip this proxy, because majority of errors are proxy errors:
52                                         try:
53                                                 os.remove(item)
54                                         except:
55                                                 pass
56 
57                                         global max_thread_num
58                                         max_thread_num+=1
59                                         return
60 
61                         global max_thread_num
62                         max_thread_num+=1
63                         return
64 
65 # Well, it was quite a long method mostly because of error handling. 
66 
67 #Now small piece of code wich gets list of proxies and starts threads:
68 socket.setdefaulttimeout(120) # increase connection timeout because some proxies respond after rather long time
69 #We will go trough all 50 pages of free proxy list at http://www.samair.ru
70 for num in range(50):
71         # We get page of proxies
72         proxy_page=urllib.urlopen('http://www.samair.ru/proxy/time-%02d.htm'%num).read()
73         # and search for all proxy addresses on it
74         proxies=re.findall('<span class="proxy\d*">(\d+)</span>.<span class="proxy\d*">(?P<n2>\d+)</span>.(\d+).(\d+):\s*(\d+)',proxy_page)
75         # rather complex regexp :)
76         for proxy in proxies: # try to start thread for each proxy
77                 while max_thread_num<=0: # all 10 thread have started, so we need to wait 
78                         time.sleep(1)
79                 # Then we can start a thread: 
80                 prox={'http':'%s.%s.%s.%s:%s'%proxy}
81                 pt=ProxyThread()
82                 pt.set_params(prox)
83                 max_thread_num-=1
84                 pt.start()

Then we can start the program python getter.py.
Well, it works!

Almost :)
I've written no ending condition, so this code will end not when all the items will be downloaded, but when all the proxies will be checked. Anyway you need to decide which items you should download and what to do in case of errors.

Well, it looks like the next post can be about spam bots or other kinds of malware :)
Nonetheless, I believe that you won't use the code for illegal actions :)

antonnaz

Thursday, June 28, 2007

Creating simple yet powerful anonymous downloader in Python

No comments:

Post a Comment