Thursday, June 28, 2007

Creating simple yet powerful anonymous downloader in Python

Let's suppose for example that we have a task to download big number of pages from some web server. It may be something all personal pages of server users, list of postal addresses or firms.
Python is a very convenient language for this task because it has Interactive Mode (or REPL - Read-Eval-Print-Loop). So we can start Python and write something like this:

>>>import urllib,re # some modules which we will need later

If target web site has some kind of index page, we should start from getting it and parsing it into items:

>>>index_page=urllib.urlopen('http://example.com/index').read() # we have read contents of the page
>>>items_to_get=re.findall('some regexp',index_page) # now we have list of all items to download, regexp may be something like '<a href=\'([^']*)\'>'
>>>out=open('index','w') # let's save all the items for future use
>>>for item in items_to_get:
... out.write(item+'\n')

>>>out.close()

It is really simple!
Now we can get all the pages. Let's suppose, that items are simple pages, something like 001.html

>>>for item in items_to_get:
... open(item,'w').write(urllib.urlopen('http://example.com/%s'%item).read())
...

This two lines of code will download and save all the pages into files like 001.html

But there are two big problems. First, we download only one page a time, so it will take a lot of time to get all the pages even in case of broad connection because of delays on sending requests, writing pages to disk and so on. Second and more important: site admin can notice that someone (you) tries to download the whole site and ban your IP.

So we need to write real piece of code.
I will write it in separate file getter.py:

1 #Let's import something
2 import os,urllib,urllib2,socket,sys,re,threading,time,random
3
4 max_thread_num=10 # we will download in ten threads!
5
6
7 # Let's create class for a thread of download
8 class ProxyThread(threading.Thread):
9 def set_params(self,proxy): # proxy address
10 self.proxy=proxy
11
12 def run(self): # main method of thread
13 # first we create proxy_handler to handle connection with anonymous http proxy
14 proxy_handler = urllib2.ProxyHandler(self.proxy)
15 auth = urllib2.HTTPBasicAuthHandler()
16
17 # then we build opener, wich works through the proxy
18 opener = urllib2.build_opener(proxy_handler, auth, urllib2.HTTPHandler)
19 #some fake User-Agent (may be useful for some sites)
20 opener.addheaders = [('User-agent', 'Mozilla/5.0')]
21 #Let's now check connection with proxy by getting some data from google
22 try:
23 urllib2.urlopen('http://google.com').read(5)
24 except:
25 # this proxy doesn't work, we will try another.
26 #We need to increment number of available threads.
27 #Simply use global variable
28 global max_thread_num
29 max_thread_num+=1
30 return
31
32 # this proxy is good, so we get items, wich are needed to be downloaded
33 items_to_get=open('index').readlines()
34 # we shuffle list so that different threads try to download different items
35 random.shuffle(items_to_get)
36
37 for itemstr in items_to_get:
38 item=itemstr.rstrip() #strip newline symbol
39 if os.path.exists(item): #Already downloaded ?
40 continue
41
42 #Let's try to download something :)
43 try:
44 data=opener.open('http://example.com/%s'%item).read()
45 #CoDeeN proxies give greetings page on first request, so we need to skip it
46 if not re.search('CoDeeN',data):
47 open(item,'w').write(data)
48 except:
49 #If there was some error during download,
50 #we will remove incompletely downloaded item,
51 #and skip this proxy, because majority of errors are proxy errors:
52 try:
53 os.remove(item)
54 except:
55 pass
56
57 global max_thread_num
58 max_thread_num+=1
59 return
60
61 global max_thread_num
62 max_thread_num+=1
63 return
64
65 # Well, it was quite a long method mostly because of error handling.
66
67 #Now small piece of code wich gets list of proxies and starts threads:
68 socket.setdefaulttimeout(120) # increase connection timeout because some proxies respond after rather long time
69 #We will go trough all 50 pages of free proxy list at http://www.samair.ru
70 for num in range(50):
71 # We get page of proxies
72 proxy_page=urllib.urlopen('http://www.samair.ru/proxy/time-%02d.htm'%num).read()
73 # and search for all proxy addresses on it
74 proxies=re.findall('<span class="proxy\d*">(\d+)</span>.<span class="proxy\d*">(?P<n2>\d+)</span>.(\d+).(\d+):\s*(\d+)',proxy_page)
75 # rather complex regexp :)
76 for proxy in proxies: # try to start thread for each proxy
77 while max_thread_num<=0: # all 10 thread have started, so we need to wait
78 time.sleep(1)
79 # Then we can start a thread:
80 prox={'http':'%s.%s.%s.%s:%s'%proxy}
81 pt=ProxyThread()
82 pt.set_params(prox)
83 max_thread_num-=1
84 pt.start()


Then we can start the program python getter.py.
Well, it works!

Almost :)
I've written no ending condition, so this code will end not when all the items will be downloaded, but when all the proxies will be checked. Anyway you need to decide which items you should download and what to do in case of errors.

Well, it looks like the next post can be about spam bots or other kinds of malware :)
Nonetheless, I believe that you won't use the code for illegal actions :)

Friday, June 22, 2007

Fit2PDA moved to Google Code, new project started

Today I moved Fit2PDA development to Google Code. Thanks to geomatsi for proposal!

All the source code is in svn now. Also I've written some help pages for project and uploaded build for Windows.
Linux build will be ready tomorrow.

So please feel free to download try and write bug reports

I will be happy to get new ideas and patches too!

New project vkontakte.net.ru

Last two weeks there was very little activity on Fit2PDA from my side, because of my new web project vkontakte.net.ru.
It is devoted to research on most successful Russian social network vkontakte.ru. I created small program in Python to draw maps of connections between people in this network. Web site has wiki for discussion of new ideas.

I need to warn you that vkontakte.net.ru is and will be in Russian only. Anyway why one not reading Russian can be interested in research of Russian social network :)

You can visit vkontakte.net.ru for more information if you read Russian.

Monday, June 4, 2007

Eric IDE on Unfriendly System

At morning I've read on linux.org.ru of new version of Eric IDE. After looking at screenshots I was very interested in this IDE.

So I decided to install it on my work PC.
Unfortunately it's running Windows XP, and I can't change it. And those OS doesn't have normal package manager.

To install Eric I needed Qt, Python, PyQt, QScintilla.

I had opensource Qt (installed with MinGW) and Active Python installed on my PC.
So I downloaded PyQt, QScintilla and Eric.
I installed binary package of PyQt and started to build QScintilla with MinGW. The build was successful, but python bindings didn't installed because of some error with SIP. So I tried to build SIP separately. It wasn't build because of some unknown linking errors with Python libraries.
I searched trough the internet, but was unable to find solution.
So I decided to reinstall everything.

First of all, I downloaded Python from python.org and installed it (it was what really helped).
Then I reinstalled Qt.
Then I downloaded source package of PyQt, unpacked it and run

python configure.py

make

make install

The build completed with no errors.
Next I downloaded QScintilla source and unpacked it. Then I went to Qt4 subdirectory, and run

qmake qscintilla.pro

make

make install

After that I went to Python subfolder of QScintilla source, and ran

python configure.py

make

make install

When the build was completed I had all needed to run Eric4.
(Also SIP is included into PyQt, so one don't need to build it separately).
Then I unpacked Eric and run
python install.py
eric4.bat

It started after all!

Eric is really great, but I've met the problem on second run - it haven't started because of some error in debugger configuration. So I had to turn remote debugging and passive debugging on through eric-configure.bat. After it the IDE started normally.