Python:网络爬虫

0, environment

Ubuntu 14.04 LTS

1,Install pip - python package management tool

sudo apt-get install python-pip

安装BeautifulSoup包:

pip install BeautifulSoup

2,IDE: PyCharm

#!/usr/bin/env python

import urllib2
import urllib

from BeautifulSoup import BeautifulSoup

def getallimagelink():
    html = urllib2.urlopen('http://www.dbmeizi.com').read()
    soup = BeautifulSoup(html)
    liresult = soup.findAll('li',attrs={"class":"span3"})
    for li in liresult:
        imageentityarray = li.findAll('img')
        for image in imageentityarray:
                link = image.get('data-src')
                imagename = image.get('data-id')
                filesavepath = '/home/xnpeng/PycharmProjects/MeiziImage/%s.jpg' % imagename
                urllib.urlretrieve(link,filesavepath)
                print filesavepath

if __name__ == '__main__':
    getallimagelink()

3,Run program

in IDE, or in console: chmod 777 main.py

./main.py

4,Python网页爬虫工具集

(1). Scrapy (scrapy.org) - https://github.com/scrapy/scrapy

(1)Import the GPG key used to sign Scrapy packages into APT keyring:

sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 627220E7

(2)Create /etc/apt/sources.list.d/scrapy.list file using the following command:

echo 'deb http://archive.scrapy.org/ubuntu scrapy main' | sudo tee /etc/apt/sources.list.d/scrapy.list

(3)Update package lists and install the scrapy-0.24 package:

sudo apt-get update && sudo apt-get install scrapy-0.24

(2).BeautifulSoup

pip install BeautifulSoup

(3).Python-Goose - https://github.com/grangier/python-goose

给定一个文章的URL, 获取文章的标题和内容。

mkvirtualenv --no-site-packages goose
git clone https://github.com/grangier/python-goose.git
cd python-goose
pip install -r requirements.txt
python setup.py install
Written on May 12, 2015