Scrapy 1.1.0 发布了。Scrapy 是一套基于基于Twisted的异步处理框架,纯python实现的爬虫框架,用户只需要定制开发几个模块就可以轻松的实现一个爬虫,用来抓取网页内容以及各种图片,非常之方便。
改进记录如下:
Scrapy 1.1 has beta Python 3 support (requires Twisted >= 15.5). See:ref:`news_betapy3` for more details and some limitations.
Hot new features:
Item loaders now support nested loaders (:issue:`1467`).
FormRequest.from_response
improvements (:issue:`1382`, :issue:`1137`).Added setting :setting:`AUTOTHROTTLE_TARGET_CONCURRENCY` and improved AutoThrottle docs (:issue:`1324`).
Added
response.text
to get body as unicode (:issue:`1730`).Anonymous S3 connections (:issue:`1358`).
Deferreds in downloader middlewares (:issue:`1473`). This enables better robots.txt handling (:issue:`1471`).
HTTP caching now follows RFC2616 more closely, added settings:setting:`HTTPCACHE_ALWAYS_STORE` and:setting:`HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS` (:issue:`1151`).
Selectors were extracted to the parsel library (:issue:`1409`). This means you can use Scrapy Selectors without Scrapy and also upgrade the selectors engine without needing to upgrade Scrapy.
HTTPS downloader now does TLS protocol negotiation by default, instead of forcing TLS 1.0. You can also set the SSL/TLS method using the new :setting:`DOWNLOADER_CLIENT_TLS_METHOD`.
These bug fixes may require your attention:
Don't retry bad requests (HTTP 400) by default (:issue:`1289`). If you need the old behavior, add
400
to :setting:`RETRY_HTTP_CODES`.Fix shell files argument handling (:issue:`1710`, :issue:`1550`). If you try
scrapy shell index.html
it will try to load the URL http://index.html, usescrapy shell ./index.html
to load a local file.Robots.txt compliance is now enabled by default for newly-created projects (:issue:`1724`). Scrapy will also wait for robots.txt to be downloaded before proceeding with the crawl (:issue:`1735`). If you want to disable this behavior, update :setting:`ROBOTSTXT_OBEY` in
settings.py
file after creating a new project.Exporters now work on unicode, instead of bytes by default (:issue:`1080`). If you use
PythonItemExporter
, you may want to update your code to disable binary mode which is now deprecated.Accept XML node names containing dots as valid (:issue:`1533`).
When uploading files or images to S3 (with
FilesPipeline
orImagesPipeline
), the default ACL policy is now "private" instead of "public" Warning: backwards incompatible!. You can use :setting:`FILES_STORE_S3_ACL` to change it.We've reimplemented
canonicalize_url()
for more correct output, especially for URLs with non-ASCII characters (:issue:`1947`). This could change link extractors output compared to previous scrapy versions. This may also invalidate some cache entries you could still have from pre-1.1 runs.Warning: backwards incompatible!.
下载地址: