Currently, my structure is Flask, Redis, RabbitMQ and Celery
. In my scraping, I am using requests
and BeautifulSoup
. My flask is running on apache and wsgi. This is on prod. With app.run(threaded=True)
I have 25 APIs. 10 are to scrape the URL like headers, etc. , and the rest is to use a 3rd party API for that URL.
I am using chord for processing my APIs and getting data from the APIs using requests
.
For my chord header I have 3 workers, while on my callback I only have 1.I am having a bottleneck issue of having ConnectTimeoutError
and MaxRetryError
. As I read some thread it said to do a timeout for every process, because having this error means you are overloading the remote server.
The problem is since I am using a chord there is no sense to use a time sleep since the 25 API call will be run at the same time. Have anyone encountered this? Or am I doing this wrong?
The thread I read seem to be saying to change the requests to pycurl or use Scrapy. But I dont think that's the case since ConnectTimeoutError
is about my host overloading a specific URLs server.
My chord process:
callback = create_document.s(url, company_logo, api_list)header = [api_request.s(key) for key in api_list.keys()]result = chord(header)(callback)
In api_request task requests is used.