#scrapy

/

      • __stranac__ has quit
      • nyov has quit
      • nyov joined the channel
      • stav has left the channel
      • lexileo has quit
      • lexileo joined the channel
      • lexileo has quit
      • lexileo joined the channel
      • greg_f joined the channel
      • pro_z joined the channel
      • cerbere joined the channel
      • cerbere has quit
      • pro_z_ joined the channel
      • pro_z has quit
      • greg_f joined the channel
      • greg_f has quit
      • greg_f joined the channel
      • rsrx joined the channel
      • rsrx
        has anyone used Splash and knows how to pass headers?
      • response = requests.get(splash_url, params={'url': url, 'timeout': timeout, 'wait': 0.5, 'headers': headers})
      • this returns http status 400 as soon as I call it
      • headers is dict
      • if I omit 'headers', it works
      • nyov
        is that not an issue with requests then?
      • ohhh nvm
      • i'll do a search
      • rsrx
        thanks
      • nyov
        actually, it looks like with `requests.get` you do mean the requests library. so I don't know
      • rsrx
        ok i figured it out
      • had to use POST request to splash render.html page
      • response = requests.post(splash_url, json={'url': url, 'http_method': 'GET', 'timeout': timeout, 'wait': 0.5, 'headers': headers})
      • this works
      • fpghost84 joined the channel
      • fpghost84
        I am thinking of having a pipeline send my Scrapy item to a celery task which would use the Django ORM to create/update a Product model
      • Should I do it like that or would it be better just to import the django Product model directly into the pipeline and do the create/update there? or via a deferred?
      • nyov
      • fpghost84
        nyov possibly it would
      • I'm trying to think about if that would be blocking or what...or if that even matters for my app
      • Also if I used this package would django post_save signals still be fired?
      • nyov
        the RDBMS is blocking, but in reality that's often not an issue unless you're really high quantity
      • fpghost84
        quantity is about 15k rows per day
      • so prob would be ok
      • nyov
        see the caveats section: "DjangoItem is a rather convenient way to integrate Scrapy projects with Django models, but bear in mind that Django ORM may not scale well if you scrape a lot of items (ie. millions) with Scrapy."
      • disclaimer: I haven't used it. (so I don't know about post_save signals)
      • fpghost84
        I suppose it is worth a try first
      • if I didn't use this package...then I was thinking to publish a message to celery and have a task pick it up and do the django orm sve
      • or possibly just importing my django model directly into the pipelines of my project and having a pipeline use the django model directly to save
      • nyov
        isn't that what this plugin does?
      • fpghost84
        I think it probably does something close to the second option, yeah
      • nyov
        I don't see how it would be different, doing it in the pipeline, but perhaps you can mod it better there
      • fpghost84
        but the first option would be to just shove the items back into celery queue and have that task write them
      • which might lead to really fast scrape times without scrapy stopping to write out to django
      • nyov
        dunno celery much, try it I guess
      • fpghost84
        yeah, will give these options a try and see what works best
      • thanks for the help
      • nyov
        for what it's worth :)
      • yw
      • greg_f has quit
      • greg_f joined the channel
      • fpghost84 has quit
      • fpghost84 joined the channel
      • greg_f has quit
      • greg_f joined the channel
      • greg_f has quit
      • greg_f joined the channel
      • greg_f has quit
      • stav joined the channel
      • fpghost84 has quit
      • greg_f joined the channel
      • czart joined the channel
      • rsrx joined the channel