• McBain@feddit.ch
    link
    fedilink
    arrow-up
    13
    ·
    1 year ago

    I use scrapy. It has a steeper learning curve than other libraries, but it’s totally worth it.

  • UraniumBlazer@lemm.ee
    link
    fedilink
    English
    arrow-up
    8
    arrow-down
    4
    ·
    1 year ago

    Sorry, I’m ignorant in this matter. Why exactly would you want to scrape websites aside from collecting data for ML? What kind of irreplaceable API are you using? Someone please educate me here.

    • coltorl@programming.dev
      link
      fedilink
      arrow-up
      29
      ·
      1 year ago

      API might cost a lot of money for the amount of requests you want to send. API may not include some fields in the data you want. API is rate limited, scraping might not be. API requires agreement to usage terms, scraping does not (though the recent LinkedIn scraping case might weaken that argument.)

  • redw04@lemmy.ca
    link
    fedilink
    arrow-up
    2
    arrow-down
    8
    ·
    1 year ago

    So uh…as someone who’s currently trying to scrape the web for email addresses to add to my potential client list … where do I start researching this?

    • lutillian@sh.itjust.works
      link
      fedilink
      arrow-up
      3
      ·
      1 year ago

      Start looking into selenium, probably in Python. It’s one of the easier to understand forms of scraping. It’s mainly used to web testing, though you can definitely use it for less… nice purposes.

  • lemmywizard@lemm.ee
    link
    fedilink
    arrow-up
    19
    ·
    1 year ago

    It’s all fun and games until you have to support all this shit and it breaks weekly!

    That being said, I do miss the simplicity of maintaining selenium projects for work

  • Fisch@lemmy.ml
    link
    fedilink
    arrow-up
    5
    ·
    1 year ago

    I really hope Libreddit switches to scraping, the “Error: Too many request” thing is so annoying, I have to click the redirect button in Libredirect like 20 times until I can actually see a post.

    Still a better experience than Reddits official site tho.