What webcrawlers have you made that you use on a regular basis?

What webcrawlers have you made that you use on a regular basis?

Other urls found in this thread:

youtube.com/feeds/videos.xml?channel_id=[HEX-ID]
youtube.com/channel/UCxr2d4As312LulcajAkKJYw
youtube.com/feeds/videos.xml?channel_id=UCxr2d4As312LulcajAkKJYw
youtube.com/feeds/videos.xml?channel_id=UCqbkm47qBxDj-P3lI9voIAw
twitter.com/SFWRedditVideos

ona that checks every chaturbate room for a pair of tetas theb it takes a screenshot and compresses it down to a 32x32 gif and sends it to my phone

why

ur my hero m8

EVERYONE GIMME YOUR WEBCRAWLER IDEAS

One that hops on my bank account and gets any cash movement.

A crawler to download poster/favorited videos of a user from xhamster.

How do I make one? Where can I get examples? What do I need? Is this like imacros in Firefox? I am really interested

I used a html parser module for python called beautifulSoup, you wont need more than this 99%of the time. If more interaction is needed with the website, selenium is my goto module.

is there any method of doing this without resorting the the cucks of programming languages`?

is it easy to do with C or Ada?

i hope this is a bait post
you are constrained mostly by your network speed, therefore the c/ada program running faster makes absolutely no difference
and no, it isnt easy either
just use python/ruby/any scripting language w/ a library that has bindings to a native html parser

Mostly i dont know those and would like to use something im comfortable with

Just came up with the idea for a crawler that looks for credit card info or pics of cards (eg on Twitter) and makes donations to something worthwhile, like cloning Harambe

>I used a html parser module for python called beautifulSoup
beautifulSoup is nice, but it's slow as shit. Unless you REALLY need the tolerance, use LXML.

Why the crap would you want to tackle a problem like this using C or Ada?

>cucks of programming languages
Are you 12?

>Mostly i dont know those
Then learn one. If you know C well then I can't imagine you'll have trouble picking up Python.

>Automatic image recognition against random images on Twitter
Good luck.

Training a neutral net to recognize credit cards would be stupidly easy. Throw in some OCR magic and hook it up to Twitter's API and there you go. Something similar was already done, there used to be an account that would retweet photos of credit cards.

Tempted to make an interpals crawler that searches girls for keywords and sends a message created around those keywords

>Training a neutral net to recognize credit cards would be stupidly easy.
Maybe?
I suspect the broad range of patterns and images on credit cards would make identifying them tricky, but I don't have real experience in that area.

did one to log into Mergent Online and search for top preformers for the day in the, well, some market like nyse. downloads financial statements, competitors, etc


pretty useless thing to do

parse university canteens websites

offer machine readable data

any guides on building one?

how hard would it be to do it in C though ? Just for fun to learn and become more familiar with C.
Or is it just too bad of an idea

WEBCRAWLING IN MY SKIN

i was fucking listening to crawling too

I work at an industrial equipment distributor.
Made some scripts to gather data and process from manufacturers pages in order to use them on our company page.
Does this count as crawling, I haven't really looked into the definition of crawling, I just did my junk to do what it need it to.

Scrapy framework for Python is very good and use concurrent requests.
A lot of options are also available.

Made some python/scrapy cronjobs to automatically like the fb/twitter posts of my gf every hour or so.

Cause you know, I'm a vagina slave developer with no time for childishness like social networks.

Sometimes I like to use nmap to scan millions of random IPs on port 80 and then see if a web page resolves. It's usually just boring shit like chinese sites and stuff. I found someone's home videos once.

One for "subscribing" to youtube channels, without having an account, navigating through a laggy GUI or getting distracted from my work by recommendations.
After the scan, the videos open in a vlc media stream.
Quite comfy on low-end computers.

Nicely done.

>One for "subscribing" to youtube channels, without having an account
I do that too.

>After the scan, the videos open in a vlc media stream.
Huh, okay. Mine returns an Atom feed that gets read by my feed reader.

Are you doing that via the Youtube API? I did a similar API to RSS kind of thing for search results a while ago, but it had some arbitrary limits in API v2 or whatever it was at the time.

Y'know, every channel does have an RSS feed. Y'can just use that.

Uuhh uh whaat

>Are you doing that via the Youtube API?
God no. The Youtube API actually requires you to authenticate with an account.

I'm just scraping the HTML of the Uploads page (or Playlist page) and the individual Video pages. To save re-scraping the same pages over and over, I store the info on the Video pages in a SQLite DB between scrapes.

If Google doesn't like me doing that, then they're free to bring back channel RSS feeds.

>every channel does have an RSS feed
That's been gone for years.

yeah. There was some specific url you paste the channels ID after, but sometimes even just view source and look for 'RSS' works. Ill see if I have the URL saved somewhere.
All you need is RSS though, for that, yeah.

>That's been gone for years.
Its not. It's still there. Just not obviously available.

>Its not. It's still there. Just not obviously available.
Shit, really? I did a bunch of searching for stuff like that before I wrote the scraper, but I found nothing that still worked.
Do you have any information you could post / link to?

I have some automated betting process going on with a few sports betting sites.

youtube.com/feeds/videos.xml?channel_id=[HEX-ID]

[HEX-ID] => search for the tag "channel-external-id" on the channel HTML

this seems cool..is it checking the betting lines in those games?

However, on osme channels, you can just view source and ctrl+f 'rss'. for example, from the channel of a random video in my recommended videos-
youtube.com/channel/UCxr2d4As312LulcajAkKJYw
youtube.com/feeds/videos.xml?channel_id=UCxr2d4As312LulcajAkKJYw

Otherwise you'll have to do that however.
hoowever, I'm...having trouble finding one its not working for right now, looking for one, even though it didnt work for alot of the channels in my rss. So its pretty helpful still, apparantly.

if there are existing libs in C for scraping like beautifulsoup, then easy

Otherwise, start from scratch would be an intermediate task for a new C programmer

>channel RSS feeds are gone
Its still there, using it right now...

I suggest going for libcurl and libtidy.
libtidy comes with a buffer that can be passed to the curl_easy_setopt on CURLOPT_WRITEDATA
But listen to

>youtube.com/feeds/videos.xml?channel_id=UCqbkm47qBxDj-P3lI9voIAw
Alright, I don't know if that was added since I wrote this thing, or if I missed it somehow.
Still, Thanks!

>>I used a html parser module for python called beautifulSoup
>beautifulSoup is nice, but it's slow as shit. Unless you REALLY need the tolerance, use LXML.
What the fuck is this and how do I use it?
Im manually parsing html with C right now..
>protip: it just werks