Sushant Hiray's Webspace

Sushant Hiray's Webspace

Profile Picture

Sushant Hiray

TSHN ~ Top Stories of Hacker News in last 24 hours

What is this?

Pretty simple. Every 15 minutes, TSHN scrapes the first page of Hacker News and generates a new page containing the top stories from the past 24 hours sorted in descending order by overall points. You can checkout a quick demo here

How do I make this work?

  • Clone the repo here
  • Check out run.sh for a sample bash script. Change the paths accordingly for make your own bash script.
  • Change the parent folder path in scrape.py
  • Now add a cron task as follows:
    • Type crontab -e in terminal
    • Append the following line into the opened file: */15 * * * * {path to run.sh}
    • Once the cron task is set right, the script will scrape front page of HN every 15 minutes and update the top stories data accordingly.

Why the hell would you want to do that?

I was bored as we were having our winter holidays, so I started reading about Beautiful Soup. I was simultaneously checking out HN in a different tab.
So I thought HN could be a nice place to start scraping. This is pretty useful as I keep checking on HN quite a few times during the whole day, having a sorted view for posts helps me read the interesting stuff first ^_^
Also a friend pointed out, it is particularly useful in the morning too check on interesting posts, which could have gone buried while you were sleeping!



How'd you do it?

Pretty simple! The code’s up on GitHub.
As I had mentioned before take a look at Beautiful Soup and see how you can go about scraping the relevant data.
Once you are done with this, rest is just simple python script to combine the scraped data.
Thanks to Bootstrap for the minimal UI.
Make ajax calls to fetch the data.
That’s it.

Further Improvements

  • Currently I'm reloading the page every 15 min. It ain't the right way ofcourse, a simple improvement would be to make ajax calls every 15 min using set timeout
  • Also I'm currently storing 96 files corresponding to dump for every 15 min slot in last 24 hours. While processing I append all the files together and then sort during the ajax call. This can be made efficient by making sure the final json file has only unique entries.
  • If you feel any further issue feel free to fork it and fix it.
Separator line
Separator line
Sushant Hiray - Foodie. Coder. Reader. Binge Watching.
Open Source Evangelist
Contact
 

Menu