Message ID: 273319
Posted By: ColonelZen
Posted On: 2005-06-13 01:43:00
Subject: ybsnarfz

Having gotten back from my trip yesterday, I looked at this board, and decided NO.

Well having said no, I decided to play a little. While yahouevre does a better job and has lots more functionality I always wanted a simpler perl tool to snarf the yahoo boards.

It needs some polish and some automation. But simple it is - a grand total of 537 lines in six files (one of which is a configuration and one of which is the table definition).

It's just playing, not anything serious, but if you don't want to do the full warmcat thing and just want to have the posts in a db for future reference or scanning, this'll do it.

I'm rerunning it against the board now, but with coming up on 300k posts it'll take a few days. I'll catch and fix bugs as it runs.

I'll tarball and post it later. For now it's as cuttable text (perl) in http://www.ip-wars.net/?op=displaystory;sid=2005/6/13/1336/32119

-- TWZ


Message ID: 273323
Posted By: ColonelZen
Posted On: 2005-06-13 03:09:00
Subject: Re: ybsnarfz

As this is a work in progress, look at the comments, found and fixed two...

As said, I'll find time to clean it up and package it sometime... it shouldn't take too long.

-- TWZ


Message ID: 273521
Posted By: ColonelZen
Posted On: 2005-06-13 18:51:00
Subject: ybsnarfz-0.0.2

see

http://www.ip-wars.net/comments/2005/6/13/1336/32119/6#6

for the README and where the tarball is.

It seems to all be running fairly smoothly now.

-- TWZ


Message ID: 274092
Posted By: ColonelZen
Posted On: 2005-06-16 00:20:00
Subject: ybsnarfz-0.1.0

A bit of code cleanup, et al.

The entire scox table mysqldump'd is 150Mb. The table is a single totally unnormalized and contains a lot of redundancy in columns but it's just worth noting how *small* all our work here has been over the last two years ;-)

I may add some trivialities to read the table in a fairly useful way now - others are invited to play as well if they choose.

Described with the link to get the tarball at:

http://www.ip-wars.net/comments/2005/6/13/1336/32119/7#7

-- TWZ


Message ID: 274096
Posted By: ColonelZen
Posted On: 2005-06-16 00:53:00
Subject: Re: ybsnarfz-0.1.0 missing posts

speaking of which there seems to be about 33k posts missing between 145000 and 178100.

Does anybody have any Clues?

-- TWZ


Message ID: 274560
Posted By: ColonelZen
Posted On: 2005-06-17 21:08:00
Subject: 0.1.1 of ybsnarfz is available

at

As described at

www.ip-wars.net/comments/2005/6/13/1336/32119/8#8

This is ybsnarfz a rather simplistic package to snarf the yahoo financial boards for any given stock....

-- TWZ


Message ID: 274843
Posted By: ColonelZen
Posted On: 2005-06-20 02:47:00
Subject: ybsnarfz new version

ybsnarfz-0.1.2 is now available at

http://mysite.verizon.net/~vze4v38p/ybsnarfz-0.1.2.tar.gz

There are some code fixes, a program FixTime.pl which you should run to fix bad times between midnight and one am. There is also a ybsnarfz.php program to display the data in the table.

Some sample output is (at the same site)

ybs-config.html
ybs-msg-list.html
ybs-msg-thread.html
ybs-list.html
ybs-thread.html

It seems to all work, but it could definitely use some style sheets!

The config file for the php program is the same .properties file used by the perl programs, but it needs to reside in the same directory as the php script (and be readable which is a security opening, but your db user should be restricted to localhost anyway).

-- TWZ


Message ID: 275778
Posted By: ColonelZen
Posted On: 2005-06-23 00:02:00
Subject: OT RTP if using/thinking of ybsnarfz

Just want to get an idea of how much interest.

It works for my purposes but I've gotten only one person giving regular feedback and one passing mention of use by another correspondent.

Given past discussions of archiving I thought there would be more interest. My question is should I just put final cleanup on what's there (mostly cosmetics and trivial usage - I've made but not yet published parts for keeping an archive of the html of the posts as per my correspondent's req and he wants the html filenames sortable [i.e the number string the same size] and a minor polish to the php [enter message number and limit lists to particular poster])

Or is there enough interest to push it a little further - the table as it is is *totally* unnormalized. Normalizing it would allow better calculation of thread and subthread rec totals and the like as well as better per user calcs.

The cosmetics could happen in the next week, my thought is to "put it to bed" after that unless there is more interest.

-- TWZ


The texts of these Yahoo Message Board posts have been licensed for copying and distribution by the Yahoo Message Board user "ColonelZen" under the following license: License: CCL Attribution-NonCommercial-ShareAlike v2.0.

Copyright 2005 Yahoo! SCOX. Messages are owned by the individual posters.