v 0.1.2 This is ybsnarfz a rather simplistic package to snarf the yahoo financial boards for any given stock. v 0.1.2 Please run FixTime.pl to fix times between midnight and 1 am. License: All files, scripts and programs in this package are available under terms of the GPL v2 from the FSF. A copy of that license is in this package. Dependencies: perl. Obviously, I run 5.8.5 but they should work on somewhat earlier versions as well. Getopt::Long package for perl Mysql. Nothing exotic except that it uses the "replace into" which may require v4 of mysql. DBD/DBH for mysql. wget. This is how data is pulled from yahoo. linux/posix/unix. Internally there are some system calls to standard posix/unix commands, cat, ls, and mv. Package Contents: for v 0.1.0 CHANGELOG - List of changes. detag.pl - The html scanner program. FindBadDates.pl - Refetch posts where the date is bad. FindMissing.pl - Fetch any posts missing from the db (not deleted). FixTime.pl - Fix bad times between midnight and 1 am. GPLv2.txt - Text copy of the GPL v2 license. htdocs/ - directory for php scripts etc htdocs/ybsnarfz.php - A PHP to display data. htdocs/scox-properties.php - A sample properties file. README - This file. sample-output/ - some sample output from ybsnarfz.php scox.properties - Sample properties file. showMessages.pl - Display (not ready for prime time) TODO - Things which might get done. UpdateRecs.pl - Go back and update number of recs. work/ - working directory to accumulate messages. yahooGetLast.pl - Scan to find last message number. yahooMsgScan.pl - The message parser program. yahooRecsParse.pl - Parse the message list for recs. yboard.tabdef - The input to mysql to create the messages table. ybScan.pl - The program which saves messages to the db. YbsOptions.pm - Header and common code used by other programs. ypull.pl - The program to pull messages from the yahoo board. Installation and Operation: Unpack the tarball: tar -xvzf ybsnarfz-X.X.X.tar.gz. This should create the directory ybsnarf-X.X.X (the X's are the version number, very low!). You may leave them where they are or move things as desired. The programs expect each other and the properties files and the work directory relative to where they are run; they will also write some transient files there (um, should be changed to use a temp directory). Create a database and user to access it. Currently it will only build one table. The example I've used is "yboard" for everything; if the user yboard is not otherwise used this should not be a problem to follow this exactly; otherwise, or if paranoid change as desired. mysqladmin -u<your master> -p<your masterpw> create yboard mysql -u<your master> -p<your masterpw> mysql grant all privileges on yboard.* to yboard@localhost identified by 'yboard'; exit; Create the table. cd into the install directory mysql -uyboard -pyboard yboard <yboard.tabdef Fix the configuration file: For SCOX the scox.properties file may look like this: domain = Yahoo Finance Board locus = SCOX boardid = 1600684464 boardname = cald workdir = work dbhost = localhost dbuser = yboard dbpass = yboard dbset = yboard domain is the primary qualifier. locus is really not used but is there jic. boardid is required by wget for yahoo to qualify the url as is boardname. Note that the domain and boardid, along with the message number is the primary key for the messages table. workdir, is the name of the work directory where pulled posts are stored until ybScan.pl processes them. The dbxxxx settings are for access to your database. All options: boardid - The numeric id used by yahoo to uniquely identify a message board. Can be seen in the url. Required. boardname - The cannonical name of the board in the yahoo url. Required. dbhost - The hostname to connect to the database. defaults as 'localhost'. dbpass - The password to connect to the database. Default is 'yboard'. dbset - The database name. Default is 'yboard'. dbuser - The database connection id. Default 'yboard'. detag - The html scanner program. Defaults to "detag.pl" (part of this package) under the directory specified in execdir. domain - The domain of messages. This is text and can be any value and will be the same for all rows as part of the primary key but is required. execdir - Defaults to ".", and tells the ybScan program where to find the html scanner and message parser. This is included so you can move the executables elsewhere. getlast - Program to find last post number (after catenated from detag). Defaults to "yahooGetLast.pl" (in package) under execdir. locus - Just a name tag in the database for these scans. A descriptive is suggested. parser - The message parser program. Defaults to "yahooMsgScan.pl" (part of this package) under the directory specified in execdir. puller - The puller program. ypull.pl under execdir. scanner - The message parser program. yahooMsgScan.pl under execdir. tempdir - Location where ypull writes some temporary files. workdir - Location where ypull writes the message files of messages retrieved from yahoo. ybScan subsequently deletes them from here after writing them to the DB. There are three basic programs which are run from the immediate directory where these reside. They are ypull.pl, ybScan.pl and UpdateRecs.pl. All of these require an argument of the prefix of the properties file. (viz "./ypull.pl scox" for scox.properties). Both can use a -d <number> </number> option for debug level. -d 1 will give more info per process; -d 2 will give a lot more in the ybScan... you probably don't want to do that unless you are fixing a code problem. ypull.pl optionally takes additional arguments of the first and last message numbers to pull. A zero (or not specified) for the first number tells it to look in the messages table for the maximum message number for that domain and board id (the last will be rescanned). ypull doesn't update the table; it writes the data in the work directory (with a magic tag on the top saying what message number) as <name>-<msgnum>.post. ybScan.pl invokes detag and yahooMsgScan for all .post files in the work directory with a prefix of the properties file name and a ".post" suffix. After processing successfully ybScan deletes them from the work directory. UpdateRecs.pl will by default go back 2000 records from the most recent and update the number of recs on posts. A starting message number may be supplied as an optional parameter. You may wish to set up a simple cron script which looks like (vary for version): cd ~/ybsnarfz-0.1.0 ./ypull.pl scox ./ybScan scox ... and have it run every day, hour or however often you feel like. Similarly you may wish to run UpdateRecs.pl somewhat less often. The Messages Table: The full text of this definition is in yboard.tabdef. domain varchar(64) # A descriptive name for the domain. locus varchar(64) # A descripive name for the particular part of domain. boardid varchar(64) # A unique identifier in the domain msgn bigint(20) # The message number, unique in domain and boardid. poster varchar(255) # Who sent the message posttime datetime # Date and time of posting recs int(11) # How well regarded was this message. title varchar(255) # Subject line of the message refmsg bigint(20) # Parent message number. refby varchar(255) # Parent message poster. msg blob # Text of the message. PRIMARY KEY (domain,boardid,msgn), KEY msgn (msgn), KEY poster (poster), KEY refmsg (refmsg), KEY postTime (postTime) Usage Notes: sh-3.00$ sh-3.00$ ./ypull.pl -h usage is [path/]ypull.pl [-h][-d n] <prop> [start] [end] This program pulls messages from the Yahoo Finance board and saves them for further processing. A -h (or no arguments) will print this message. A -d n will turn on debugging if n is not zero. <prop> is the prefix name of a .properties file. "start" is the first message number; if zero or not specified ypull will find the last message in the database and restart from there. "end" is the last message number to pull. If unspecified or zero ypull will find the last message number currently available on the board and use that. sh-3.00$ sh-3.00$ sh-3.00$ ./ybScan.pl -h usage is [path/]ybScan.pl [-h][-d n] <prop> This program takes Yahoo Finance Board posts as captured by ypull.pl, parses out the information and saves it in a database table, removing the posts from the archive created by ypull. -h (or no arguments) causes this message to print. -d n will display some or lots of debug information if n is 1 or higher. <prop> is the prefix of a .properties files specifying how to handle this information. sh-3.00$ sh-3.00$ sh-3.00$ ./UpdateRecs.pl usage is [path/]UpdateRecs.pl [-h][-d n] <prop> [start] This program updates the number of recs in the messages table. saves them for further processing. A -h (or no arguments) will print this message. A -d n will turn on debugging if n is not zero. <prop> is the prefix name of a .properties file. "start" is the first message number; if zero or not specified this will find the last message in the database and start from 2000 back. sh-3.00$ sh-3.00$ ybsnarfz.php This is a php script/program to display data in useful ways. It depends on a configuration (<name>-properties.php) file being in the same directory as the php script, and readable by apache (or web server program); this configuration is identical to the .properties file used by the perl programs save for language changes for PHP. This displays in both list and threaded modes. xybsnarfz.php Very similar to ybsnarfz.php but will truncate display of message at 600 characters and include an iframe back to the yahoo board. Gotchas: The primary gotcha in this system is that it does not respect multiple space lines. More than one blank line in the input will always be reduced to a single blank line. That could be fixed... do I want to? -- TWZ