The parsefile program parses files using different parsing algorithms, specified in library files.
It includes a command line, which is used to manage files, parsing algorithms, configurations and variables.
The program can also be run without any user input by using configuration files and command-line arguments.
It is meant to be small, simple and maximally extendable by using shared libraries and C++ classes when needed.
The program must be run in its program directory, which includes at least the following directories:
algos - for the algorithm files: libraries (*.so), descriptions (*.txt), temporary files (*.tmp)
conf - for configuration files (*.cfg)
output - for the output files (*.out)
Optional directories include:
src - the source files (*.cpp) and header files (*.h) of the program and a compile.sh for compiling and linking
algos/src - the source files (*.cpp) and header files (*.h) of the provided algorithms and a compile.sh for compiling
algos/src/templates - template files for the programming of new algorithms
input - different example files for parsing
Note: The program has to be run exactly in its program directory, not in one of these sub-directories!
The compiled and executable binary file parsefile is already included in the package.
If you want to recompile and link the source code of the main program, which is found in the src folder (if received), you can use the g++ compiler with the command line options -ldl and -rdynamic for the dynamic library loading fuctionality.
Optionally, you can use the src/compile.sh by opening the terminal in the main program folder (not the src folder) and typing sh src/compile.sh. The executable will be compiled and you have to enter your master password to copy it to /usr/bin/parsefile, so that you can use it without restrictions. Check out the source of the src/compile.sh before, because you should never give your master password away without exactly knowing what a program is about to do with it.
parsefile
Opens the command console.
parsefile <config>
Loads the configuration <config> and opens the command console.
Enter start to immediately start the parsing process, using this configuration.
parsefile <config> --startparsefile <config> -s
Loads the configuration <config> and starts the parsing process immediately, using this configuration.
Note: Commands and arguments can be separated by space(s) or tabulator(s).
Here is the complete list of commands for the current version of parsefile:
help [<command>] |
Displays the current list of commands, or - if specified - information about <command>.Identical commands: ?, h, man |
version |
Displays the current version of the main program. Identical commands: v, ver |
addfile <file> |
Adds <file> to the list of files to parse and displays the file number used by removefile.Identical commands: f, af, addf, file |
showalgos |
Shows the list of available parsing algorithms. Identical command: algos |
addalgo <name> |
Adds parsing algorithm <name> to the parser and displays the algorithm number used by removealgo. Note that for most parsing algorithms the order of the algorithms added is important. The algorithm added first will be run first, etc.Identical commands: a, aa, adda, algo |
removefile <number> |
Removes file number <number> (see result of addfile or showconf) from the file list.Identical commands: rf, rmvf, rmvfile |
removealgo <number> |
Removes algorithm number <number> (see result of addalgo or showconf) from the list of algorithms.Identical commands: ra, rmva, rmvalgo |
clear |
Removes all added global variables, files and algorithms. Identical command: clr |
clearfiles |
Removes all added files. Identical commands: cf, clrf, clrfiles |
clearalgos |
Removes all added algorithms. Identical commands: ca, clra, clralgos |
clearvars |
Removes all added global variables. Identical commands: cv, clrv, clrvars |
showconf |
Shows the current configuration, which includes all file names, parsing algorithms and global variables. Includes also the file and algorithm numbers used by removefile and removealgo.
During the parsing process, the list of algorithms will not be shown.Identical commands: sc, conf, confg, showconfg, showconfig |
load <name> |
Loads the configuration named <name> from its configuration file (conf/<name>.cfg).Identical commands: l, o, ld, rd, open, read |
save <name> |
Saves the current configuration to a configuration file so that it can be loaded by load <name>.Identical commands: w, sv, wr, write |
set <name>[=<value>] |
Sets the global variable <name> to <value>. If no value is given, an empty string will be set.Identical command: st |
unset <name> |
Unsets the global variable <name>.Identical commands: u, us, uset |
get <name> |
Gets the value of the global variable <name>.Identical commands: g, gt |
start |
Starts the parsing and quits the program when finished. Identical commands: r, s, rn, run, strt |
exit |
Quits the program without parsing. Identical commands: c, e, q, x, cncl, quit, close, cancel |
If you want to parse binary files, you have to put the program into binary mode.
Simply add a global variable named binary to your configuration, e.g. by using the set binary command.
This variable will - like all other variables - also be saved in any configuration file created by the save command.
The binary mode has been introduced in version 0.5a.
The following example parses a HTML file (located at input/example.htm), removes all HTML comments and returns the content of the <body> tag only, which will be written into the output file. When finished, it checks for links in this HTML file and processes all files which have been linked to in the very same way using the HTMLSpider parsing algorithm.
Note: The example will not work if you only got the minimum package of parsefile, unless you create your own input file at input/example.htm.
To reach our goal, we will use four different parsing algorithms:
1. html_remove_comments
Will remove all HTML comments so that HTML parsing can start.
HTML parsing algorithms usually don't check if a found tag is inside a comment or not, that's why html_remove_comments should be used before using any other HTML parsing algorithms.
2. html_body
Leaves just the content of the <body> tag, the rest will be ignored.
We don't need more, because the links we are looking for are supposed to be inside the <body> tag.
3. print_content
For demonstration purposes, we will write the manipulated content into the output file(s).
4. HTMLSpider
After we're done, we let HTMLSpider check the content (which is now the content of the <body> tag only) for links.
The algorithm will automatically add files found in link tags, and the program will start the whole parsing process for them, too.
The setting up of the example is easy. Start parsefile and enter the following commands (without the ": " in front, it just represents the command line prompt):
: addfile input/example.htm |
We only want to parse HTML files, so we add a filter to HTMLSpider by setting a global variable:
: set html_spider.filter=htm,html,php,php5, |
Note: The last comma says, that files without extensions will be parsed, too. Some webpages like wikis provide html files without extensions.
Add a comment to your new configuration by typing:
: set comment=Example configuration for README.txt |
You can now enter the showconfig command to check the configuration you created. It should produce a similar result to the following, depending on the versions of your algorithms:
: showconfig |
Note: The version variable is an internal read-only constant and contains the version of the main program.
To save the configuration, you can enter save example. It will be written into the config file conf/example.cfg and can easily be loaded by load example another time, to run it again or change it.
: save example |
Now the moment has come to let parsefile do what it is supposed to do: parse the file. Enter start!
: start |
The output can be similar to this, depending on the HTML files in your input directory:
Okay, let's go... |
As you see, we could have used the set command to set the global variable html_spider.max_files_to_add and change the limit of files, that HTMLSpider maximally adds. The default value is 1,000 files.
In this example, HTMLSpider finds two additional files, because there is a link in the example.htm to the linktest.htm, in which (in the version used here) there are five links, all to test.htm. Nevertheless, test.htm will only be parsed once.
For all parsed files there should now be output files in the output folder. Open them with a text editor and you will find that all the comments have been removed (by html_remove_comments) and only the content of the <body> tag is left (by html_body). The copying of the created content into these output files was done by print_output.
Next time, you don't have to manually add the file and all the algorithms. Just use the load command:
: load example |
0.1a - First alpha version. Can open local files and algorithms only. No download functionality yet.
0.2a - Global variables environment (vars class) has been added.
0.3a - Command line can be skipped by --start/-s argument.
0.4a - Updated global variables environment.
0.5a - Supports binary files.
0.6a - Bugfix and update of filelist and vars classes, change of configuration file format
0.7a - Algorithms can only be added once, except they allow multiple instances
See Changelog for more detailed version information.
1.0b - First beta version. Will be able to download files, but can use local algorithms only.1.0 - First release version. Will be able to download files and algorithms.
No portations to other operating systems than Linux are planned, but feel free to port it yourself and tell the world about your awesome work. See Bugs & Contact also.
As soon as you install a new version, check the change list for changes, maybe there will be funtionality added that needs you to change your algorithm or configuration files.
|
Changes from The global variables environment ( New command line commands are: Important change: The See Creating new algorithms for more details. Important change: The file format for the configuration files has been changed. It now includes the global variables.
However, older configuration files (from |
|
Changes from Added the program arguments Removed bug from the parsing of program arguments that occured when the name of the configuration file included spaces. |
|
Changes from The global variables environment ( Important change: The See Creating new algorithms for more details. For compatibility reasons, the very same argument passed to the |
|
Changes from Support for reading binary files has been added.
The output file will be written in binary mode, supporting binary output files, too.
The compatibility of algorithms will now be checked by introducing version constants (in If you want to use the new binary mode (all input files will be handled as binary files), you have to set the global variable
Important change: The Important change: The Important change: The All parsing algorithms from previous versions need to be changed by updating their member functions according to these changes. See Creating new algorithms for more information. Important change: The Important change: The The The |
|
Changes from Important change: The Important change: The Important change: The Important change: The file format of the configuration files has been changed. Global variables will be read before the parser configuration (used algorithms), so that the variables will affect the added algorithms correctly. Older configuration files are not supported anymore. See Reference: Configuration file formats for more information. The new console command |
|
Changes from Important change: The |
You can find the changelogs of the algorithms inside their header files (*.h) located in algos/src/.
The program is easily expendable.
To add a new parsing algorithm, copy the shared library (*.so) file and (optionally) the description (*.txt) file of the new algorithm into the algos folder.
To add a new configuration, copy the configuration (*.cfg) file into the confs folder. Make sure that you added all necessary algorithms before loading the configuration file, otherwise the load command will fail.
The included algorithms are already compiled into shared library (*.so ) files.
If you want to recompile them (or compile new ones), you can use the g++ compiler, which has to be called twice for every algorithm. The following examples should be run from inside the algos/src folder.
g++ -Wall -fPIC -c <source code> [additional sources used]
This will create an object (*.o) file which has to be converted to a shared library (*.so) file.
Note: [additional sources used] can also be source code of the main program, for example filelist.cpp if the algorithm wants to use the filelist class to manipulate the list of files to be parsed, or vars.h if the algorithm wants to use the vars class to get access to the global variables environment (e.g. HTMLSpider uses both of them). Inside the algos/src folder, the path to the main program source (and header) files is ../../src/.
g++ -shared -o ../<name>.so <object file> [additional object files used]
This will save the shared library <name>.so in the directory above (algos).
Depending on the sources used in step #1, add the additional object files as needed, e.g. add filelist.o and/or vars.o if you want to use the filelist and/or vars classes of the main program and therefore created, in step #1, object files from the ../../src/filelist.cpp and/or ../../src/vars.cpp source code files.
The included default parsing algorithms provide a compile.sh, which is located in algos/src. Check out the source and run it from within this folder by opening the terminal in algos/src and typing sh compile.sh. It performs exactly step #1 and step #2 for all included default algorithms.
You can easily add own algorithms by creating a child class of the algorithm class in src/algorithm.h.
The class has to have the following functions:
|
Will be called when the algorithm is added to a file. The The In the end of the parsing process, the output buffer (if existing) will be written to the output file.
Both, text and binary output, are supported since version clear function is called without a call to the parse function before.
Usually you will not need to use the output buffer in the The If you want to use global variables in your Please name your variables in the following format:
Return |
|
Will be called to parse the content of the file. The Use the The The The In the end of the parsing process, the output buffer (if existing) will be written to the output file.
Both, text and binary output, are supported (since version The Note: The file name can be local (without directories or with sub-directories only) or global (complete path, url).
Basically, the program supports urls starting with The Return Returning |
|
Is called when the algorithm is removed from the file or the parsing process is either cancelled or finished. You have to free used memory here, except for the content or output buffers, this will be done by the program. |
|
These three functions should return static strings containing name, version and author of the algorithm. They will be used by the main program to show more information about the added algorithm. |
|
This function has to return the version of the main program for which your algorithm has been written. Older versions of the main program will not be able to use your algorithm. If the version of the main program is higher than the version used by you, a warning message will be shown. Make sure to update your algorithm with each new version of the main program. Check the Changelog section of the readme file to learn about possible changes to make to your algorithm. The version constants that you should use are defined in the Example: If your algorithm source code is located inside the program's
to include the original (up-to-date) header file. |
|
This function has to return, whether multiple instances of the algorithm are allowed. Note, that each instance has access to the same global variables environment. It is not possible to define different variables with the same name for different instances of your algorithm. |
You also need to include construction and destruction functions in the source code of your class.
They are needed by the main program to load your class and have to be called create and destroy.
The create function creates a instance of the class and returns it as a pointer to the algorithm class.
The destroy function deletes the created instance of the class.
Just copy the following code to the end of the source file of your class and replace <classname> with the name of your class:
// creation and destruction "C" functions for dynamic loading of the class
|
If you want to write output to the console (stdout), please add a tag with the algorithm name in front, like:
printf("[<algoname>] This is output by the algorithm named <algoname>.");
The best way to find out more about programming parsing algorithms for parsefile is to check out the source code of the existing algorithms, which is located in the algos/src folder (if received).
To start a new algorithm, feel free to use the template files in algos/src/templates. Copy them into the algos/src folder, rename them to the name of your algorithm and open them to add your own code. To do so, follow the instructions inside the newly created files.
For information on how to compile your algorithms, see Compiling algorithms.
All your algorithms have to be licensed under the used GNU license, see License.
Developed and tested with:
[v0.1a-v0.5a] g++ 4.5.2 on Ubuntu 11.04 (natty), Kernel Linux 2.6.38-11 generic, GNOME 2.32.1
[since v0.6a] g++ 4.6.1 on Ubuntu 11.10 (oneiric), Kernel Linux 3.0.0-12 generic, GNOME 3.2.0
No development software suite has been used due to the simplicity of the program. It can be discovered and extended by using a text editor (e.g. gedit) and the Ubuntu (or other Linux) command line tools.
The program website can be found at http://www.ghstyle.de/parsefile/.
Contact ans(at)ghstyle.de for bug reports, questions and remarks.
Please make sure that you choose a meaningful e-mail topic due to spam detection.
Note that the program is mainly programmed for own usage and no detailed compatibility support can be given. To receive support for non-default parsing algorithms, contact the author(s) of the respective algorithm.
Please contact me if you either make improvements or extension of the main program, or develop new parsing algorithms, so that these changes can be included in the original program package. This way, others will be able to reuse it and you will be able to receive the maximum credit for your work. Always remember: Sharing is caring!
Copyright (C) 2011 by Anselm Schmidt, www.ghstyle.de.
parsefile is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.
parsefile is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See the GNU General Public License for more details.
This version of the configuration file format is out of date. It was used between parsefile v0.1a and v0.5a. The problem was, that the global variables were loaded after the algorithms and therefore, global variables were not able to influence the initialization of the saved algorithms. This file format can manually be updated to version 2 by exchanging the part for global variables with the part for algorithms and updating the tag in the first line to [PFCFF2].
The format is in plain text and has the following structure:
[PFCFF]
|
The tag in the first line ([PFCFF]) stands for "parsefile configuration file format", version 1. The other parts are identical with version 2 (see below), just the order is different.
The last part has been added with parsefile v0.2a, which introduced the global variables environment.
Here is the configuration file conf/example.cfg created in the tutorial (see Tutorial) with an old version of parsefile, here the version 0.4a:
[PFCFF]
|
Current version of the configuration file format, since parsefile v0.6a.
The format is in plain text and has the following structure:
[PFCFF2]
|
The tag in the first line ([PFCFF2]) stands for "parsefile configuration file format", version 2. The other parts are structured the following way:
<Main program version>: Single line starting with ver= and ending with the version of parsefile that created the configuration file. Example: ver=0.6a (ALPHA)
<Part for file list>: The first line contains the number of files in the file list (N), followed by N lines, each of them containing a file name. Empty lines are possible, they mark already removed files.
<Part for global variables>: The first line contains the number of global variables (N), followed by N*2 lines, made of N pairs, each pair representing one variable. The first line of such a pair contains the name of the variable, the second line the value. Empty names are possible, they mark already removed variables. The values of pairs with an empty name will be ignored. Empty values represent empty variables.
<Part for algorithms>: The first line contains the number of algorithms to add (N), followed by N lines, each of them containing the path of an algorithm library to be loaded (e.g. algos/html_spider.so). Empty lines are possible, they mark already removed algorithms.
Here is the configuration file conf/example.cfg created in the tutorial (see Tutorial) with the current version of parsefile, version 0.6a:
[PFCFF2]
|
There are two packages available at the moment: The full package includes the source code of the program and the default parsing algorithms as well as some sample files. The minimal package contains the binary files only.
Full packageparsefile - executable program binaryREADME.txt - readme filealgos/ - folder for parsing algorithmsalgos/command_line.so - the command_line parsing algorithmalgos/command_line.txt - short description of the command_line parsing algorithmalgos/count_strings.so - the count_string parsing algorithmalgos/count_strings.txt - short description of the count_string parsing algorithmalgos/count_strings_readme.txt - readme file for the count_string parsing algorithmalgos/html_body.so - the html_body parsing algorithmalgos/html_body.txt - short description of the html_body parsing algorithmalgos/html_remove_comments.so - the html_remove_comments parsing algorithmalgos/html_remove_comments.txt - short description of the html_remove_comments parsing algorithmalgos/html_remove_tags.so - the html_remove_tags parsing algorithmalgos/html_remove_tags.txt - short description of the html_remove_tags parsing algorithmalgos/html_spider.so - the HTMLSpider parsing algorithmalgos/html_spider.txt - short description of the HTMLSpider parsing algorithmalgos/print_binary.so - the print_binary parsing algorithmalgos/print_binary.txt - short description of the print_binary parsing algorithmalgos/print_content.so - the print_content parsing algorithmalgos/print_content.txt - short description of the print_content parsing algorithmalgos/show_content.so - the show_content parsing algorithmalgos/show_content.txt - short description of the show_content parsing algorithmalgos/wait.so - the wait parsing algorithmalgos/wait.txt - short description of the wait parsing algorithmalgos/src - folder for the source code of the parsing algorithmsalgos/src/command_line.cpp - source code file of the command_line parsing algorithmalgos/src/command_line.h - header file of the command_line parsing algorithm, including changelogalgos/src/compile.sh - shell file to compile the default parsing algorithmsalgos/src/count_strings.cpp - source code file of the count_strings parsing algorithmalgos/src/count_strings.h - header file of the count_strings parsing algorithm, including changelogalgos/src/html_body.cpp - source code file of the html_body parsing algorithmalgos/src/html_body.h - header file of the html_body parsing algorithm, including changelogalgos/src/html_remove_comments.cpp - source code file of the html_remove_comments parsing algorithmalgos/src/html_remove_comments.h - header file of the html_remove_comments algorithm, including changelogalgos/src/html_remove_tags.cpp - source code file of the html_remove_tags parsing algorithmalgos/src/html_remove_tags.h - header file of the html_remove_tags algorithm, including changelogalgos/src/html_spider.cpp - source code file of the HTMLSpider parsing algorithmalgos/src/html_spider.h - header file of the HTMLSpider parsing algorithm, including changelogalgos/src/print_binary.cpp - source code file of the print_binary parsing algorithmalgos/src/print_binary.h - header file of the print_binary parsing algorithm, including changelogalgos/src/print_content.cpp - source code file of the print_content parsing algorithmalgos/src/print_content.h - header file of the print_content parsing algorithm, including changelogalgos/src/show_content.cpp - source code file of the show_content parsing algorithmalgos/src/show_content.h - header file of the show_content parsing algorithm, including changelogalgos/src/wait.cpp - source code file of the wait parsing algorithmalgos/src/wait.h - header file of the wait parsing algorithm, including changelogalgos/src/templates - templates for the development of new algorithmsalgos/src/templates/algo.h - template of the header file of a new algorithmalgos/src/templates/algo.cpp - template of the source file of a new algorithmconf/ - folder for configuration filesconf/example.cfg - configuration file as created by the tutorial in this readme fileconf/count_strings_example.cfg - configuration file as created by the tutorial in the count_strings readmeinput/ - folder for sample input filesinput/example.htm - example HTML file used by the tutorial in this readme fileinput/linktest.htm - example HTML file indirectly used by the tutorial in this readme fileinput/test.htm - example HTML file indirectly used by the tutorial in this readme fileoutput/ - folder for output filessrc/ - folder for the source code of the main programsrc/algorithm.h - header file of the algorithm class used as parent class for the single parsing algorithmssrc/commands.cpp - source code of the command line command functionssrc/commands.h - header file of the command line command functionssrc/compile.sh - shell file to compile the main programsrc/filelist.cpp - source code of the filelist class used for the list of files to be parsedsrc/filelist.h - header file of the filelist class used for the list of files to be parsedsrc/functions.cpp - source code of different helper functions used by the programsrc/functions.h - header file of different helper functions used by the programsrc/main.cpp - source code of the main program including main function (program entry point)src/parser.cpp - source code of the parser class used for adding algorithms and parsing filessrc/parser.h - header file of the parser class used for adding algorithms and parsing filessrc/vars.cpp - source code of the vars class used for managing the global variable environmentsrc/vars.h - header file of the vars class used for managing the global variable environmentsrc/version.h - header file containing version constants and the current version of the program |
Minimal package
|
Version: 0.7a, last change: 23/10/2011