MuffinInfo
An information extraction system from FastQ data
Table of Contents
- Introduction
- System Requirements
- Tabs
- General Info Tab
- Quality Tab
- Kmers Tab
- GC Percentage Tab
- Reads Length Tab
- Duplicates Tab
- Homopolymers Tab
- Menus
- File Menu
- Edit Menu
- Help Menu
- Technical Details
Introduction
MuffinInfo is a HTML5-powered information extraction system from FastQ data.
It processes the files locally without the need for a server.
Everything that one has to do is to click the "Open" menu and browse for a local file.
It displays useful information grouped by category, in tabs.
Our opinion is that this analysis should be the first step before any other operation on a raw dataset.
The site should work with any HTML5-compliant web browser, including mobile browsers.
There is no installer and no special operating system required.
MuffinInfo was built with cross-browser, cross-device and cross-operating system compatibility in mind.
We limited ourselves to what exists in the HTML standard, avoiding third party plugins like Flash or Java.
MuffinInfo is work in progress, therefore it is neither full-featured nor very stable.
Due to the existence of many web browsers (and their versions) and different behaviour on different operating systems and/or devices, the website may not perform as advertised in all circumstances.
We try to keep up with the latest releases and modifications, but sometimes it is impossible to keep up with the vendors.
If you encounter a bug or you have a feature request, please do not hesitate to contact us using the "Contact" menu on the main page.
Please give us as many details as possible (like the browser (with version), operating system, file example etc.).
We cannot guarantee that the requested features will be added in the future versions, but we will try to do our best.
System Requirements
The general requirement is a HTML5 compliant browser, including mobile browsers..
However due to the dynamics of the Web World and the time required for the browser developers to implement the standard, we had to single out some minimum versions of the major browsers.
The most stringent feature which forced us to have a minimum version for the browser is the "Map" native implementation to store key/values.
It is avalable starting with the following versions for desktop: Firefox 13, Chrome 38 Internet Explorer 11, Opera 25 and Safari 7.1.
For mobile, an user needs at least: Chrome for Android 38, Firefox Mobile 13 and Safari Mobile 8.
We took the previous minimum versions from Map - JavaScript | MDN
General Info Tab
The general tab aims to inform the user over the general characteristics of the dataset like the number of bases and the number of reads.
When a dataset contains unknown values marked with 'N' or 'n', their count is taken into account.
At the moment, other characters used to denote an unknown value are simply ignored.
Please note that execution time does not take into account the UI updating.
Quality Tab
MuffinInfo can automatically detect the quality profile using the maximum and minimum values found in the datatset.
It also counts the number of bases with a certain quality and displays the result as a chart with the minimum and maximum values given by the automatically determined profile.
K-mers Tab
This tab contains a chart depicting the k-mer spectrum, i.e. how many k-mers appear in how many reads.
A k-mer is a sub-segment of a read.
The set of k-mers for a read is generated by using a sliding window with a length k.
At each step, the window move one nucleotide at a time.
As a result a k-mer and its successor differ by one nucleotide and have the same k-1 nucleotides as their suffix and prefix respectively.
The default value for the k-mer is 7.
GC Percentage Tab
Another important metric is the GC percentage per read.
The chart on this tab depicts the percentage of GC found in a read against the number of reads with the respective percentage.
Reads Length Tab
This statistic groups all the reads in the input file by their length.
The user has the possiblity to sort ascendng or descending the items by using the little arrows displayed at the right of each column's name.
Duplicates Tab
This statistic hunts for those reads appearing multiple times.
The user can select the preffered value of the threshold for which all reads with a greater count will be listed.
Please understand that due to performance reason we are unable to store all reads and compare read with read.
As a result we make use of Murmur Hash with the implementation from https://github.com/garycourt/murmurhash-js.
We selected the 32 bits variant because Javascript supports this maximum size of integers natively.
The Murmur Hash offers a grood tradeoff between performance and collisions, but oe should be aware of the possible false positives.
Adapters Tab
MuffinInfo tries to locate the adapters/primers it has predefined in each read it processes.
The extractor tries to match the forward and reverse complement of an adapter/primer at both ends of a read.
The user can set the number of bases that may preceed the adapter/primer (when comparing the beginning of the read) or succeed (when comparing against the end of the read).
At the moment, MuffinInfo doesn't support adapters/primers with errors.
As a result it won't be able to detect those reads starting with an adapters/primers with sequencing errors.
Homopolymers Tab
MuffinInfo generates the distribution of homopolymers, namely how many homopolymers of a certain are in all reads.
The user can set the minimum length for a homopolymer.
Furthermore, using the HighCharts' capabilities, one can select only the types of homopolymers (A, C, G, T or N) that (s)he is interested in.
File Menu
The File menu is home to the all features related to files.
First command, "Clear", purges the data obtained from a run and re-initializes all components for a new run.
Next, the "Open Reads" sub menu displays the open dialog from where the user can select th input file in the right format.
Please be aware that currently MuffinInfo determines the input type by looking at the selected file's extension.
As a result a file ending with ".fa" or ".fasta" is a fasta file, with ".fq" and ".fastq" is a fastq file andwith ".sam" is a SAM file.
The "Settings" option displays the settings dialog where the user can choose the desired values for a set of parameters for a run.
The values of the parameters are saved in the local storage.
One can always roll back to the initial configuration by pressing the "Default" button.
Please keep in mind that this option will not automatically save the default values in the local storage.
The user must use the "Save" button after using "Default" to store the predefined set of values.
The user is able to save and open a statistic file using the "Save Stat" and "Open Stat" sub-menus.
MuffinInfo stores the result of a run as a JSON file.
One of the advantages given by this format is the ease of use with Javascript as this is the underlying format of the language.
Secondly, being a text file with named fields it is easy to read by humans.
Finally, it can be loaded in third party applications as it is easily parseable.
Edit Menu
The Edit menu is home for the "Custom Statistic" option.
This option displays the form that enables the user to add a new statistic.
Before anything, the reader must understand the execution of any statistic by MuffinInfo.
It is divided in four parts/stages, namely: initialization, loop entry processing, finalization and display.
The first part deals with the declaration and initialization of the all the required properties needed at the next step.
At this step the user can access the parser object with all its properties and the parameters (s)he defined in the Settings window (or maintained the defaults).
The parser is the main engine of our software because it receives a chunk of text read from the disk and it extracts the entries (id + read + quality) from it.
The loop entry processing represents the method called each time MuffinInfo extract an entry from the input file.
Normally, an user would update the statistics at this step without declaring any new properties for the parser object.
The custom code has access to three additionale new objects, namely the components of an entry.
Please bear in mind that the quality object can be null (as it happens for fasta files where there are no quality scores).
Finally, the parser invokes the finalization code when the stream reader has reached the end of the input file and all entries have been extracted and processed.
The statistics object becomes available as it is the vector bewtween the parser and the rest of the program.
The user must declare some new properties attaching them directly to the statistics object.
Next (s)he must copy/move/translate the statistics calculated during the previous step to something that (s)he wants to be displayed.
The last stage moves awat from the parser object into the display mechanism where the user can display the custom statistic using three methods: a list of fields, a table or a chart (they are mutually exclusive).
The list of filed involves a predeclared array called "list".
One can push fields declared as arrays with two elements.
The first element represents the name of the field, while the second is the value.
When the user whishes to generate a table, (s)he must use the table object containing two properties: "columnNames" which defines an array with all the desired columns and "datatablesInit" which is an object used to initialize a Datatables object.
The format of the latter can be obtained from the Datatables manual.
We pass the definition created by the user directly to the Datatables constructor.
This way we offer full maneuver space for anyone whishing to use this type of display method.
Finally, one can create charts using the HighCharts library.
We offer a charts object which can be modified by the user at will using the library's documentation.
DUe to the inner workings of MuffinInfo, we ask the user not to modify the "chart.renderTo", "chart.width" and "chart.height" options.
MuffinInfo will take care of the layout and arrangement of the chart.
The custom statistics are saved in the local storage of the browser.
To aid the interested user, our program contains a template with a read count statistic as an example of how the things should be handled.
In case of no defined custom method, the extractor will automatically load the template.
Please keep in mind that "Save" won't add the statistic to the execution pipeline.
One must use the "Add" button which will also save the statistic.
We also offer the possibility of prior validation of the introduced code before adding the statistic to the execution pipeline.
This way we hope to avoid crashing the program during the overall the execution when the user might have waited a lot of time for the completion of the process only to be forced to start again.
Help Menu
The help menu includes all options that explain different aspects and features of MuffinInfo.
Technical Details
One of the main characteristics of MuffinInfo is the use of the File Api to read the input file.
Usually, the Next Generation Sequencing datasets are quite large and the information can be processed sequentially, therefore loading them entirely into the RAM memory makes no sense. Our application parses the input data in chunks of predetermined size.
To avoid blocking the user interface, we decided to parse the chunks in a separate thread, or in HTML5 jargon, a Web Worker.