Ngram
From IntereditionWiki
The Ngram Extractor is an easy to use tool that allows anyone to perform Ngram operations on their TEI-XML files.
Contents |
Team
User Interface
- Google gadget
Source Code
- Web Service
https://bitbucket.org/hlynam/ngram [1]
- Google Gadget
https://github.com/jeffreycwitt/ngram [2]
Version 2.0 Google Gadget published to Internet:
http://jeffreycwitt.com/ngrams/ngram2.xml [3]
Version 3.0 Ngram Extractor published to Internet:
http://jeffreycwitt.com/ngrams/ngram3.php [4]
Wednesday Notes
- TEI-XML file location:
- dropbox location - dropbox folder - xslt file (could be in the dropbox folder)
- Select ngrams (range):
- min: 1 - max: 5
- Select stats operation:
- frequency - marginal frequency - create contingency table
- Output format:
- xml ngram - mysql dump - summary stats
Thursday Notes
Parsed ngrams for Moby Dick and loaded them into MySQL. Currently running Dice coefficient calculation over entire book. Also loaded one million, ten million, and one hundred million trigrams from Google Ngrams corpus into MySQL. Seeing 2 - 5ms query times to find counts with specification of words in any position in ngram.
This is to slow to do large corpora quickly. Moby Dick (200k words) would take around 15 minutes to do the whole process.
Jeff and Henry worked on an igoogle app was created-called "ngram extractor".
The extractor makes a call with 5 parameters to Henry's site. Henry's site returns an XML file, which is then parsed in the igoogle app. Which displays the results for the user.
Friday Notes
- MySQL Optimized Ngram Processor
- Implemented ngram service in Coffeescript/Javascript Node.js Express, sent up to Dotcloud, available online.
- Google Gadgets component is operational
- It communicates with the web service
- It correctly bypasses Google Gadget caching using a random variable addition
- It parses and displays the returned Ngram data
- It is nicely formatted
- Web Service
- It parses Ngrams with Ngram Length and the Collocation Span
- It calculates frequency
- It handles duplicate words in the collocation spans (almost)
- It produces an alphabetically sorted XML output of the Ngrams with their frequencies
Saturday Notes
- Web Service
Cleaned up function name and parameters
Deleted unused parameters
This allows us to use the ngram.asmx page as a test of POST functionality of the web service
Fixed Ngram extractor bug
Sorted by frequency
Read XML file from supplied URL
Extracted p data
- Google Gadget
Deleted unused parameters
Published this url as a Google Gadget
Demo of Ngram Extractor
- Overview of Ngram Extractor microservice (Henry)
Google Gadget connects to Web Service which converts a TEI-XML file into Ngrams in XML format
Add the Google Gadget by searching for:
Ngram Extractor
Demo Values:
- URL Location: http://www.jeffreycwitt.com/plaoul/translation_english/trans_engl_prollecture1.xml
- Ngram Length: 3
- Collocation Span: 4
- Google Gadget User Interface (Jeff)
Talk about issues in creating and debugging a Gadget
cache, editing, debugging, publishing a gadget
Access an external TEI-XML file
what constitutes text, how to detect sentence boundaries
- Talk about Ngram algorithm and the parameters (Maria)
Ngram Length, Collocation Span
Frequencies, marginal frequencies
- Talk about web service (Henry)
Demo web service at:
http://henrylynam.com/ngram/ngram.asmx [5]
Demo Values:
- URL Location: http://www.jeffreycwitt.com/plaoul/translation_english/trans_engl_prollecture1.xml
- Ngram Length: 3
- Collocation Span: 4
Access web service from web server
Show string format of GET in web service
Show returned XML data
Lessons Learned
Google Gadget issues (versions)
Interoperability issues server side: web services to allow developers work in parallel in different languages / frameworks?
Server monolithic framework issues
Overall: a great experience