Printable version | Disclaimers | Privacy policy

Ngram

From IntereditionWiki

The Ngram Extractor is an easy to use tool that allows anyone to perform Ngram operations on their TEI-XML files.

Contents

Team

Maria Sukhareva, Russell Horton, Jeffrey C. Witt, Henry Lynam


User Interface

- Google gadget

Source Code

https://bitbucket.org/hlynam/ngram [1]

https://github.com/jeffreycwitt/ngram [2]

Version 2.0 Google Gadget published to Internet:

http://jeffreycwitt.com/ngrams/ngram2.xml [3]

Version 3.0 Ngram Extractor published to Internet:

http://jeffreycwitt.com/ngrams/ngram3.php [4]


Wednesday Notes

- dropbox location - dropbox folder - xslt file (could be in the dropbox folder)

- min: 1 - max: 5

- frequency - marginal frequency - create contingency table

- xml ngram - mysql dump - summary stats


Thursday Notes

Parsed ngrams for Moby Dick and loaded them into MySQL. Currently running Dice coefficient calculation over entire book. Also loaded one million, ten million, and one hundred million trigrams from Google Ngrams corpus into MySQL. Seeing 2 - 5ms query times to find counts with specification of words in any position in ngram.

This is to slow to do large corpora quickly. Moby Dick (200k words) would take around 15 minutes to do the whole process.

Jeff and Henry worked on an igoogle app was created-called "ngram extractor".

The extractor makes a call with 5 parameters to Henry's site. Henry's site returns an XML file, which is then parsed in the igoogle app. Which displays the results for the user.

Friday Notes

- Implemented ngram service in Coffeescript/Javascript Node.js Express, sent up to Dotcloud, available online.


- It communicates with the web service

- It correctly bypasses Google Gadget caching using a random variable addition

- It parses and displays the returned Ngram data

- It is nicely formatted


- It parses Ngrams with Ngram Length and the Collocation Span

- It calculates frequency

- It handles duplicate words in the collocation spans (almost)

- It produces an alphabetically sorted XML output of the Ngrams with their frequencies

Saturday Notes

Cleaned up function name and parameters

Deleted unused parameters

This allows us to use the ngram.asmx page as a test of POST functionality of the web service

Fixed Ngram extractor bug

Sorted by frequency

Read XML file from supplied URL

Extracted p data

Deleted unused parameters

Published this url as a Google Gadget

Demo of Ngram Extractor

Google Gadget connects to Web Service which converts a TEI-XML file into Ngrams in XML format

Add the Google Gadget by searching for:

Ngram Extractor

Demo Values:

- URL Location: http://www.jeffreycwitt.com/plaoul/translation_english/trans_engl_prollecture1.xml

- Ngram Length: 3

- Collocation Span: 4


Talk about issues in creating and debugging a Gadget

cache, editing, debugging, publishing a gadget

Access an external TEI-XML file

what constitutes text, how to detect sentence boundaries


Ngram Length, Collocation Span

Frequencies, marginal frequencies


Demo web service at:

http://henrylynam.com/ngram/ngram.asmx [5]

Demo Values:

- URL Location: http://www.jeffreycwitt.com/plaoul/translation_english/trans_engl_prollecture1.xml

- Ngram Length: 3

- Collocation Span: 4


Access web service from web server

Show string format of GET in web service

Show returned XML data


Lessons Learned

Google Gadget issues (versions)

Interoperability issues server side: web services to allow developers work in parallel in different languages / frameworks?

Server monolithic framework issues

Overall: a great experience

Retrieved from "http://beintereditioneu.huygens.knaw.nl/wiki/index.php/Ngram"

This page has been accessed 191 times. This page was last modified 10:19, 28 January 2012. Content is available under GNU Free Documentation License 1.2.


Find

Browse
Main Page
Upcoming meetings
Past meetings
Recent changes
Help
Donations
Edit
View source
Editing help
This page
Discuss this page
Post a comment
Printable version
Context
Page history
What links here
Related changes
My pages
Log in / create account
Special pages
New pages
File list
Statistics
Bug reports
More...