digitalmars.D - [OT] All your medians are belong to me

Andrei Alexandrescu (12/12) Nov 21 2016 Hey folks, I'm working on a paper for fast median computation and

jmh530 (6/19) Nov 21 2016 You might following worthwhile.

Andrei Alexandrescu (3/5) Nov 21 2016 I have that, too, but was looking for some real data as well. It would

Patrick Schluter (12/18) Nov 21 2016 I don't really know what kind of data you would need but there

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

Hey folks, I'm working on a paper for fast median computation and 
https://issues.dlang.org/show_bug.cgi?id=16517 came to mind. I see the 
Google ngram corpus has occurrences of n-grams per year. Is data 
aggregated for all years available somewhere? I'd like to compute e.g. 
"the word (1-gram) with the median frequency across all English books" 
so I don't need the frequencies per year, only totals.

Of course I can download the entire corpus and then do some processing, 
but that would take a long time.

Also, if you can think of any large corpus that would be pertinent for 
median computation, please let me know!


Thanks,

Andrei

Nov 21 2016

jmh530 <john.michael.hall gmail.com> writes:

On Monday, 21 November 2016 at 17:39:40 UTC, Andrei Alexandrescu 
wrote:
 Hey folks, I'm working on a paper for fast median computation 
 and https://issues.dlang.org/show_bug.cgi?id=16517 came to 
 mind. I see the Google ngram corpus has occurrences of n-grams 
 per year. Is data aggregated for all years available somewhere? 
 I'd like to compute e.g. "the word (1-gram) with the median 
 frequency across all English books" so I don't need the 
 frequencies per year, only totals.

 Of course I can download the entire corpus and then do some 
 processing, but that would take a long time.

 Also, if you can think of any large corpus that would be 
 pertinent for median computation, please let me know!


 Thanks,

 Andrei

You might following worthwhile.

http://opendata.stackexchange.com/questions/6114/dataset-for-english-words-of-dictionary-for-a-nlp-project

I would just generate a bunch of integers randomly and use that, 
but I don't know if you specifically need to work with strings.

Nov 21 2016

Andrei Alexandrescu <SeeWebsiteForEmail erdani.org> writes:

On 11/21/2016 01:18 PM, jmh530 wrote:
 I would just generate a bunch of integers randomly and use that, but I
 don't know if you specifically need to work with strings.

I have that, too, but was looking for some real data as well. It would 
be a nice addition. -- Andrei

Nov 21 2016

Patrick Schluter <Patrick.Schluter bbox.fr> writes:

On Monday, 21 November 2016 at 18:39:26 UTC, Andrei Alexandrescu 
wrote:
 On 11/21/2016 01:18 PM, jmh530 wrote:
 I would just generate a bunch of integers randomly and use 
 that, but I
 don't know if you specifically need to work with strings.

 I have that, too, but was looking for some real data as well. 
 It would be a nice addition. -- Andrei

I don't really know what kind of data you would need but there 
are the European Unions Language Technology Resources corpuses 
made available for the research community. There are several 
different data sets in different formats (documents, alignments, 
xml) and in all European languages that can be used for 
experiments and real world use. The data is public domain and is 
free to use. The DGT-TM dataset is compiled by myself and updated 
yearly. It consist of around 12 billion characters or 1.8 billion 
words or 111 million segments in 28 languages.

https://ec.europa.eu/jrc/en/language-technologies

Nov 21 2016

D Programming

C/C++ Programming

Other

digitalmars.D - [OT] All your medians are belong to me