digitalmars.D.learn - dataframe implementations
- Jay Norwood (23/23) Nov 02 2015 I was reading about the Julia dataframe implementation yesterday,
- Laeeth Isharc (19/44) Nov 02 2015 Hi Jay.
- Jay Norwood (4/9) Nov 02 2015 yes, thanks. I believe I did see your comments previously.
- Jay Norwood (22/22) Nov 17 2015 I looked through the dataframe code and a couple of comments...
- Laeeth Isharc (27/50) Nov 18 2015 Yes - I think that one will want to have a choice between this
- Laeeth Isharc (6/31) Nov 18 2015 What do you think about the use of NaN for missing floats? In
- Jay Norwood (17/22) Nov 18 2015 The julia discussions mention another dataframe implementation, I
- Jay Norwood (5/6) Nov 18 2015 Here are the two discussions I recall on the julia NA
- Jay Norwood (4/4) Nov 18 2015 One more discussion link on the NA subject. This one on the R
- jmh530 (3/7) Nov 18 2015 My sense is that any data frame implementation should try to
- Jay Norwood (11/13) Nov 18 2015 I've been watching that development, but I don't have a feel for
- John Colvin (3/16) Nov 19 2015 You might not build on the nd slice type itself, but implementing
- ZombineDev (4/17) Nov 19 2015 How about using a nd slice of Variant(s), or a more specialized
- Laeeth Isharc (4/26) Nov 21 2015 Not sure it is a great idea to use a variant as the basic option
- Jay Norwood (18/21) Dec 03 2015 I'm reading today about an n-dim extension to pandas named xray.
- jmh530 (2/6) Nov 20 2015 I meant in the sense that Pandas is built upon Numpy.
I was reading about the Julia dataframe implementation yesterday, trying to understand their decisions and how D might implement. From my notes, 1. they are currently using a dictionary of column vectors. 2. for NA (not available) they are currently using an array of bytes, effectively as a Boolean flag, rather than a bitVector, for performance reasons. 3. they are not currently implementing hierarchical headers. 4. they are transforming non-valid symbol header strings (read from csv, for example) to valid symbols by replacing '.' with underscore and prefixing numbers with 'x', as examples. This allows use in expressions. 5. Along with 4., they currently have with for DataVector, to allow expressions to use, for example, :symbol_name instead of dv[:symbol_name]. 6. They have operation symbols for per element operations on two vectors, for example a ./ b expresses applying the operation to the vector. 7. They currently only have row indexes, no row names or symbols. I saw someone posting that they were working on DataFrame implementation here, but haven't been able to locate any code in github, and was wondering what implementation decisions are being made here. Thanks.
Nov 02 2015
On Monday, 2 November 2015 at 13:54:09 UTC, Jay Norwood wrote:I was reading about the Julia dataframe implementation yesterday, trying to understand their decisions and how D might implement. From my notes, 1. they are currently using a dictionary of column vectors. 2. for NA (not available) they are currently using an array of bytes, effectively as a Boolean flag, rather than a bitVector, for performance reasons. 3. they are not currently implementing hierarchical headers. 4. they are transforming non-valid symbol header strings (read from csv, for example) to valid symbols by replacing '.' with underscore and prefixing numbers with 'x', as examples. This allows use in expressions. 5. Along with 4., they currently have with for DataVector, to allow expressions to use, for example, :symbol_name instead of dv[:symbol_name]. 6. They have operation symbols for per element operations on two vectors, for example a ./ b expresses applying the operation to the vector. 7. They currently only have row indexes, no row names or symbols. I saw someone posting that they were working on DataFrame implementation here, but haven't been able to locate any code in github, and was wondering what implementation decisions are being made here. Thanks.Hi Jay. That may have been me. I have implemented something very basic, but you can read and write my proto dataframe to/from CSV and HDF5. The code is up here: https://github.com/Laeeth/d_dataframes You should think of it as a crude prototype that nonetheless has been useful for me, but it's done more in the old school hacker spirit of getting something working first rather than being designed properly. The reason for that is I have a lot on my plate at the moment, and technology is only one of many of these, although an important one. In time I may get someone else to work on dataframes and opensource the results, but that may be some months away. So I'd welcome any assistance, or even taking it over. I haven't really done a good job of having idiomatic access, but it's something and a start. Laeeth. I
Nov 02 2015
On Monday, 2 November 2015 at 15:33:34 UTC, Laeeth Isharc wrote:Hi Jay. That may have been me. I have implemented something very basic, but you can read and write my proto dataframe to/from CSV and HDF5. The code is up here: https://github.com/Laeeth/d_dataframesyes, thanks. I believe I did see your comments previously. That's great that you've already got support for hdf5. I'll take a look.
Nov 02 2015
I looked through the dataframe code and a couple of comments... I had thought perhaps an app could read in the header info and type info from hdf5, and generate D struct definitions with column headers as symbol names. That would enable faster processing than with the associative arrays, as well as support the auto-completion that would be helpful in writing expressions. The csv type info for columns could be inferred, or else stated in the reader call, as done as an option in julia. In both cases the column names would have to be valid symbol names for this to work. I believe Julia also expects this, or else does some conversion on your column names to make them valid symbols. I think the D csv processing would also need to check if the The jupyter interactive environment supports python pandas and Julia dataframe column names in the autocompletion, and so I think the D debugging environment would need to provide similar capability if it is to be considered as a fast-recompile substitute for interactive dataframe exploration. It seems to me that your particular examples of stock data would eventually need to handle missing data, as supported in Julia dataframes and python pandas. They both provide ways to drop or fill missing values. Did you want to support that?
Nov 17 2015
On Tuesday, 17 November 2015 at 13:56:14 UTC, Jay Norwood wrote:I looked through the dataframe code and a couple of comments... I had thought perhaps an app could read in the header info and type info from hdf5, and generate D struct definitions with column headers as symbol names. That would enable faster processing than with the associative arrays, as well as support the auto-completion that would be helpful in writing expressions.Yes - I think that one will want to have a choice between this kind of approach and using associative arrays. Because for some purposes it's not convenient to have to compile code every time you open a strange file, and on the other hand the hit with an AA sometimes will matter. The situation at the moment for me is that I have very little time to work on a correct general solution for this problem myself (yet its important for D that we do get to one). I also lack the experience with D to do it very well very quickly. I do have a couple of seasoned people from the community helping me with things, but dataframes won't be the first thing they look at, and it could be a while before we get to that. If we implement for our own needs,then I will open source it as it is commercially sensible as well as the right thing to do. But that could be a year away. Vlad Levenfeld was also looking at this a bit.The csv type info for columns could be inferred, or else stated in the reader call, as done as an option in julia. In both cases the column names would have to be valid symbol names for this to work. I believe Julia also expects this, or else does some conversion on your column names to make them valid symbols. I think the D csv processing would also need to check if the The jupyter interactive environment supports python pandas and Julia dataframe column names in the autocompletion, and so I think the D debugging environment would need to provide similar capability if it is to be considered as a fast-recompile substitute for interactive dataframe exploration.Well we don't need to get there in a single bound - already just being able to do this at all is a big improvement, and I am already using D with jupyter to do things.It seems to me that your particular examples of stock data would eventually need to handle missing data, as supported in Julia dataframes and python pandas. They both provide ways to drop or fill missing values. Did you want to support that?Yes - we should do so eventually, and there's much more that could be done. But maybe a sensible basic implementation is a start and we can refine after that. I wrote the dataframe in a couple of evenings, so I am sure it can be improved, and even rearchitected. Pull requests welcomed, and maybe we should set up a Trello to organise ideas ? Let me know if you are in.
Nov 18 2015
On Monday, 2 November 2015 at 13:54:09 UTC, Jay Norwood wrote:I was reading about the Julia dataframe implementation yesterday, trying to understand their decisions and how D might implement. From my notes, 1. they are currently using a dictionary of column vectors. 2. for NA (not available) they are currently using an array of bytes, effectively as a Boolean flag, rather than a bitVector, for performance reasons. 3. they are not currently implementing hierarchical headers. 4. they are transforming non-valid symbol header strings (read from csv, for example) to valid symbols by replacing '.' with underscore and prefixing numbers with 'x', as examples. This allows use in expressions. 5. Along with 4., they currently have with for DataVector, to allow expressions to use, for example, :symbol_name instead of dv[:symbol_name]. 6. They have operation symbols for per element operations on two vectors, for example a ./ b expresses applying the operation to the vector. 7. They currently only have row indexes, no row names or symbols. I saw someone posting that they were working on DataFrame implementation here, but haven't been able to locate any code in github, and was wondering what implementation decisions are being made here. Thanks.What do you think about the use of NaN for missing floats? In theory I could imagine wanting to distinguish between an NaN in the source file and a missing value, but in my world I never felt the need for this. For integers and bools, that is different of course.
Nov 18 2015
On Wednesday, 18 November 2015 at 17:15:38 UTC, Laeeth Isharc wrote:What do you think about the use of NaN for missing floats? In theory I could imagine wanting to distinguish between an NaN in the source file and a missing value, but in my world I never felt the need for this. For integers and bools, that is different of course.The julia discussions mention another dataframe implementation, I believe it was for R, where NaN was used. There was some mention of the virtues of their own choice and the problems with NaN. I think use of NaN was a particular encoding of NaN. Other implementations they mentioned used some reserved value in each of the numeric data types to represent NA. In the julia case, I believe what they use is a separate byte vector for each column that holds the NA status. They discussed some other possible enhancements, but I don't know what they implemented. For example, if the single byte holds the NA flag, the cell value can hold additional info ... maybe the reason for the NA. There was also some discussion of having the associated cell hold repeat counts for the NA status, which I suppose meant to repeat it for following cells in the column vector. I'll try to find the discussions and post the link.
Nov 18 2015
On Wednesday, 18 November 2015 at 18:04:30 UTC, Jay Norwood wrote:vector. I'll try to find the discussions and post the link.Here are the two discussions I recall on the julia NA implementation. http://wizardmac.tumblr.com/post/104019606584/whats-wrong-with-statistics-in-julia-a-reply https://github.com/JuliaLang/julia/pull/9363
Nov 18 2015
One more discussion link on the NA subject. This one on the R implementation of NA using a single encoding of NaN, as well as their treatment of a selected integer value as a NA. http://rsnippets.blogspot.com/2013/12/gnu-r-vs-julia-is-it-only-matter-of.html
Nov 18 2015
On Monday, 2 November 2015 at 13:54:09 UTC, Jay Norwood wrote:I saw someone posting that they were working on DataFrame implementation here, but haven't been able to locate any code in github, and was wondering what implementation decisions are being made here. Thanks.My sense is that any data frame implementation should try to build on the work that's being done with n-dimensional slices.
Nov 18 2015
On Wednesday, 18 November 2015 at 22:46:01 UTC, jmh530 wrote:My sense is that any data frame implementation should try to build on the work that's being done with n-dimensional slices.I've been watching that development, but I don't have a feel for where it could be applied in this case, since it appears to be focused on multi-dimensional slices of the same data type, slicing up a single range. The dataframes often consist of different data types by column. How did you see the nd slices being used? Maybe the nd slices could be applied if you considered each row to be the same structure, and slice by rows rather than operating on columns. Pandas supports a multi-dimension panel. Maybe this would be the application for nd slices by row.
Nov 18 2015
On Thursday, 19 November 2015 at 06:33:06 UTC, Jay Norwood wrote:On Wednesday, 18 November 2015 at 22:46:01 UTC, jmh530 wrote:You might not build on the nd slice type itself, but implementing the same API (where possible/appropriate) would be good.My sense is that any data frame implementation should try to build on the work that's being done with n-dimensional slices.I've been watching that development, but I don't have a feel for where it could be applied in this case, since it appears to be focused on multi-dimensional slices of the same data type, slicing up a single range. The dataframes often consist of different data types by column. How did you see the nd slices being used? Maybe the nd slices could be applied if you considered each row to be the same structure, and slice by rows rather than operating on columns. Pandas supports a multi-dimension panel. Maybe this would be the application for nd slices by row.
Nov 19 2015
On Thursday, 19 November 2015 at 06:33:06 UTC, Jay Norwood wrote:On Wednesday, 18 November 2015 at 22:46:01 UTC, jmh530 wrote:How about using a nd slice of Variant(s), or a more specialized type Algebraic type? [1]: http://dlang.org/phobos/std_variantMy sense is that any data frame implementation should try to build on the work that's being done with n-dimensional slices.I've been watching that development, but I don't have a feel for where it could be applied in this case, since it appears to be focused on multi-dimensional slices of the same data type, slicing up a single range. The dataframes often consist of different data types by column. How did you see the nd slices being used? Maybe the nd slices could be applied if you considered each row to be the same structure, and slice by rows rather than operating on columns. Pandas supports a multi-dimension panel. Maybe this would be the application for nd slices by row.
Nov 19 2015
On Thursday, 19 November 2015 at 22:14:01 UTC, ZombineDev wrote:On Thursday, 19 November 2015 at 06:33:06 UTC, Jay Norwood wrote:Not sure it is a great idea to use a variant as the basic option when very often you will know that every cell in a particular column will be of the same type.On Wednesday, 18 November 2015 at 22:46:01 UTC, jmh530 wrote:How about using a nd slice of Variant(s), or a more specialized type Algebraic type? [1]: http://dlang.org/phobos/std_variantMy sense is that any data frame implementation should try to build on the work that's being done with n-dimensional slices.I've been watching that development, but I don't have a feel for where it could be applied in this case, since it appears to be focused on multi-dimensional slices of the same data type, slicing up a single range. The dataframes often consist of different data types by column. How did you see the nd slices being used? Maybe the nd slices could be applied if you considered each row to be the same structure, and slice by rows rather than operating on columns. Pandas supports a multi-dimension panel. Maybe this would be the application for nd slices by row.
Nov 21 2015
On Saturday, 21 November 2015 at 14:16:26 UTC, Laeeth Isharc wrote:Not sure it is a great idea to use a variant as the basic option when very often you will know that every cell in a particular column will be of the same type.I'm reading today about an n-dim extension to pandas named xray. Maybe should try to understand how that fits. They support io from netCDF, and are making extensions to support blocked input using dask, so they can process data larger than in-memory limits. http://xray.readthedocs.org/en/stable/data-structures.html https://www.continuum.io/content/xray-dask-out-core-labeled-arrays-python In general, pandas and xray are supporting with the requirement of pulling in data from storage of initially unknown column and index names and data types. Julia throws in support of jit compilation and specialized operations for different data types. It seems to me that D's strength would be in a quick compile, which would then allow you to replace the dictionary tag implementations and variants with something that used compile time symbol names and data types. Seems like that would provide more efficient processing, as well as better tab completion support when creating expressions.
Dec 03 2015
On Thursday, 19 November 2015 at 06:33:06 UTC, Jay Norwood wrote:Maybe the nd slices could be applied if you considered each row to be the same structure, and slice by rows rather than operating on columns. Pandas supports a multi-dimension panel. Maybe this would be the application for nd slices by row.I meant in the sense that Pandas is built upon Numpy.
Nov 20 2015