digitalmars.D - GZip File Reading
- dsimcha (20/20) Mar 09 2011 I noticed last night that Phobos actually has all the machinations
- Daniel Gibson (7/27) Mar 09 2011 Maybe a proper stream API would help. It could provide ByLine etc, could...
- Jonathan M Davis (7/36) Mar 09 2011 That was my thought. We really need proper streams...
- Lars T. Kyllingstad (3/41) Mar 10 2011 Not gzip and bzip2 compressed files. They only contain a single file.
- Jonathan M Davis (5/46) Mar 10 2011 Ah. True. I'm too used to always using tar with them. ;)
- spir (7/47) Mar 10 2011 Yop, but the underlying 'file' is a tar-ed pack of files...
- Walter Bright (2/6) Mar 10 2011 Use ranges.
- dsimcha (3/11) Mar 10 2011 Ok, obviously. The point was trying to figure out how to maximize the
- Walter Bright (3/15) Mar 10 2011 It's not so obvious based on my reading of the other comments. For examp...
- dsimcha (4/21) Mar 10 2011 Ok, I see what you're saying. I was making the assumption that the
- Steven Schveighoffer (13/30) Mar 11 2011 C's FILE * interface is too limiting/low performing. I'm working to
- dsimcha (8/39) Mar 11 2011 Well, I certainly appreciate your efforts. IMHO the current state of
- Russel Winder (25/47) Mar 10 2011 =20
- Lars T. Kyllingstad (6/19) Mar 10 2011 Nope, a gzip or bzip2 file only contains a single file. To zip several
- Russel Winder (21/25) Mar 10 2011 =20
- dsimcha (18/23) Mar 10 2011 This is **exactly** my point. These single-file gzip and bzip2 files
- Lars T. Kyllingstad (7/33) Mar 10 2011 Although I agree this would be nice, I don't think std.stdio.File is the...
- dsimcha (8/14) Mar 10 2011 Ok, this seems to be the general consensus based on the replies I've
- Jonathan M Davis (13/29) Mar 10 2011 There _have_ been some threads on designing a new stream API, and there ...
- dsimcha (3/15) Mar 10 2011 So I guess in such a design we would still have things like a decorator ...
- Jonathan M Davis (12/30) Mar 10 2011 I don't remember exactly what it's going to look like. IIRC, streams are...
- Jonathan M Davis (13/29) Mar 10 2011 There _have_ been some threads on designing a new stream API, and there ...
- Stewart Gordon (13/17) Mar 11 2011 You don't seem to get how std.stream works.
- dsimcha (6/28) Mar 11 2011 But:
- Jonathan M Davis (9/44) Mar 11 2011 Technically speaking, I think that it's intended to be scheduled for dep...
- Steven Schveighoffer (11/13) Mar 14 2011 No. I/O Ranges should be based on streams. A stream is a low level
- dsimcha (3/6) Mar 14 2011 I don't get the concern about performance, for file I/O at least. Isn't...
- Daniel Gibson (2/8) Mar 14 2011 SSDs, RAID, RAM-disks, 10GBit (and faster) networking, ...
- Steven Schveighoffer (19/25) Mar 14 2011 That is solved by buffering, which would be done in either case.
- dsimcha (4/29) Mar 14 2011 Ok, makes sense. I sincerely hope your I/O library is good enough to ge...
- dsimcha (11/11) Mar 12 2011 Since it seems like the consensus is that streaming gzip support belongs...
I noticed last night that Phobos actually has all the machinations required for reading gzipped files, buried in etc.c.zlib. I've wanted a high-level D interface for reading and writing compressed files with an API similar to "normal" file I/O for a while. I'm thinking about what the easiest/best design would be. At a high level there are two designs: 1. Hack std.stdio.file to support gzipped formats. This would allow an identical interface for "normal" and compressed I/O. It would also allow reuse of things like ByLine. However, it would require major refactoring of File to decouple it from the C file I/O routines so that it could call either the C or GZip ones depending on how it's configured. Probably, it would make sense to make an interface that wraps I/O functions and make an instance for C and one for gzip, with bzip2 and other goodies possibly being added later. 2. Write something completely separate. This would keep std.stdio.File doing one thing well (wrapping C file I/O) but would be more of a PITA for the user and possibly result in code duplication. I'd like to get some comments on what an appropriate API design and implementation for writing gzipped files would be. Two key requirements are that it must be as easy to use as std.stdio.File and it must be easy to extend to support other single-file compression formats like bz2.
Mar 09 2011
Am 10.03.2011 05:53, schrieb dsimcha:I noticed last night that Phobos actually has all the machinations required for reading gzipped files, buried in etc.c.zlib. I've wanted a high-level D interface for reading and writing compressed files with an API similar to "normal" file I/O for a while. I'm thinking about what the easiest/best design would be. At a high level there are two designs: 1. Hack std.stdio.file to support gzipped formats. This would allow an identical interface for "normal" and compressed I/O. It would also allow reuse of things like ByLine. However, it would require major refactoring of File to decouple it from the C file I/O routines so that it could call either the C or GZip ones depending on how it's configured. Probably, it would make sense to make an interface that wraps I/O functions and make an instance for C and one for gzip, with bzip2 and other goodies possibly being added later. 2. Write something completely separate. This would keep std.stdio.File doing one thing well (wrapping C file I/O) but would be more of a PITA for the user and possibly result in code duplication. I'd like to get some comments on what an appropriate API design and implementation for writing gzipped files would be. Two key requirements are that it must be as easy to use as std.stdio.File and it must be easy to extend to support other single-file compression formats like bz2.Maybe a proper stream API would help. It could provide ByLine etc, could be used for any kind of compression format (as long as an appropriate input-stream is provided), ... (analogous for writing) Cheers, - Daniel
Mar 09 2011
On Wednesday 09 March 2011 21:10:59 Daniel Gibson wrote:Am 10.03.2011 05:53, schrieb dsimcha:That was my thought. We really need proper streams... The other potential issue with compressed files is that they can contain directories and such. A gzipped/bzipped file is not necessarily a file that you can read, even once it's been uncompressed. That may or not matter for this particular application of them, but it is something to be aware of. - Jonathan M DavisI noticed last night that Phobos actually has all the machinations required for reading gzipped files, buried in etc.c.zlib. I've wanted a high-level D interface for reading and writing compressed files with an API similar to "normal" file I/O for a while. I'm thinking about what the easiest/best design would be. At a high level there are two designs: 1. Hack std.stdio.file to support gzipped formats. This would allow an identical interface for "normal" and compressed I/O. It would also allow reuse of things like ByLine. However, it would require major refactoring of File to decouple it from the C file I/O routines so that it could call either the C or GZip ones depending on how it's configured. Probably, it would make sense to make an interface that wraps I/O functions and make an instance for C and one for gzip, with bzip2 and other goodies possibly being added later. 2. Write something completely separate. This would keep std.stdio.File doing one thing well (wrapping C file I/O) but would be more of a PITA for the user and possibly result in code duplication. I'd like to get some comments on what an appropriate API design and implementation for writing gzipped files would be. Two key requirements are that it must be as easy to use as std.stdio.File and it must be easy to extend to support other single-file compression formats like bz2.Maybe a proper stream API would help. It could provide ByLine etc, could be used for any kind of compression format (as long as an appropriate input-stream is provided), ... (analogous for writing)
Mar 09 2011
On Wed, 09 Mar 2011 21:34:29 -0800, Jonathan M Davis wrote:On Wednesday 09 March 2011 21:10:59 Daniel Gibson wrote:Not gzip and bzip2 compressed files. They only contain a single file. -LarsAm 10.03.2011 05:53, schrieb dsimcha:That was my thought. We really need proper streams... The other potential issue with compressed files is that they can contain directories and such.I noticed last night that Phobos actually has all the machinations required for reading gzipped files, buried in etc.c.zlib. I've wanted a high-level D interface for reading and writing compressed files with an API similar to "normal" file I/O for a while. I'm thinking about what the easiest/best design would be. At a high level there are two designs: 1. Hack std.stdio.file to support gzipped formats. This would allow an identical interface for "normal" and compressed I/O. It would also allow reuse of things like ByLine. However, it would require major refactoring of File to decouple it from the C file I/O routines so that it could call either the C or GZip ones depending on how it's configured. Probably, it would make sense to make an interface that wraps I/O functions and make an instance for C and one for gzip, with bzip2 and other goodies possibly being added later. 2. Write something completely separate. This would keep std.stdio.File doing one thing well (wrapping C file I/O) but would be more of a PITA for the user and possibly result in code duplication. I'd like to get some comments on what an appropriate API design and implementation for writing gzipped files would be. Two key requirements are that it must be as easy to use as std.stdio.File and it must be easy to extend to support other single-file compression formats like bz2.Maybe a proper stream API would help. It could provide ByLine etc, could be used for any kind of compression format (as long as an appropriate input-stream is provided), ... (analogous for writing)
Mar 10 2011
On Thursday 10 March 2011 00:15:34 Lars T. Kyllingstad wrote:On Wed, 09 Mar 2011 21:34:29 -0800, Jonathan M Davis wrote:Ah. True. I'm too used to always using tar with them. ;) Actually, the fact that they're that way makes them _way_ more pleasant to deal with programmatically than zip... - Jonathan M DavisOn Wednesday 09 March 2011 21:10:59 Daniel Gibson wrote:Not gzip and bzip2 compressed files. They only contain a single file.Am 10.03.2011 05:53, schrieb dsimcha:That was my thought. We really need proper streams... The other potential issue with compressed files is that they can contain directories and such.I noticed last night that Phobos actually has all the machinations required for reading gzipped files, buried in etc.c.zlib. I've wanted a high-level D interface for reading and writing compressed files with an API similar to "normal" file I/O for a while. I'm thinking about what the easiest/best design would be. At a high level there are two designs: 1. Hack std.stdio.file to support gzipped formats. This would allow an identical interface for "normal" and compressed I/O. It would also allow reuse of things like ByLine. However, it would require major refactoring of File to decouple it from the C file I/O routines so that it could call either the C or GZip ones depending on how it's configured. Probably, it would make sense to make an interface that wraps I/O functions and make an instance for C and one for gzip, with bzip2 and other goodies possibly being added later. 2. Write something completely separate. This would keep std.stdio.File doing one thing well (wrapping C file I/O) but would be more of a PITA for the user and possibly result in code duplication. I'd like to get some comments on what an appropriate API design and implementation for writing gzipped files would be. Two key requirements are that it must be as easy to use as std.stdio.File and it must be easy to extend to support other single-file compression formats like bz2.Maybe a proper stream API would help. It could provide ByLine etc, could be used for any kind of compression format (as long as an appropriate input-stream is provided), ... (analogous for writing)
Mar 10 2011
On 03/10/2011 09:15 AM, Lars T. Kyllingstad wrote:On Wed, 09 Mar 2011 21:34:29 -0800, Jonathan M Davis wrote:Yop, but the underlying 'file' is a tar-ed pack of files... Denis -- _________________ vita es estrany spir.wikidot.comOn Wednesday 09 March 2011 21:10:59 Daniel Gibson wrote:Not gzip and bzip2 compressed files. They only contain a single file.Am 10.03.2011 05:53, schrieb dsimcha:That was my thought. We really need proper streams... The other potential issue with compressed files is that they can contain directories and such.I noticed last night that Phobos actually has all the machinations required for reading gzipped files, buried in etc.c.zlib. I've wanted a high-level D interface for reading and writing compressed files with an API similar to "normal" file I/O for a while. I'm thinking about what the easiest/best design would be. At a high level there are two designs: 1. Hack std.stdio.file to support gzipped formats. This would allow an identical interface for "normal" and compressed I/O. It would also allow reuse of things like ByLine. However, it would require major refactoring of File to decouple it from the C file I/O routines so that it could call either the C or GZip ones depending on how it's configured. Probably, it would make sense to make an interface that wraps I/O functions and make an instance for C and one for gzip, with bzip2 and other goodies possibly being added later. 2. Write something completely separate. This would keep std.stdio.File doing one thing well (wrapping C file I/O) but would be more of a PITA for the user and possibly result in code duplication. I'd like to get some comments on what an appropriate API design and implementation for writing gzipped files would be. Two key requirements are that it must be as easy to use as std.stdio.File and it must be easy to extend to support other single-file compression formats like bz2.Maybe a proper stream API would help. It could provide ByLine etc, could be used for any kind of compression format (as long as an appropriate input-stream is provided), ... (analogous for writing)
Mar 10 2011
On 3/9/2011 8:53 PM, dsimcha wrote:I'd like to get some comments on what an appropriate API design and implementation for writing gzipped files would be. Two key requirements are that it must be as easy to use as std.stdio.File and it must be easy to extend to support other single-file compression formats like bz2.Use ranges.
Mar 10 2011
On 3/10/2011 4:59 AM, Walter Bright wrote:On 3/9/2011 8:53 PM, dsimcha wrote:Ok, obviously. The point was trying to figure out how to maximize the reuse of the infrastructure from std.stdio.File.I'd like to get some comments on what an appropriate API design and implementation for writing gzipped files would be. Two key requirements are that it must be as easy to use as std.stdio.File and it must be easy to extend to support other single-file compression formats like bz2.Use ranges.
Mar 10 2011
On 3/10/2011 6:24 AM, dsimcha wrote:On 3/10/2011 4:59 AM, Walter Bright wrote:It's not so obvious based on my reading of the other comments. For example, we should not be inventing a streaming interface.On 3/9/2011 8:53 PM, dsimcha wrote:Ok, obviously. The point was trying to figure out how to maximize the reuse of the infrastructure from std.stdio.File.I'd like to get some comments on what an appropriate API design and implementation for writing gzipped files would be. Two key requirements are that it must be as easy to use as std.stdio.File and it must be easy to extend to support other single-file compression formats like bz2.Use ranges.
Mar 10 2011
On 3/10/2011 8:29 PM, Walter Bright wrote:On 3/10/2011 6:24 AM, dsimcha wrote:Ok, I see what you're saying. I was making the assumption that the streaming interface would be based on ranges, and it was more a matter of working out other details, like what decorators to provide.On 3/10/2011 4:59 AM, Walter Bright wrote:It's not so obvious based on my reading of the other comments. For example, we should not be inventing a streaming interface.On 3/9/2011 8:53 PM, dsimcha wrote:Ok, obviously. The point was trying to figure out how to maximize the reuse of the infrastructure from std.stdio.File.I'd like to get some comments on what an appropriate API design and implementation for writing gzipped files would be. Two key requirements are that it must be as easy to use as std.stdio.File and it must be easy to extend to support other single-file compression formats like bz2.Use ranges.
Mar 10 2011
On Thu, 10 Mar 2011 20:29:55 -0500, Walter Bright <newshound2 digitalmars.com> wrote:On 3/10/2011 6:24 AM, dsimcha wrote:C's FILE * interface is too limiting/low performing. I'm working to create a streaming interface to replace it, and then we can compare the differences. I think it's pretty obvious from Tango's I/O performance that a D-based stream interface is a better approach. Ranges should be built on top of that interface. I won't continue the debate, since it's difficult to argue from a position of theory. However, I don't think it will be long before I can show some real numbers. I'm not expecting Phobos to adopt, based on my experience with dcollections, but it should be seamlessly usable with Phobos, especially since range-based functions are templated. -SteveOn 3/10/2011 4:59 AM, Walter Bright wrote:It's not so obvious based on my reading of the other comments. For example, we should not be inventing a streaming interface.On 3/9/2011 8:53 PM, dsimcha wrote:Ok, obviously. The point was trying to figure out how to maximize the reuse of the infrastructure from std.stdio.File.I'd like to get some comments on what an appropriate API design and implementation for writing gzipped files would be. Two key requirements are that it must be as easy to use as std.stdio.File and it must be easy to extend to support other single-file compression formats like bz2.Use ranges.
Mar 11 2011
On 3/11/2011 8:04 AM, Steven Schveighoffer wrote:On Thu, 10 Mar 2011 20:29:55 -0500, Walter Bright <newshound2 digitalmars.com> wrote:Well, I certainly appreciate your efforts. IMHO the current state of file I/O for anything but uncompressed plain text in D is pretty sad. Even uncompressed plain text is pretty bad on Windows due to various bugs. IMHO one huge improvement that could be made to Phobos would be to create modules for reading the most common file formats (my personal list would be gzip, bzip2, png, bmp, jpeg and csv) with a nice high-level D interface.On 3/10/2011 6:24 AM, dsimcha wrote:C's FILE * interface is too limiting/low performing. I'm working to create a streaming interface to replace it, and then we can compare the differences. I think it's pretty obvious from Tango's I/O performance that a D-based stream interface is a better approach. Ranges should be built on top of that interface. I won't continue the debate, since it's difficult to argue from a position of theory. However, I don't think it will be long before I can show some real numbers. I'm not expecting Phobos to adopt, based on my experience with dcollections, but it should be seamlessly usable with Phobos, especially since range-based functions are templated. -SteveOn 3/10/2011 4:59 AM, Walter Bright wrote:It's not so obvious based on my reading of the other comments. For example, we should not be inventing a streaming interface.On 3/9/2011 8:53 PM, dsimcha wrote:Ok, obviously. The point was trying to figure out how to maximize the reuse of the infrastructure from std.stdio.File.I'd like to get some comments on what an appropriate API design and implementation for writing gzipped files would be. Two key requirements are that it must be as easy to use as std.stdio.File and it must be easy to extend to support other single-file compression formats like bz2.Use ranges.
Mar 11 2011
On Wed, 2011-03-09 at 23:53 -0500, dsimcha wrote:I noticed last night that Phobos actually has all the machinations=20 required for reading gzipped files, buried in etc.c.zlib. I've wanted a==20high-level D interface for reading and writing compressed files with an==20API similar to "normal" file I/O for a while. I'm thinking about what==20the easiest/best design would be. At a high level there are two designs:But isn't a gzip (or zip, 7z, bzip2, etc., etc.) file actually a container: a tree of files. So isn't it more a persistent data structure that has a rendering as a single flat file on the filestore, than being a partitioned flat file which is what you will end up with if you head directly down the file/stream route?=201. Hack std.stdio.file to support gzipped formats. This would allow an==20identical interface for "normal" and compressed I/O. It would also=20 allow reuse of things like ByLine. However, it would require major=20 refactoring of File to decouple it from the C file I/O routines so that==20it could call either the C or GZip ones depending on how it's=20 configured. Probably, it would make sense to make an interface that=20 wraps I/O functions and make an instance for C and one for gzip, with=20 bzip2 and other goodies possibly being added later. =20 2. Write something completely separate. This would keep std.stdio.File==20doing one thing well (wrapping C file I/O) but would be more of a PITA==20for the user and possibly result in code duplication. =20 I'd like to get some comments on what an appropriate API design and=20 implementation for writing gzipped files would be. Two key requirements==20are that it must be as easy to use as std.stdio.File and it must be easy==20to extend to support other single-file compression formats like bz2.--=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Mar 10 2011
On Thu, 10 Mar 2011 10:17:17 +0000, Russel Winder wrote:On Wed, 2011-03-09 at 23:53 -0500, dsimcha wrote:Nope, a gzip or bzip2 file only contains a single file. To zip several files, you first make a tar archive, and then you run gzip or bzip2 on it. That's why most compressed archives targeted at the Linux platform have extensions like .tar.gz, .tar.bz2, and so on. -LarsI noticed last night that Phobos actually has all the machinations required for reading gzipped files, buried in etc.c.zlib. I've wanted a high-level D interface for reading and writing compressed files with an API similar to "normal" file I/O for a while. I'm thinking about what the easiest/best design would be. At a high level there are two designs:But isn't a gzip (or zip, 7z, bzip2, etc., etc.) file actually a container: a tree of files. So isn't it more a persistent data structure that has a rendering as a single flat file on the filestore, than being a partitioned flat file which is what you will end up with if you head directly down the file/stream route?
Mar 10 2011
On Thu, 2011-03-10 at 10:57 +0000, Lars T. Kyllingstad wrote: [ . . . ]Nope, a gzip or bzip2 file only contains a single file. To zip several==20files, you first make a tar archive, and then you run gzip or bzip2 on==20it. That's why most compressed archives targeted at the Linux platform==20have extensions like .tar.gz, .tar.bz2, and so on.Obviously ;-) I confused myself thinking of files with extension tgz. Zip, Gzip so similar, so different. Sorry for the noise. Everyone should go back to thinking of a transforming stream architecture for this problem. =20 --=20 Russel. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Dr Russel Winder t: +44 20 7585 2200 voip: sip:russel.winder ekiga.n= et 41 Buckmaster Road m: +44 7770 465 077 xmpp: russel russel.org.uk London SW11 1EN, UK w: www.russel.org.uk skype: russel_winder
Mar 10 2011
On 3/10/2011 5:57 AM, Lars T. Kyllingstad wrote:Nope, a gzip or bzip2 file only contains a single file. To zip several files, you first make a tar archive, and then you run gzip or bzip2 on it. That's why most compressed archives targeted at the Linux platform have extensions like .tar.gz, .tar.bz2, and so on. -LarsThis is **exactly** my point. These single-file gzip and bzip2 files should be usable with exactly the same API as uncompressed file I/O. My personal use case for this is files that contain large amounts of DNA sequence. This compresses very well, since besides a little meta-info it's just a bunch of A's, C's, G's and T's. I want to be able to read in these huge files and decompress them transparently on the fly. Another example (and the one that brought the subject of these non-tarred gzips to my attention) is the svgz format. This is an image format, and is literally just a gzipped SVG. Uncompressed SVG is a ridiculously bloated format but compresses very well, so the SVG standard requires that gzipped SVG files "just work" transparently with any SVG-compliant program. I recently added svgz support to plot2kill, and it was somewhat of a PITA because I had to find the C API buried in etc.c.zlib and then I got stuck using it instead of a nice D API. The bigger point, though, is that use cases for non-tarred single-file gzips do exist and they should be handled transparently via an interface identical to normal file I/O.
Mar 10 2011
On Thu, 10 Mar 2011 09:20:34 -0500, dsimcha wrote:On 3/10/2011 5:57 AM, Lars T. Kyllingstad wrote:Although I agree this would be nice, I don't think std.stdio.File is the right place to put it. I think a general streaming framework should be in place first, and File be made to work with it. Then, working with a gzipped/bzipped file should be as simple as wrapping the raw File stream in a compression/decompression stream. -LarsNope, a gzip or bzip2 file only contains a single file. To zip several files, you first make a tar archive, and then you run gzip or bzip2 on it. That's why most compressed archives targeted at the Linux platform have extensions like .tar.gz, .tar.bz2, and so on. -LarsThis is **exactly** my point. These single-file gzip and bzip2 files should be usable with exactly the same API as uncompressed file I/O. My personal use case for this is files that contain large amounts of DNA sequence. This compresses very well, since besides a little meta-info it's just a bunch of A's, C's, G's and T's. I want to be able to read in these huge files and decompress them transparently on the fly. Another example (and the one that brought the subject of these non-tarred gzips to my attention) is the svgz format. This is an image format, and is literally just a gzipped SVG. Uncompressed SVG is a ridiculously bloated format but compresses very well, so the SVG standard requires that gzipped SVG files "just work" transparently with any SVG-compliant program. I recently added svgz support to plot2kill, and it was somewhat of a PITA because I had to find the C API buried in etc.c.zlib and then I got stuck using it instead of a nice D API. The bigger point, though, is that use cases for non-tarred single-file gzips do exist and they should be handled transparently via an interface identical to normal file I/O.
Mar 10 2011
On 3/10/2011 9:45 AM, Lars T. Kyllingstad wrote:Although I agree this would be nice, I don't think std.stdio.File is the right place to put it. I think a general streaming framework should be in place first, and File be made to work with it. Then, working with a gzipped/bzipped file should be as simple as wrapping the raw File stream in a compression/decompression stream. -LarsOk, this seems to be the general consensus based on the replies I've gotten. Unfortunately, I have neither the time, the desire, nor the knowledge to design and implement a full-fledged stream API, whereas I have enough of all three of these to bolt gzip support onto std.stdio. I guess I'll solve my specific use case at a higher level by wrapping the C gzip stuff for my DNA sequence reader class, and let someone who knowns something about good stream design solve the more general problem.
Mar 10 2011
On Thursday 10 March 2011 07:14:32 dsimcha wrote:On 3/10/2011 9:45 AM, Lars T. Kyllingstad wrote:There _have_ been some threads on designing a new stream API, and there are some preliminary designs, but as far as I know, they have yet to really go anywhere. I'm unaware of the stream API really having a champion per se. Andrei has done some preliminary design work, but I don't know if he intends to actually implement anything, and as far as I know, no one else has volunteered. So, a new std.stream is one of those things that we all agree on that we want but which hasn't happened yet, because no one has stepped up to do it. And I do agree that the ability to deal with a compressed file should be part of the stream API (probably as some sort of adapter/wrapper which uncompresses the stream as you iterate through it). But the stream API needs to be designed and _implemented_ before we'll have that. - Jonathan M DavisAlthough I agree this would be nice, I don't think std.stdio.File is the right place to put it. I think a general streaming framework should be in place first, and File be made to work with it. Then, working with a gzipped/bzipped file should be as simple as wrapping the raw File stream in a compression/decompression stream. -LarsOk, this seems to be the general consensus based on the replies I've gotten. Unfortunately, I have neither the time, the desire, nor the knowledge to design and implement a full-fledged stream API, whereas I have enough of all three of these to bolt gzip support onto std.stdio. I guess I'll solve my specific use case at a higher level by wrapping the C gzip stuff for my DNA sequence reader class, and let someone who knowns something about good stream design solve the more general problem.
Mar 10 2011
== Quote from Jonathan M Davis (jmdavisProg gmx.com)'s articleThere _have_ been some threads on designing a new stream API, and there are some preliminary designs, but as far as I know, they have yet to really go anywhere. I'm unaware of the stream API really having a champion per se. Andrei has done some preliminary design work, but I don't know if he intends to actually implement anything, and as far as I know, no one else has volunteered. So, a new std.stream is one of those things that we all agree on that we want but which hasn't happened yet, because no one has stepped up to do it. And I do agree that the ability to deal with a compressed file should be part of the stream API (probably as some sort of adapter/wrapper which uncompresses the stream as you iterate through it). But the stream API needs to be designed and _implemented_ before we'll have that. - Jonathan M DavisSo I guess in such a design we would still have things like a decorator to iterate through a stream by chunk, by line, etc., with a range interface?
Mar 10 2011
On Thursday, March 10, 2011 09:16:35 dsimcha wrote:== Quote from Jonathan M Davis (jmdavisProg gmx.com)'s articleI don't remember exactly what it's going to look like. IIRC, streams are the reason that Andrei was looking at at a range API that gave you a T[] instead of T. Whatever you were doing would be built on top of that. So, grabbing it by line or whatever. But I would fully expect that you would be able to put a wrapper range in there which took the stream of bytes, treated them as if they were gzipped or bzipped or whatever, and gave you bytes (or chars or whatever was appropriate) as if were reading from an uncompressed version of the file. So, the reading would be identical whether the file was compressed or not. It's just that in the case of a compressed stream/file, you'd have decorator/wrapper which uncompressed it for you. - Jonathan M DavisThere _have_ been some threads on designing a new stream API, and there are some preliminary designs, but as far as I know, they have yet to really go anywhere. I'm unaware of the stream API really having a champion per se. Andrei has done some preliminary design work, but I don't know if he intends to actually implement anything, and as far as I know, no one else has volunteered. So, a new std.stream is one of those things that we all agree on that we want but which hasn't happened yet, because no one has stepped up to do it. And I do agree that the ability to deal with a compressed file should be part of the stream API (probably as some sort of adapter/wrapper which uncompresses the stream as you iterate through it). But the stream API needs to be designed and _implemented_ before we'll have that. - Jonathan M DavisSo I guess in such a design we would still have things like a decorator to iterate through a stream by chunk, by line, etc., with a range interface?
Mar 10 2011
On Thursday 10 March 2011 07:14:32 dsimcha wrote:On 3/10/2011 9:45 AM, Lars T. Kyllingstad wrote:There _have_ been some threads on designing a new stream API, and there are some preliminary designs, but as far as I know, they have yet to really go anywhere. I'm unaware of the stream API really having a champion per se. Andrei has done some preliminary design work, but I don't know if he intends to actually implement anything, and as far as I know, no one else has volunteered. So, a new std.stream is one of those things that we all agree on that we want but which hasn't happened yet, because no one has stepped up to do it. And I do agree that the ability to deal with a compressed file should be part of the stream API (probably as some sort of adapter/wrapper which uncompresses the stream as you iterate through it). But the stream API needs to be designed and _implemented_ before we'll have that. - Jonathan M DavisAlthough I agree this would be nice, I don't think std.stdio.File is the right place to put it. I think a general streaming framework should be in place first, and File be made to work with it. Then, working with a gzipped/bzipped file should be as simple as wrapping the raw File stream in a compression/decompression stream. -LarsOk, this seems to be the general consensus based on the replies I've gotten. Unfortunately, I have neither the time, the desire, nor the knowledge to design and implement a full-fledged stream API, whereas I have enough of all three of these to bolt gzip support onto std.stdio. I guess I'll solve my specific use case at a higher level by wrapping the C gzip stuff for my DNA sequence reader class, and let someone who knowns something about good stream design solve the more general problem.
Mar 10 2011
On 10/03/2011 04:53, dsimcha wrote: <snip>I'd like to get some comments on what an appropriate API design and implementation for writing gzipped files would be. Two key requirements are that it must be as easy to use as std.stdio.File and it must be easy to extend to support other single-file compression formats like bz2.You don't seem to get how std.stream works. The API is defined in the InputStream and OutputStream interfaces. Various classes implement this interface, generally through the Stream abstract class, to provide the functionality for a specific kind of stream. File is just one of these classes. Another is MemoryStream, to read to and write from a buffer in memory. A stream class used to work with gzipped files would be just another. Indeed, we have FilterStream, which is a base class for stream classes that wrap a stream, such as a file or memory stream, to modify the data in some way as it goes in and out. Compressing or decompressing is an example of this - so I guess that GzipStream would be a subclass of FilterStream. Stewart.
Mar 11 2011
On 3/11/2011 7:12 PM, Stewart Gordon wrote:On 10/03/2011 04:53, dsimcha wrote: <snip>But: 1. std.stream is scheduled for deprecation IIRC. 2. std.stdio.File is what's now idiomatic to use. 3. Streams in D should be based on input ranges, not whatever crufty old stuff std.stream is based on.I'd like to get some comments on what an appropriate API design and implementation for writing gzipped files would be. Two key requirements are that it must be as easy to use as std.stdio.File and it must be easy to extend to support other single-file compression formats like bz2.You don't seem to get how std.stream works. The API is defined in the InputStream and OutputStream interfaces. Various classes implement this interface, generally through the Stream abstract class, to provide the functionality for a specific kind of stream. File is just one of these classes. Another is MemoryStream, to read to and write from a buffer in memory. A stream class used to work with gzipped files would be just another. Indeed, we have FilterStream, which is a base class for stream classes that wrap a stream, such as a file or memory stream, to modify the data in some way as it goes in and out. Compressing or decompressing is an example of this - so I guess that GzipStream would be a subclass of FilterStream. Stewart.
Mar 11 2011
On Friday, March 11, 2011 16:27:21 dsimcha wrote:On 3/11/2011 7:12 PM, Stewart Gordon wrote:Technically speaking, I think that it's intended to be scheduled for deprecation as opposed to actually being scheduled for deprecation, but whatever. It's going to be phased out as soon as we have a replacement.dOn 10/03/2011 04:53, dsimcha wrote: <snip>But: 1. std.stream is scheduled for deprecation IIRC.I'd like to get some comments on what an appropriate API design and implementation for writing gzipped files would be. Two key requirements are that it must be as easy to use as std.stdio.File and it must be easy to extend to support other single-file compression formats like bz2.You don't seem to get how std.stream works. The API is defined in the InputStream and OutputStream interfaces. Various classes implement this interface, generally through the Stream abstract class, to provide the functionality for a specific kind of stream. File is just one of these classes. Another is MemoryStream, to read to and write from a buffer in memory. A stream class used to work with gzipped files would be just another. Indeed, we have FilterStream, which is a base class for stream classes that wrap a stream, such as a file or memory stream, to modify the data in some way as it goes in and out. Compressing or decompressing is an example of this - so I guess that GzipStream would be a subclass of FilterStream. Stewart.2. std.stdio.File is what's now idiomatic to use.Well, more like it's the only solution we have which will be sticking around. Once we have a new std.stream, it may be the preferred solution.3. Streams in D should be based on input ranges, not whatever crufty old stuff std.stream is based on.Indeed. But the new API still needs to be fleshed out and implemented before we actually have even a _proposed_ new std.stream, let alone actually have it. - Jonathan M Davis
Mar 11 2011
On Fri, 11 Mar 2011 19:27:21 -0500, dsimcha <dsimcha yahoo.com> wrote:3. Streams in D should be based on input ranges, not whatever crufty old stuff std.stream is based on.No. I/O Ranges should be based on streams. A stream is a low level construct that can read and/or write data. In essence it is an abstraction of the capabilities of the OS. BTW, that crufty old stuff probably way outperforms anything you could ever do with ranges as the base. The range interface simply isn't built to deal with I/O properly. For example, std.stdio.File is based on FILE *, which is an opaque stream interface. There should probably be RangeStream which would wrap a range in a stream interface, if you want to go that route. -Steve
Mar 14 2011
On 3/14/2011 8:22 AM, Steven Schveighoffer wrote:BTW, that crufty old stuff probably way outperforms anything you could ever do with ranges as the base.I don't get the concern about performance, for file I/O at least. Isn't the main bottleneck reading it off the disk platter?
Mar 14 2011
Am 14.03.2011 14:17, schrieb dsimcha:On 3/14/2011 8:22 AM, Steven Schveighoffer wrote:SSDs, RAID, RAM-disks, 10GBit (and faster) networking, ...BTW, that crufty old stuff probably way outperforms anything you could ever do with ranges as the base.I don't get the concern about performance, for file I/O at least. Isn't the main bottleneck reading it off the disk platter?
Mar 14 2011
On Mon, 14 Mar 2011 09:17:04 -0400, dsimcha <dsimcha yahoo.com> wrote:On 3/14/2011 8:22 AM, Steven Schveighoffer wrote:That is solved by buffering, which would be done in either case. With ranges, you likely have to copy things more than if you use a proper stream interface, or else make the interface very awkward. I don't know about you, but having to set the amount I want to read before I call front seems awkward. The library I'm writing optimizes the copying so there is very little copying from the buffer. If you look at Tango's I/O performance, it outperforms even C I/O, and uses a class/interface hierarchy w/ delegates for reading data. I think the range concept is good to paste on top of a buffered I/O stream, but not to use as the base. For example, byLine is a good example of an I/O range that would use a buffered I/O stream to do its work. See this message I posted a few months back: http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=119400 A couple replies later I outline how to do byLine funtion (and easily a range) on top of such a stream. This is the basis for my current I/O library I'm writing. -SteveBTW, that crufty old stuff probably way outperforms anything you could ever do with ranges as the base.I don't get the concern about performance, for file I/O at least. Isn't the main bottleneck reading it off the disk platter?
Mar 14 2011
== Quote from Steven Schveighoffer (schveiguy yahoo.com)'s articleOn Mon, 14 Mar 2011 09:17:04 -0400, dsimcha <dsimcha yahoo.com> wrote:http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=119400On 3/14/2011 8:22 AM, Steven Schveighoffer wrote:That is solved by buffering, which would be done in either case. With ranges, you likely have to copy things more than if you use a proper stream interface, or else make the interface very awkward. I don't know about you, but having to set the amount I want to read before I call front seems awkward. The library I'm writing optimizes the copying so there is very little copying from the buffer. If you look at Tango's I/O performance, it outperforms even C I/O, and uses a class/interface hierarchy w/ delegates for reading data. I think the range concept is good to paste on top of a buffered I/O stream, but not to use as the base. For example, byLine is a good example of an I/O range that would use a buffered I/O stream to do its work. See this message I posted a few months back:BTW, that crufty old stuff probably way outperforms anything you could ever do with ranges as the base.I don't get the concern about performance, for file I/O at least. Isn't the main bottleneck reading it off the disk platter?A couple replies later I outline how to do byLine funtion (and easily a range) on top of such a stream. This is the basis for my current I/O library I'm writing. -SteveOk, makes sense. I sincerely hope your I/O library is good enough to get adopted, then, because Phobos is in **serious** need of better I/O functionality.
Mar 14 2011
Since it seems like the consensus is that streaming gzip support belongs in a stream API, I guess we have yet another reason to get busy with the stream API. However, I'm wondering if std.file should support gzip and, if license issues can be overcome, bzip2. I'd love to be able to write code like this: // Read and transparently decompress foo.txt, which is UTF-8 encoded. auto foo = cast(string) gzippedRead("foo.txt.gz"); // Write a buffer to a gzipped file. gzippedWrite("foo.txt.gz", buf); This stuff would be trivial to implement in std.file and, IMHO, belongs there. What's the consensus on whether it belongs?
Mar 12 2011