What is meant by "number of characters extracted" in the std::ifstream gcount doc?

98 Views Asked by At

It is said in the documentation of the ifstream::getline method that:

The number of characters successfully read and stored by this function can be accessed by calling member gcount. https://cplusplus.com/reference/istream/istream/getline/

In any case, if count > 0, it then stores a null character CharT() into the next successive location of the array and updates gcount(). https://en.cppreference.com/w/cpp/io/basic_istream/getline

From both of the above resources documenting ifstream::getline it can be deduced that gcount is supposed to be changed even after encountering the end of file (EOF). That's due to the fact that any case includes the EOF case and we all know that an update is only an update if it changes the target record.

It is said in the documentation of ifstream::gcount method that it:

Returns the number of characters extracted by the last unformatted input operation performed on the object. https://cplusplus.com/reference/istream/istream/gcount/

Returns the number of characters extracted by the last unformatted input operation, or the maximum representable value of std::streamsize if the number is not representable. https://en.cppreference.com/w/cpp/io/basic_istream/gcount

If it's the number of characters extracted from the ifstream, then the CPlusPlus.com documentation of getline must be wrong as it states "characters successfully read and stored".

Also, the CppReference.com would be wrong, because it states that "in any case ... updates gcount()" but gcount is not updated when an EOF is encountered before the line end delimiter.

If it's the number of characters written into the array buffer argument of ifstream::getline, then the standard library has a bug. When during the execution of ifstream::getline the line ends prematurely with end-of-file (EOF), the null character is appended to the end of the array buffer but gcount is not updated accordingly.

Here is the code that exemplifies the dilemma.

#include <stdlib.h>
#include <iostream>
#include <array>
#include <fstream>
#include <limits>
#include <cstring>

int main(int argc, char **argv) {
    if (argc < 2) {
        std::cerr << "Usage: " << argv[0] << " file\n";
        return EXIT_FAILURE;
    }

    std::array<char, 10> buf;
    std::ifstream file;
    file.open(argv[1], std::ifstream::in);

    do {
        file.clear();
        file.getline(buf.data(), buf.size());
        std::streamsize gcount = file.gcount();

        if (file.bad() || gcount <= 0) {
            break;
        }

        if (!file.fail()) {
            std::cerr
                << "LINE: [" << buf.data() << "] gcount "
                << std::to_string(gcount) << ", strlen "
                << std::to_string(strlen(buf.data()))
                << (file.eof() ? " (EOF)\n" : "\n");

            continue;
        }

        // Buffer must have got full. Let's skip to the end of line.
        file.clear();
        file.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
    }
    while (!file.eof() && !file.bad());

    file.close();

    return EXIT_SUCCESS;
}

Here is the output I get for a text file that does not have a newline character in the end of its last line.

LINE: [dgsagdsa] gcount 9, strlen 8
LINE: [test] gcount 5, strlen 4
LINE: [test123] gcount 8, strlen 7
LINE: [123test] gcount 8, strlen 7
LINE: [] gcount 1, strlen 0
LINE: [xxxxxxx] gcount 8, strlen 7
LINE: [yy] gcount 2, strlen 2 (EOF)

As you can see, there is a discrepancy between gcount and strlen on the last line of the output.

That said, let's come back to the main question now.

What is meant by the number of characters extracted in the documentation of std::ifstream::gcount?

The question has two parts to it.

  1. What is meant by a "character"?
  2. What is meant by "extraction"?

Is one character always one byte in this context? A unicode character could consist of multiple bytes. A line end sequence could consist of multiple bytes too (CR+LF). Could it ever happen (perhaps in the future) that gcount is increased by 1 but multiple bytes were extracted? Could it ever happen that gcount is increased by 1 but multiple bytes were stored in the array buffer?

1

There are 1 best solutions below

7
Dean Johnson On

Let's take the last line in your example and walk through it - yy<eof>.

initial state: gcount = 0, strlen(inProgressBuf) == 0
yy<eof>

gcount = 1, strlen(inProgressBuf) == 1
yy<eof>
^

gcount = 2, strlen(inProgressBuf) == 2
yy<eof>
 ^

oh, hit EOF
yy<eof>
  ^

At the point of hitting EOF, two characters have been extracted and so gcount is 2. getline is now going to append a null character to your buffer - this has nothing to do with gcount. Only two characters were actually extracted.

In the case of a string with a delimiter, lets say yy<lf><eof>:

initial state: gcount = 0, strlen(inProgressBuf) == 0
yy<lf><eof>

gcount = 1, strlen(inProgressBuf) == 1
yy<lf><eof>
^

gcount = 2, strlen(inProgressBuf) == 2
yy<lf><eof>
 ^
gcount = 3, strlen(inProgressBuf) == 2
yy<lf><eof>
  ^

When the LF is hit, a character IS being extracted from the input, and so gcount is incremented. However, that extracted character matches the getline delimiter and so it is NOT added to your buffer. A null character gets added simply for null termination of the string.

EOF is not a character that can be extracted and so reaching it does not increment gcount.

The only wording I can see on cppreference that could maybe be disputed is this excerpt from https://en.cppreference.com/w/cpp/io/basic_istream/getline:

In any case, if count > 0, it then stores a null character CharT() into the next successive location of the array and updates gcount().

You could maybe interpret this as the appending of the null character is why gcount is being updated. However, I believe the intended meaning is that gcount is being updated because count > 0.

Regarding the question of how to determine the number of bytes written, the suggestion in the comments seems appropriate:

It is gcount unless you hit eof then it is gcount + 1