your programing

std :: ifstream이 LF, CR 및 CRLF를 처리하도록 하시겠습니까?

lovepro 2020. 10. 12. 07:59
반응형

std :: ifstream이 LF, CR 및 CRLF를 처리하도록 하시겠습니까?


특히 나는에 관심이 istream& getline ( istream& is, string& str );있습니다. 모든 개행 인코딩을 내부적으로 '\ n'으로 변환하도록 지시하는 ifstream 생성자에 대한 옵션이 있습니까? getline모든 라인 엔딩을 우아하게 처리하고 호출 할 수 있기를 원합니다 .

업데이트 : 명확히하기 위해 거의 모든 곳에서 컴파일되고 거의 모든 곳에서 입력을받는 코드를 작성할 수 있기를 원합니다. '\ n'없이 '\ r'이있는 희귀 파일을 포함합니다. 소프트웨어 사용자의 불편을 최소화합니다.

문제를 해결하는 것은 쉽지만 표준에서 모든 텍스트 파일 형식을 유연하게 처리하는 올바른 방법이 궁금합니다.

getline문자열로 최대 '\ n'까지 전체 줄을 읽습니다. '\ n'은 스트림에서 소비되지만 getline은 문자열에 포함하지 않습니다. 지금까지는 괜찮지 만 문자열에 포함되는 '\ n'바로 앞에 '\ r'이있을 수 있습니다.

텍스트 파일 에는 세 가지 유형의 줄 끝이 있습니다. '\ n'은 유닉스 컴퓨터의 일반적인 끝, '\ r'은 (내 생각에) 오래된 Mac 운영 체제에서 사용되었으며 Windows는 '\ r'쌍을 사용합니다. 뒤에 '\ n'.

문제는 getline문자열 끝에 '\ r' 남긴다 는 것입니다 .

ifstream f("a_text_file_of_unknown_origin");
string line;
getline(f, line);
if(!f.fail()) { // a non-empty line was read
   // BUT, there might be an '\r' at the end now.
}

편집 것을 지적 닐 덕분에 f.good()내가 원하는 것이 아니다. !f.fail()내가 원하는 것입니다.

직접 수동으로 제거 할 수 있으며 (이 질문 편집 참조) Windows 텍스트 파일의 경우 쉽습니다. 하지만 누군가 '\ r'만 포함 된 파일을 피드 할까 봐 걱정됩니다. 이 경우 getline이 전체 파일을 소비하고 단일 라인이라고 생각합니다!

.. 그리고 그것은 심지어 유니 코드를 고려하지 않습니다 :-)

.. 아마도 Boost는 텍스트 파일 유형에서 한 번에 한 줄을 소비하는 좋은 방법이 있습니까?

편집 Windows 파일을 처리하기 위해 이것을 사용하고 있지만 여전히 그럴 필요가 없다고 느낍니다! 그리고 이것은 '\ r'전용 파일에 대해 포크되지 않습니다.

if(!line.empty() && *line.rbegin() == '\r') {
    line.erase( line.length()-1, 1);
}

Neil이 지적했듯이 "C ++ 런타임은 특정 플랫폼에 대한 줄 끝 규칙이 무엇이든 올바르게 처리해야합니다."

그러나 사람들은 서로 다른 플랫폼간에 텍스트 파일을 이동하므로 충분하지 않습니다. 다음은 세 줄 끝 ( "\ r", "\ n"및 "\ r \ n")을 모두 처리하는 함수입니다.

std::istream& safeGetline(std::istream& is, std::string& t)
{
    t.clear();

    // The characters in the stream are read one-by-one using a std::streambuf.
    // That is faster than reading them one-by-one using the std::istream.
    // Code that uses streambuf this way must be guarded by a sentry object.
    // The sentry object performs various tasks,
    // such as thread synchronization and updating the stream state.

    std::istream::sentry se(is, true);
    std::streambuf* sb = is.rdbuf();

    for(;;) {
        int c = sb->sbumpc();
        switch (c) {
        case '\n':
            return is;
        case '\r':
            if(sb->sgetc() == '\n')
                sb->sbumpc();
            return is;
        case std::streambuf::traits_type::eof():
            // Also handle the case when the last line has no line ending
            if(t.empty())
                is.setstate(std::ios::eofbit);
            return is;
        default:
            t += (char)c;
        }
    }
}

다음은 테스트 프로그램입니다.

int main()
{
    std::string path = ...  // insert path to test file here

    std::ifstream ifs(path.c_str());
    if(!ifs) {
        std::cout << "Failed to open the file." << std::endl;
        return EXIT_FAILURE;
    }

    int n = 0;
    std::string t;
    while(!safeGetline(ifs, t).eof())
        ++n;
    std::cout << "The file contains " << n << " lines." << std::endl;
    return EXIT_SUCCESS;
}

C ++ 런타임은 특정 플랫폼에 대한 endline 규칙이 무엇이든 올바르게 처리해야합니다. 특히이 코드는 모든 플랫폼에서 작동합니다.

#include <string>
#include <iostream>
using namespace std;

int main() {
    string line;
    while( getline( cin, line ) ) {
        cout << line << endl;
    }
}

물론, 다른 플랫폼의 파일을 다루는 경우 모든 베팅이 해제됩니다.

가장 일반적인 두 가지 플랫폼 (Linux 및 Windows) 모두 캐리지 리턴으로 그 앞에 Windows에서 개행 문자와 라인을 종료로 ,, 당신의 마지막 문자 검사 할 수 line가 있는지 확인하기 위해 위의 코드에서 문자열을 \r만약 그렇다면 애플리케이션 별 처리를 수행하기 전에 제거하십시오.

For example, you could provide yourself with a getline style function that looks something like this (not tested, use of indexes, substr etc for pedagogical purposes only):

ostream & safegetline( ostream & os, string & line ) {
    string myline;
    if ( getline( os, myline ) ) {
       if ( myline.size() && myline[myline.size()-1] == '\r' ) {
           line = myline.substr( 0, myline.size() - 1 );
       }
       else {
           line = myline;
       }
    }
    return os;
}

Are you reading the file in BINARY or in TEXT mode? In TEXT mode the pair carriage return/line feed, CRLF, is interpreted as TEXT end of line, or end of line character, but in BINARY you fetch only ONE byte at a time, which means that either character MUST be ignored and left in the buffer to be fetched as another byte! Carriage return means, in the typewriter, that the typewriter car, where the printing arm lies in, has reached the right edge of the paper and is returned to the left edge. This is a very mechanical model, that of the mechanical typewriter. Then the line feed means that the paper roll is rotated a little bit up so the paper is in position to begin another line of typing. As fas as I remember one of the low digits in ASCII means move to the right one character without typing, the dead char, and of course \b means backspace: move the car one character back. That way you can add special effects, like underlying (type underscore), strikethrough (type minus), approximate different accents, cancel out (type X), without needing an extended keyboard, just by adjusting the position of the car along the line before entering the line feed. So you can use byte sized ASCII voltages to automatically control a typewriter without a computer in between. When the automatic typewriter is introduced, AUTOMATIC means that once you reach the farthest edge of the paper, the car is returned to the left AND the line feed applied, that is, the car is assumed to be returned automatically as the roll moves up! So you do not need both control characters, only one, the \n, new line, or line feed.

This has nothing to do with programming but ASCII is older and HEY! looks like some people were not thinking when they begun doing text things! The UNIX platform assumes an electrical automatic typemachine; the Windows model is more complete and allows for control of mechanical machines, though some control characters become less and less useful in computers, like the bell character, 0x07 if I remember well... Some forgotten texts must have been originally captured with control characters for electrically controlled typewriters and it perpetuated the model...

Actually the correct variation would be to just include the \r, line feed, the carriage return being unnecessary, that is, automatic, hence:

char c;
ifstream is;
is.open("",ios::binary);
...
is.getline(buffer, bufsize, '\r');

//ignore following \n or restore the buffer data
if ((c=is.get())!='\n') is.rdbuf()->sputbackc(c);
...

would be the most correct way to handle all types of files. Note however that \n in TEXT mode is actually the byte pair 0x0d 0x0a, but 0x0d IS just \r: \n includes \r in TEXT mode but not in BINARY, so \n and \r\n are equivalent... or should be. This is a very basic industry confusion actually, typical industry inertia, as the convention is to speak of CRLF, in ALL platforms, then fall into different binary interpretations. Strictly speaking, files including ONLY 0x0d (carriage return) as being \n (CRLF or line feed), are malformed in TEXT mode (typewritter machine: just return the car and strikethrough everything...), and are a non-line oriented binary format (either \r or \r\n meaning line oriented) so you are not supposed to read as text! The code ought to fail maybe with some user message. This does not depend on the OS only, but also on the C library implementation, adding to the confusion and possible variations... (particularly for transparent UNICODE translation layers adding another point of articulation for confusing variations).

The problem with the previous code snippet (mechanical typewriter) is that it is very inefficient if there are no \n characters after \r (automatic typewriter text). Then it also assumes BINARY mode where the C library is forced to ignore text interpretations (locale) and give away the sheer bytes. There should be no difference in the actual text characters between both modes, only in the control characters, so generally speaking reading BINARY is better than TEXT mode. This solution is efficient for BINARY mode typical Windows OS text files independently of C library variations, and inefficient for other platform text formats (including web translations into text). If you care about efficiency, the way to go is to use a function pointer, make a test for \r vs \r\n line controls however way you like, then select the best getline user-code into the pointer and invoke it from it.

Incidentally I remember I found some \r\r\n text files too... which translates into double line text just as is still required by some printed text consumers.


Other than writing your own custom handler or using an external library, you are out of luck. The easiest thing to do is to check to make sure line[line.length() - 1] is not '\r'. On Linux, this is superfluous as most lines will end up with '\n', meaning you'll lose a fair bit of time if this is in a loop. On Windows, this is also superfluous. However, what about classic Mac files which end in '\r'? std::getline would not work for those files on Linux or Windows because '\n' and '\r' '\n' both end with '\n', eliminating the need to check for '\r'. Obviously such a task that works with those files would not work well. Of course, then there exist the numerous EBCDIC systems, something that most libraries won't dare tackle.

Checking for '\r' is probably the best solution to your problem. Reading in binary mode would allow you to check for all three common line endings ('\r', '\r\n' and '\n'). If you only care about Linux and Windows as old-style Mac line endings shouldn't be around for much longer, check for '\n' only and remove the trailing '\r' character.


One solution would be to first search and replace all line endings to '\n' - just like e.g. Git does by default.

참고URL : https://stackoverflow.com/questions/6089231/getting-std-ifstream-to-handle-lf-cr-and-crlf

반응형