针对文件和字符串处理基准评估不同的语言和方法本文比较了解决将大量基于行的逗号分隔文本从文件加载到数据结构中并将这些数据写回另一个文件的问题的不同方法。比较了这些方法的性能概况并得出结论。
介绍
我对比较使用不同编程语言和不同方法解决基本编程问题的性能很感兴趣,因此我用 C#、C++ 和 C 开发了应用程序,并得出结论以决定在不同情况下使用哪种语言和方法.
基准
我想出的问题是程序将输入 CSV 文件加载到数据结构数组中,并使用数据结构将数据写回输出 CSV 文件。您的基本输入/输出数据处理问题,无需任何数据处理。这是一个文件 I/O 和数据结构序列化基准测试。
我选择对 CSV 数据使用 Unicode 编码的文本文件,因为 C# 是一种 Unicode 语言,而 C/C++ 在这一点上可以很好地处理 Unicode 数据。我选择每行 CSV 文本有七个数据字段,典型的人口统计资料:名字、姓氏、地址行 1、地址行 2、城市、州和邮政编码。为简单起见,每个字段限制为 127 个字符的 Unicode 文本。
CSV 生成程序(gen
附件代码中的“”项目)将 100,000 行此随机 CSV 数据输出到桌面上的文件中。
// C# script to generate the data used by the test programs
using System.Text;
// These constants are shared with the C program that uses fixed-length fields
const int FIELD_LEN = 128;
const int FIELD_CHAR_LEN = FIELD_LEN - 1;
const int RECORD_COUNT = 100 * 1000;
var rnd = new Random(0); // same output each gen
string rnd_str()
{
StringBuilder sb = new StringBuilder(FIELD_CHAR_LEN);
int rnd_len = rnd.Next(1, FIELD_CHAR_LEN);
for (int r = 1; r <= rnd_len; ++r)
sb.Append((char)((sbyte)'A' + rnd.Next(0, 25)));
return sb.ToString();
}
string output_file_path =
Path.Combine
(
Environment.GetFolderPath(Environment.SpecialFolder.Desktop),
"recordio.csv"
);
if (output_file_path == null)
throw new NullReferenceException("output_file_path");
using (var output = new StreamWriter(output_file_path, false, Encoding.Unicode))
{
for (int r = 1; r <= RECORD_COUNT; ++r)
output.Write($"{rnd_str()},{rnd_str()},{rnd_str()},{rnd_str()},
{rnd_str()},{rnd_str()},{rnd_str()}\n");
}
让我们跳到重点:结果是什么?
以下是不同程序的加载和编写阶段的时间安排。程序通过加载/写入周期运行四次。我从每个程序的所有运行中获取了最好的加载和写入时间,最好的。
方法/语言 | 负载(毫秒) | 写入(毫秒) |
方法 1:带循环的 C#(网络) | 317 | 178 |
方法 2:批量 C# (net2) | 223 | 353 |
方法 3:C++ 循环(类) | 2,489 | 1,379 |
方法 4:C 批次(结构) | 107 | 147 |
方法 5:C++ 批处理(类 2) | 202 | 136 |
结论和兴趣点
C# 程序、循环和批处理,干净且易于阅读并且具有良好的性能。循环使用StreamReader
/StreamWriter
并且直观且易于开发和维护。批处理程序使用File
类函数ReadAllLines
/WriteAllLines
并且在读取、LINQ 等方面比 C# 循环程序快得多,而在写入方面则更慢。鉴于此,您将使用ReadAllLines
/ LINQ 进行加载和StreamWriter
写入。
大新闻是方法 3:C++ 循环(类)有一些非常错误的地方。它归结std::getline
为加载调用和写入流输出;其他代码花费很少。我对重现这些数字并报告如何解决这些性能问题的人感兴趣。
C 批处理程序在加载数据方面轻松胜过其他程序,因为它使用固定大小的字符串打包在一起,struct
按顺序存储在一个数组中,因此数据局部性非常好。我们在大约 100 毫秒内读取了 90 MB 的 CSV。哇!由于某种原因,C 程序将数据写入输出文件的速度较慢;代码看起来很干净,不确定。
受 C 程序的启发,“C++ 批处理”程序在读取数据方面并不接近 C,但在写出数据方面却优于 C。你会选择 C 来阅读和 C++ 来写作。
C 和 C++ 中的批处理方法的语言混合将产生 107 / 136,优于 C# 的最佳值 223 / 178。写入性能的差异并不显着。负载性能的 2 倍加速不容忽视。到达那里需要一点C。string
使用打包到单个数组中的固定长度的 s,这是 C++或 C#的struct
任何存储所无法比拟的。wstring
string
我的总体建议是,如果您使用 C/C++ 代码并且希望获得最佳加载性能,请采用 C 加载算法和 C++ 编写算法,将它们放入带有模板的类中以供重用,并拥有两全其美.
如果您使用 C# 代码,请使用File.ReadAllLines
和 LINQ 加载数据并将StreamWriter
其写回,并获得可观的性能以及出色的生产力和安全性。
以下是有关不同方法的更多信息。
方法 1:带有循环的 C#
在附加的代码中,这是“ net
”项目。这可能是最直观的语言和方法。您已经StreamReader
收集了数据,并将StreamWriter
其全部写出来:
// C# performance test program
using System.Diagnostics;
using System.Text;
var sw = Stopwatch.StartNew();
string input_file_path =
Path.Combine
(
Environment.GetFolderPath(Environment.SpecialFolder.Desktop),
"recordio.csv"
);
if (input_file_path == null)
throw new NullReferenceException("input_file_path");
for (int iter = 1; iter <= 4; ++iter)
{
// Read lines into a list of objects
var nets = new List<info>();
using (var input = new StreamReader(input_file_path, Encoding.Unicode))
{
while (true)
{
string? line = input.ReadLine();
if (line == null)
break;
else
nets.Add(new info(line.Split(',')));
}
}
Console.WriteLine($".NET load took {sw.ElapsedMilliseconds} ms");
sw.Restart();
// Write the objects to an output CSV file
using (var output = new StreamWriter("output.csv", false, Encoding.Unicode))
{
foreach (var cur in nets)
output.Write($"{cur.firstname},{cur.lastname},
{cur.address1},{cur.address2},{cur.city},{cur.state},{cur.zipcode}\n");
}
Console.WriteLine($".NET write took {sw.ElapsedMilliseconds} ms");
sw.Restart();
}
// NOTE: Using struct did not change performance, probably because the
// contents of the strings are not stored consecutively, so
// any data locality with the array of info objects is irrelevant
class info
{
public info(string[] parts)
{
firstname = parts[0];
lastname = parts[1];
address1 = parts[2];
address2 = parts[3];
city = parts[4];
state = parts[5];
zipcode = parts[6];
}
public string firstname;
public string lastname;
public string address1;
public string address2;
public string city;
public string state;
public string zipcode;
}
方法 2:批量使用 C#
在附件源中,这是“ net2
”项目。您可能会对自己说,“循环很乏味。我可以使用File
类函数ReadAllLines
来做大量的事情。我相信在 .NET 中!” 这肯定是更少的代码……
// C# performance test program
using System.Diagnostics;
using System.Runtime.ConstrainedExecution;
using System.Text;
var sw = Stopwatch.StartNew();
string input_file_path =
Path.Combine
(
Environment.GetFolderPath(Environment.SpecialFolder.Desktop),
"recordio.csv"
);
if (input_file_path == null)
throw new NullReferenceException("input_file_path");
for (int iter = 1; iter <= 4; ++iter)
{
// Read CSV file into a list of objects
var nets =
File.ReadAllLines(input_file_path, Encoding.Unicode)
.Select(line => new info(line.Split(',')));
Console.WriteLine($".NET 2 load took {sw.ElapsedMilliseconds} ms");
sw.Restart();
// Write the objects to an output CSV file
int count = nets.Count();
string[] strs = new string[count];
int idx = 0;
foreach (var cur in nets)
strs[idx++] = $"{cur.firstname},{cur.lastname},{cur.address1},
{cur.address2},{cur.city},{cur.state},{cur.zipcode}\n";
File.WriteAllLines("output.csv", strs, Encoding.Unicode);
Console.WriteLine($".NET 2 write took {sw.ElapsedMilliseconds} ms");
sw.Restart();
}
方法 3:C++ 循环
C++ 在 Unicode 文件 I/O 和流方面取得了长足的进步。现在可以轻松编写与 C# 的 Loopy 方法 1 相媲美的 C++ 代码,以提高可读性和简单性:
// C++ loop performance test program
#include <codecvt>
#include <fstream>
#include <iostream>
#include <string>
#include <vector>
// Our record type, just a bunch of wstrings
struct info
{
info(const std::vector<std::wstring>& parts)
: firstname(parts[0])
, lastname(parts[1])
, address1(parts[2])
, address2(parts[3])
, city(parts[4])
, state(parts[5])
, zipcode(parts[6])
{
}
std::wstring firstname;
std::wstring lastname;
std::wstring address1;
std::wstring address2;
std::wstring city;
std::wstring state;
std::wstring zipcode;
};
// Split a string by a separator, returning a vector of substrings
std::vector<std::wstring> split(const std::wstring& str, const wchar_t seperator)
{
std::vector<std::wstring> retVal;
retVal.reserve(FIELD_COUNT); // cheat...
std::wstring acc;
acc.reserve(FIELD_CHAR_LEN); // ...a little
for (wchar_t c : str)
{
if (c == seperator)
{
retVal.push_back(acc);
acc.clear();
}
else
acc.push_back(c);
}
if (!acc.empty())
retVal.push_back(acc);
return retVal;
}
int main(int argc, char* argv[])
{
timer t;
for (int iter = 1; iter <= 4; ++iter)
{
// Read the file into a vector of line strings
std::vector<std::wstring> lines;
{
std::wifstream input(argv[1], std::ios::binary);
input.imbue(std::locale(input.getloc(), new std::codecvt_utf16<wchar_t,
0x10ffff, std::codecvt_mode(std::consume_header | std::little_endian)>));
if (!input)
{
std::cout << "Opening input file failed\n";
return 1;
}
std::wstring line;
while (std::getline(input, line))
lines.push_back(line);
}
// Process the lines into a vector of structs
std::vector<info> infos;
infos.reserve(lines.size());
for (const auto& line : lines)
infos.emplace_back(split(line, ','));
t.report("class load ");
// Write the structs to an output CSV file
{
std::wofstream output("output.csv", std::ios::binary);
output.imbue(std::locale(output.getloc(), new std::codecvt_utf16<wchar_t,
0x10ffff, std::codecvt_mode(std::generate_header | std::little_endian)>));
if (!output)
{
std::cout << "Opening output file failed\n";
return 1;
}
for (const auto& record : infos)
{
output
<< record.firstname << ','
<< record.lastname << ','
<< record.address1 << ','
<< record.address2 << ','
<< record.city << ','
<< record.state << ','
<< record.zipcode << '\n';
}
}
t.report("class write");
}
}
对于文件加载步骤,快速计时检查显示所有时间都花在了std::getline()
调用上。而对于文件写入步骤,所有的时间都花在了输出循环上,还有什么地方呢?这个谜题留给读者作为练习。这个简单的代码有什么问题?
方法 4:C 批次
如果我们愿意将整个文本加载到内存中,那么我们可以在适当的位置对数据进行切片和切块,并利用固定长度的字符串缓冲区和利用数据局部性和不安全操作的字符级字符串操作. 多么有趣!
#include <stdio.h>
#include <stdlib.h>
const size_t FIELD_LEN = 128;
const size_t FIELD_CHAR_LEN = FIELD_LEN - 1;
const size_t FIELD_COUNT = 7;
const size_t RECORD_LEN = std::max(FIELD_COUNT * FIELD_LEN + 1, size_t(1024));
// Struct with fixed char array fields
struct info
{
wchar_t firstname[FIELD_LEN];
wchar_t lastname[FIELD_LEN];
wchar_t address1[FIELD_LEN];
wchar_t address2[FIELD_LEN];
wchar_t city[FIELD_LEN];
wchar_t state[FIELD_LEN];
wchar_t zipcode[FIELD_LEN];
};
// Read a comma-delimited string out of a buffer
void read_str(const wchar_t*& input, wchar_t* output)
{
size_t copied = 0;
while (*input && *input != ',')
*output++ = *input++;
*output = '\0';
if (*input == ',')
++input;
}
// Initialize a record using a buffer of text
void set_record(info& record, const wchar_t* buffer)
{
read_str(buffer, record.firstname);
read_str(buffer, record.lastname);
read_str(buffer, record.address1);
read_str(buffer, record.address2);
read_str(buffer, record.city);
read_str(buffer, record.state);
read_str(buffer, record.zipcode);
}
// Output a record to a buffer of text
wchar_t* add_to_buffer(const wchar_t* input, wchar_t* output, wchar_t separator)
{
while (*input)
*output++ = *input++;
*output++ = separator;
return output;
}
int64_t output_record(const info& record, wchar_t* buffer)
{
const wchar_t* original = buffer;
buffer = add_to_buffer(record.firstname, buffer, ',');
buffer = add_to_buffer(record.lastname, buffer, ',');
buffer = add_to_buffer(record.address1, buffer, ',');
buffer = add_to_buffer(record.address2, buffer, ',');
buffer = add_to_buffer(record.city, buffer, ',');
buffer = add_to_buffer(record.state, buffer, ',');
buffer = add_to_buffer(record.zipcode, buffer, '\n');
return buffer - original;
}
int main(int argc, char* argv[])
{
timer t;
for (int iter = 1; iter <= 4; ++iter)
{
// Open input file
FILE* input_file = nullptr;
if (fopen_s(&input_file, argv[1], "rb") != 0)
{
printf("Opening input file failed\n");
return 1;
}
// Compute file length
fseek(input_file, 0, SEEK_END);
int file_len = ftell(input_file);
fseek(input_file, 0, SEEK_SET);
// Read file into memory
wchar_t* file_contents = (wchar_t*)malloc(file_len + 2);
if (file_contents == nullptr)
{
printf("Allocating input buffer failed\n");
return 1;
}
if (fread(file_contents, file_len, 1, input_file) != 1)
{
printf("Reading input file failed\n");
return 1;
}
size_t char_len = file_len / 2;
file_contents[char_len] = '\0';
fclose(input_file);
input_file = nullptr;
// Compute record count and delineate the line strings
size_t record_count = 0;
for (size_t idx = 0; idx < char_len; ++idx)
{
if (file_contents[idx] == '\n')
{
++record_count;
file_contents[idx] = '\0';
}
}
// Allocate record array
info* records = (info*)malloc(record_count * sizeof(info));
if (records == nullptr)
{
printf("Allocating records list failed\n");
return 1;
}
// Process memory text into records
wchar_t* cur_str = file_contents;
wchar_t* end_str = cur_str + file_len / 2;
size_t record_idx = 0;
while (cur_str < end_str)
{
set_record(records[record_idx++], cur_str);
cur_str += wcslen(cur_str) + 1;
}
if (record_idx != record_count)
{
printf("Record counts differ: idx: %d - count: %d\n",
(int)record_idx, (int)record_count);
return 1;
}
t.report("struct load ");
// Write output file
wchar_t* file_output = (wchar_t*)malloc(record_count * RECORD_LEN);
if (file_output == nullptr)
{
printf("Allocating file output buffer failed\n");
return 1;
}
size_t output_len = 0;
for (size_t r = 0; r < record_count; ++r)
{
int new_output = output_record(records[r], file_output + output_len);
if (new_output < 0)
{
printf("Writing to output buffer failed\n");
return 1;
}
output_len += new_output;
}
FILE* output_file = nullptr;
if (fopen_s(&output_file, "output.csv", "wb") != 0)
{
printf("Opening output file failed\n");
return 1;
}
if (fwrite(file_output, output_len * 2, 1, output_file) != 1)
{
printf("Writing output file failed\n");
return 1;
}
fclose(output_file);
output_file = nullptr;
t.report("struct write");
// Clean up
free(file_contents);
file_contents = nullptr;
free(records);
records = nullptr;
}
return 0;
}
方法 5:C++ 批处理
也许那一堆 C 在你的嘴里留下了不好的味道。我们可以在 C++ 中应用相同的批处理方法吗?
// C++ batch test program
#include <codecvt>
#include <fstream>
#include <iostream>
#include <string>
#include <vector>
// Our record type, just a bunch of wstrings
struct info
{
info(const std::vector<wchar_t*>& parts)
: firstname(parts[0])
, lastname(parts[1])
, address1(parts[2])
, address2(parts[3])
, city(parts[4])
, state(parts[5])
, zipcode(parts[6])
{
}
std::wstring firstname;
std::wstring lastname;
std::wstring address1;
std::wstring address2;
std::wstring city;
std::wstring state;
std::wstring zipcode;
};
void parse_parts(wchar_t* buffer, wchar_t separator, std::vector<wchar_t*>& ret_val)
{
ret_val.clear();
ret_val.push_back(buffer); // start at the beginning
while (*buffer)
{
if (*buffer == separator)
{
*buffer = '\0';
if (*(buffer + 1))
ret_val.push_back(buffer + 1);
}
++buffer;
}
}
int main(int argc, char* argv[])
{
timer t;
for (int iter = 1; iter <= 4; ++iter)
{
// Read the file into memory
std::vector<char> file_contents;
{
std::ifstream file(argv[1], std::ios::binary | std::ios::ate);
std::streamsize size = file.tellg();
file.seekg(0, std::ios::beg);
file_contents.resize(size + 2); // save room for null termination
if (!file.read(file_contents.data(), size))
{
std::cout << "Reading file failed\n";
return 1;
}
// null terminate
file_contents.push_back(0);
file_contents.push_back(0);
}
// Get the lines out of the data
std::vector<wchar_t*> line_pointers;
parse_parts(reinterpret_cast<wchar_t*>
(file_contents.data()), '\n', line_pointers);
// Process the lines into data structures
std::vector<info> infos;
infos.reserve(line_pointers.size());
std::vector<wchar_t*> line_parts;
for (wchar_t* line : line_pointers)
{
parse_parts(line, ',', line_parts);
infos.emplace_back(line_parts);
}
t.report("C++ 2 load");
// Write the structs to an output CSV file
std::wstring output_str;
output_str.reserve(file_contents.size() / 2);
for (const auto& record : infos)
{
output_str += record.firstname;
output_str += ',';
output_str += record.lastname;
output_str += ',';
output_str += record.address1;
output_str += ',';
output_str += record.address2;
output_str += ',';
output_str += record.city;
output_str += ',';
output_str += record.state;
output_str += ',';
output_str += record.zipcode;
output_str += '\n';
}
std::ofstream output_file("output.csv", std::ios::binary);
if (!output_file)
{
std::cout << "Opening output file failed\n";
return 1;
}
output_file.write(reinterpret_cast<const char*>
(output_str.c_str()), output_str.size() * 2);
output_file.close();
t.report("C++ 2 write");
}
}
就是这样!期待评论!