I want to keep a list of files of the SD card in RAM. So a kind of dictionary. The files can be in an arbitrary folders (or subfolders) in the SD card. In the code I want to quickly check if a given file name (with path) exists in the SD card.
Instead of storing a list of all file names and paths of the SD card in RAM, I was thinking of storing only a hash of each filepath to save RAM.
Do you know a fast and simple algorithm to convert a file path to a hash? The resulted hash id should be max 32-bit if possible.
Xorshift32 is guaranteed to have a full 32-bit period if used as the state transition function of a PRNG.
The generic algorithm has no guarantees, best measure it to make sure you’re happy with it.
(I used this hashing/checksum algorithm to detect when the EEPROM data had been corrupted.)
I was wondering if there was an easy-ish way to convert a const char* string to a 32-bit hash at compile time using constexpr.
I figured the almighty @Pharap would know a good way to do it. The idea is to be able to do the following with my DataPack library:
DataPack dataFile;
//the DataPack::getPackedFile would ideally take a 32-bit hash
//or convert the string at compile-time and call the appropriate overloaded function
DataPack::PackedFile packedFile = dataFile.getPackedFile("/path/to/file");
//I would also like to do the following
const uint32_t dataFiles[] =
{
DataPack::getHash("/path/to/file1"),
DataPack::getHash("/path/to/file2")
//...
};
The idea is to be able to call a function passing a path as a string constant and have it converted to a hash at compile time to save space. Also to store an array of files (so they can be referenced via an index) in flash space to save on RAM.
(Or using namespace DataPack; if they really wanted to bring in everything, including the kitchen sink. :P)
xorshift32 is guaranteed to be good for avoiding clashes because George Marsaglia says so and he knows more about that kind of stuff than I do. :P
The byte combining algorithm doesn’t come with that guarantee because I wrote it and I’m not a mathematician so you may want to run some tests, but generally it ought to be decent.
Be warned that no algorithm is going to guarantee no clashes simply due to the pigeon hole principle.
You’ll probably need to think of a way to handle clashes eventually.
One possibility, although not necessarily an easy one, is a compile-time hash table.
Sounds like a job for a file allocation table to me.
I.e. an array of data offsets/pointers.
This is really great, and pretty much exactly what I was looking for.
That is much appreciated as I’ll be able to more easily incorporate that into the library with less work.
Clashes will be easily handled by the data packing tool (the tool that creates the pack). The idea is the tool will create the hashes and either error out or at least warn of clashes. Then on the Pokitto side the library just needs to know the path and the hash will be calculated at compile time.
The alternative I have now is something like this for a start (the packer either embeds the file table in the file header, or spits out a c++ header with the following for now):
While arguably all files could be just coded with a resource ID (even using uin8_t for extra savings if there’s fewer than 256 files), but that gets harder to program and even harder to read (ie. dataPack.getPackedFile(idNumber) which is provided, but mostly for iteration convenience).
Ideally I would like the files to be able to be repacked (like if some changes were made) without having to recompile the code (thus the need for storing the file offsets inside the file itself). While a compile-time hashing function might not be considered “trivial” for some it makes the user-end portion far simpler. Also there’s the benefit of more readable code, like dataPack.getPackedFile("/path/to/file/as/string") could be a constexpr function that calls the overload dataPack.getPackedFile(hashValue) using the hash function. IMHO this would make the user’s code really easy to read/write and put all the complexity behind the curtains (fully accessible to the curious, but otherwise hidden from view).
Anyway, thanks a bunch for the help, I’ll have to implement this soon and get the DataPack library and packing tool on github ASAP.
Then users could do dataPack.getPackedFile(MyGameData::dataFile1).
Effectively you’d be replacing URLs encoded in string literals with name constants, which has the added advantages of:
You’ll know you’re getting them wrong at compile time when the editor tells you “this identifier doesn’t exist”
The various identifiers will show up in an editor’s autocomplete
If each constant represents an index into an allocation table stored in the file then there would be no need to recompile the code.
Also if you really want them readable you could produce nested namespaces such that the user writes something like dataPack.getPackedFile(files::path::to::file::as::string);.
The big difference here is that rather than getting a pseudorandom value (i.e. a hash) from a fixed value and being at the mercy of the hash, you get to dictate what the value is.
Of if you’re feeling particularly crazy you can overload the / operator and do some type magic so you end up writing dataPack.getPackedFile(files/path/to/file/as/string).
(If you want to see this kind of madness in action, have a peek at this, especially the examples.)
It’s nicer to look at but more cumbersome to write the boilerplate for. :P
(If you’re interested in any of these alternatives feel free to PM me -
I don’t want to clog up the hash function thread too much.)
Thank you @Pharap for all the info and help. I now have a fully functional v0.1 of my DataPack utility, just need to create an example demo project and put it up on github for others to use. Will create a dedicated topic for it once I’ve got it on github.
I went with this format for referencing files (snippet from my PreludeToADream game):
With this method it’s easy enough to create an array of all files, or separate arrays like I’ve done for individual types of resources (music, tilesets, etc.).
I don’t know if I’m crazy enough to go to that extent :P
Also clashes are handled by the packing utility, but I’ll explain a little more detail on using the packing tool, and using the DataPack library in the readme on github once I get it ready.