EmuCR Feeds
Email Us

EmuCR:OpenMSX OpenMSX Git (2016/08/21) is complie. OpenMSX is an open source MSX emulator which is free according to the Debian Free Software Guidelines, available under the GNU General Public License.For copyright reasons the emulator cannot be distributed with original BIOS ROM images. OpenMSX includes C-BIOS a minimal implementation of the MSX BIOS, allowing to play quite some games without the need to have an original MSX BIOS ROM image. You can also use your own BIOS ROM image if you please.

OpenMSX Git Changelog:
* Fix (?) build for MSVC after adding DeltaBlock and TrackedRam
* [12/12] Added memory usage tracking
Like the previous patch, the code in this patch is disabled by default.
This patch adds detailed tracking of the memory usage for all DeltaBlocks
(ignoring allocation overhead). This feature together with the 'reverse debug'
functionality allowed me to measure that the new reverse snapshot system indeed
uses 2x-5x less memory compared to the old one.
* [11/12] Added (optional) debug checks
The last two patches in this series don't actually change anything. They only
add some code that is disabled by default. Though at some point (during
development) they were useful to me. So I prefer to have them in git history
(even if in the future we will remove these patches again).
This patch adds various checks for delta compression. E.g. it checks that
decompressing a delta-compressed block perfectly restores the initial block or
that null-compression is only done when the memory block indeed hasn't changed.
* [10/12] Optimize YMF278 sample ram serialize
Similar to the previous patches: Moonsound sample RAM can be large (640kB in
boosted machines, but can be up to 2MB) and doesn't change that often. So it's
a good fit for dirty tracking.
* [9/12] Switch YMF278 MemBuffer to Ram
Change the type of the YMF278::ram object from MemBuffer to Ram. This is mostly
a preparation for the next patch. But as a bonus it makes the YMF278 sample ram
debuggable (the full YMF278 memory and the ROM part were already debuggable).
* [8/12] Optimize Y8950Adpcm serialize if unchanged
MSX-Audio sample ram can be large (256kB in the boosted machines) and it
doesn't change that often (samples are loaded, then used for a relately long
time). So it also benefits from dirty tracking and efficient serialize.
* [7/12] Optimize V9990VRAM serialize if unchanged
Gfx9000 has a relatively large amount of memory (512kB) and it is included in
our boosted machines. But MSX software for V9990 is relatively rare, so the
V9990VRAM content often doesn't change at all. This makes it a good candidate
for dirty tracking and null-serialization if not-dirty.
But even in case the user is running V9990 software with frequent VRAM writes
(of course V9990 commands also count as writes), the added overhead of
V9990VRAM dirty tracking should be negligible compared to the code surrounding
the write.
* [6/12] Optimize SRAM serialization if not changed
The 'KonamiUltimateCollection' example was already used a few times in this
patch series. Very often the flashrom content doesn't change between reverse
snapshots. And we don't want to spend much time (and memory) storing this
(often unchanged) flashrom in the snapshot.
The previous few patches already made some preparations. This patch actually
optimizes reverse snapshots for flashroms in case they didn't change.
Basically the only non-trivial thing this patch does is to replace indirect
writes (write via a pointer) with explicit write() calls.
Actually as an internal implementation detail our 'AmdFlash' class uses a
'SRAM' object for the actual storage. In general most SRAMs are not written
that frequently, so they all benefit from the more efficient serialization when
not dirty. And even when they are written frequently (e.g. in some specific
period), dirty-tracking is not that expensive, so it also doesn't hurt.
* [5/12] added TrackedRam wrapper around Ram
The previous patch added infrastructure to efficiently store a null-difference
in snapshots in case a memory block hasn't changed. This patch adds a helper
class that can actually detect when a Ram block hasn't changed.
This patch adds a 'TrackedRam' class. It offers almost the same interface as
the existing 'Ram' class, but it has an internal dirty-flag that keeps track of
whether there were any writes since the last reverse snapshot. So compared to
'Ram' it has a little overhead on each write operation.
IMHO an important property of this class is that when you change some existing
code from using 'Ram' to 'TrackedRam' it's impossible (or at least very hard)
to accidentally change the content of the Ram without also setting the internal
dirty flag. In case you do make a mistake (e.g. you miss some (indirect) write
when changing from Ram to TrackedRam) you get a compile error instead of wrong
runtime behavior.
Later patches in this series will actually start using this helper class.
* [4/12] extend serialize_blob() with no-changes flag
The previous patch explained how delta-compression loops over 2 input buffers.
In case of the 'KonamiUltimateCollection' mapper this means looping over 2 8MB
buffers only to conclude that both buffers are still identical (in the vast
majority of the cases). From a memory cache perspective this is not a good
idea. And, also explained in the previous patch, memory access is often very
important for good performance.
If we have an alternative mechanism that can tell us the memory block hasn't
changed since the last snapshot we can avoid the memory scan. And then we can
directly store a null-difference in the snapshot. (Actually we don't store a
null-difference, instead we simply repeat the previously stored DeltaBlock).
The details of how to detect that blocks haven't changed are left for later
patches in this series. Instead this patch focusses on:
- How to actually store null-differences (e.g. we don't only need to keep track
of a reference delta-block but also of the last delta-block). We also need to
be able to indicate to the serializer that a block hasn't changed.
- What does the 'previous snapshot' mean exactly. In-memory savestates are made
for more than one reason (e.g. loading a replay-file and actual reverse
snapshots). Only the latter should be taken into account when defining the
'previous snapshot'.
* [3/12] Optimize delta calculation
A sub goal of this patch series is to improve the speed of the reverse system.
Before this patch this series actually made reverse slower (2%-50%(!), of
course very much depending on how/what you measure). With this patch I measured
a 4%-10% speedup. To be honest the speedups are not only the result of this
patch but also of some future patches in this series, but this patch had the
largest impact.
The core of the delta compression is a loop like this
while ((q != end) && (*p == *q)) { ++p; ++q; }
It searches in 2 byte-buffers for the first difference or until the end of the
buffer is reached. There is a similar loop that searches for equal bytes. These
loops can be sped up in two ways:
- Use sentinels to avoid the end-of-buffer check on each iteration.
- Compare words- instead of bytes-at-a-time.
See comments in the code for much more details.
I measured that the optimized delta compression is indeed faster than the old
isolated blocks compression. But let's look at both from a distance:
Suppose we need to add a large block to the snapshot.
* When compressing isolated blocks, the compression routine reads the whole
input and produces some output. Some parts of the input might be read
multiple times, but let's assume the compression routine is well optimized
for memory caching and those secondary reads all come from L1 cache.
* When using delta compression we compare two buffers and produce some output.
Every byte of each input buffer is read exactly once. But still this routine
needs to read twice as much data.
Depending on how compressible the input data is and how much it changed
compared to the previous snapshot, the size of the produced output can greatly
vary in size. But for simplicity let's ignore the output part, because
typically it's anyway an order of magnitude smaller than the input.
Very often memory bandwidth is the most important factor when optimizing code,
and from that perspective compressing isolated blocks seems better: it only
touches half the amount of memory. On the other hand delta compression has a
very simple memory access pattern (linear scan) while regular compression also
does some backwards memory reads (but only in a limited window).
When comparing both compression routines, delta- looks vastly simpler than
regular compression (we currently use 'snappy' as out compression algorithm,
but algorithms like 'LZ4' behave very similar). But still it only runs a little
faster (and the unoptimized version actually ran slower). So possibly this only
modest improvement can be explained by cache effects. There also is still room
to improve delta compression further, while snappy has reached its limits (or
is close to it).
* [2/12] Compress non-reference delta blocks
The previous patch explained the overall idea of delta-compression for reverse
snapshots. Let's go in a little more detail:
* When asked to include a memory block in a snapshot (an in-memory savestate)
we first check whether we already have a reference for this block.
* If not we create a 'DeltaBlockCopy' object. This is basically a memcpy of the
data.
* If we do have a reference we calculate the difference between that
and the current block and store the result in a 'DeltaBlockDiff' object.
* (Typically) over time a memory block will start to deviate more and more from
the reference, thus the delta between both becomes larger. When this
difference becomes too large (using some heuristics to determine what 'too
large' means exactly), instead of creating a new 'DeltaBlockDiff' we create a
'DeltaBlockCopy' and use that as the new reference for the future.
Once we switch to a new reference block, the old reference isn't used anymore.
Not even when going back in time. In other words: we do use old reference
blocks to restore the machine state, but never as a reference to create new
delta-blocks). Instead, after a 'reverse goto' we always create new references.
We might loose a bit of compression efficiency because of this, but in return
we get a simpler implementation. (The underlying technical reason is that
blocks are identified based on their start address, and switching to a new
machine (internally done by 'reverse goto') creates new objects with different
start addresses for the memory blocks).
To be able to (efficiently) calculate the difference between memory blocks,
both blocks need to be present (uncompressed) in memory. Though after we switch
to a new reference we can compress the old reference block (see previous
paragraph). This gives the guarantee that the new reverse mechanism (delta
compression) never uses more memory compared to the old mechanism (compress
each block independently). (Modulo the size of a single snapshot and taking the
current heuristic into account. I'm not going into more detail).
I did some quick tests and the new mechanism indeed typically uses 2x-5x less
memory compared to the old one. As expected the gain is larger for more complex
MSX machines and/or when the MSX software doesn't use all included MSX devices.
* [1/12] Initial version of DeltaBlock based reverse snapshots
First the motivation for this change:
* We recently added the 'KonamiUltimateCollection' mapper. This contains a 8MB
flashrom. When reverse is enabled, each second, we compress this 8MB memory
block. In some cases this causes noticeable hiccups.
* On 'small' devices (e.g. Android) we don't enable reverse by default
because it takes a relatively large amount of memory and CPU time.
In case of 'KonamiUltimateCollection' the flashrom rarely changes, so for each
reverse snapshot we spend time compressing the exact same memory block. This is
true in general, flashroms rarely change. But also for many other, sometimes
large, memory blocks like MoonSound sample RAM (typically gets (re-)loaded
infrequently) or Gfx9000 VRAM in case you're emulating non-v9990 software. Our
boosted machine configs do include both MoonSound and Gfx9000.
A smarter way to store memory blocks in reverse snapshots could be to only
store the difference between successive snapshots. In case the blocks truly
don't change much, this difference can be encoded very compact in memory,
resulting in lower overall memory consumption for the reverse system.
Calculating this difference is also much simpler compared to compressing a
block. So there's hope this can also be done faster.
There's one complication: if each snapshot stores the difference with the
previous snapshot, and you want to restore a specific snapshot, you have to
start from the initial snapshot and apply all differences in turn. This could
take a long time. Also to save memory we want to prune snapshots, that is from
recent history we keep more snapshots in memory compared to the more distant
history. But if all difference are linked in a chain it's impossible to drop
some of the snapshots.
To solve both problems we don't calculate the difference between two successive
snapshots but between a semi-fixed reference block and the current block. Once
this difference becomes too large we switch to a new reference.
This is only the first patch in a series. It adds the basic delta-compression
functionality. It's already fully working but:
* Memory consumption isn't very good yet (although already better than before
this patch. Of course very much depending on which emulated machine and
software you use to measure).
* Performance is actually a bit worse compared to before this patch.
Both issues will be addressed later in this patch series.
It's probably worth mentioning that this only affects the in-memory savestates
(the reverse snapshots). The on-disk savetstate and replay format is unchanged.

Download: OpenMSX Git (2016/08/21) x86
Download: OpenMSX Git (2016/08/21) x64
Source: Here



Random Related Topic Refresh Related Topic

Random Related Topic Loading...

0 Comments

Post a Comment

Can't post a comment? Try This!