%PDF- %PDF-
| Direktori : /usr/share/doc/imath-devel/html/ |
| Current File : //usr/share/doc/imath-devel/html/float.html |
<!doctype html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Floating Point Representation — Imath Documentation</title>
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="_static/bizstyle.css" type="text/css" />
<script id="documentation_options" data-url_root="./" src="_static/documentation_options.js"></script>
<script src="_static/jquery.js"></script>
<script src="_static/underscore.js"></script>
<script src="_static/doctools.js"></script>
<script async="async" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.7/latest.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script src="_static/bizstyle.js"></script>
<link rel="index" title="Index" href="genindex.html" />
<link rel="search" title="Search" href="search.html" />
<link rel="next" title="Box" href="classes/Box.html" />
<link rel="prev" title="half-float Conversion Configuration Options" href="half_conversion.html" />
<meta name="viewport" content="width=device-width,initial-scale=1.0" />
<!--[if lt IE 9]>
<script src="_static/css3-mediaqueries.js"></script>
<![endif]-->
</head><body>
<div class="related" role="navigation" aria-label="related navigation">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="genindex.html" title="General Index"
accesskey="I">index</a></li>
<li class="right" >
<a href="classes/Box.html" title="Box"
accesskey="N">next</a> |</li>
<li class="right" >
<a href="half_conversion.html" title="half-float Conversion Configuration Options"
accesskey="P">previous</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">Imath</a> »</li>
<li class="nav-item nav-item-this"><a href="">Floating Point Representation</a></li>
</ul>
</div>
<div class="document">
<div class="documentwrapper">
<div class="bodywrapper">
<div class="body" role="main">
<div class="section" id="floating-point-representation">
<h1>Floating Point Representation<a class="headerlink" href="#floating-point-representation" title="Permalink to this headline">¶</a></h1>
<p><strong>Representation of a 32-bit float:</strong></p>
<p>We assume that a float, f, is an IEEE 754 single-precision floating point number, whose bits are arranged as follows: <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="mi">31</span> <span class="p">(</span><span class="n">msb</span><span class="p">)</span>
<span class="o">|</span>
<span class="o">|</span> <span class="mi">30</span> <span class="mi">23</span>
<span class="o">|</span> <span class="o">|</span> <span class="o">|</span>
<span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="mi">22</span> <span class="mi">0</span> <span class="p">(</span><span class="n">lsb</span><span class="p">)</span>
<span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span>
<span class="n">X</span> <span class="n">XXXXXXXX</span> <span class="n">XXXXXXXXXXXXXXXXXXXXXXX</span>
<span class="n">s</span> <span class="n">e</span> <span class="n">m</span>
</pre></div>
</div>
S is the sign-bit, e is the exponent and m is the significand.</p>
<p>If e is between 1 and 254, f is a normalized number: <div class="highlight-default notranslate"><div class="highlight"><pre><span></span> <span class="n">s</span> <span class="n">e</span><span class="o">-</span><span class="mi">127</span>
<span class="n">f</span> <span class="o">=</span> <span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="mi">2</span> <span class="o">*</span> <span class="mf">1.</span><span class="n">m</span>
</pre></div>
</div>
If e is 0, and m is not zero, f is a denormalized number: <div class="highlight-default notranslate"><div class="highlight"><pre><span></span> <span class="n">s</span> <span class="o">-</span><span class="mi">126</span>
<span class="n">f</span> <span class="o">=</span> <span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="mi">2</span> <span class="o">*</span> <span class="mf">0.</span><span class="n">m</span>
</pre></div>
</div>
If e and m are both zero, f is zero: <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">f</span> <span class="o">=</span> <span class="mf">0.0</span>
</pre></div>
</div>
If e is 255, f is an “infinity” or “not a number” (NAN), depending on whether m is zero or not.</p>
<p>Examples: <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="mi">0</span> <span class="mi">00000000</span> <span class="mi">00000000000000000000000</span> <span class="o">=</span> <span class="mf">0.0</span>
<span class="mi">0</span> <span class="mi">01111110</span> <span class="mi">00000000000000000000000</span> <span class="o">=</span> <span class="mf">0.5</span>
<span class="mi">0</span> <span class="mi">01111111</span> <span class="mi">00000000000000000000000</span> <span class="o">=</span> <span class="mf">1.0</span>
<span class="mi">0</span> <span class="mi">10000000</span> <span class="mi">00000000000000000000000</span> <span class="o">=</span> <span class="mf">2.0</span>
<span class="mi">0</span> <span class="mi">10000000</span> <span class="mi">10000000000000000000000</span> <span class="o">=</span> <span class="mf">3.0</span>
<span class="mi">1</span> <span class="mi">10000101</span> <span class="mi">11110000010000000000000</span> <span class="o">=</span> <span class="o">-</span><span class="mf">124.0625</span>
<span class="mi">0</span> <span class="mi">11111111</span> <span class="mi">00000000000000000000000</span> <span class="o">=</span> <span class="o">+</span><span class="n">infinity</span>
<span class="mi">1</span> <span class="mi">11111111</span> <span class="mi">00000000000000000000000</span> <span class="o">=</span> <span class="o">-</span><span class="n">infinity</span>
<span class="mi">0</span> <span class="mi">11111111</span> <span class="mi">10000000000000000000000</span> <span class="o">=</span> <span class="n">NAN</span>
<span class="mi">1</span> <span class="mi">11111111</span> <span class="mi">11111111111111111111111</span> <span class="o">=</span> <span class="n">NAN</span>
</pre></div>
</div>
<strong>Representation of a 16-bit half:</strong></p>
<p>Here is the bit-layout for a half number, h: <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="mi">15</span> <span class="p">(</span><span class="n">msb</span><span class="p">)</span>
<span class="o">|</span>
<span class="o">|</span> <span class="mi">14</span> <span class="mi">10</span>
<span class="o">|</span> <span class="o">|</span> <span class="o">|</span>
<span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="mi">9</span> <span class="mi">0</span> <span class="p">(</span><span class="n">lsb</span><span class="p">)</span>
<span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span> <span class="o">|</span>
<span class="n">X</span> <span class="n">XXXXX</span> <span class="n">XXXXXXXXXX</span>
<span class="n">s</span> <span class="n">e</span> <span class="n">m</span>
</pre></div>
</div>
S is the sign-bit, e is the exponent and m is the significand.</p>
<p>If e is between 1 and 30, h is a normalized number: <div class="highlight-default notranslate"><div class="highlight"><pre><span></span> <span class="n">s</span> <span class="n">e</span><span class="o">-</span><span class="mi">15</span>
<span class="n">h</span> <span class="o">=</span> <span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="mi">2</span> <span class="o">*</span> <span class="mf">1.</span><span class="n">m</span>
</pre></div>
</div>
If e is 0, and m is not zero, h is a denormalized number: <div class="highlight-default notranslate"><div class="highlight"><pre><span></span> <span class="n">S</span> <span class="o">-</span><span class="mi">14</span>
<span class="n">h</span> <span class="o">=</span> <span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="mi">2</span> <span class="o">*</span> <span class="mf">0.</span><span class="n">m</span>
</pre></div>
</div>
If e and m are both zero, h is zero: <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">h</span> <span class="o">=</span> <span class="mf">0.0</span>
</pre></div>
</div>
If e is 31, h is an “infinity” or “not a number” (NAN), depending on whether m is zero or not.</p>
<p>Examples: <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="mi">0</span> <span class="mi">00000</span> <span class="mi">0000000000</span> <span class="o">=</span> <span class="mf">0.0</span>
<span class="mi">0</span> <span class="mi">01110</span> <span class="mi">0000000000</span> <span class="o">=</span> <span class="mf">0.5</span>
<span class="mi">0</span> <span class="mi">01111</span> <span class="mi">0000000000</span> <span class="o">=</span> <span class="mf">1.0</span>
<span class="mi">0</span> <span class="mi">10000</span> <span class="mi">0000000000</span> <span class="o">=</span> <span class="mf">2.0</span>
<span class="mi">0</span> <span class="mi">10000</span> <span class="mi">1000000000</span> <span class="o">=</span> <span class="mf">3.0</span>
<span class="mi">1</span> <span class="mi">10101</span> <span class="mi">1111000001</span> <span class="o">=</span> <span class="o">-</span><span class="mf">124.0625</span>
<span class="mi">0</span> <span class="mi">11111</span> <span class="mi">0000000000</span> <span class="o">=</span> <span class="o">+</span><span class="n">infinity</span>
<span class="mi">1</span> <span class="mi">11111</span> <span class="mi">0000000000</span> <span class="o">=</span> <span class="o">-</span><span class="n">infinity</span>
<span class="mi">0</span> <span class="mi">11111</span> <span class="mi">1000000000</span> <span class="o">=</span> <span class="n">NAN</span>
<span class="mi">1</span> <span class="mi">11111</span> <span class="mi">1111111111</span> <span class="o">=</span> <span class="n">NAN</span>
</pre></div>
</div>
<strong>Conversion via Lookup Table:</strong></p>
<p>Converting from half to float is performed by default using a lookup table. There are only 65,536 different half numbers; each of these numbers has been converted and stored in a table pointed to by the <code class="docutils literal notranslate"><span class="pre">imath_half_to_float_table</span></code> pointer.</p>
<p>Prior to Imath v3.1, conversion from float to half was accomplished with the help of an exponent look table, but this is now replaced with explicit bit shifting.</p>
<p><strong>Conversion via Hardware:</strong></p>
<p>For Imath v3.1, the conversion routines have been extended to use F16C SSE instructions whenever present and enabled by compiler flags.</p>
<p><strong>Conversion via Bit-Shifting</strong></p>
<p>If F16C SSE instructions are not available, conversion can be accomplished by a bit-shifting algorithm. For half-to-float conversion, this is generally slower than the lookup table, but it may be preferable when memory limits preclude storing of the 65,536-entry lookup table.</p>
<p>The lookup table symbol is included in the compilation even if <code class="docutils literal notranslate"><span class="pre">IMATH_HALF_USE_LOOKUP_TABLE</span></code> is false, because application code using the exported <code class="docutils literal notranslate"><span class="pre">half.h</span></code> may choose to enable the use of the table.</p>
<p>An implementation can eliminate the table from compilation by defining the <code class="docutils literal notranslate"><span class="pre">IMATH_HALF_NO_LOOKUP_TABLE</span></code> preprocessor symbol. Simply add: <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="c1">#define IMATH_HALF_NO_LOOKUP_TABLE</span>
</pre></div>
</div>
before including <code class="docutils literal notranslate"><span class="pre">half.h</span></code>, or define the symbol on the compile command line.</p>
<p>Furthermore, an implementation wishing to receive <code class="docutils literal notranslate"><span class="pre">FE_OVERFLOW</span></code> and <code class="docutils literal notranslate"><span class="pre">FE_UNDERFLOW</span></code> floating point exceptions when converting float to half by the bit-shift algorithm can define the preprocessor symbol <code class="docutils literal notranslate"><span class="pre">IMATH_HALF_ENABLE_FP_EXCEPTIONS</span></code> prior to including <code class="docutils literal notranslate"><span class="pre">half.h</span></code>: <div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="c1">#define IMATH_HALF_ENABLE_FP_EXCEPTIONS</span>
</pre></div>
</div>
<strong>Conversion Performance Comparison:</strong></p>
<p>Testing on a Core i9, the timings are approximately:</p>
<p>half to float<ul class="simple">
<li><p>table: 0.71 ns / call</p></li>
<li><p>no table: 1.06 ns / call</p></li>
<li><p>f16c: 0.45 ns / call</p></li>
</ul>
</p>
<p>float-to-half:<ul class="simple">
<li><p>original: 5.2 ns / call</p></li>
<li><p>no exp table + opt: 1.27 ns / call</p></li>
<li><p>f16c: 0.45 ns / call</p></li>
</ul>
</p>
<p><strong>Note:</strong> the timing above depends on the distribution of the floats in question. </p>
</div>
<div class="clearer"></div>
</div>
</div>
</div>
<div class="sphinxsidebar" role="navigation" aria-label="main navigation">
<div class="sphinxsidebarwrapper">
<p class="logo"><a href="index.html">
<img class="logo" src="_static/imath-logo-blue.png" alt="Logo"/>
</a></p>
<h4>Previous topic</h4>
<p class="topless"><a href="half_conversion.html"
title="previous chapter">half-float Conversion Configuration Options</a></p>
<h4>Next topic</h4>
<p class="topless"><a href="classes/Box.html"
title="next chapter">Box</a></p>
<div role="note" aria-label="source link">
<h3>This Page</h3>
<ul class="this-page-menu">
<li><a href="_sources/float.rst.txt"
rel="nofollow">Show Source</a></li>
</ul>
</div>
<div id="searchbox" style="display: none" role="search">
<h3 id="searchlabel">Quick search</h3>
<div class="searchformwrapper">
<form class="search" action="search.html" method="get">
<input type="text" name="q" aria-labelledby="searchlabel" />
<input type="submit" value="Go" />
</form>
</div>
</div>
<script>$('#searchbox').show(0);</script>
</div>
</div>
<div class="clearer"></div>
</div>
<div class="related" role="navigation" aria-label="related navigation">
<h3>Navigation</h3>
<ul>
<li class="right" style="margin-right: 10px">
<a href="genindex.html" title="General Index"
>index</a></li>
<li class="right" >
<a href="classes/Box.html" title="Box"
>next</a> |</li>
<li class="right" >
<a href="half_conversion.html" title="half-float Conversion Configuration Options"
>previous</a> |</li>
<li class="nav-item nav-item-0"><a href="index.html">Imath</a> »</li>
<li class="nav-item nav-item-this"><a href="">Floating Point Representation</a></li>
</ul>
</div>
<div class="footer" role="contentinfo">
© Copyright 2021, Contributors to the OpenEXR Project.
Created using <a href="https://www.sphinx-doc.org/">Sphinx</a> 3.4.3.
</div>
</body>
</html>